GROUNDHOG (logo): Grounding Large Language Models to Holistic Segmentation

Summary and Highlight (TL;DR)

We present GROUNDHOG (logo), a multimodal large language model developed by grounding large language models to holistic segmentation. GROUNDHOG (logo) is flexible and diagnosable, reduces object hallucination, and can plug in and play with any segmentation foundation model (e.g., SAM).

GROUNDHOG (logo): Grounding LLMs to Holistic Segmentation

Model Architecture

Key Idea: GROUNDHOG (logo) formulate the grounding process as an entity segment selection problem which involves (1) proposing entity segmentation masks where the masks encapsulate regions with discernible semantic content, and (2) recognizing the retrieved entities through the understanding of both visual and language context.
Details: GROUNDHOG (logo) incorporates a masked feature extractor that takes an input image and a set of class-agnostic entity mask proposals, and converts each mask's features into visual entity tokens for an MLLM backbone. This MLLM then connects groundable phrases to unified grounding masks by retrieving and merging the entity masks. To enable holistic entity mask proposals, our default mask proposal model is an enhanced Mask2Former with 50 additional queries each for segmenting parts and text regions, alongside the original 200 entity queries.

Pointer Input

We introduce a pointer token <PTR> which refers to a specific point or region in the image input. <PTR> serves as a placeholder token to be replaced by the visual token from the mask proposal, which corresponds to the actual pointer input. For example, a user instruction can be formed as "What is that <PTR>?", with requests the model for a referring expression about a specific region.

M3G2: Dataset for Visually Grounding Instruction Tuning

We introduce the M3G2 dataset for Multi-Modal Multi-Grained Grounding. M3G2 is a comprehensive dataset consisting of 36 sub-problems, derived and augmented from 27 existing datasets with grounded vision-language annotations. The dataset is categorized into four main types: (1) Grounded Image Captioning (GCAP), (2) Referential Expression Segmentation (RES), (3) Grounded Visual Question Answering (GVQA), and (4) Referential Dialogue (RD).

Results and Applications

Grounded image captioning.

Grounded image captioning with short descriptions.
Grounded image captioning with detailed descriptions.

Referential expression segmentation.

Referential dialogue.

Grounded visual question answering.

Less Hallucination, Diagnosability, and Plug-in-and-Play with SAM

Less Hallucination

We assessed object hallucination on the POPE benchmark, which includes binary questions about object existence.
Thanks to the varied task distribution and the inclusion of negative question-answering samples in M3G2 dataset, GROUNDHOG reduces object hallucination. Remarkably, GROUNDHOG consistently outperforms other models in both accuracy and F1 score across all splits, particularly on the more challenging ones. It shows an absolute improvement of 5.2% in accuracy for Popular and 4.0% for Adversarial over the previously best-performing model. This suggests that our model's enhanced grounding capability plays a significant role in mitigating the object hallucination problem.

Diagnosability and Explainability

GROUNDHOG enables diagnosability through the decoupled design of entity proposal and selection. This is exemplified in the case on the left, which illustrates the mask proposal scoring and selective merging process of our model. We show the top-4 masks, where the higher-score masks are labeled in green while the lower-score masks are labeled in red. Users can easily interpret that the failure is due to the incapability of MLLM to recognize the word "KWIK", despite it being successfully localized and proposed as an entity candidate.

Plug-in-and-Play with any segmentation foundation model

GROUNDHOG supports plug-in-and-play with any segmentation foundation model, e.g., SAM, as the model conditions the entity features solely on the binary masks without using any embeddings from the mask proposal model. For the pointer-to-mask conversion, we show the best-matched mask proposal from our Mask2Former+ model in comparison to the mask from SAM. The SAM-generated mask offers a more precise representation of the specified region, leading to a more accurate caption.


    title={GROUNDHOG: Grounding Large Language Models to Holistic Segmentation},
    author={Zhang, Yichi and Ma, Ziqiao and Gao, Xiaofeng and Shakiah, Suhaila and Gao, Qiaozi and Chai, Joyce},
    booktitle={Conference on Computer Vision and Pattern Recognition 2024},