GROUNDHOG

M3G2: Dataset for Visually Grounding Instruction Tuning

Task schema overview.

Dataset statistics.

Grounded Image Captioning

Grounded image captioning with detailed descriptions with PNG dataset.

Grounded image captioning with short descriptions with Flickr30K-Entity dataset.

Referring Expression Segmentation

Referring expression segmentation with RefCOCO+ dataset.

Referring expression segmentation with PhraseCut dataset.

Referring expression segmentation with gRefCOCO dataset.

Referring expression segmentation with D-Cube dataset.

Referring expression segmentation with ReasonSeg dataset.

Referring expression segmentation with RIO dataset.

Referring expression segmentation with SK-VG dataset.

Grounded Visual Question Answering

Grounded visual question answering with Chain-of-Thoughts with GQA dataset.

Grounded visual question answering with VizWiz dataset.

Grounded visual question answering with TextVQA-X dataset.

Grounded visual question answering with VQS dataset.

Grounded visual question answering with Shikra-Binary dataset.

Grounded visual question answering with EntityCount dataset.

Grounded visual question answering with FoodSeg-QA dataset.

Grounded visual question answering with LVIS dataset.

Referential Dialogue

Referential dialogue with RefCOCO+ dataset.

Referential dialogue with VG dataset.

Referential dialogue with V7W dataset.

Referential dialogue with PointQA-Local dataset.

Referential dialogue with PointQA-Twice dataset.

Referential dialogue with VCR dataset.

Referential dialogue with Shikra-RD dataset.

Referential dialogue with SVIT dataset.

Referential dialogue with GuessWhat dataset.

Referential dialogue for region matching with VG dataset.

Referential dialogue with HierText dataset.

BibTeX


@inproceedings{zhang2024groundhog,
    title={GROUNDHOG: Grounding Large Language Models to Holistic Segmentation},
    author={Zhang, Yichi and Ma, Ziqiao and Gao, Xiaofeng and Shakiah, Suhaila and Gao, Qiaozi and Chai, Joyce},
    booktitle={Conference on Computer Vision and Pattern Recognition 2024},
    year={2024}
}

GROUNDHOG : Grounding Large Language Models to Holistic Segmentation

M3G2: Dataset for Visually Grounding Instruction Tuning

Task schema overview.

Dataset statistics.

Grounded Image Captioning

Grounded image captioning with detailed descriptions with PNG dataset.

Grounded image captioning with short descriptions with Flickr30K-Entity dataset.

Referring Expression Segmentation

Referring expression segmentation with RefCOCO+ dataset.

Referring expression segmentation with PhraseCut dataset.

Referring expression segmentation with gRefCOCO dataset.

Referring expression segmentation with D-Cube dataset.

Referring expression segmentation with ReasonSeg dataset.

Referring expression segmentation with RIO dataset.

Referring expression segmentation with SK-VG dataset.

Grounded Visual Question Answering

Grounded visual question answering with Chain-of-Thoughts with GQA dataset.

Grounded visual question answering with VizWiz dataset.

Grounded visual question answering with TextVQA-X dataset.

Grounded visual question answering with VQS dataset.

Grounded visual question answering with Shikra-Binary dataset.

Grounded visual question answering with EntityCount dataset.

Grounded visual question answering with FoodSeg-QA dataset.

Grounded visual question answering with LVIS dataset.

Referential Dialogue

Referential dialogue with RefCOCO+ dataset.

Referential dialogue with VG dataset.

Referential dialogue with V7W dataset.

Referential dialogue with PointQA-Local dataset.

Referential dialogue with PointQA-Twice dataset.

Referential dialogue with VCR dataset.

Referential dialogue with Shikra-RD dataset.

Referential dialogue with SVIT dataset.

Referential dialogue with GuessWhat dataset.

Referential dialogue for region matching with VG dataset.

Referential dialogue with HierText dataset.

BibTeX