Chain-of-Caption: Training-free Improvement of Multimodal Large Language Model on Referring Expression Comprehension

Queen Mary, University of London

Abstract

Given a textual description, the task of referring expression comprehension (REC) involves the localisation of the referred object in an image. Multimodal large language models (MLLMs) have achieved high accuracy on REC benchmarks through scaling up the model size and training data. Moreover, the performance of MLLMs can be further improved using techniques such as chain-of-thought and tool use, which provides additional visual or textual context to the model. In this paper, we analyse the effect of various techniques for providing additional visual and textual context via tool use to the MLLM and its effect on the REC task. Furthermore, we propose a training-free framework named Chain-of-Caption to improve the REC performance of MLLMs. We perform experiments on RefCOCO/RefCOCOg/RefCOCO+ and Ref-L4 datasets and show that individual textual or visual context can improve the REC performance without any fine-tuning. By combining multiple contexts, our training-free framework shows between 5% to 30% performance gain over the baseline model on accuracy at various Intersection over Union (IoU) thresholds.

Proposed method

We propose to tackle the referring expression comprehension task using an MLLM in a 2-stage process. (1) Grounded description is generated for the input image using an MLLM. Each line contains a pair of object description and bounding box coordinates. (2) We use the VQA and captioning capability of the MLLM to refine the predicted bounding box in a process named Chain-of-Caption.

Robot setup

Results

Our proposed method Chain-of-Caption improves the prediction accuracy especially at high IoU thresholds, predicting bounding boxes that fits better to the target. Legend: - Groundtruth bounding box, - Predicted bounding box

Robot setup
Robot setup