LocateBench: Evaluating the Locating Ability of Visual Language Models

University of Southern California

Abstract

The ability to locate an object in an image according to natural language instructions is crucial to many real-world applications. However, there is no high-quality benchmark dedicated to evaluating this ability. Therefore, we propose a benchmark dataset, LocateBench, to evaluate commercially available proprietary large visual language models. We experiment with a few prompting approaches and find that even the best proprietary large visual language models still lag behind human performance by 10% of accuracy. We expect our benchmark will guide the future development of visual language models.

Which one contains the bunch of bananas that has only one sticker?
Which one contains the tallest oval-shaped vase?
Which one contains the third fridge counting from the left?
Which one contains the fruit that is to the left of the yellow one?