The ability to locate an object in an image according to natural language instructions is crucial to many real-world applications. However, there is no high-quality benchmark dedicated to evaluating this ability. Therefore, we propose a benchmark dataset, LocateBench, to evaluate commercially available proprietary large visual language models. We experiment with a few prompting approaches and find that even the best proprietary large visual language models still lag behind human performance by 10% of accuracy. We expect our benchmark will guide the future development of visual language models.