Extracted Text

2312.14135.pdf

V
∗
: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Penghao Wu
†
UC San Diego
pew011@ucsd.edu
Saining Xie
New York University
saining.xie@nyu.edu[targets
located]
V*: LLM‐guided
Search
Visual Working Memory (VWM)
VQA
LLM
[search ends;
construct VWM]
[search starts]
[search
iterations]
What color is the liquid
in the glass?
<question> �� L <
��1, ��1, ��2, ��2 , … =
<target locations> <searched targets>
: , … ;
<global image>
1
2
3
image resolution:1920 H 830
[tokenize VWM]
Question: What color is the liquid in the glass?
Initial answer: “Sorry, the current visual information is not
enough. I need to find the glassfirst.”
Contextual cues: “The glass is most likely to
appear on the dining table.”
Final answer:“The liquid in the glass is green.”
GPT4‐V answer:
“The liquid in the glass appears to be a shade of pink.”
Figure 1. The visual search mechanism enables humans to identify a target within a multitude of stimuli, streamlining the organization of
information critical for problem-solving and reasoning. In this work, we explore this core mechanism in the context of MLLMs, addressing
its absence, which currently impedes precise visual grounding, especially for high-resolution images. In this example, the VQA LLM could
not immediately answer the question, thus activatingV
∗
, an LLM-guided visual search process that uses common sense and contextual
cues to search for the required details. Throughout this informed search, it builds a visual working memory (VWM), tokenizing the overall
context and the areas of interest related to the targets, which are then re-fed to the VQA LLM, enabling it to accurately answer the question.
Abstract
When we look around and perform complex tasks, how
we see and selectively process what we see is crucial. How-
ever, the lack of this visual search mechanism in current
multimodal LLMs (MLLMs) hinders their ability to focus
on important visual details, especially when handling high-
resolution and visually crowded images. To address this,
we introduce V
∗
, an LLM-guided visual search mechanism
that employs the world knowledge in LLMs for efficient vi-
sual querying. When combined with an MLLM, this mech-
anism enhances collaborative reasoning, contextual under-
standing, and precise targeting of specific visual elements.
This integration results in a new MLLM meta-architecture,
namedShow, sEArch, and TelL(SEAL). We further cre-
ate V
∗
Bench, a benchmark specifically designed to evaluate
MLLMs in their ability to process high-resolution images
and focus on visual details. Our study highlights the neces-
sity of incorporating visual search capabilities into multi-
modal systems. The code is available.
†
Work done during an internship at NYU.
1. Introduction
One of the hallmarks of human intelligence is being able
to process and integrate multi-sensory information to per-
form complex tasks. A salient aspect of our cognitive rea-
soning process involving visual information is the ability
to conductvisual search– the process of efficiently recog-
nizing and localizing key objects within intricate real-world
scenes. This mechanism plays a fundamental role in the
interaction with the environment and happens everywhere,
from finding keys on a cluttered table to searching for a
friend in the crowd. Besides, it is also an indispensable step
for complex tasks that require multiple reasoning steps. The
intricacy of visual search has been studied for a long time
in cognitive science and vision science [37,,,–52].
While visual search seems intuitive for humans, it is ac-
tually a complex process underpinned by a series of com-
plex behaviors. To accomplish this task efficiently, top-
down feature guidance and contextual scene guidance are
two fundamental factors, guiding humans’ visual search
process [51]. The top-down feature guidance directs hu-
mans’ attention to items with specific features or attributes
(e.g. color, shape, and orientation) based on the specifica-
1

tion of the target object or knowledge about its general cat-
egory. The contextual scene guidance is based on the fact
that objects are usually well-organized in structured scenes
in real-world scenarios. Therefore, one can use the seman-
tics of the scene, object co-occurrence, and other physical
constraints based on common sense knowledge to pay at-
tention to specific regions, accelerating the search process.
As an important step towards achieving artificial general
intelligence, multimodal LLMs (MLLMs) [1,,,,]
try to emulate humans’ ability to integrate multimodal in-
formation and perform general tasks. Significant advances
have been made in this domain, leveraging the strong rea-
soning capabilities of large language models. However, a
key limitation of current MLLMs is their dependence on
pre-trained (and often frozen) vision encoders, such as the
CLIP [39] image encoder. This dependency forms a major
bottleneck for visual information processing. The vision en-
coder is often trained on images with low resolution, such as
224×224 or 336×336 pixels. During deployment, images
are also often resized to a lower resolution. As a result, the
encoder may overlook important details in high-resolution
images. Additionally, current MLLMs struggle to identify
which essential visual details are missing or unclear in the
images they process, nor can they proactively seek out or
request this missing information.
Inspired by human capabilities, we proposeSEAL
(Show, SEArch, and TelL), a general meta-architecture
to integrate an LLM-guided visual search mechanism into
MLLMs to address the aforementioned visual limitations.
The SEAL framework consists of a VQA LLM and a vi-
sual search model. Unlike typical MLLM models that might
refuse to answer or make uninformed guesses (i.e. halluci-
nations) due to insufficient information from the vision en-
coder, the VQA LLM in SEAL can explicitly pinpoint the
visual details that are missing, thus creating target objects
for focus. Then, using the rich world knowledge and com-
mon sense of language models, the visual search compo-
nent locates these identified elements, adding them to a Vi-
sual Working Memory (VWM). This additional visual data
in the VWM enables the VQA Language Model to provide
more accurate and informed responses. SEAL’s adaptabil-
ity allows it to work with various MLLM base models; in
our case, we use LLaVA [28] as both the VQA LLM and
the MLLM in the visual search model. With this new visual
search capability, the MLLM is better equipped to handle
situations that require accurate visual grounding in high-
resolution images, as highlighted in our comparison (Fig).
As humans’ visual search process is guided by top-down
feature guidance and contextual scene guidance, we design
an informed visual search algorithm dubbedV
∗
with a vi-
sual search model following similar principles. For hu-
mans, such guidance largely comes from their knowledge
and experiences about the physical world. Thus, our visual
search model is built atop another MLLM which contains
vast common sense knowledge about the world and can ef-
fectively reason about the possible locations of the target in
the scene based on this knowledge.
The existing MLLM benchmarks [10,,] primarily
focus on providing comprehensive evaluations across var-
ious task categories, and do not adequately challenge or
expose the specific limitations of current paradigms men-
tioned above. To bridge this gap and evaluate our pro-
posed framework, we introduceV
∗
Bench, a new ded-
icated VQA benchmark that focuses on detailed visual
grounding on high-resolution images.V
∗
Bench is a vision-
focused benchmark, requiring multimodal models to accu-
rately ground specific visual information that could be eas-
ily overlooked by a standard, static vision encoder lacking
visual search capabilities. In a world increasingly domi-
nated by rich and complex visual content like images and
videos, it’s crucial for MLLMs to be able to actively fo-
cus on critical visual information for complex reasoning
tasks. This benchmark aims to highlight the significance
of this fundamental mechanism and guide the evolution of
MLLMs towards mirroring the multimodal processing and
reasoning aptitudes inherent in human cognition.
In summary, our contributions are threefold: 1) We pro-
pose SEAL, an MLLM meta-architecture designed to ac-
tively reason about and search for needed visual informa-
tion, a vital capability for vision-intensive multimodal tasks,
especially when dealing with high-resolution images. 2) We
develop a visual search algorithmV
∗
that utilizes the com-
mon sense understanding inherent in LLMs to perform ef-
ficient informed searches across images of any resolution.
3) We introduceV
∗
Bench to thoroughly evaluate the ability
of MLLMs in accurately processing and grounding detailed
visual information in high-resolution images.
2. Related Work
2.1. Computational Models for Visual Search
Inspired by guiding factors in humans’ visual search pro-
cess, several computational models have been proposed to
mimic the human visual search process. Sclar et al. [41]
proposes a Bayesian searcher combined with a saliency map
as prior. Torralba et al. [46] combines the local saliency
map with the global scene priors to form a scene-modulated
saliency map. IVSN [59] uses convolutional networks to
compute the similarity map between the search image and
the target template and perform the search greedily. Yang
et al. [55] uses inverse reinforcement learning (IRL) to learn
the reward function and policy of human visual search.
Nevertheless, such models mainly focus on mimicking
the human gazing trajectory, without requiring accurately
localizing the target object. And they usually adopt a fixed-
size gazing window while our visual search model tackles
2

What is the color of the child's shoes? What is the color of the mug?
Based on the exact content of the flag on the roof,
what can we know about its owner?
What is the instrument held by an ape?
GPT-4V:
The ape is holding a
saxophone.
SEAL:
The instrument held by an
ape is a guitar.
Where to buy a mug like this based on its logo?
Which company does that little doll belong to?
What is the logo on that bag of bottles carried by the
man?
GPT-4V:
The child's shoes are white.
SEAL:
The child's shoes are green.
GPT-4V:
The color of the mug in the
image is white.
SEAL:
The color of the mug is blue.
GPT-4V:
The logo on the bag of
bottles carried by the man
is not clearly visiblein
the image.
SEAL:
The logo on that bag of
bottles carried by the man
is evian.
What animal is drawn on that red signicade?
GPT-4V:
The red signicade appears
to have a drawing of a
rooster or chickenon it.
SEAL:
A tigeris drawn on that red
signicade.
From the information on the black framed board,
how long do we have to wait in line for this attraction?
GPT-4V:
The black framed board in
the image indicates
"STANDBY" but the specific
wait time is not visiblein
the provided image.
SEAL:
The black framed board
displays a wait time of 10
minutesfor the attraction.
Tell me the number of that player who is shooting.
GPT-4V:
The player who is shooting
is wearing the number “10.”
SEAL:
The number of that player
who is shooting is eight.
GPT-4V:
The little doll belongs to the
company "Rubbermaid
Commercial".
SEAL:
The little doll belongs to
Intel.
GPT-4V:
The flag on the roof
appears to have a light
blue design or emblem,
but the resolution and
angle of the image make it
difficult to clearly
identifythe specific
content or design of the
flag.
SEAL:
The flag on the roof of the
house is an indicator of
the owner's support for
the Golden State
Warriors, a professional
basketball team. We can
know that the owner is
likely a fan of the
Warriorsand may have a
strong connection to the
team or its values.
GPT-4V:
The mug in the image has
a logo reminiscent of
"Nuka-Cola" from the
"Fallout" video game
series. You can typically
find merchandise related
to "Fallout", including
Nuka-Cola mugs, at online
retailers such as Amazon,
eBay, or specialized
gaming merchandise
stores.
SEAL:
You can buy a mug like this
based on its logo at a
Starbucks storeoronline
at the Starbucks website.
GPT-4V:
The vending machine in
the image has the Pepsi
logoon it. Based on that,
it's likely that the machine
offers various drinks from
the Pepsiproduct line,
including but not limited
to Pepsi itself.
SEAL:
From the vending machine,
you can buy Coca-Cola
products, as it is specifically
stocked with Coca-Cola
items.
At which conference did someone get that black mug?
GPT-4V:
The black mug has the
logo "Corsair" on it.
Corsair is a company
known for its computer
hardware products. The
mug likely came from a
tech or gaming
conference or event
where Corsair was a
participant or sponsor.
SEAL:
The black mug with the
white logo was given away
at theSIGGRAPH
conference.
What kind of drink can we buy from that vending
machine? Figure 2. Examples on which GPT-4V fails(Accessed:Oct 31,2023)while SEAL with theV
∗
visual search mechanism succeeds. Even
though GPT-4V has a much more powerful LLM (GPT-4) than ours (Vicuna-7B), it still occasionally struggles in scenarios that demand
extensive visual processing. These situations require precise visual grounding in high-resolution images, a task where the visual search
mechanism becomes essential.Best viewed on screen with zoom. Image sources are provided in Appendix.
any resolution images in a hierarchical process. Besides,
their usage of categorical information about the target ob-
jects and the contextual scene information is limited to sim-
ple statistics and does not generalize to general domains.
Our visual search model utilizes the rich common sense
knowledge from LLM to expedite the search process. We
note that our active search strategy is linked to System II
cognitive processes [16] – for complex tasks, dynamic com-
putation allocation for visual search becomes necessary.
Our approach can also be thought as a visual counterpart to
the chain-of-thought (CoT) technique used in LLMs [49].
2.2. Multimodal LLMs
Propelled by the success of large language models, vision
language model research begins to explore how to equip
LLMs with additional vision input to solve various multi-
modal tasks. Currently, MLLMs can be categorized into
two types: end-to-end models and LLM tool-using systems.
End-to-end MLLMs.End-to-end MLLMs[1,,,,
28,] connect the pre-trained LLM with a vision encoder
through projection or alignment modules, and the whole
system is jointly trained in an end-to-end manner. These
models aim to project the visual features to the input em-
3

bedding space of language or intermediate feature space,
enabling the LLM to process visual information and per-
form vision-language tasks. While vision encoders like
CLIP [39], which are pre-trained through image-text align-
ment, can translate visual features into a form of ‘language
tokens’ understandable by LLMs, this process introduces
an information bottleneck. The conversion and projection
of visual features often lead to inherent information loss,
especially since vision encoders are typically constrained
to low-resolution images. Consequently, these models may
struggle to provide accurate results or might produce hallu-
cinated answers if crucial visual information is poorly cap-
tured or inadequately focused upon.
LLM-tool-using systems.LLM-tool-using systems or
LLM-based agents treat the LLM as a black box and give
them access to some vision expert systems to perform cer-
tain vision-language tasks through reasoning [14,,,
54,,]. Such systems utilize different kinds of vision
experts to provide needed information about the visual in-
put in the form oftext. They usually adopt captioning and
detection models to create general textual information about
an image which is then provided to the LLM. Based on the
description of the image and a certain question or task in-
struction, the LLM further decides what visual information
is needed and which visual experts to call through reason-
ing. The LLM decides to terminate the process and provide
the final answer when it thinks the information is enough.
However, one main problem of such systems is that as the
whole system is running based on text only, certain visual
information might be inevitably ignored or distorted when
translated into text. Moreover, as the vision experts are not
perfect themselves, cascaded errors exist and the complex
and lengthy process makes the whole system prone to fail.
3. Method
Our proposed Show, Search and Tell (SEAL) framework is
a general meta-architecture for MLLMs. It is comprised of
a VQA LLM and a visual search model which collaborate
and interact through the visual working memory (VWM).
An illustration of the SEAL framework is shown in Fig. In
this work, we provide an instantiation of SEAL to validate
its effectiveness and choose the LLaVA-7B model as the
MLLM in the SEAL framework. We now elaborate on the
model structures of each of these two parts. The training
data curation process for the visual search model and the
training details are provided in the Appendix.
3.1. VQA LLM with Visual Working Memory
3.1.1 Model Structure
Modern MLLMs usually have three components: a vision
encoder, a projection module, and an LLM. The type of
projection module varies across different models. including
options like Resampler [1,,], QFormer [8,], and
linear layer [5,]. The placement of the projected vision
tokens within the LLM also differs among models, such as
in the input layer [8,,,], or middle cross-attention
layers [1,]. Despite these variations, most of these mod-
els adopt pre-trained CLIP as their vision encoder. When
tackling high-resolution and visually-crowded images, the
visual features extracted by CLIP may not capture the nec-
essary information required to answer the question.
Algorithm 1:SEAL Working Pipeline
1FunctionSEALVQA(I,T,δ):
2 list of needed target objectsL←VQALLM(I,T)
3 Initialize VWM
4 VWM.add(I), VWM.add(T)
5 fortarget inLdo
6 Priority queueq
7 q.add((I,∞))
8 search result←Visual Search(q,s,δ)
9 ifsearch result is Nonethen
10 VWM.add(”{target}not existent in the image”)
11 else
12 Cropped the object patch fromI
13 VWM.add(”{target}<object patch>at location
[x1, y1, x2, y2]”)
14 response←VQALLM(VWM)
15 returnresponse
The visual search mechanism is not always engaged.
The model first evaluates if the encoder’s initial (global)
visual features suffice for answering the question. If not,
it explicitly lists all the needed but missing information
in the format of a list of target objects. Then, it initial-
izes a visual working memory (VWM). The VWM has four
blocks, the<question>block contains the initial textual
question;<global image>contains the initial image;
<searched targets>stores the target object crops
after search; and<target location>stores the co-
ordinates of the searched targets. Next, the visual search
model searches over the image and localizes each required
target. A region containing the identified target is then
cropped from the whole image. The cropped targets, along
with their coordinates, are added to the VWM. After that,
the VQA LLM processes the data contained in the VWM to
generate the response accordingly. The working pipeline of
the SEAL framework is illustrated in Algorithm.
In this work, we choose the CLIP ViT-L/14 model [39] as
the visual feature extractor, with input resized and padded
to224
2
. We use it to process both the initial image and
the crops of searched targets. To adapt the visual features
for input into the LLM, we consider two types of projec-
tion modules, the linear layer and the resampler. The linear
layer projection module keeps the number of visual tokens
from the vision encoder, and the cross-attention based re-
sampler projection reduces the number of tokens (i.e. 256
to 32). To manage the token length corresponding to differ-
ent contents in VWM, we have designed a simple scheme
to flexibly switch between these two projection modules. In
4

Language Model
⋯
Vision
Encoder
Re-
sampler
Linear
“The orange luggage is most
likely to appear next to the
person.”
“What is the most
likely location of the
orange luggage?”
<question>
��={&#3627408485;
1,&#3627408486;
1,&#3627408485;
2,&#3627408486;
2,
[&#3627408485;
1,&#3627408486;
1,&#3627408485;
2,&#3627408486;
2],…}
<target locations>
<searched targets>
<global image>
“Is the orange luggage
on the left side of the
black luggage?”
{,,⋯}
<target locations>
“Please locate the
personin the image.”
��
��
��
��
Vision
Backbone
��
��
Vision
Backbone
��
��
It is <LOC>.
⋯
“Please locate the
orange luggagein
the image.”
Vision
Backbone
��
��
It is <LOC>.
��
��
&#3627408485;
1,&#3627408486;
1,&#3627408485;
2,&#3627408486;
2
LoRA
MLLM
LoRA
MLLM
LoRA
MLLM
Visual Working Memory (VWM) Figure 3. An instantiation of the proposed SEAL framework. The left section represents the VQA LLM, which utilizes all the data within
the Visual Working Memory to respond to questions. On the right, we illustrate the operational pipeline of theV
∗
visual search algorithm.
scenarios where the input comprises only the initial image
feature without any searched targets, we apply the linear
layer projection to maintain all visual tokens. When one
or two searched targets are present in the VWM, it’s pre-
sumed that the model needs to focus on these targets. In
such cases, we use the linear layer projection for the visual
features of these targets and employ the resampler to sub-
sample the global image features. For situations where the
VWM holds more than two searched targets, the resampler
is used for all visual features to reduce computational cost.
3.2. Data Curation for VQA LLM
Since our VQA LLM will now work with the VWM that has
searched targets, we need to perform additional instruction
tuning to train the VQA LLM. We describe the training data
as follows. More details can be found in the Appendix.
Negative data for target objects reasoning (100k)The
VQA LLM must first identify the target objects that are
1) required to answer the question, and 2) missing or not
clear enough in the initial global image features. To facili-
tate this, we construct (image, question, answer) data where
the question pertains to one or two objects not present in the
image. Additionally, we construct questions about details of
certain objects, deliberately made too small to be captured
by the CLIP encoder. This is achieved by choosing objects
with bounding box sizes smaller than20×20. The appro-
priate response to such questions is a straightforward ac-
knowledgment that the question cannot be answered, along
with a clear enumeration of all the additional target objects
required. We construct 100k data on COCO2017 [25] with
questions generated by GPT-3.5.
VQA data (167k)This data consists of three parts: GQA
data (70k) from [15], VQA data focused on object attributes
(51k), and VQA data focused on spatial relationship (46k).
In the GQA subset, we utilize the original dataset’s GT an-
notations about specific objects mentioned in the questions.
We select a portion of this data, treating the mentioned ob-
jects as search targets in the VWM during training. Ad-
ditionally, we rephrase the short answers in GQA into full
sentences using GPT-3.5. For the object attribute data, we
utilize the VAW [38] data, transforming them into question-
answer pairs in a standard format that inquire about certain
object attributes, and consider these objects as search tar-
gets. Regarding the spatial relationship data, we use the
COCO2017 dataset to generate questions about the relative
spatial positioning of two objects within an image, treating
these two objects as the search targets.
LLaVA Instruction Tuning (120k)To maintain the gen-
eral multimodal question answering and instruction follow-
ing capabilities, we also include the LLaVA-80K instruction
tuning data, of which the image sources are also COCO.
Additionally, we identify object entities in the questions that
match with COCO categories and have box annotations.
These matched objects are then designated as the search tar-
gets, creating an additional set of 40k data.
3.3.V
∗
: LLM-guided Visual Search
3.3.1 Problem Formulation
At a high level, the objective of visual search shares simi-
larities with the task of referring expression comprehension
(REC) [32] in computer vision. REC aims to localize a spe-
cific object in the image as described by a textual referring
expression. However, unlike REC, which is restricted to im-
ages of a specific size, visual search must adapt to images of
any resolution. Sometimes, a thorough search across the en-
tire image is needed to find the target object. Consequently,
visual search efficiency matters: an effective visual search
algorithm should not only locate the target accurately but
also do so as quickly as possible.
5

Self-Attn.
Cross-Attn.
MLP
Cross-Attn.
Q
K,V
Q
K,V
Up
Cross-Attn.
Q
K,V
MLP
heatmap
dot
product
image
tokens
��
��
output
token
output
token
×2
×2
image
tokens
��
��
ML
P
ML
P
cosine
similarity
.0
.8
.1
confidence score
coordinates
��
��
�� Figure 4. Detailed structure of thetargetlocalization decoderDtl,
and thesearch cuelocalization decoderDcl.
3.3.2 Model Structure
Similar to how people often zoom in on their phones for
a clearer view, when dealing with a high-resolution image,
it’s possible that the target object cannot be precisely identi-
fied and located if only the entire image is viewed as a small
thumbnail. To address this, one straightforward approach is
to patchify an image into uniformly sized small patches and
perform the localization on each patch exhaustively. This
brute-force strategy is often used in aerial image detection
and whole slide image analysis [6,]. However, it tends to
be too inefficient for effectively managing images with very
high resolutions – we need a smarter solution.
Drawing inspiration from how humans utilize contextual
scene and top-down feature guidance in their visual search
process, we’ve incorporated similar concepts into the design
of the visual search model inV
∗
. This process utilizes an
MLLM that encapsulates a vast amount of common sense
knowledge, serving as heuristic guidance. In order to local-
ize and crop the searched targets for VWM, it’s also nec-
essary to enhance the MLLM with additional localization
capabilities, comparable to those mentioned in [20,].
Our visual search model consists of an MLLM and a
localization module with an image backbone and two de-
coders,i.e. a target localization decoderDtland a search
cue localization decoderDcl. The MLLM has an additional
localization ability with a localization token<LOC>added
to its vocabulary. Given an image and a textual expression
of an object or region, the textual expression is first trans-
formed into a fixed-format instruction (i.e. “Please locate
the[object]in the image.”) and then fed into the MLLM
together with the image. The MLLM outputs the local-
ization token<LOC>containing contextual and location-
related information of the queried textual expression. We
process the<LOC>token embeddingvlocwith two sepa-
rate MLPs to get two additional embeddingsvtlandvcl.
The image tokens from the visual encoder are then com-
bined withvtlandvcl, processed by decodersDtlandDcl
respectively, and outputtarget coordinates(with confidence
scores) andsearch cue heatmaprespectively. TheDclre-
sembles the mask decoder in SAM [19] , and the Dtlis
implemented with two linear heads, one for coordinate pre-
diction and the other for confidence score prediction. The
detailed structure of these two modules is shown in Fig.
3.3.3 Search Algorithm
With this visual search model, ourV
∗
algorithm works as
follows. Given an image and a textual expression of the tar-
get object, theV
∗
MLLM first attempts to locate the target
directly. In this step, we obtain thetarget coordinatesand
thesearch cue heatmapfromvloccorresponding to the tar-
get object. When no object is located (i.e. the confidence
score falls below a threshold), we examine the heatmap for
possible target-specific cues.
The search cue heatmap highlights regions that could po-
tentially contain the queried target object. When thetarget-
specific cueis prominent (i.e. when the highest value in the
heatmap exceeds the thresholdδ), we use it to guide the
search directly. Otherwise, we ask the MLLM what is the
most likely location of the target object in the image. This
requires the MLLM to utilize its common sense knowledge
and integrate it with the image’s context to provide thecon-
textual cueabout the target’s whereabouts. Upon receiving
a description of the region where the target object is likely
located, we then prompt the MLLM to locate the described
area with theDcldecoder and produce a search cue heatmap
corresponding to the contextual cue.
Then, we use a simple strategy and recursively divide the
image into 4 non-overlapping equal-sized patches.
1
In order
to maintain a square-like aspect ratio for each patch during
the search, we adjust our division strategy based on the im-
age’s orientation. For landscape images (i.e., where width
is greater than twice the height), we divide the image ver-
tically. Conversely, for portrait images (i.e., where height
exceeds twice the width), we divide it horizontally. In all
other cases, we split the image both horizontally and verti-
cally. This approach to patching is depicted in Fig. Sub-
sequently, we assignsearch priority scoresto these patches.
The search priority score is calculated from the search cue
heatmap (either target-specific or contextual). Based on the
priority scores, the patches are then cropped and processed
sequentially. This recursive procedure is repeated until the
target object is located or the size of the current patch be-
comes smaller than a predetermined threshold. The overall
process of theV
∗
algorithm is illustrated in Algorithm.
Connection toA
∗
Algorithm.The naming of our LLM-
guided visual searchV
∗
algorithm is inspired by its similar-
ities to the informed search algorithmA
∗
.A
∗
is designed for
pathfinding, aiming to identify the shortest route between
a starting point and a goal by using a heuristic to approx-
imate the cost. In the context of our LLM-guided visual
search,V
∗
can be seen as a unique variant ofA
∗
, where
sub-images are treated as nodes. The cost functiong(n)is
set as a uniform positive constant for all nodesn, and the
heuristic functionh(n)is defined as the negative of the pri-
1
A corner case is that this simple strategy might fail when targets are
located at the boundaries of patches. One can use overlapping patches or
variable-sized patches based on the heatmap distribution, if necessary.
6

portrait ��
landscape��
square ��hierarchical sub-images Figure 5. Images are recursively divided into four patches based
on their aspect ratio. Landscape images are divided vertically. Por-
trait images are divided horizontally.
ority score derived from the search cue heatmap. While the
A
∗
algorithm’s objective is to find a path with minimal cost
from start to goal, our focus withV
∗
is solely on minimizing
the total number of steps required to locate the goal.
Algorithm 2:V
∗
: LLM-guided Visual Search
1FunctionVisual Search(q,s,δ):
2 Current imageIp=q.pop()
3 target coordinates&confidence, search cue heatmap←
VisualSearchModel(instruction=”Please locate thesin the
image.”, image=Ip)
4 iftarget confidence is highthen
5 returntarget coordinates
6 ifheatmap.max()< δthen
7 contextual cue←
VisualSearchModel(instruction=”What is the most
likely location of thesin the image?”, image=Ip)
8 , search cue heatmap←
VisualSearchModel(instruction=”Please locate the
’contextual cue’ in the image.”, image=Ip)
9 DivideIpinto sub-images, calculate the priority for each sub-image
based on the heatmap, and add (sub-image, priority) pairs toq.
10 whileqis not emptydo
11 search result←Visual Search(q,s,δ)
12 ifsearch result is not Nonethen
13 returnsearch result
14 returnNone
4. Benchmark
To quantitatively evaluate MLLMs’ ability in challenging
scenarios where the image contains abundant and complex
information and the visual information needed might not
be easily found, we build a benchmarkV
∗
Benchbased on
191 high-resolution images from SA-1B dataset [19] with
an average image resolution of 2246×1582.
Our benchmark contains two sub-tasks: attribute recog-
nition and spatial relationship reasoning. The attribute
recognition task has 115 samples and requires the model
to recognize a certain type of attribute (e.g. color, mate-
rial) of an object. The spatial relationship reasoning task
has 76 samples and asks the model to determine the rela-
tive spatial relationship between two objects. These tasks
focus on evaluating the detailed visual analysis capability
of the multimodal models. Both the test images and ques-
tions have been carefully selected and crafted by human an-
notators to ensure that it is difficult to directly “guess” the
Attribute (%) Spatial (%) Overall (%)
Human 98.26 100.00 98.95
Random Guess 26.73 50.00 35.99
Open-source end-to-end MLLMs
BLIP2 [23] 26.95 53.94 37.69
MiniGPT-4 [63] 30.43 50.00 38.22
LLaVA [28] 23.47 53.94 35.59
InstructBLIP [8] 25.21 47.36 34.02
Otter [22] 26.95 56.57 38.74
LLaVA-1.5 [27] 43.47 56.57 48.68
LLM tool-using pipelines
MM-React [53] 34.78 51.31 41.36
VisualChatGPT [54] 30.43 48.68 37.69
Visprog [12] 31.30 56.57 41.36
Commercial chatbot systems
Bard [11] 31.30 46.05 37.17
Gemini Pro [9] 40.86 59.21 48.16
GPT-4V [35] 51.30 60.52 54.97
SEAL (Ours) 74.78 76.31 75.39
Table 1. Evaluation of multimodal systems onV
∗
Bench. We find
our SEAL model outperforms leading-edge systems such as GPT-
4V and Gemini by a large margin, even though we only use a
Vicuna-7B LLM. This result demonstrates the importance of in-
tegrating a visual search mechanism into MLLMs.
correct answer without accurate visual grounding of the rel-
evant objects in the image. Examples of our benchmark can
be found in the Appendix.
For a quantitative comparison of open-source MLLM
models on our benchmark, we construct multiple choice
options for each question. We formulate four options for
open-ended questions and two for binary questions. To en-
sure clarity, these multiple choices are carefully crafted and
reviewed by human annotators for any potential ambiguity.
5. Experiments
5.1. Evaluation onV
∗
Bench
In this work, we implement the VQA LLM in our SEAL
framework with Vicuna-7B[61] as the language model. We
evaluate it with other open-source end-to-end MLLMs and
LLM-tool-using systems on the proposedV
∗
Bench. For
end-to-end models, we include representative methods in-
cluding [8,,,,,] and use the likelihood ap-
proach to evaluate their performance following [2,]—
we select the choice with the highest log-likelihood as the
model’s prediction. For the LLM-tool-using systems, we
evaluate methods including MM-React [53], VisualChat-
GPT [54], and Visprog [12]. Additionally, we also evaluate
industrial multimodal chatbots: Bard [11], Gemini Pro [9],
and GPT4-V [35]. For the LLM-tool-using systems and the
multimodal chatbots, we prompt them to directly answer the
option as the likelihood does not apply to them, and we ask
7

Experiment ID LLM VWM Search Attribute Spatial Overall
1 Vicuna-7B ✗ N/A 38.26 55.26 45.02
2 Vicuna-7B ✓ Querying Detection (GD [29]) 62.60 61.84 62.30
3 Vicuna-7B ✓ Querying Detection (OWL-ViT [34]) 60.86 65.78 62.82
4 Vicuna-7B ✓ V
∗
-Search 74.78 76.31 75.39
Table 2. Ablation studies on the necessity of the visual search mechanism. The LLaVA* denotes our VQA model without the visual search
mechanism. Detection (GD) and (OWL-ViT) denote replacing the visual search model with GroundingDINO and OWL-ViT respectively.
Search Length↓
Random-DFS 8.94
Random-BFS 7.18
Sequential-DFS 11.39
Sequential-BFS 6.62
LLM-guided visual search 4.65
w/o target-specific cue 5.22
w/o contextual cue 5.36
Table 3. Evaluation of different search strategies onV
∗
Bench.Image withTarget LLM-guided Cues Human Fixation
Figure 6. Comparison with the human fixation on COCO-
Search18 [7]. Humans tend to focus on center regions or salient
objects while our model focuses on a larger contextual region.
Search Length↓
Random-DFS 9.97
Random-BFS 4.90
Sequential-DFS 9.82
Sequential-BFS 4.20
Human Fixation (γ=0.9) 2.52
Human Fixation (γ=0.8) 2.70
LLM-guided visual search 2.80
Table 4. Comparison with the human fixation on COCO-Search18.
them to choose the most likely option when they find that
it is impossible to answer the question or none of the op-
tion is correct. We evaluate Bard and GPT4-V through the
web chatbot (Accessed:Oct 31,2023) and evaluate the
Gemini Pro through the API (Accessed:Dec 16,2023).
As shown in Table, we can see that the performance
of most MLLMs is merely close to random guessing. The
GPT-4V and Gemini systems can better handle some rela-
tively easy scenarios in the attribute recognition task, but
the overall performance is still not satisfactory. It’s also
noteworthy that the LLaVA-1.5 model, compared to the ini-
tial LLaVA model, shows a significant improvement in the
attribute recognition task. This enhancement could be par-
tially attributed to the adoption of a new vision encoder with
a higher training resolution (CLIP-ViT-L-336px). How-
ever, there is still a considerable gap in performance when
compared to our visual search strategy. With the visual
search process, our model greatly improves performance.
Nonetheless, considering that humans can achieve near-
perfect results, there remains considerable potential for fur-
ther improvement for MLLMs. Our visual search incurs an
average time cost of 6.0 seconds per target on one A100
GPU. This is a reasonable trade-off, as, akin to human vi-
sual search and reasoning, allocating more computational
resources is necessary for tackling challenging tasks.
5.2. Ablation Study
We conducted ablation experiments to verify the effective-
ness of our key designs. First, we start with the LLaVA
model that has the same structure (without the VWM) as
our VQA LLM and train it on the same training data. Then,
we replace the visual search mechanism with open-world
detectors GroundingDINO [29] and OWL-ViT [34] and use
the detection results to fill in the VWM. The experiment
results are shown in Table. We can see that, although we
include attribute recognition and spatial relationship reason-
ing data in the VQA LLM training data, the MLLM without
the visual search mechanism (ID 1) still struggles. Direct
querying detection models (ID 2&3) as a substitute for the
search process also results in significantly inferior perfor-
mance compared toV
∗
. Moreover, off-the-shelf detectors
would encounter practical difficulties when applied to im-
ages of very high resolution.
5.3. Visual Search Evaluation
First, we record all 245 target object locations in the
V
∗
Bench. We then evaluate different search strategies in
terms of search length. The search length here is defined
as the number of search steps from the initial image to the
patch where the target is located. We only include samples
that can be successfully located after the search for evalua-
tion. We compare our LLM-guidedV
∗
algorithm with two
8

V
∗
Bench MME POPE MMBench SEED-Bench(Img) MM-Vet LLaVA
W
LLaVA* (7B) 45.0 1051.2 76.5 34.4 41.8 30.4 62.6
SEAL (7B)75.3 (+30.30)1128.9 (+77.70)82.4 (+5.85)33.1 (-1.36)41.7 (-0.17)27.7 (-2.70)59.1 (-3.50)
Table 5. When tested on a broader range of multimodal benchmarks, the addition of the visual search module mostly maintains the overall
multimodal capability, and enhances performance in object hallucination benchmarks like POPE [24].The guard is most
likely to appear next
to the building.
The guard is most
likely to appear
near the entrance
of the building.
Contextual cue
Contextual cue
Search for the guard
（Question: Tell me the guard's detailed
posture.）
The guitar is most
likely to appear on
the stage.
Contextual cue
Target-specific
cue
Search for the guitar
(Question: Is the guitar on the left or right side
of the keyboard?）
The bicycle is most
likely to appear on
the street.
Contextual cue
Search for the bicycle
(Question: What is the color of the bicycle?）
Figure 7. Examples of the LLM-guided visual search process. Each row in each example represents a step in the visual search process and
the heatmap of contextual cue or target-specific cue is shown on the right.
baselines. TheRandombaseline adopts the random search
strategy that picks a random sub-image to explore, and the
Sequentialbaseline searches the sub-images sequentially,
following a reverse raster scan order. These two strategies
are evaluated in breadth-first search (BFS) and depth-first
search (DFS) settings respectively. As shown in Table,
V
∗
could greatly reduce the average search length, and both
the target and contextual search cues are helpful. We pro-
vide visualizations of the search process in Fig.
To further study the efficiency ofV
∗
algorithm and
draw parallels with cognitive science research in visual
search, we conduct comparisons between our search out-
comes and human behaviors using the COCO-Search18
dataset [7]. COCO-Search18 records people’s eye fixations
when searching for a specific target object in natural scene
images. We use the validation set and select samples where
a visual search is needed to successfully locate the target.
We convert the ground-truth human fixation sequence on
each sample to a 2D heatmap and use it as guidance dur-
ing the search. Specifically, the fixation sequence is an or-
dered sequence of points on the image, and we convert it to
a dense 2D heatmap by adding Gaussian distributions cen-
tered at each fixation point to assign scores to each pixel.
Considering the order of the points in the fixation sequence,
for thei
th
fixation point, we multiply a weightγ
i
where
0< γ <1. Then we use this heatmap generated from hu-
man fixations as guidance to guide our search process and
compare it withV
∗
in terms of search length. Interestingly,
V
∗
algorithm can achieve similar efficiency to the human
fixations (Table). Examples are shown in Fig.
5.4. General Multimodal Benchmarks Evaluation
To verify that adding the visual search ability does not im-
pede the general multimodal ability, we evaluate our model
on several multimodal benchmarks including MME[10],
POPE [24], MMBench [30], SEED-Bench [21], MM-Vet
[57], and LLaVA-Bench
W
[28]. For fair comparisons, we
compared with the LLaVA model trained on our VQA
training data. In Table, we show that with the visual
search mechanism, the performance on the comprehensive
benchmark MME is improved and the hallucination prob-
lem is alleviated on POPE. For the larger-scale benchmarks
MMBench and SEED-Bench, the performance basically re-
mains the same. There is a slight decline in performance on
MM-Vet and LLaVA-Bench
W
. This could be attributed to
their smaller scale and the use of a GPT4-based evaluation
method, which may introduce more uncertainty and poten-
tial biases. Moreover, certain questions in these benchmarks
trigger a visual search for targets that are items in diagrams.
This often results in the model failing to locate objects ac-
curately because it was trained on common objects. Over-
all, while most common multimodal benchmarks focus on
large, prominent visual elements, our model, supplemented
with the visual search mechanism, still upholds its general
multimodal capabilities.
6. Conclusion
We introduce the SEAL MLLM framework, featuring the
LLM-guided visual search algorithmV
∗
for accurate visual
grounding in high-resolution images. Our new benchmark
V
∗
Bench highlights the critical role of visual search capa-
bilities in MLLMs. At present, our visual search model is
primarily tailored to natural images and common objects.
To extend its applicability to document and diagram images,
long-form videos, or open-world environments, additional
training and new algorithm design are necessary. Moreover,
exploring architectural improvements–such as integrating
convolution-based models for more efficient processing of
images of any resolution, could further enhance the effi-
ciency of the search process.
9

References
[1]
Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men-
sch, Katherine Millican, Malcolm Reynolds, Roman Ring,
Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong,
Sina Samangooei, Marianne Monteiro, Jacob Menick, Se-
bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa-
hand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira,
Oriol Vinyals, Andrew Zisserman, and Karen Simonyan.
Flamingo: a visual language model for few-shot learning.
InNeurIPS, 2022.,,
[2]
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
guage models are few-shot learners. 2020.
[3]
stuff: Thing and stuff classes in context. InCVPR, 2018.
13
[4]
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
end object detection with transformers. InECCV, 2020.
[5]
Feng Zhu, and Rui Zhao. Shikra: Unleashing multi-
modal llm’s referential dialogue magic.arXiv preprint
arXiv:2306.15195, 2023.
[6]
Chen, Andrew D Trister, Rahul G Krishnan, and Faisal Mah-
mood. Scaling vision transformers to gigapixel images via
hierarchical self-supervised learning. InCVPR, 2022.
[7]
Minh Hoai, and Gregory Zelinsky. Coco-search18 fixation
dataset for predicting goal-directed attention control.Scien-
tific reports, 2021.,
[8]
Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and
Steven Hoi. InstructBLIP: Towards general-purpose vision-
language models with instruction tuning. InNeurIPS, 2023.
2,,,
[9]
[10]
Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang,
Xiawu Zheng, et al. Mme: A comprehensive evaluation
benchmark for multimodal large language models.arXiv
preprint arXiv:2306.13394, 2023.,
[11]
[12]
ming: Compositional visual reasoning without training. In
CVPR, 2023.
[13]
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
LoRA: Low-rank adaptation of large language models. In
ICLR, 2022.
[14]
Sun, David A Ross, Cordelia Schmid, and Alireza Fathi.
Avis: Autonomous visual information seeking with large
language models. InNeurIPS, 2023.
[15]
dataset for real-world visual reasoning and compositional
question answering. InCVPR, 2019.
[16] Thinking, fast and slow. macmillan,
2011.
[17]
Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr – mod-
ulated detection for end-to-end multi-modal understanding.
InICCV, 2021.
[18]
Tamara Berg. ”ReferItGame: Referring to objects in pho-
tographs of natural scenes”. InEMNLP, 2014.
[19]
Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
head, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and
Ross Girshick. Segment anything. InICCV, 2023.,,
14
[20]
Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation
via large language model.arXiv preprint arXiv:2308.00692,
2023.
[21]
iao Ge, and Ying Shan. Seed-bench: Benchmarking mul-
timodal llms with generative comprehension.arXiv preprint
arXiv:2307.16125, 2023.,,
[22]
Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-
it: Multi-modal in-context instruction tuning.arXiv preprint
arXiv:2306.05425, 2023.,,
[23]
BLIP-2: Bootstrapping language-image pre-training with
frozen image encoders and large language models. InICML,
2023.,,,
[24]
Zhao, and Ji-Rong Wen. Evaluating object hallucination in
large vision-language models. InEMNLP, 2023.
[25]
Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In
ECCV, 2014.
[26]
Piotr Doll´ar. Focal loss for dense object detection. InICCV,
2017.
[27]
Improved baselines with visual instruction tuning.arXiv
preprint arXiv:2310.03744, 2023.,
[28]
Visual instruction tuning. InNeurIPS, 2023.,,,,
[29]
Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun
Zhu, et al. Grounding dino: Marrying dino with grounded
pre-training for open-set object detection.arXiv preprint
arXiv:2303.05499, 2023.
[30]
Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He,
Ziwei Liu, et al. Mmbench: Is your multi-modal model an
all-around player?arXiv preprint arXiv:2307.06281, 2023.
2,
10

[31]
Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao.
Chameleon: Plug-and-play compositional reasoning with
large language models.arXiv preprint arXiv:2304.09842,
2023.
[32]
Camburu, Alan L Yuille, and Kevin Murphy. Generation
and comprehension of unambiguous object descriptions. In
CVPR, 2016.
[33]
Camburu, Alan L Yuille, and Kevin Murphy. Generation
and comprehension of unambiguous object descriptions. In
CVPR, 2016.
[34]
Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh
Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran
Shen, et al. Simple open-vocabulary object detection with
vision transformers. InECCV, 2022.
[35] arXiv preprint
arXiv:2303.08774, 2023.
[36]
power of tiling for small object detection. InCVPR Work-
shops, 2019.
[37]
world visual search in human occipitotemporal cortex.Pro-
ceedings of the National Academy of Sciences, 2011.
[38]
hen, Quan Tran, and Abhinav Shrivastava. Learning to pre-
dict visual attributes in the wild. InCVPR, 2021.
[39]
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
ing transferable visual models from natural language super-
vision. InICML, 2021.,
[40]
Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Mar-
quez, Rama Kovvuri, Abhishek Kadian, et al. Paco: Parts
and attributes of common objects. InCVPR, 2023.
[41]
Solovey, and Juan Esteban Kamienkowski. Modeling human
visual search: A combined bayesian searcher and saliency
map approach for eye movement guidance in natural scenes.
InNeurIPS Workshop SVRHM, 2020.
[42]
Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365:
A large-scale, high-quality dataset for object detection. In
ICCV, 2019.
[43]
picture, 2011. Produced by CoMix Wave Films. Distributed
by Sentai Filmworks.
[44]
2019. Produced by CoMix Wave Films. Distributed by Toho.
14
[45]
Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov,
Naejin Kong, Harshith Goka, Kiwoong Park, and Victor
Lempitsky. Resolution-robust large mask inpainting with
fourier convolutions. InWACV, 2022.
[46]
John M Henderson. Contextual guidance of eye movements
and attention in real-world scenes: the role of global features
in object search.Psychological Review, 2006.,
[47]
halli, and Ying Shan. What makes for good visual to-
kenizers for large language models?arXiv preprint
arXiv:2305.12223, 2023.
[48]
man. Statistical learning speeds visual search: More effi-
cient selection, or faster response?Journal of Experimental
Psychology: General, 2023.
[49]
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.
Chain-of-thought prompting elicits reasoning in large lan-
guage models.NeurIPS, 2022.
[50]
are looking for?Annual review of vision science, 2020.
[51]
guide attention in visual search.Nature Human Behaviour,
2017.
[52] ˜o, Karla K Evans, and
Michelle R Greene. Visual search in scenes involves se-
lective and nonselective pathways.Trends in Cognitive Sci-
ences, 2011.
[53]
Zecheng Tang, and Nan Duan. Visual ChatGPT: Talking,
drawing and editing with visual foundation models.arXiv
preprint arXiv:2303.04671, 2023.,
[54]
Zecheng Tang, and Nan Duan. Visual ChatGPT: Talking,
drawing and editing with visual foundation models.arXiv
preprint arXiv:2303.04671, 2023.,
[55]
oung Ahn, Gregory Zelinsky, Dimitris Samaras, and Minh
Hoai. Predicting goal-directed human attention using inverse
reinforcement learning. InCVPR, 2020.
[56]
Wang, Hammad Ayyubi, Kai-Wei Chang, and Shih-Fu
Chang. IdealGPT: Iteratively decomposing vision and lan-
guage reasoning via large language models. InEMNLP,
2023.
[57]
Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang.
Mm-vet: Evaluating large multimodal models for integrated
capabilities.arXiv preprint arXiv:2308.02490, 2023.
[58]
Chen Change Loy. Contextual object detection with
multimodal large language models.arXiv preprint
arXiv:2305.18279, 2023.
[59]
Qi Zhao, and Gabriel Kreiman. Finding any waldo with zero-
shot invariant and efficient visual search.Nature communi-
cations, 2018.
[60]
Zoom to learn, learn to zoom. InCVPR, 2019.
11

[61]
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan
Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez,
and Ion Stoica. Judging LLM-as-a-judge with MT-bench and
chatbot arena. InNeurIPS, 2023.,
[62]
Wenxuan Zhang, and Mohamed Elhoseiny. ChatGPT asks,
blip-2 answers: Automatic questioning towards enriched vi-
sual descriptions.arXiv preprint arXiv:2303.06594, 2023.
4
[63]
hamed Elhoseiny. MiniGPT-4: Enhancing vision-language
understanding with advanced large language models.arXiv
preprint arXiv:2304.10592, 2023.,,,
12

A. Implementation Details
A.1. Data Curation for VQA LLM
For the GQA part of the 167k VQA data, our target is to find ques-
tions where the annotated objects mentioned in the question are
critical to correctly answer the question. Therefore, we first evalu-
ate InstructBLIP on GQA questions with annotated objects in the
question and only keep questions that can be correctly answered by
it. Then we use the image inpainting model LaMa [45] to erase all
the mentioned objects in the corresponding images and re-evaluate
the InstructBLIP model with the modified images and only keep
the questions that can not be correctly answered after this modifi-
cation. Through these process, we have selected a subset of GQA
questions where the annotated objects are important and we use
them to construct our VQA data.
For the VAW object attribution part of the 167k VQA data, we
create open-ended questions and binary questions about objects’
attributes. For the open-ended questions, we consider attribute
types including ‘color’, ‘material’, ‘hair color’, ‘pattern’, ‘face ex-
pression’, ‘pose’, ‘activity’, ‘opaqeness’, and ‘texture’. For the bi-
nary questions, we additionally include attribute types ‘state’ and
‘optical property’. All these attribute types follow the definition
from the VAW dataset. We use fixed templates for both types of
questions. For the open-ended questions, the question template is
“What is the [attribute type] of the [object name]?” and for the
binary questions, the template is “Is the [attribute type] of the [ob-
ject name] [attribute value]?” The corresponding answer to open-
ended questions is “The [attribute type] of [object name] is [at-
tribute value].” and the answer to binary questions is “Yes/No, the
[attribute type] of [object name] is/is not [attribute value].” Be-
sides, we use the same strategy as the GQA part to filter the ques-
tions with the InstructBLIP model and the object erasing process.
For the additional 40K data created from LLaVA-80K instruc-
tion tuning data, we first extract all the noun phrases in the ques-
tions/instructions of the LLaVA-80K data. Then we choose noun
phrases that are matched with the object category names defined
by COCO. Note that we augment some original category names
with more common synonyms (e.g. add ‘man’ and ‘woman’ for
the ‘person’ category). Then we check whether there exist anno-
tated instances of this category in this corresponding image. If
so, we keep this sample and use the annotated instances together
with their bounding box information as the targets objects for our
training.
A.2. Data Curation for Visual Search Model
The training data of our visual search model includes two parts.
The first part is the detection and segmentation data and the second
part is the VQA data which includes possible locations QAs and
LLaVA-80K instruction tuning data.
The COCO-Stuff [3], LVIS-PACO part [40], refCOCO(+/g)
[18,], and refCLEF [18] datasets are used as both detection and
segmentation data. Objects365 v2 [42] and GoldG [17] datasets
are only used as detection data.
The textual contextual cues of our visual search model are in
the form of possible location expressions about the target objects.
So we construct (image, question, answer) pairs about objects’
possible locations. We randomly sample a subset of images from
COCO2017. For each image, we randomly sample 2 objects that
are absent in this image but appear in the five images that are most
similar to it (based on CLIP embedding). The question is always
asking “What is the most likely position of [object]”. Then we
provide the image information (5 captions and a list of existing
objects) to GPT-3.5 and ask it to provide the possible location of
the absent objects and use its response as the answer. The com-
plete prompt is shown in Table.
A.3. Model Training
For the VQA LLM, we use the Vicuna-7b-1.3 [61] as the language
model. Following the common practice of current MLLMs, the
training process has two stages, a feature alignment stage and an
instruction tuning stage. For the alignment stage, the linear layer
projection module and the resampler projection module are sepa-
rately trained along with a frozen language model and vision en-
coder on the subset of image-text pairs from the 558K LAION-
CC-SBU subset used in LLaVA. For the instruction tuning stage,
we use the constructed 387k data to train the LLM along with the
projection modules, and only the vision encoder is frozen in this
stage.
For the alignment stage, the linear projection module and the
resampler projection module are separately trained with batch size
256. The linear projection module is trained for 1 epoch with
learning rate10
−3
and the resampler module is trained for 5
epochs with learning rate2×10
−4
. For the instruction tuning
stage, the model is trained for 3 epochs with learning rate2×10
−5
and batch size 128. To reduce the computational cost, during the
training, we use the linear projection module to project the search
target’s feature only when there is just one object, otherwise, we
use the linear projection module to project the global image fea-
ture and use the resampler for the searched objects. The exact
input sequence for the LLM constructed from the VWM is:
<Image>
Additional visual information to focus
on:
{Target Object 1’s Name }<Object> at
location [x1, y1, x2, y2];
{Target Object 2’s Name }<Object> at
location [x1, y1, x2, y2];
· · ·
Question
Here<Image>is the feature tokens of the image and
<Object>is the feature tokens of the target object stored in the
VWM.
For the visual search model, we adopt the LLaVA-7B-v1.1 as
the MLLM. We use the OWL-ViT-B-16’s vision encoder as the
image backbone. TheDclis trained with segmentation loss which
consists of binary cross-entropy loss and DICE loss. During in-
ference, the logits output from theDclis used as the search cue
heatmap. TheDtlis trained with set prediction loss similar to
DETR [4] with focal loss [26] for coordinates regression. The
whole model is trained for 100K steps with batch size 64 and learn-
ing rate10
−4
. The sampling ratio of general detection/segmenta-
tion datasets (Objects365 v2, COCO-Stuff, LVIS-PACO), refer-
ring detection/segmentation datasets (refCOCO, refCOCO+, ref-
13

COCOg, refCLEF, GoldG), and VQA data is 15:8:15.
During training, the pre-trained MLLM is trained with LoRA
[13] with the word embeddings layer being trainable. The image
encoder of the localization module and the coordinates MLP in
Dtlare frozen. The confidence score MLP and theDclare train-
able.
A.4. Visual Search Process
During the search process, we first set a relatively higher thresh-
old and terminate when a target is located with a confidence score
higher than this threshold. If the whole search process is com-
pleted without any target being found, we adjust the threshold to
a lower value and find the target with the highest confidence score
during the whole search process and accept it if its confidence
score passes the adjusted threshold. In the scenario where the vi-
sual search is needed to find a certain target, finding all instances
in a high-resolution image requires scanning the whole image ex-
haustively, making the search strategy less meaningful. Therefore,
our current visual search process focuses on finding a single tar-
get object instead of locating all targets exhaustively. However,
if we successfully locate multiple targets directly on the global
image without the need for further search, we add all of them to
the VWM. When evaluated onV
∗
Bench, the target with the high-
est confidence score is accepted as the searched target if no target
passes the threshold during the search process. In our implemen-
tation, the higher threshold is set to 0.5 and the lower threshold
is set to 0.3. And the thresholdδis set tomax(3.0,6.0×0.7
l
),
wherelis the image sub-dividing level.
To let the MLLM in the visual search model generate the search
cue heatmap corresponding to the contextual cue, we extract the
noun phrases in the textual contextual cue which is a possible lo-
cation expression, and prompt the MLLM to locate the phrase and
output the heatmap corresponding to this contextual cue.
B.V
∗
Bench Examples and Two Special Subsets
Some examples of our proposedV
∗
Bench are shown in Fig. Be-
sides the regular attribute recognition and spatial relationship rea-
soning sub-tasks, we additionally collect two special subsets and
add them to ourV
∗
Bench for exploratory study. The first subset
contains 30 VQA samples which require the model to recognize
and understand textual characters or digits on certain objects in
the image and we denote this subset as OCR. To better expose
the problem of current MLLMs and even the leading multimodal
system GPT-4V, we collect 17 samples on which GPT-4V would
fail but our simple model with visual search mechanism success
and denote them as GPT-4V-hard. We also evaluate LLaVA-1.5,
MM-REACT, GPT-4V, and SEAL on these two subsets, and the
results are shown in Fig. As the MM-React system relies on
the external OCR detection model and GPT-4V also likely has an
OCR module, their performance on the OCR task is decent. How-
ever, for MM-React, the external OCR model detects all the texts
in the image and provides them to the LLM in a bottom-up man-
ner. Therefore, it is easy to choose the correct option merely based
on the detected texts in the scene when the image only contains a
few texts. When there are multiple text contents in the image or
the question needs the model to fully understand the context of the
text or its location, the system would fail. An example is shown inQuestion:
Is the red balloon above of white
balloon?
Options:
•The red balloon is below the white
balloon.
•The red balloon is above the white
balloon.
Question:
What is the color of the clock?
Options:
•The color of the clock is green.
•The color of the clock is black.
•The color of the clock is red.
•The color of the clock is yellow.
Question:
What is the material of the stool?
Options:
•The material of the stool is plastic.
•The material of the stool is wood.
•The material of the stool is steel.
•The material of the stool is bamboo.
Question:
Is the broom on the left or right side
of the folded chair?
Options:
•The broom is on the left side of
the folded chair.
•The broom is on the right side of
the folded chair.
Figure 8. Examples of theV
∗
Benchmark. The top row belongs to
the attribute recognition task while the bottom row belongs to the
spatial relationship reasoning task. The correct option is in green.
Fig.
We also provide the image sources of examples in Figure 2 of
the main text below:
Row-1, Column-1:
Row-1, Column-2: Japanese animated filmChildren Who
Chase Lost Voices[43]
Row-1, Column-3
Row-1, Column-4:
Row-2, Column-1: SR-RAW dataset [60]
Row-2, Column-2: Personal Photo in Orlando Disney
Row-2, Column-3:
Row-2, Column-4: SR-RAW dataset [60]
Row-3, Column-1: SR-RAW dataset [60]
Row-3, Column-2:
Row-3, Column-3: SAM dataset [19]
Row-3, Column-4: SR-RAW dataset [60]
And the image source of the example in Figure 1 of the main
text is the Japanese animated filmWeathering with You[44]
C. Learning Spatial Relationship from Coordi-
nates
We provide the numerical coordinates of the search targets to the
VQA LLM as the spatial information about the searched targets.
We find that though it seems intuitive and simple to recognize the
relative spatial relationship with the coordinates, it is not trivial
for the VQA LLM to understand the numerical coordinates and
14

Question:
What is the number on that blue board?
Options:
•The number on that blue board is
2050.
•Thenumberonthat blue board is
2013.
•Thenumberonthatblueboardis
2030.
•The number on that blue board is
2023. Figure 9. An example on which the MM-React system with an
external OCR detection model fails. The correct option is in green
and the option chosen by MM-React is in red.
Figure 10. Comparison between SEAL and top MLLM systems
on the OCR and GPT4V-hard sub-tasks.
understand the spatial relationship between search targets by com-
paring their coordinates. We conduct additional experiments to
train the VQA LLM only on the constructed 46K spatial relation-
ship related VQA data and the loss curve is shown in Fig11. We
Figure 11. The loss curve of training on spatial relationship VQA
data. The “grokking” happens after a certain number of optimiza-
tion steps.
can see that instead of gradually decreasing, the loss suddenly
drops to 0 after a certain number of optimization steps, suggest-
ing that the model has learned to correctly compare the numerical
coordinates for determining spatial relationships. As this kind of
grokking needs a certain amount of optimization steps, one might
need to improve the ratio of the spatial relationship related VQA
data when mixing it with a larger amount of general multimodal
instruction tuning data (e.g. training data for LLaVA-1.5 [27]) to
ensure the model can correctly understand the numerical coordi-
nates.
15

”””
You are an AI visual assistant that can analyze a single image. You receive five captions, each describing the same
image you are observing. And you will also be given a list of objects which are in the image.
Captions:
{}
Objects:{}
You will be given two target objects, and your task is to use your common sense knowledge and the information
about the image to describe the most possible location of the given target objects in the image. Your answer must
be a short expression describing the possible location referring to some other existing entities in the image. Answer
concisely in less than 10 words.
Examples:
Target Object: bird
Most Possible Location: in the sky
Target Object: flag
Most Possible Location: on the roof of the building
###
Target Object 1:{}
Target Object 2:{}
Output Format
Most Possible Location of Target Object 1:
Most Possible Location of Target Object 2:
”””
Table 6. Prompt for GPT-3.5 to generate possible location expression of an object which is absent in the image.
16

”””
You are an AI visual assistant that can analyze a single image. You receive five captions, each describing the same
image you are observing.
Captions:
{}
You will be given 1 target object and you need to imagine certain attributes of it using your imagination and assume
it is in the image.
You need to ask 2 questions about the provided target object for each of the following question types:
Type 1: certain object attributes
Type 2: relative positions between objects
Type 3: interactions between objects
Your question could also involve other objects or provided information about the image. Do not ask questions about
the existence of the objects. Your questions should not contain any uncertainty about the presence of the target
object. Make sure each question involves the target object provided below. Ask short and natural questions of less
than 20 words.
Target Object:
{}
###
Output format:
Type 1
Question 1:
Question 2:
Type 2
Question 1:
Question 2:
Type 3
Question 1:
Question 2:
”””
Table 7. Prompt for GPT-3.5 to generate question-answer pairs about one target object which is absent in the image.
17

”””
You are an AI visual assistant that can analyze a single image. You receive five captions, each describing the same
image you are observing.
Captions:
{}
You will be given 2 target objects and you need to imagine certain attributes of them using your imagination and
assume they are in the image.
You need to ask 2 questions around the provided target objects for each of the following question types:
Type 1: direct attribute questions about the two target objects or simple logical comparison of the attributes
Type 2: positions of the 2 target objects or relative positions between them or other objects
Type 3: interactions between the two target objects while other objects might be used as reference
Target Objects:
{}
Your question could also involve other objects or provided information about the image. Do not ask questions about
the existence of the objects. Your questions should not contain any uncertainty about the presence of the target
objects. Ask short and natural questions of less than 20 words. Each question must contain both the{}and the{}.
###
Output format:
Type 1
Question 1:
Question 2:
Type 2
Question 1:
Question 2:
Type 3
Question 1:
Question 2:
”””
Table 8. Prompt for GPT-3.5 to generate question-answer pairs about two target objects which are absent in the image.
”””
Assume there is an object of type{}in an image, you need to come up with two visual questions asking about the
visual details of the{}. Make sure the questions are so detailed that it is very hard to answer if the{}is very small
in the image. Do not ask about the existence of the object.
Examples:
Object: shirt; Question: what is the text printed on the shirt?
Object: cup; Question: Does the cup have a dotted pattern?
Ask 2 reasonable questions about the{}and each question should be less than 20 words.
Object:{}
Question 1:
Question 2:
”””
Table 9. Prompt for GPT-3.5 to generate question-answer pairs about the details of a small object in the image.
18