[Paper Review] Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization (ICLR 2026)

A recent ICLR 2026 paper proposes a more direct approach to hybrid retrieval: instead of fusing outputs, GQR performs test-time optimization on the query embedding itself. This post breaks down the GQR architecture, the experimental results on the ViDoRe benchmark, and why this approach can be an efficient option for high-performance RAG pipelines.

Guided Query Refinement (GQR) — ICLR 2026 — arXiv | GitHub

Motivation

Hybrid retrieval has become the default strategy in modern retrieval systems, especially in multimodal and long-document settings.

Typical hybrid systems combine:

Dense retrievers (semantic similarity)
Sparse retrievers (lexical matching)
Sometimes vision-based retrievers

However, most systems combine these models at the score level:

Linear interpolation
Rank fusion
Reciprocal Rank Fusion (RRF)

Score fusion combines decisions after retrieval.

GQR modifies the query representation before retrieval.

This shift from output fusion to representation refinement is the core conceptual contribution of the paper.

Core Idea

Given a primary retriever (e.g., dense or multimodal encoder) and a complementary retriever (e.g., sparse BM25), instead of fusing their retrieval results, GQR:

Uses the complementary retriever to score documents
Derives a refinement signal from those scores
Updates the query embedding of the primary retriever
Performs retrieval again using the refined query

Formally:

Original query embedding: q
Refined embedding: q′=q+αΔq

Where ΔqΔq is computed from complementary retrieval signals.

This happens at test time, without retraining the base retriever.

Paper foundations

The paper states explicitly that the query encoder weights are frozen and the query embedding is treated as a learnable parameter. The optimization target is the embedding vector itself, not the query text. Like training a neural network, GQR updates the query’s numerical representation via backpropagation.

Consensus distribution : The auxiliary retriever yields a distribution p2(d∣e^q_2) over candidates; the primary yields p1(d∣z^{(t)}). Their consensus is:

The primary model adjusts its query embedding z^{(t)} so that p_1 moves toward p_{\mathrm{avg}}. The objective minimizes \mathrm{KL}(p_{\mathrm{avg}} \| p_1) with respect to z ; the gradient step z^{(t+1)} = z^{(t)} — \alpha \nabla_z \mathcal{L} updates only the query vector, not model weights.

Two Retrievers in Hybrid Search

Each retriever encodes the same query and scores documents, but their outputs differ because they use different encoders, modalities, or scoring mechanisms (e.g., dot-product vs MaxSim).

Where they differ

(A) Representation level — Query embeddings $e_1^q$ and $e_2^q$ live in different embedding spaces.

(B) Score level — For each document, similarity scores differ (e.g., retriever 1: 0.78, 0.34, 0.69; retriever 2: 0.62, 0.68, 0.87), so score distributions differ.

© Ranking level — Different scores produce different rankings (e.g., retriever 1: {d1, d3, d2}; retriever 2: {d3, d2, d1}).

Why GQR needs them different

If both retrievers produced identical outputs, there would be no complementary signal and no refinement. GQR works because the auxiliary retriever captures information the primary misses (e.g., primary for visual alignment, auxiliary for lexical match). The disagreement between them carries useful signal.

Typical setup

Primary Auxiliary Vision-centric multimodal encoder Text-only dense retriever Late interaction (MaxSim) Single-vector dot-product Larger index Lightweight index

Important nuance

GQR does not combine both outputs. The auxiliary retriever’s distribution over documents is a teacher signal; the primary query embedding is updated so its distribution moves toward the auxiliary. The final ranking uses only the primary retriever. The auxiliary is guidance only.

Why This Matters

Score fusion assumes:

Query embedding is fixed and correct
Errors can be fixed by mixing scores

Dense retrievers often fail because the query embedding:

Misses lexical cues
Underrepresents rare tokens
Is biased toward training distribution

GQR treats complementary retrieval as a form of weak supervision to reshape the query representation.

This reframes hybrid retrieval as a test-time optimization problem.

Multimodal Setting

The paper extends this idea to multimodal retrieval.

In multimodal hybrid systems:

Vision-based retrievers capture layout and visual semantics
Text-based sparse retrievers capture lexical precision

Visual retrievers (e.g., ColPali) are strong at layout and visual features but weak at exact lexical matching — specific words, numbers, or rare tokens. When a BM25-style auxiliary retriever signals “this document contains that exact term,” GQR injects that information into the primary model’s query vector, improving retrieval accuracy without changing the encoder.

Naive fusion struggles when the modalities disagree. GQR enables:

Text-based signals to guide visual query embeddings
Or vice versa

This is useful in:

Visually rich documents
Long PDFs
Chart/table-heavy pages
Multimodal QA benchmarks

Test-Time Optimization Perspective

GQR reframes retrieval as a query adaptation problem at inference time.

It is related to:

Test-time adaptation
Meta-learning
Query expansion
Prompt refinement

But differs in key ways:

It does not generate synthetic text (unlike HyDE); it modifies the query embedding via backpropagation, not LLM text generation
It does not retrain models; encoder weights stay frozen
It operates directly in embedding space
It uses real complementary scores

Algorithmic Outline

High-level procedure:

Encode query using primary retriever → q
Run complementary retriever → get document scores
Construct refinement direction Δq
Update query embedding
Re-run primary retriever with refined query

Computational overhead is modest (one extra retrieval step).

Code

The official implementation (GitHub) provides a clean PyTorch implementation of GQR and baseline fusion methods.

Fusion baselines (post-hoc)

Baseline methods in fusion_methods.py fuse rankings or scores after retrieval. The query embedding is never modified:

def sim_score_fusion(r1, r2, normal_func=normalize_min_max, alpha=0.5):
    # r1, r2: retrieval results {query: [(doc_id, score), ...]}
    for doc in union:
        agg_score = alpha * get_score(doc, curr_r1) + (1-alpha) * get_score(doc, curr_r2)
        raw_score_averages[doc] = agg_score
    # Returns fused ranking — query embedding unchanged

Score fusion normalizes outputs from each retriever (min-max or softmax) and takes a weighted average. The retriever models and query representations stay fixed.

Retriever

The Retriever class in retriever.py loads precomputed embeddings and computes similarity. It supports single-vector (dot product) and multi-vector (MaxSim, for ColPali-style late interaction):

def get_topk(d_embs, q_embs, k, is_multi=False):
    if is_multi:
        similarity = score_multi_vector(q_embs, d_embs, batch_size=8)
    else:
        similarity = torch.matmul(q_embs, d_embs.T)
    top_vals, top_idx = torch.topk(similarity, k=k, dim=-1)
    return top_vals, top_idx


def compute_scores(self, q, d):
    if self.is_multi:
        similarity = score_multi_vector(q, d, batch_size=16)
        similarity = similarity / q.size(1)
    else:
        similarity = torch.matmul(q, d.T)
    return similarity

For multi-vector embeddings, MaxSim computes ∑imax⁡jqiTdj∑imaxjqiTdj. The implementation uses pad_sequence for variable-length token sequences and a 4D einsum (bnd,csd->bcns) to compute token-level similarity in batches; this supports ColPali-style late interaction. Importantly, MaxSim is differentiable: the gradient flows through this scoring function to the query tokens, so GQR can refine each token embedding based on the auxiliary retriever's signal. This is crucial for multimodal late-interaction models. Single-vector uses L2-normalized dot product. get_topk builds the candidate pool; compute_scores is called inside the optimization loop to obtain p1p1.

Loss functions

The code supports multiple loss functions. KL divergence is the default; Jensen-Shannon divergence (js_divergence) is a symmetric alternative that can behave more stably and improve optimization robustness:

def kl_divergence(p, q, eps=1e-8):
    return torch.sum(p * torch.log((p + eps) / (q + eps)))

def js_divergence(p, q, eps=1e-8):
    m = .5 * (p + q)
    return .5 * (torch.sum(p * torch.log((p + eps) / (m + eps))) + torch.sum(q * torch.log((q + eps) / (m + eps))))

GQR core loop (iterative optimization)

GQR is not a one-shot formula; it performs test-time optimization. The query embedding is declared as a parameter and fine-tuned over n_steps iterations (typically via Adam) so the primary retriever's document distribution moves toward consensus with the auxiliary. The loss is KL or JS divergence. The refined query can then be used for full-index search (higher recall) or candidate re-ranking (lower latency). From query_optimizations.py:

# Query made trainable
qs = [q[i].unsqueeze(0).clone().detach().requires_grad_(True) for i in range(Q)]
opt = optimizer(qs, lr=lr)  # Adam, etc.

# Candidate pool: primary retriever's top-k
top_k = get_topk(d, q, k, is_multi=main_model.is_multi)[1]
set1 = d[top_k[q_id]]  # docs for this query
# Sparse (BM25): slice_sparse_coo_tensor for efficient candidate-pool score extraction
if f_d.is_sparse:
    set2 = slice_sparse_coo_tensor(f_d, slice_indices=top_k[q_id]).to(device)
else:
    set2 = f_d[top_k[q_id]].to(device)
# Softmax distributions (paper Eq. 8-9)
d2 = F.softmax(feedback_model.compute_scores(q2, set2) / T, -1)  # p_2 from complementary
d1 = F.softmax(main_model.compute_scores(q1, set1) / T, -1)      # p_1 from primary
# Consensus: p_avg = (1-α) p_1 + α p_2
mixture_d = (1 - mixture_alpha) * d1 + mixture_alpha * d2
# Iterative optimization: n_steps gradient steps
for step in range(n_steps):
    d1 = F.softmax(main_model.compute_scores(q1, set1) / T, -1)
    loss = loss_func(mixture_d, d1)  # KL or JS
    opt.zero_grad(set_to_none=True)
    loss.backward(retain_graph=True)
    opt.step()

When the auxiliary retriever is sparse (e.g., BM25), slice_sparse_coo_tensor efficiently slices the sparse document matrix to the candidate pool, integrating dense and sparse signals in a unified tensor workflow without full dense conversion.

With Search vs No Search

Two execution modes control the trade-off between latency and recall:

With Search (optimize_queries_with_search): After refinement, run retrieval again over the full index with the refined query. Higher recall, higher latency.
No Search (optimize_queries_no_search): Keep the candidate pool from the optimization phase; only re-score within that pool. Faster, but recall is limited to the initial candidates.

Selection depends on whether full search or candidate re-ranking is acceptable for the application.

Pseudocode summary

q = encode(query_text)

for t in range(n_steps):
    primary_scores = primary.compute_scores(q, candidates)
    guidance_scores = complementary.compute_scores(query_emb, candidates)
    p1 = softmax(primary_scores / T)
    p2 = softmax(guidance_scores / T)
    p_avg = (1 - alpha) * p1 + alpha * p2
    loss = KL(p_avg, p1)
    q = q - lr * grad(loss, q)
final_results = primary.search(q)

Fusion vs GQR

Baseline fusion GQR Query Fixed Refined Aggregation (primary scores + guidance scores) → rerank guidance → update q → re-retrieve Level Score or rank Representation

GQR keeps the primary retriever fixed and absorbs complementary signals by refining the query representation, which also works with multi-vector retrievers (e.g., MaxSim).

Empirical Findings

The paper evaluates primarily on ViDoRe (Visual Document Retrieval), which includes visually rich documents such as charts, infographics, and complex layouts. With ColPali (vision-centric multi-vector) as the primary retriever and BM25 (sparse text-based) as the auxiliary, GQR consistently outperforms standalone ColPali in nDCG@5 and Recall@5. The largest gains appear on subsets where fine-grained text matters (e.g., ChartQA, InfographicVQA), where the auxiliary retriever supplies lexical signals the vision model tends to miss.

GQR vs traditional fusion

Unlike RRF, which only reorders documents already in the top-k, GQR refines the query representation. In “With Search” mode, the refined query can surface relevant documents that were not in the initial retrieval. Compared with static score fusion (linear interpolation), GQR is more robust across datasets; score fusion often needs manual tuning of weights per domain, whereas GQR adapts via test-time optimization without retraining. In out-of-distribution settings where the primary model was not trained on a given document type (e.g., specialized financial charts), the auxiliary retriever’s signal acts as a correction, improving accuracy without model retraining.

With Search vs No Search

With Search: Re-retrieval over the full index with the refined query. Best for recall improvement.
No Search: Re-scoring only within the initial top-k. Faster and useful for improving precision within the candidate pool.

Modality synergy

Experiments show complementary effects between vision and text:

Lexical guidance for vision: BM25 provides specific terms (e.g., dates, IDs in tables) that visual encoders tend to blur, helping ColPali focus on fine-grained text.
Visual guidance for text: When a dense text retriever is primary and a vision model is auxiliary, the vision signal helps the text model capture layout and visual semantics it would otherwise ignore.

Optimization efficiency

Most performance gains occur within about 10–20 optimization steps; further iterations show diminishing returns. GQR adds inference-time cost but remains faster than LLM-based query rewriting (e.g., HyDE) because it operates directly on embeddings via gradient descent.

Retrieval performance

The paper reports NDCG@5 by primary model, complementary model, and subset (Biomed Lectures, Economics, ESG Human, ESG Full). Deltas are relative to the “No refinement” baseline for each primary.

Primary models: Colnomic-7B (vision), Jina (vision), Llama-Nemo (vision); Jina (text) and Linq-Embed also appear as text primaries. Complementary models: Jina (text), Linq-Embed, Qwen3.

Findings: Jina (text) and Linq-Embed generally yield strong gains as complements; Qwen3 is more mixed and sometimes lowers scores on some subsets. ESG Full often shows the largest delta improvements. Linq-Embed frequently attains the best average across subsets for Colnomic-7B, Jina (vision), and Llama-Nemo. Text-as-complement (e.g., Jina text guiding Colnomic or Jina vision) adds lexical signal; vision-as-complement adds layout and visual semantics for text primaries.

Indicative comparison

Setup Query updated? Relative nDCG@5 (ViDoRe) ColPali (standalone) No Baseline ColPali + BM25 (RRF) No Moderate gain GQR (ColPali, BM25 auxiliary) Yes Highest

Comparison with Traditional Hybrid

Method Fusion Level Query Modified Requires Retraining Linear interpolation Score No No RRF Rank No No Late fusion Score No No GQR Representation Yes No

Conceptual Takeaway

The main contribution is the reframing:

Hybrid retrieval can adapt query representations instead of mixing scores.

Instead of post-hoc score or rank fusion, the paper argues that retrievers can reach consensus on document relevance and refine the query embedding accordingly. This is supported by the consensus distribution, the KL-based optimization, and empirical gains on ViDoRe.

Thank you!

References

Official paper: Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization (ICLR 2026).
Official code: IBM/test-time-hybrid-retrieval — Full PyTorch implementation of GQR, including the optimization loops and multimodal retriever benchmarks.
ColPali: Efficient Document Retrieval with Vision Language Models (ICLR 2025)
Dataset: ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval

'Papers & Research Notes > Paper & Code Review' 카테고리의 다른 글

Recent Multimodal RAG Papers (ColPali, SV-RAG, URaG, MetaEmbed) (0)	2026.02.14
Paper Review: Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning (0)	2025.10.11
[논문 리뷰] StyleGAN: A Style-Based Generator Architecture for GANs (0)	2021.03.25
[논문 리뷰] Multiple-Clothing Detection and Fashion Landmark Estimation Using a Single-Stage Detector (0)	2021.03.13
[논문 리뷰] EfficientDet: Scalable and Efficient Object Detection Review (0)	2021.03.06

Air

[Paper Review] Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization (ICLR 2026)

Motivation

Core Idea

Paper foundations

Two Retrievers in Hybrid Search

Where they differ

Why GQR needs them different

Typical setup

Important nuance

Why This Matters

Multimodal Setting

Test-Time Optimization Perspective

Algorithmic Outline

Code

Fusion baselines (post-hoc)

Retriever

Loss functions

GQR core loop (iterative optimization)

With Search vs No Search

Pseudocode summary

Fusion vs GQR

Empirical Findings

GQR vs traditional fusion

With Search vs No Search

Modality synergy

Optimization efficiency

Retrieval performance

Indicative comparison

Comparison with Traditional Hybrid

Conceptual Takeaway

References

'Papers & Research Notes > Paper & Code Review' 카테고리의 다른 글

댓글

티스토리툴바

[Paper Review] Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization (ICLR 2026)

Motivation

Core Idea

Paper foundations

Two Retrievers in Hybrid Search

Where they differ

Why GQR needs them different

Typical setup

Important nuance

Why This Matters

Multimodal Setting

Test-Time Optimization Perspective

Algorithmic Outline

Code

Fusion baselines (post-hoc)

Retriever

Loss functions

GQR core loop (iterative optimization)

With Search vs No Search

Pseudocode summary

Fusion vs GQR

Empirical Findings

GQR vs traditional fusion

With Search vs No Search

Modality synergy

Optimization efficiency

Retrieval performance

Indicative comparison

Comparison with Traditional Hybrid

Conceptual Takeaway

References

'Papers & Research Notes > Paper & Code Review' 카테고리의 다른 글

관련글

댓글

티스토리툴바