Optimising Query Transformations for Retrieval: A Negative Result

I spent some time exploring whether you could automatically optimise query transformation prompts for retrieval systems using LLM-based failure analysis. The idea was appealing: identify why queries fail, generate hypotheses to fix them, test and combine the winners. Derivative-free optimisation without needing gradients through the retrieval pipeline.

It didn't work. But the failure modes are interesting, and there are some useful takeaways.

The Setup

The goal was simple: given a retrieval system (embedding model + vector search), can we find a prompt that transforms user queries to improve retrieval performance?

The pipeline looked like this:

for each cycle:
    1. Run retrieval on training queries
    2. Identify failed queries (target doc not in top-k)
    3. Ask an LLM to analyse failure patterns
    4. Generate hypothesis prompts to address each pattern
    5. Test each hypothesis, keep the best
    6. Combine successful hypotheses
    7. Repeat

I tested on three BEIR datasets: - SciFact: Scientific claim verification (formal claims → paper abstracts) - NFCorpus: Nutrition/health queries (often single words or marketing-style titles) - FiQA: Financial question answering (natural language questions)

The embedding model was OpenAI's text-embedding-3-small. The LLM for transformations and analysis was Claude Haiku 4.5.

The Proxy Metric Problem

I wanted to test whether we could optimize for embedding separation, in a similar way to contrastive learning, and if that would result in better retrieval. However, we can't directly compute a gradient with respect to the prompt, so we need a proxy metric that we can measure and optimize.

$$\text{separation} = \frac{1}{|Q|} \sum_{q \in Q} \left[ \text{sim}(q, d^+) - \frac{1}{|N_q|} \sum_{d^- \in N_q} \text{sim}(q, d^-) \right]$$

where $d^+$ is the relevant document and $N_q$ is a sample of negatives. The idea is that if your transformed query has higher similarity to the positive and lower similarity to negatives (like contastive learning), retrieval should improve.

This is wrong, or at least not predictive. Here's why:

Dataset	Separation Δ	R@10 Δ
SciFact	+3.1%	-0.2%
NFCorpus	+13.4%	-2.2%
FiQA	-0.3%	-4.8%

Separation improved in 2/3 datasets while R@10 got worse. On NFCorpus, a 13% improvement in separation corresponded to a 2% drop in retrieval performance. Obviously, the numbers here haven't got significance tests etc. so take with a pinch of salt.

The Results

Across all three datasets, the optimised prompts never beat the better of baseline (no transformation) or naive ("rewrite this query for search"):

Dataset	Baseline R@10	Naive R@10	optimised R@10
SciFact	0.8549	0.8348	0.8531
NFCorpus	0.1860	0.1967	0.1819
FiQA	0.5218	0.4664	0.4966

The naive baseline behavior is itself interesting. A simple "rewrite for search" prompt: - Hurts SciFact by 2.4% (scientific claims are already precise) - Helps NFCorpus by 5.8% (vague queries benefit from expansion) - Hurts FiQA by 10.6% (financial questions are already well-formed)

The optimised prompts sit somewhere in between, recovering some of the damage the naive prompt causes but never exceeding the best baseline.

Combined Prompts Do Worse

One of the more counterintuitive findings: combining multiple successful hypotheses often produces worse results than any individual hypothesis.

On NFCorpus cycle 0:

Approach	R@10
H1 alone	0.167
H2 alone	0.176
H3 alone	0.183
Combined	0.157

The combined prompt had a 14% drop from the best individual hypothesis.

I think there are two things going on:

Vector interference: Each transformation strategy pushes the query embedding in a different direction. When you apply multiple strategies, the pushes partially cancel or interfere, landing the embedding somewhere less useful than any single transformation.
Prompt complexity: A multi-step prompt ("first do X, then do Y, then do Z") produces less consistent outputs than a focused single-strategy prompt. The LLM has more room to go off the rails.

The Failure Analysis Actually Works

The irony is that the failure analysis itself produces genuinely useful insights. Here's a sample from NFCorpus:

Pattern: "Query Title Metaphor/Catchphrase vs. Nutritional Content"
Frequency: 6 failures
Description: Queries use evocative titles like 'Blocking the First Step
of Heart Disease' or 'Priming the Proton Pump' that don't contain the
actual nutritional/medical topics covered in target documents.

This is correct! The NFCorpus queries are often weird marketing-style titles that don't match the scientific language of the corpus. The LLM correctly identifies this.

The problem is that "fixing" these queries by expanding them doesn't actually help retrieval. The embedding model was trained on natural queries and already does a reasonable job. Our expansions add noise without adding signal that helps ranking.

Keep it simple, stupid

The BEIR queries aren't "broken." They're the actual queries people used with those corpora. The embedding model was trained on natural queries.

We're essentially trying to "fix" queries that were already designed/selected to work with their documents. Of course that doesn't improve retrieval—the baseline is already doing what it can with the query as given. The queries, as they are, are probably more similar to what the embedding model was trained on.

Conclusion

The approach doesn't work as implemented, but the failure modes are instructive:

Proxy metrics can be dangerously uncorrelated with your actual target
Combining successful strategies doesn't always produce a more successful combined strategy
Query transformation value is highly domain-specific
Automatic failure analysis produces reasonable insights, but acting on those insights doesn't always help

Sometimes the best outcome of a research project is a clear understanding of why something doesn't work. Or at least, I'll tell myself that this time!