January 09, 2026
By Rian Dolphin
I spent some time exploring whether you could automatically optimise query transformation prompts for retrieval systems using LLM-based failure analysis. The idea was appealing: identify why queries fail, generate hypotheses to fix them, test and combine the winners. Derivative-free optimisation without needing gradients through the retrieval pipeline.
It didn't work. But the failure modes are interesting, and there are some useful takeaways.
The goal was simple: given a retrieval system (embedding model + vector search), can we find a prompt that transforms user queries to improve retrieval performance?
The pipeline looked like this:
for each cycle:
1. Run retrieval on training queries
2. Identify failed queries (target doc not in top-k)
3. Ask an LLM to analyse failure patterns
4. Generate hypothesis prompts to address each pattern
5. Test each hypothesis, keep the best
6. Combine successful hypotheses
7. Repeat
I tested on three BEIR datasets:
The embedding model was OpenAI's text-embedding-3-small. The LLM for transformations and analysis was Claude Haiku 4.5.
I wanted to test whether we could optimize for embedding separation, in a similar way to contrastive learning, and if that would result in better retrieval. However, we can't directly compute a gradient with respect to the prompt, so we need a proxy metric that we can measure and optimize.
$$\text{separation} = \frac{1}{|Q|} \sum_{q \in Q} \left[ \text{sim}(q, d^+) - \frac{1}{|N_q|} \sum_{d^- \in N_q} \text{sim}(q, d^-) \right]$$
where $d^+$ is the relevant document and $N_q$ is a sample of negatives. The idea is that if your transformed query has higher similarity to the positive and lower similarity to negatives (like contastive learning), retrieval should improve.
This is wrong, or at least not predictive. Here's why:
| Dataset | Separation Δ | R@10 Δ |
|---|---|---|
| SciFact | +3.1% | -0.2% |
| NFCorpus | +13.4% | -2.2% |
| FiQA | -0.3% | -4.8% |
Separation improved in 2/3 datasets while R@10 got worse. On NFCorpus, a 13% improvement in separation corresponded to a 2% drop in retrieval performance. Obviously, the numbers here haven't got significance tests etc. so take with a pinch of salt.
Across all three datasets, the optimised prompts never beat the better of baseline (no transformation) or naive ("rewrite this query for search"):
| Dataset | Baseline R@10 | Naive R@10 | optimised R@10 |
|---|---|---|---|
| SciFact | 0.8549 | 0.8348 | 0.8531 |
| NFCorpus | 0.1860 | 0.1967 | 0.1819 |
| FiQA | 0.5218 | 0.4664 | 0.4966 |
The naive baseline behavior is itself interesting. A simple "rewrite for search" prompt:
The optimised prompts sit somewhere in between, recovering some of the damage the naive prompt causes but never exceeding the best baseline.
One of the more counterintuitive findings: combining multiple successful hypotheses often produces worse results than any individual hypothesis.
On NFCorpus cycle 0:
| Approach | R@10 |
|---|---|
| H1 alone | 0.167 |
| H2 alone | 0.176 |
| H3 alone | 0.183 |
| Combined | 0.157 |
The combined prompt had a 14% drop from the best individual hypothesis.
I think there are two things going on:
Vector interference: Each transformation strategy pushes the query embedding in a different direction. When you apply multiple strategies, the pushes partially cancel or interfere, landing the embedding somewhere less useful than any single transformation.
Prompt complexity: A multi-step prompt ("first do X, then do Y, then do Z") produces less consistent outputs than a focused single-strategy prompt. The LLM has more room to go off the rails.
The irony is that the failure analysis itself produces genuinely useful insights. Here's a sample from NFCorpus:
Pattern: "Query Title Metaphor/Catchphrase vs. Nutritional Content"
Frequency: 6 failures
Description: Queries use evocative titles like 'Blocking the First Step
of Heart Disease' or 'Priming the Proton Pump' that don't contain the
actual nutritional/medical topics covered in target documents.
This is correct! The NFCorpus queries are often weird marketing-style titles that don't match the scientific language of the corpus. The LLM correctly identifies this.
The problem is that "fixing" these queries by expanding them doesn't actually help retrieval. The embedding model was trained on natural queries and already does a reasonable job. Our expansions add noise without adding signal that helps ranking.
The BEIR queries aren't "broken." They're the actual queries people used with those corpora. The embedding model was trained on natural queries.
We're essentially trying to "fix" queries that were already designed/selected to work with their documents. Of course that doesn't improve retrieval—the baseline is already doing what it can with the query as given. The queries, as they are, are probably more similar to what the embedding model was trained on.
The approach doesn't work as implemented, but the failure modes are instructive:
Sometimes the best outcome of a research project is a clear understanding of why something doesn't work. Or at least, I'll tell myself that this time!