Advanced RAG: Query Expansion, Hybrid Search, Re-Ranking, and More
Basic RAG works for demos. Production RAG needs more — query expansion to handle ambiguous inputs, hybrid search for keyword precision, re-ranking to filter noise, and feedback loops to improve over time.
This is Part 18 of the AI Agents series. Parts 11–17 built a working RAG pipeline from scratch: embeddings, vector search, ChromaDB, and five chunking strategies. This post covers what separates a prototype from a production system — six techniques that make RAG reliable at scale.
Simple RAG vs Advanced RAG
The basic pipeline covered so far:
Documents → Chunk → Embed → Vector DB
│
User query → Embed → Similarity search → Top-K chunks → LLM → Answer
This works. For a demo, for a small document set, for low-stakes queries — it’s fine.
At production scale, this pipeline has predictable failure modes:
- Vague queries retrieve irrelevant chunks
- Semantic search misses exact keyword matches
- Retrieved chunks are noisy — many are technically “similar” but not actually relevant
- No way to know when the system is performing poorly
The six techniques below each address a specific failure mode.
1. Query expansion
Problem: Users write vague or incomplete queries. “Tesla CEO speech 2020” doesn’t specify what aspect of the speech matters. “Machine learning accuracy” could mean a dozen different things. If the query doesn’t match the phrasing in the documents, semantic search misses.
Solution: Before searching, rewrite the query into multiple better-phrased variations. If one phrasing fails to retrieve relevant chunks, another might succeed.
Three approaches, in order of preference:
LLM-based expansion (best): Use an LLM to generate 3–5 semantically similar questions from the original query. The LLM understands intent and produces variations that match how information is actually phrased in documents.
import os
import json
from groq import Groq
client = Groq(api_key=os.environ["GROQ_API_KEY"])
def expand_query(query: str, n: int = 4) -> list[str]:
prompt = f"""Generate {n} different ways to ask the following question.
Each variation should preserve the original intent but use different phrasing.
Return a JSON array of strings only, no explanation.
Question: {query}"""
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}]
)
variations = json.loads(response.choices[0].message.content)
return [query] + variations # include original
queries = expand_query("Machine learning accuracy")
# ["Machine learning accuracy",
# "How do I measure ML model performance?",
# "What metrics evaluate classification models?",
# "Confusion matrix precision recall explained",
# "How accurate is my machine learning model?"]
Synonym expansion (lowest priority): Replace abbreviations and terms with synonyms (CEO → Chief Executive Officer). Simple but risky — incorrect synonyms change meaning. Only use for controlled vocabularies where synonym lists are curated.
Historical queries: If you log user queries, map new queries to historically successful ones. Works well for high-traffic systems where the same questions repeat.
After expansion, retrieve against all query variations and merge the results (deduplicating by document ID) before passing to the LLM.
2. Hybrid search
Problem: Semantic search finds conceptually similar content but can miss exact keyword matches. A query for “car repair near me” returns documents about “automobile maintenance” and “vehicle servicing” — mathematically similar embeddings — but misses the document that literally says “car repair.”
Solution: Run both semantic search (vector similarity) and keyword search (BM25 sparse retrieval) in parallel, then merge the results.
from rank_bm25 import BM25Okapi
def hybrid_search(query: str, documents: list[str], collection, top_k: int = 5):
# Semantic search via ChromaDB
semantic_results = collection.query(
query_texts=[query],
n_results=top_k
)
semantic_docs = set(semantic_results["documents"][0])
# Keyword search via BM25
tokenized_docs = [doc.split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
scores = bm25.get_scores(query.split())
top_bm25_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
keyword_docs = set(documents[i] for i in top_bm25_indices)
# Merge: union of both result sets
combined = list(semantic_docs | keyword_docs)
return combined
The merged result set has higher recall than either approach alone. Keyword search catches exact matches that semantic search floats away from. Semantic search catches conceptual matches that keyword search misses entirely.
3. Re-ranking
Problem: Retrieving top-K chunks by similarity score gives you candidates — not a guarantee of relevance. Similarity in embedding space is approximate. Many retrieved chunks will be tangentially related but not actually useful for answering the specific question.
Solution: After retrieval, run a re-ranker — a cross-encoder model that scores each (query, document) pair together and produces a relevance score from 0 to 1.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, candidates: list[str], threshold: float = 0.5) -> list[str]:
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)
# Filter below threshold and sort by score descending
ranked = sorted(
[(score, doc) for score, doc in zip(scores, candidates) if score >= threshold],
reverse=True
)
return [doc for _, doc in ranked]
Cross-encoders are slower than bi-encoders (which power your vector search) but far more accurate — they see the query and document jointly rather than comparing independent embeddings. Use a small, fast cross-encoder (MiniLM class) so re-ranking doesn’t add noticeable latency.
The pipeline with re-ranking:
Query → Retrieve top-50 candidates → Re-rank → Keep top-5 → Send to LLM
Retrieving more candidates (50) before re-ranking improves recall. Filtering to fewer (5) after re-ranking keeps the context window clean and reduces LLM token cost.
4. Metadata filtering
Problem: A large document collection mixes content from different sources, departments, years, or document types. A query about “leave policy 2024” may retrieve finance policy documents from 2023 alongside the HR document you actually want.
Solution: Attach structured metadata to every chunk at index time, then apply filters at query time to pre-scope the search.
# Indexing with metadata
collection.upsert(
documents=chunks,
metadatas=[
{"source": "HR_handbook", "year": 2024, "department": "HR"},
{"source": "Finance_policy", "year": 2023, "department": "Finance"},
# ...
],
ids=[f"doc_{i}" for i in range(len(chunks))]
)
# Query with metadata filter
results = collection.query(
query_texts=["What is the leave policy?"],
n_results=5,
where={"department": "HR"} # only search HR documents
)
Useful metadata fields for most document collections:
| Field | Example values | Use case |
|---|---|---|
department | HR, Finance, Engineering | Scope queries to the right team |
year | 2023, 2024 | Filter by recency |
doc_type | policy, report, FAQ | Match query intent to document type |
source | handbook, contract, email | Exclude irrelevant source types |
Metadata filtering is cheap — it runs before vector search, reducing the candidate pool immediately. For any system with documents from multiple distinct sources, this should be standard.
5. Multi-stage retrieval
Problem: Complex queries can’t be answered with a single retrieval step. “Write a SQL query to find the top 5 customers by revenue last quarter” requires knowing which table has customer data, which columns hold revenue, and ideally a few examples of similar queries.
Solution: Break the retrieval into sequential stages, each building context for the next.
def text_to_sql_retrieval(user_question: str, table_collection, column_collection, example_collection):
# Stage 1: find the relevant table
table_results = table_collection.query(
query_texts=[user_question],
n_results=2
)
table_context = table_results["documents"][0]
# Stage 2: find relevant columns within those tables
column_query = f"{user_question} table: {' '.join(table_context)}"
column_results = column_collection.query(
query_texts=[column_query],
n_results=5
)
column_context = column_results["documents"][0]
# Stage 3: retrieve similar historical query examples
example_results = example_collection.query(
query_texts=[user_question],
n_results=3
)
examples = example_results["documents"][0]
# Combine all context for the LLM
return {
"tables": table_context,
"columns": column_context,
"examples": examples
}
Multi-stage retrieval works for any domain where answers depend on hierarchical knowledge: first find the category, then find the specific document, then find the exact passage.
6. Feedback loops
Problem: Without measurement, you don’t know if your RAG system is improving or degrading over time. Chunking changes, new documents, and model updates all affect retrieval quality — silently.
Solution: Collect feedback signals and use them to identify failure cases.
Explicit feedback — show thumbs up/down after each answer:
def log_feedback(query: str, answer: str, retrieved_chunks: list[str], rating: int):
# rating: 1 = helpful, 0 = not helpful
feedback_store.append({
"query": query,
"answer": answer,
"chunks": retrieved_chunks,
"rating": rating,
"timestamp": datetime.utcnow().isoformat()
})
Implicit feedback — detect when users immediately re-ask the same question (signals the first answer was unsatisfactory):
def detect_repeat_query(user_id: str, query: str, session_log: list[dict]) -> bool:
recent = [entry["query"] for entry in session_log[-5:] if entry["user_id"] == user_id]
return query in recent
Use feedback data to:
- Identify which queries consistently get thumbs-down → improve chunks for that topic
- Track which retrieved chunks led to good vs bad answers → tune retrieval parameters
- Catch regressions when you update the pipeline
The advanced RAG pipeline
Putting it all together:
User query
│
▼
[Query Expansion] → 3–5 query variations
│
▼
[Hybrid Search] → semantic + keyword, per variation
│
▼
[Metadata Filtering] → scope to relevant document subset
│
▼
[Multi-stage Retrieval] → if query is complex
│
▼
[Re-ranking] → score each (query, chunk) pair, filter low-relevance
│
▼
[LLM Generation] → answer grounded in filtered context
│
▼
[Feedback Collection] → explicit + implicit signals
Not every query needs every layer. A simple factual query over a well-structured document set may only need metadata filtering and re-ranking. Add layers where your specific failure mode requires them, not preemptively.
Full video walkthrough is embedded above.