Advanced RAG: Query Expansion, Hybrid Search, Re-Ranking, and More

This is Part 18 of the AI Agents series. Parts 11–17 built a working RAG pipeline from scratch: embeddings, vector search, ChromaDB, and five chunking strategies. This post covers what separates a prototype from a production system — six techniques that make RAG reliable at scale.

Simple RAG vs Advanced RAG

The basic pipeline covered so far:

Documents → Chunk → Embed → Vector DB
                                │
User query → Embed → Similarity search → Top-K chunks → LLM → Answer

This works. For a demo, for a small document set, for low-stakes queries — it’s fine.

At production scale, this pipeline has predictable failure modes:

Vague queries retrieve irrelevant chunks
Semantic search misses exact keyword matches
Retrieved chunks are noisy — many are technically “similar” but not actually relevant
No way to know when the system is performing poorly

The six techniques below each address a specific failure mode.

1. Query expansion

Problem: Users write vague or incomplete queries. “Tesla CEO speech 2020” doesn’t specify what aspect of the speech matters. “Machine learning accuracy” could mean a dozen different things. If the query doesn’t match the phrasing in the documents, semantic search misses.

Solution: Before searching, rewrite the query into multiple better-phrased variations. If one phrasing fails to retrieve relevant chunks, another might succeed.

Three approaches, in order of preference:

LLM-based expansion (best): Use an LLM to generate 3–5 semantically similar questions from the original query. The LLM understands intent and produces variations that match how information is actually phrased in documents.

import os
import json
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

def expand_query(query: str, n: int = 4) -> list[str]:
    prompt = f"""Generate {n} different ways to ask the following question.
Each variation should preserve the original intent but use different phrasing.
Return a JSON array of strings only, no explanation.

Question: {query}"""

    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}]
    )

    variations = json.loads(response.choices[0].message.content)
    return [query] + variations  # include original


queries = expand_query("Machine learning accuracy")
# ["Machine learning accuracy",
#  "How do I measure ML model performance?",
#  "What metrics evaluate classification models?",
#  "Confusion matrix precision recall explained",
#  "How accurate is my machine learning model?"]

Synonym expansion (lowest priority): Replace abbreviations and terms with synonyms (CEO → Chief Executive Officer). Simple but risky — incorrect synonyms change meaning. Only use for controlled vocabularies where synonym lists are curated.

Historical queries: If you log user queries, map new queries to historically successful ones. Works well for high-traffic systems where the same questions repeat.

After expansion, retrieve against all query variations and merge the results (deduplicating by document ID) before passing to the LLM.

2. Hybrid search

Problem: Semantic search finds conceptually similar content but can miss exact keyword matches. A query for “car repair near me” returns documents about “automobile maintenance” and “vehicle servicing” — mathematically similar embeddings — but misses the document that literally says “car repair.”

Solution: Run both semantic search (vector similarity) and keyword search (BM25 sparse retrieval) in parallel, then merge the results.

from rank_bm25 import BM25Okapi

def hybrid_search(query: str, documents: list[str], collection, top_k: int = 5):
    # Semantic search via ChromaDB
    semantic_results = collection.query(
        query_texts=[query],
        n_results=top_k
    )
    semantic_docs = set(semantic_results["documents"][0])

    # Keyword search via BM25
    tokenized_docs = [doc.split() for doc in documents]
    bm25 = BM25Okapi(tokenized_docs)
    scores = bm25.get_scores(query.split())
    top_bm25_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    keyword_docs = set(documents[i] for i in top_bm25_indices)

    # Merge: union of both result sets
    combined = list(semantic_docs | keyword_docs)
    return combined

The merged result set has higher recall than either approach alone. Keyword search catches exact matches that semantic search floats away from. Semantic search catches conceptual matches that keyword search misses entirely.

3. Re-ranking

Problem: Retrieving top-K chunks by similarity score gives you candidates — not a guarantee of relevance. Similarity in embedding space is approximate. Many retrieved chunks will be tangentially related but not actually useful for answering the specific question.

Solution: After retrieval, run a re-ranker — a cross-encoder model that scores each (query, document) pair together and produces a relevance score from 0 to 1.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[str], threshold: float = 0.5) -> list[str]:
    pairs = [(query, doc) for doc in candidates]
    scores = reranker.predict(pairs)

    # Filter below threshold and sort by score descending
    ranked = sorted(
        [(score, doc) for score, doc in zip(scores, candidates) if score >= threshold],
        reverse=True
    )
    return [doc for _, doc in ranked]

Cross-encoders are slower than bi-encoders (which power your vector search) but far more accurate — they see the query and document jointly rather than comparing independent embeddings. Use a small, fast cross-encoder (MiniLM class) so re-ranking doesn’t add noticeable latency.

The pipeline with re-ranking:

Query → Retrieve top-50 candidates → Re-rank → Keep top-5 → Send to LLM

Retrieving more candidates (50) before re-ranking improves recall. Filtering to fewer (5) after re-ranking keeps the context window clean and reduces LLM token cost.

4. Metadata filtering

Problem: A large document collection mixes content from different sources, departments, years, or document types. A query about “leave policy 2024” may retrieve finance policy documents from 2023 alongside the HR document you actually want.

Solution: Attach structured metadata to every chunk at index time, then apply filters at query time to pre-scope the search.

# Indexing with metadata
collection.upsert(
    documents=chunks,
    metadatas=[
        {"source": "HR_handbook", "year": 2024, "department": "HR"},
        {"source": "Finance_policy", "year": 2023, "department": "Finance"},
        # ...
    ],
    ids=[f"doc_{i}" for i in range(len(chunks))]
)

# Query with metadata filter
results = collection.query(
    query_texts=["What is the leave policy?"],
    n_results=5,
    where={"department": "HR"}  # only search HR documents
)

Useful metadata fields for most document collections:

Field	Example values	Use case
`department`	HR, Finance, Engineering	Scope queries to the right team
`year`	2023, 2024	Filter by recency
`doc_type`	policy, report, FAQ	Match query intent to document type
`source`	handbook, contract, email	Exclude irrelevant source types

Metadata filtering is cheap — it runs before vector search, reducing the candidate pool immediately. For any system with documents from multiple distinct sources, this should be standard.

5. Multi-stage retrieval

Problem: Complex queries can’t be answered with a single retrieval step. “Write a SQL query to find the top 5 customers by revenue last quarter” requires knowing which table has customer data, which columns hold revenue, and ideally a few examples of similar queries.

Solution: Break the retrieval into sequential stages, each building context for the next.

def text_to_sql_retrieval(user_question: str, table_collection, column_collection, example_collection):
    # Stage 1: find the relevant table
    table_results = table_collection.query(
        query_texts=[user_question],
        n_results=2
    )
    table_context = table_results["documents"][0]

    # Stage 2: find relevant columns within those tables
    column_query = f"{user_question} table: {' '.join(table_context)}"
    column_results = column_collection.query(
        query_texts=[column_query],
        n_results=5
    )
    column_context = column_results["documents"][0]

    # Stage 3: retrieve similar historical query examples
    example_results = example_collection.query(
        query_texts=[user_question],
        n_results=3
    )
    examples = example_results["documents"][0]

    # Combine all context for the LLM
    return {
        "tables": table_context,
        "columns": column_context,
        "examples": examples
    }

Multi-stage retrieval works for any domain where answers depend on hierarchical knowledge: first find the category, then find the specific document, then find the exact passage.

6. Feedback loops

Problem: Without measurement, you don’t know if your RAG system is improving or degrading over time. Chunking changes, new documents, and model updates all affect retrieval quality — silently.

Solution: Collect feedback signals and use them to identify failure cases.

Explicit feedback — show thumbs up/down after each answer:

def log_feedback(query: str, answer: str, retrieved_chunks: list[str], rating: int):
    # rating: 1 = helpful, 0 = not helpful
    feedback_store.append({
        "query": query,
        "answer": answer,
        "chunks": retrieved_chunks,
        "rating": rating,
        "timestamp": datetime.utcnow().isoformat()
    })

Implicit feedback — detect when users immediately re-ask the same question (signals the first answer was unsatisfactory):

def detect_repeat_query(user_id: str, query: str, session_log: list[dict]) -> bool:
    recent = [entry["query"] for entry in session_log[-5:] if entry["user_id"] == user_id]
    return query in recent

Use feedback data to:

Identify which queries consistently get thumbs-down → improve chunks for that topic
Track which retrieved chunks led to good vs bad answers → tune retrieval parameters
Catch regressions when you update the pipeline

The advanced RAG pipeline

Putting it all together:

User query
    │
    ▼
[Query Expansion] → 3–5 query variations
    │
    ▼
[Hybrid Search] → semantic + keyword, per variation
    │
    ▼
[Metadata Filtering] → scope to relevant document subset
    │
    ▼
[Multi-stage Retrieval] → if query is complex
    │
    ▼
[Re-ranking] → score each (query, chunk) pair, filter low-relevance
    │
    ▼
[LLM Generation] → answer grounded in filtered context
    │
    ▼
[Feedback Collection] → explicit + implicit signals

Not every query needs every layer. A simple factual query over a well-structured document set may only need metadata filtering and re-ranking. Add layers where your specific failure mode requires them, not preemptively.

Full video walkthrough is embedded above.