RAG Deep Dive: Embeddings, Vector Search, and ChromaDB

This is Part 11 of the AI Agents series. Part 10 introduced RAG as a concept alongside ReAct. This post goes deep on how RAG actually works — the architecture, the math behind vector search, and a working implementation.

1. The problem RAG solves

Suppose you’re building an AI assistant for a company called Nerchuko, founded in 2025. A user asks: “How many holidays does Nerchuko offer?”

GPT-4 or any LLM trained before 2025 has never seen this company. It will either hallucinate a number or say it doesn’t know. Neither is acceptable in a real product.

The fix isn’t to retrain the model. It’s to fetch the relevant answer from your own documents at query time and inject it into the prompt as context. The model doesn’t need to have memorized the answer — it just needs to be handed the right page.

That’s RAG: Retrieve, Augment, Generate.

2. The RAG architecture

Your documents (PDFs, text files, handbooks)
        │
        ▼
    [Chunking]
    Split into smaller pieces
        │
        ▼
    [Embedding]
    Convert each chunk to a vector
        │
        ▼
    [Vector Database]
    Store and index all embeddings
        
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ at query time ━━━

User question
        │
        ▼
    [Embed the question]
    Same embedding model as above
        │
        ▼
    [Similarity Search]
    Find top-K most relevant chunks
        │
        ▼
    [Augment]
    Combine question + retrieved chunks into prompt
        │
        ▼
    [LLM]
    Generate a grounded, accurate answer

Three phases: indexing (done once, offline), retrieval (at query time), and generation (the LLM call).

3. Chunking: why document size matters

Before you can embed a document, you split it into chunks. Chunking matters more than most people expect.

If you embed an entire document as one vector, you get a high-level representation of the whole thing. Fine detail is averaged out. When a user asks a specific question, the similarity score against that blob will be imprecise — you’ll either retrieve the whole document (wasteful) or miss the relevant section entirely.

Smaller, focused chunks produce sharper embeddings. A chunk that covers one topic maps cleanly to questions about that topic.

Common chunking strategies:

Strategy	When to use
Fixed size (e.g. 512 tokens)	General-purpose; simple to implement
Paragraph / section boundaries	When document structure is consistent
Page-by-page	PDFs, textbooks, reports
Semantic chunking	Best accuracy; group sentences by meaning

For a first implementation, paragraph or fixed-size chunking works well. Semantic chunking improves retrieval quality but adds complexity.

Overlap: most implementations add a small overlap between consecutive chunks (e.g. 50 tokens) so that sentences near chunk boundaries don’t lose context.

4. Embeddings: turning text into vectors

An embedding is a numerical representation of text — a list of floating point numbers (a vector) where the position in that list encodes meaning. Text with similar meaning produces vectors that point in similar directions in high-dimensional space.

For example:

“The company offers 15 paid holidays” and “How many holidays does Nerchuko give employees?” are semantically similar — their vectors will be close.
“Quarterly revenue report” is unrelated — its vector will be far away.

This is what makes retrieval work. You’re not doing keyword search. You’re doing semantic search — finding chunks that mean something similar to the question, even if they use different words.

Embedding models:

Model	Type	Notes
`text-embedding-3-small`	OpenAI (paid)	Fast, cheap, good quality
`text-embedding-3-large`	OpenAI (paid)	Highest OpenAI quality
`text-embedding-ada-002`	OpenAI (paid)	Older, still widely used
`all-MiniLM-L6-v2`	Open-source	Small, fast, good for local use
`nomic-embed-text`	Open-source	High quality, free

Critical rule: use the same embedding model to embed documents and to embed queries. If you index documents with text-embedding-3-small and query with all-MiniLM-L6-v2, the vectors live in different spaces and similarity scores are meaningless.

Note on model versions: Embedding models above are accurate as of May 2026. The state of the art moves fast. Check the MTEB Leaderboard for current top-performing embedding models ranked by retrieval, clustering, and semantic similarity benchmarks.

5. Similarity search: the math

Once everything is embedded, retrieval is a search problem: find the vectors in the database closest to the query vector.

Cosine Similarity

Measures the angle between two vectors, ignoring their magnitude.

$$\text{cosine_similarity}(A, B) = \frac{A \cdot B}{|A| |B|}$$

Score of 1.0: vectors point in the same direction → highly similar content
Score of 0.0: vectors are perpendicular → unrelated
Score of -1.0: vectors point in opposite directions → opposite meaning

Range: -1 to 1. Higher is more similar. This is the most common metric for text embeddings.

Euclidean Distance

Measures the straight-line distance between two points in vector space.

$$d(A, B) = \sqrt{\sum_{i=1}^{n} (A_i - B_i)^2}$$

Distance of 0: identical vectors
Larger distance → less similar

Range: 0 to ∞. Lower is more similar. Less common for text, but used in some vector databases.

In practice, you pick a metric when setting up your vector database and stay consistent. Cosine similarity is the default for most text RAG systems.

After computing similarity scores, you retrieve the top-K chunks — typically K=3 to K=10 depending on how much context the LLM can handle and how focused you need the answer to be.

6. Vector databases

A vector database is a database optimized for storing embeddings and running fast similarity searches across millions of vectors.

Options:

Database	Type	Notes
ChromaDB	Open-source	Simple API, good for prototyping
FAISS	Open-source (Meta)	Very fast, in-memory, no persistence by default
Pinecone	Managed cloud	Production-scale, paid
Weaviate	Open-source / cloud	Feature-rich, supports hybrid search
Qdrant	Open-source / cloud	Fast, good Rust-based performance

For learning and local development, ChromaDB is the easiest starting point — minimal setup, Python-native API.

7. Full implementation with ChromaDB

This implementation indexes a small set of company policy documents and answers questions from them.

pip install chromadb sentence-transformers groq

Step 1: Index your documents

import chromadb
from sentence_transformers import SentenceTransformer

# Use a free, local embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Your documents (in a real app, load from PDFs or a database)
documents = [
    "Nerchuko employees are entitled to 15 paid holidays per year.",
    "Holidays include national public holidays and company-specific days announced in January.",
    "Nerchuko employees must work from the office on Tuesdays and Wednesdays.",
    "Remote work is permitted on all other days subject to manager approval.",
    "The annual performance review cycle runs from October to December.",
    "Salary increments are effective from the 1st of January each year."
]

# Create a ChromaDB collection
client = chromadb.Client()
collection = client.create_collection("company_policies")

# Embed and index all documents
embeddings = embedding_model.encode(documents).tolist()

collection.add(
    documents=documents,
    embeddings=embeddings,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

print(f"Indexed {len(documents)} document chunks.")

Step 2: Retrieve and answer a question

import os
from groq import Groq

groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])

def answer_question(question: str, top_k: int = 3) -> str:
    # Embed the question with the SAME model used for indexing
    query_embedding = embedding_model.encode([question]).tolist()

    # Retrieve top-K most similar chunks
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=top_k
    )

    retrieved_chunks = results["documents"][0]
    context = "\n".join(f"- {chunk}" for chunk in retrieved_chunks)

    # Build the RAG prompt
    prompt = f"""Answer the question using only the information provided below.
If the answer is not in the context, say "I don't have that information in the provided documents."
Do not use any outside knowledge.

Context:
{context}

Question: {question}"""

    response = groq_client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content


# Test it
questions = [
    "How many holidays do Nerchuko employees get?",
    "When do salary increments take effect?",
    "What is the remote work policy?",
]

for q in questions:
    print(f"Q: {q}")
    print(f"A: {answer_question(q)}")
    print()

What’s happening:

The question is embedded using the same model that indexed the documents
ChromaDB computes cosine similarity between the query vector and all stored document vectors
The top 3 matching chunks are retrieved
Those chunks are injected into the prompt as context
The LLM extracts the relevant answer from the context — it doesn’t guess

8. What goes wrong and how to fix it

Wrong chunks are retrieved:

Chunks are too large — try smaller chunks or add overlap
The embedding model is too weak for your domain — try a stronger model
Not enough K — increase top-K so more context reaches the LLM

LLM ignores the context and hallucinates anyway:

Strengthen the system prompt: "You must only use the context provided. If the answer is not there, say so."
Try a more instruction-following model

Same question gets different chunks on different runs:

Embeddings are deterministic — if you get different chunks, the issue is inconsistent preprocessing (whitespace, encoding). Normalize your text before chunking.

What’s next

Part 12 is a hands-on ChromaDB implementation: persistent clients, collections, metadata filtering, custom embedding models, upsert vs update, and a complete end-to-end RAG pipeline that answers questions from a company handbook using Groq.

Full video walkthrough is embedded above.