RAG Deep Dive: Embeddings, Vector Search, and ChromaDB
RAG is how you give LLMs accurate answers from documents they've never seen. This post covers the full architecture: chunking, embeddings, similarity search, vector databases, and a working implementation with ChromaDB.
This is Part 11 of the AI Agents series. Part 10 introduced RAG as a concept alongside ReAct. This post goes deep on how RAG actually works — the architecture, the math behind vector search, and a working implementation.
1. The problem RAG solves
Suppose you’re building an AI assistant for a company called Nerchuko, founded in 2025. A user asks: “How many holidays does Nerchuko offer?”
GPT-4 or any LLM trained before 2025 has never seen this company. It will either hallucinate a number or say it doesn’t know. Neither is acceptable in a real product.
The fix isn’t to retrain the model. It’s to fetch the relevant answer from your own documents at query time and inject it into the prompt as context. The model doesn’t need to have memorized the answer — it just needs to be handed the right page.
That’s RAG: Retrieve, Augment, Generate.
2. The RAG architecture
Your documents (PDFs, text files, handbooks)
│
▼
[Chunking]
Split into smaller pieces
│
▼
[Embedding]
Convert each chunk to a vector
│
▼
[Vector Database]
Store and index all embeddings
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ at query time ━━━
User question
│
▼
[Embed the question]
Same embedding model as above
│
▼
[Similarity Search]
Find top-K most relevant chunks
│
▼
[Augment]
Combine question + retrieved chunks into prompt
│
▼
[LLM]
Generate a grounded, accurate answer
Three phases: indexing (done once, offline), retrieval (at query time), and generation (the LLM call).
3. Chunking: why document size matters
Before you can embed a document, you split it into chunks. Chunking matters more than most people expect.
If you embed an entire document as one vector, you get a high-level representation of the whole thing. Fine detail is averaged out. When a user asks a specific question, the similarity score against that blob will be imprecise — you’ll either retrieve the whole document (wasteful) or miss the relevant section entirely.
Smaller, focused chunks produce sharper embeddings. A chunk that covers one topic maps cleanly to questions about that topic.
Common chunking strategies:
| Strategy | When to use |
|---|---|
| Fixed size (e.g. 512 tokens) | General-purpose; simple to implement |
| Paragraph / section boundaries | When document structure is consistent |
| Page-by-page | PDFs, textbooks, reports |
| Semantic chunking | Best accuracy; group sentences by meaning |
For a first implementation, paragraph or fixed-size chunking works well. Semantic chunking improves retrieval quality but adds complexity.
Overlap: most implementations add a small overlap between consecutive chunks (e.g. 50 tokens) so that sentences near chunk boundaries don’t lose context.
4. Embeddings: turning text into vectors
An embedding is a numerical representation of text — a list of floating point numbers (a vector) where the position in that list encodes meaning. Text with similar meaning produces vectors that point in similar directions in high-dimensional space.
For example:
- “The company offers 15 paid holidays” and “How many holidays does Nerchuko give employees?” are semantically similar — their vectors will be close.
- “Quarterly revenue report” is unrelated — its vector will be far away.
This is what makes retrieval work. You’re not doing keyword search. You’re doing semantic search — finding chunks that mean something similar to the question, even if they use different words.
Embedding models:
| Model | Type | Notes |
|---|---|---|
text-embedding-3-small | OpenAI (paid) | Fast, cheap, good quality |
text-embedding-3-large | OpenAI (paid) | Highest OpenAI quality |
text-embedding-ada-002 | OpenAI (paid) | Older, still widely used |
all-MiniLM-L6-v2 | Open-source | Small, fast, good for local use |
nomic-embed-text | Open-source | High quality, free |
Critical rule: use the same embedding model to embed documents and to embed queries. If you index documents with text-embedding-3-small and query with all-MiniLM-L6-v2, the vectors live in different spaces and similarity scores are meaningless.
Note on model versions: Embedding models above are accurate as of May 2026. The state of the art moves fast. Check the MTEB Leaderboard for current top-performing embedding models ranked by retrieval, clustering, and semantic similarity benchmarks.
5. Similarity search: the math
Once everything is embedded, retrieval is a search problem: find the vectors in the database closest to the query vector.
Cosine Similarity
Measures the angle between two vectors, ignoring their magnitude.
$$\text{cosine_similarity}(A, B) = \frac{A \cdot B}{|A| |B|}$$
- Score of 1.0: vectors point in the same direction → highly similar content
- Score of 0.0: vectors are perpendicular → unrelated
- Score of -1.0: vectors point in opposite directions → opposite meaning
Range: -1 to 1. Higher is more similar. This is the most common metric for text embeddings.
Euclidean Distance
Measures the straight-line distance between two points in vector space.
$$d(A, B) = \sqrt{\sum_{i=1}^{n} (A_i - B_i)^2}$$
- Distance of 0: identical vectors
- Larger distance → less similar
Range: 0 to ∞. Lower is more similar. Less common for text, but used in some vector databases.
In practice, you pick a metric when setting up your vector database and stay consistent. Cosine similarity is the default for most text RAG systems.
After computing similarity scores, you retrieve the top-K chunks — typically K=3 to K=10 depending on how much context the LLM can handle and how focused you need the answer to be.
6. Vector databases
A vector database is a database optimized for storing embeddings and running fast similarity searches across millions of vectors.
Options:
| Database | Type | Notes |
|---|---|---|
| ChromaDB | Open-source | Simple API, good for prototyping |
| FAISS | Open-source (Meta) | Very fast, in-memory, no persistence by default |
| Pinecone | Managed cloud | Production-scale, paid |
| Weaviate | Open-source / cloud | Feature-rich, supports hybrid search |
| Qdrant | Open-source / cloud | Fast, good Rust-based performance |
For learning and local development, ChromaDB is the easiest starting point — minimal setup, Python-native API.
7. Full implementation with ChromaDB
This implementation indexes a small set of company policy documents and answers questions from them.
pip install chromadb sentence-transformers groq
Step 1: Index your documents
import chromadb
from sentence_transformers import SentenceTransformer
# Use a free, local embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
# Your documents (in a real app, load from PDFs or a database)
documents = [
"Nerchuko employees are entitled to 15 paid holidays per year.",
"Holidays include national public holidays and company-specific days announced in January.",
"Nerchuko employees must work from the office on Tuesdays and Wednesdays.",
"Remote work is permitted on all other days subject to manager approval.",
"The annual performance review cycle runs from October to December.",
"Salary increments are effective from the 1st of January each year."
]
# Create a ChromaDB collection
client = chromadb.Client()
collection = client.create_collection("company_policies")
# Embed and index all documents
embeddings = embedding_model.encode(documents).tolist()
collection.add(
documents=documents,
embeddings=embeddings,
ids=[f"doc_{i}" for i in range(len(documents))]
)
print(f"Indexed {len(documents)} document chunks.")
Step 2: Retrieve and answer a question
import os
from groq import Groq
groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])
def answer_question(question: str, top_k: int = 3) -> str:
# Embed the question with the SAME model used for indexing
query_embedding = embedding_model.encode([question]).tolist()
# Retrieve top-K most similar chunks
results = collection.query(
query_embeddings=query_embedding,
n_results=top_k
)
retrieved_chunks = results["documents"][0]
context = "\n".join(f"- {chunk}" for chunk in retrieved_chunks)
# Build the RAG prompt
prompt = f"""Answer the question using only the information provided below.
If the answer is not in the context, say "I don't have that information in the provided documents."
Do not use any outside knowledge.
Context:
{context}
Question: {question}"""
response = groq_client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Test it
questions = [
"How many holidays do Nerchuko employees get?",
"When do salary increments take effect?",
"What is the remote work policy?",
]
for q in questions:
print(f"Q: {q}")
print(f"A: {answer_question(q)}")
print()
What’s happening:
- The question is embedded using the same model that indexed the documents
- ChromaDB computes cosine similarity between the query vector and all stored document vectors
- The top 3 matching chunks are retrieved
- Those chunks are injected into the prompt as context
- The LLM extracts the relevant answer from the context — it doesn’t guess
8. What goes wrong and how to fix it
Wrong chunks are retrieved:
- Chunks are too large — try smaller chunks or add overlap
- The embedding model is too weak for your domain — try a stronger model
- Not enough K — increase top-K so more context reaches the LLM
LLM ignores the context and hallucinates anyway:
- Strengthen the system prompt:
"You must only use the context provided. If the answer is not there, say so." - Try a more instruction-following model
Same question gets different chunks on different runs:
- Embeddings are deterministic — if you get different chunks, the issue is inconsistent preprocessing (whitespace, encoding). Normalize your text before chunking.
What’s next
Part 12 is a hands-on ChromaDB implementation: persistent clients, collections, metadata filtering, custom embedding models, upsert vs update, and a complete end-to-end RAG pipeline that answers questions from a company handbook using Groq.
Full video walkthrough is embedded above.