LLMsAI AgentsRAG 2026-05-28

Building a RAG Pipeline with ChromaDB

A complete hands-on implementation of RAG using ChromaDB — persistent storage, collections, metadata filtering, custom embedding models, and a full end-to-end pipeline that answers questions from a private document.

This is Part 12 of the AI Agents series. Part 11 covered the RAG architecture and the math behind vector search. This post is the implementation — building a working pipeline from scratch with ChromaDB, a real document, and Groq for generation.


1. Setup

pip install chromadb sentence-transformers groq

ChromaDB handles embedding storage and similarity search. sentence-transformers gives you a free, local embedding model. Groq provides the LLM for generation.


2. In-memory vs persistent client

ChromaDB gives you two ways to create a client:

import chromadb

# In-memory: data is gone when the process ends
client = chromadb.Client()

# Persistent: data is saved to disk at the specified path
client = chromadb.PersistentClient(path="./chroma_db")

Use in-memory for experiments. Use PersistentClient for anything you want to keep between runs — indexing large documents is slow and you don’t want to redo it every time.

If you’re working in a hosted notebook environment (Colab, etc.), download the database directory before closing the session or your indexed data is gone.


3. Collections

A collection is ChromaDB’s unit of logical separation — like a table in a relational database. Keep different datasets in different collections so you don’t mix embeddings from unrelated domains.

# Will throw an error if the collection already exists
collection = client.create_collection(name="company_handbook")

# Better: create if it doesn't exist, return it if it does
collection = client.get_or_create_collection(name="company_handbook")

Always use get_or_create_collection in any code that might be run more than once.


4. Adding documents

documents = [
    "Nerchuko was founded in 2024 by Ravi and Priya.",
    "Employees are entitled to 12 casual leaves and 6 sick leaves per year.",
    "Work hours are 9 AM to 6 PM Monday through Friday.",
    "New employees must complete onboarding within the first two weeks.",
]

collection.add(
    documents=documents,
    ids=["doc_0", "doc_1", "doc_2", "doc_3"]
)

ChromaDB uses the all-MiniLM-L6-v2 model by default to convert text to embeddings automatically. You don’t need to call an embedding function yourself unless you want a custom model.

IDs must be unique. If you re-add a document with an existing ID, it overwrites the previous entry.

Metadata lets you attach filterable key-value pairs to each document:

collection.add(
    documents=documents,
    metadatas=[
        {"section": "founding"},
        {"section": "leave_policy"},
        {"section": "work_hours"},
        {"section": "onboarding"},
    ],
    ids=["doc_0", "doc_1", "doc_2", "doc_3"]
)

5. Querying

results = collection.query(
    query_texts=["How many sick leaves do employees get?"],
    n_results=2
)

print(results["documents"])    # the retrieved chunks
print(results["distances"])    # similarity scores (lower = more similar)
print(results["metadatas"])    # metadata for each result

By default, ChromaDB returns IDs, distances, metadatas, and documents. Embeddings are excluded unless you pass include=["embeddings"].

Metadata filtering — restrict search to a specific section:

results = collection.query(
    query_texts=["How many sick leaves do employees get?"],
    n_results=2,
    where={"section": "leave_policy"}
)

This forces ChromaDB to only search within documents where section == "leave_policy". Useful when you have multiple document types in one collection and want to limit the search scope.


6. Distance metrics

By default, ChromaDB uses L2 (Euclidean) distance: 0 means identical, higher means less similar, no upper bound.

To use cosine distance (range 0–2, lower is more similar), set it at collection creation:

collection = client.get_or_create_collection(
    name="company_handbook",
    metadata={"hnsw:space": "cosine"}
)

You can’t change the distance metric on an existing collection. Decide upfront.

For text similarity tasks, cosine distance is generally more robust. L2 is sensitive to vector magnitude, which is less meaningful for normalized text embeddings.


7. Upsert, update, and delete

# upsert: insert if ID doesn't exist, update if it does
collection.upsert(
    documents=["Nerchuko was founded in 2024 by Ravi, Priya, and Arjun."],
    ids=["doc_0"]
)

# update: throws an error if the ID doesn't exist — avoid this
collection.update(
    documents=["..."],
    ids=["doc_0"]
)

# delete by ID
collection.delete(ids=["doc_0"])

# delete by metadata filter
collection.delete(where={"section": "founding"})

Always use upsert over update. It handles both the insert and update case safely, so your code doesn’t break on a fresh database.


8. Custom embedding models

ChromaDB’s default all-MiniLM-L6-v2 produces 384-dimensional vectors. It’s fast and works well for simple cases. For better retrieval on complex or domain-specific documents, use a larger model.

from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_fn = SentenceTransformerEmbeddingFunction(
    model_name="Alibaba-NLP/gte-Qwen2-1.5B-instruct"
)

collection = client.get_or_create_collection(
    name="handbook_highq",
    embedding_function=embedding_fn
)

Critical: if you switch embedding models, create a new collection. Vectors from different models live in different dimensional spaces — mixing them produces nonsensical similarity scores. A collection built with a 384-dim model cannot be queried with a 1024-dim model.

Larger models capture context more accurately and produce tighter distance scores between semantically related chunks, which improves retrieval precision.

Note on model versions: gte-Qwen2-1.5B-instruct is accurate as of May 2026. New high-quality open-source embedding models are released regularly. Check the MTEB Leaderboard for the current state of the art before committing to a model in production.


9. End-to-end RAG pipeline

Putting it all together: a pipeline that answers questions from a company handbook.

import os
import chromadb
from groq import Groq

# --- Setup ---
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection(name="nerchuko_handbook")
groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])

# --- Document: Nerchuko Employee Handbook (chunked by section) ---
handbook_chunks = [
    "### Company Structure\nNerchuko was founded in 2024 by Ravi Kumar and Priya Singh. The company operates in the AI education space with teams across engineering, content, and operations.",
    "### Work Hours\nStandard work hours are 9 AM to 6 PM, Monday through Friday. Flexible start times between 8–10 AM are permitted with manager approval.",
    "### Leave Policy\nEmployees are entitled to 12 casual leaves and 6 sick leaves per year. Leaves do not carry forward to the next year.",
    "### Remote Work\nEmployees must be present in the office on Tuesdays and Wednesdays. Remote work is permitted on all other working days.",
    "### Onboarding\nNew employees are expected to complete the onboarding module within two weeks of joining. The module covers company tools, processes, and code of conduct.",
    "### Learning and Development\nNerchuko provides access to learning platforms. Priority courses are: Python for Data Science, SQL Fundamentals, and Machine Learning Foundations.",
]

# --- Index: upsert so re-running is safe ---
collection.upsert(
    documents=handbook_chunks,
    ids=[f"chunk_{i}" for i in range(len(handbook_chunks))]
)

print(f"Indexed {len(handbook_chunks)} chunks.")


# --- RAG query function ---
def ask(question: str, top_k: int = 3) -> str:
    # Retrieve relevant chunks
    results = collection.query(
        query_texts=[question],
        n_results=top_k
    )
    retrieved = results["documents"][0]
    context = "\n\n".join(retrieved)

    # Build the RAG prompt
    prompt = f"""Answer the user question using only the context provided below.
If the answer cannot be found in the context, reply exactly: "I am not sure."
Do not use outside knowledge.

Context:
{context}

Question: {question}"""

    response = groq_client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content


# --- Test ---
questions = [
    "How many casual leaves and sick leaves do employees get?",
    "Who founded Nerchuko?",
    "What courses should I prioritize for learning?",
    "What are the onboarding expectations for new employees?",
    "What is the company's revenue last quarter?",  # not in the handbook
]

for q in questions:
    print(f"Q: {q}")
    print(f"A: {ask(q)}")
    print()

The last question (“revenue last quarter”) returns “I am not sure” — the model correctly refuses to guess because the answer isn’t in the retrieved context. This is the key behavior RAG enables: grounded answers, not hallucinations.

Compare that to asking the same questions without RAG — the LLM has never seen Nerchuko, will confuse it with other organizations, and confidently produce incorrect answers.


10. What the distance scores tell you

After querying, the distances field in the result tells you how similar each retrieved chunk is to the question.

With L2 (default):

  • Score near 0 → very close match
  • Score above 1–2 → weak match; the retrieved chunk may not be relevant

If your distances are consistently high (e.g. > 1.5) for questions that should have clear answers, your chunks are likely too large or your embedding model is too weak for the domain. Try smaller chunks or a better model.


What’s next

Part 13 covers chunking strategies in depth — starting with fixed-size chunking, why it breaks context, and what to use instead.

Full video walkthrough is embedded above.

Nerchuko Academy · Free DS Interview Prep