RAG Chunking: Sentence-Based Splitting

This is Part 14 of the AI Agents series. Part 13 covered fixed-size chunking and why it produces bad vectors. This post covers the next step up: sentence-based chunking, which preserves semantic completeness in every chunk.

1. The core idea

In sentence-based chunking, each sentence becomes one chunk. Instead of splitting every N characters, you split at sentence boundaries — full stops, question marks, exclamation marks.

Input:

The sun is a star. Earth orbits the sun. The moon orbits earth. Solar energy powers our planet.

Output chunks:

[0] "The sun is a star."
[1] "Earth orbits the sun."
[2] "The moon orbits earth."
[3] "Solar energy powers our planet."

Every chunk carries a complete, self-contained thought. The embedding model can represent each one accurately. The result is sharper vectors and more precise retrieval.

2. Why this is better than fixed-size chunking

The embedding model needs complete semantic units to produce meaningful vectors. A sentence is the natural unit of meaning in written language — it has a subject, a predicate, and a complete thought.

Fixed-size chunking cuts across that boundary arbitrarily. Sentence-based chunking respects it.

Property	Fixed-size	Sentence-based
Context preserved	No	Yes
Grammar intact	No	Yes
Vector quality	Low	High
Implementation complexity	Trivial	Low–Medium
Handles abbreviations	N/A	Needs care

3. The naive approach and why it breaks

The obvious implementation — split on ., ?, ! — fails immediately on real text:

text = "Dr. Smith earned $9.5M from U.S.A. operations. He works at Stanford."
chunks = text.split(".")
# ['Dr', ' Smith earned $9', '5M from U', 'S', 'A', ' operations', ' He works at Stanford', '']

Three problems:

Abbreviations: Dr., Mr., Mrs., Prof. contain dots that don’t end sentences
Acronyms: U.S.A. splits into individual letters
Decimal numbers: 9.5 splits at the decimal point

All of these are dots that should not trigger a sentence split.

4. A smart sentence chunker

import re

def sentence_chunks(text: str) -> list[str]:
    # Protect abbreviations: Dr. Mr. Mrs. Miss. Prof.
    protected = re.sub(r'\b(Dr|Mr|Mrs|Miss|Prof)\.\s', r'\1<PERIOD> ', text)

    # Protect acronyms: sequences of single uppercase letters separated by dots (U.S.A.)
    protected = re.sub(r'\b([A-Z]\.){2,}', lambda m: m.group().replace('.', '<PERIOD>'), protected)

    # Protect decimal numbers: digits.digits
    protected = re.sub(r'(\d)\.(\d)', r'\1<PERIOD>\2', protected)

    # Split on sentence-ending punctuation followed by whitespace or end of string
    raw_chunks = re.split(r'(?<=[.!?])\s+', protected)

    # Restore protected periods and clean up
    chunks = [
        chunk.replace('<PERIOD>', '.').strip()
        for chunk in raw_chunks
        if chunk.strip()
    ]

    return chunks


# Test
text = (
    "Dr. Smith earned $9.5M from U.S.A. operations. "
    "He joined the company in 2019! "
    "Did Prof. Jones approve the budget? "
    "The growth rate was 12.3% last quarter."
)

chunks = sentence_chunks(text)
for i, chunk in enumerate(chunks):
    print(f"[{i}] {chunk}")

Output:

[0] Dr. Smith earned $9.5M from U.S.A. operations.
[1] He joined the company in 2019!
[2] Did Prof. Jones approve the budget?
[3] The growth rate was 12.3% last quarter.

The abbreviations, acronyms, and decimal are all preserved. Each split produces a complete sentence.

5. Integrating with ChromaDB

Once you have clean sentence chunks, indexing them is identical to Part 12:

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="sentence_chunks")

document = """
Dr. Smith leads the AI research team at Nerchuko. The company was founded in 2024.
Nerchuko offers 12 casual leaves and 6 sick leaves per year. Work hours are 9 AM to 6 PM.
Employees in the U.S.A. office follow Eastern Time. The average team size is 8.5 members per pod.
"""

chunks = sentence_chunks(document.strip())

collection.upsert(
    documents=chunks,
    ids=[f"s_{i}" for i in range(len(chunks))]
)

# Query
results = collection.query(
    query_texts=["How many sick leaves do employees get?"],
    n_results=2
)

for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"[{dist:.4f}] {doc}")

Each sentence is now an independent, semantically complete vector in the database. Retrieval precision is substantially higher than with fixed-size chunks.

6. Limitations of sentence-based chunking

Sentence-based chunking is a significant improvement over fixed-size, but it has its own edge cases:

Very short sentences lack context. "He did it." as a standalone chunk produces a poor vector — who is “he” and what did he do? Adjacent context matters.
Very long sentences can carry multiple distinct facts, diluting the embedding. A 200-word sentence may embed the same way as a 10-word one, losing granularity.
Lists and bullet points often don’t end in sentence punctuation. Your splitter won’t catch them.

For documents with consistent prose — articles, handbooks, reports — sentence chunking works well out of the box. For mixed-format documents, you may need to combine strategies.

What’s next

Part 15 covers recursive split chunking — a smarter strategy that tries to split on paragraph boundaries first, then sentences, then words, only falling back to character-level splits as a last resort. This handles mixed-format documents better than either fixed-size or sentence-based chunking alone.

Full video walkthrough is embedded above.