How LLMs Work: From Tokens to AI Agents
Before you can build an AI agent, you need to understand the engine inside it. A ground-up walkthrough of LLMs — tokenization, transformers, training, and the limits that make agents necessary.
Each post pairs with a YouTube video. Open any article in Claude for AI-assisted Q&A.
Before you can build an AI agent, you need to understand the engine inside it. A ground-up walkthrough of LLMs — tokenization, transformers, training, and the limits that make agents necessary.
Practical ways to use ChatGPT, Claude, and Gemini — from clearing doubts and building resumes to brainstorming ML projects and automating content creation. Plus one critical warning about when not to reach for them.
GPT, Claude, and Gemini aren't the only options. A clear breakdown of open-source vs paid models — what they are, how they differ, and a decision framework for choosing the right one for your use case.
Stop using the chat UI — connect to LLMs directly in Python. Covers OpenAI and Groq setup, streaming vs non-streaming, picking the right model for cost, and controlling behavior with system prompts.
Three knobs that control how your LLM behaves — and how much it costs you. Learn what Temperature, Max Tokens, and Context Window actually do, with real examples and code.
Open-source models are free, but hosting them is not. Learn how to use Groq to run Llama and other open models via API, understand free-tier limits, and ship your first Groq-powered call in Python.
Open-source models are free to download but expensive to run. Learn the VRAM math, what quantization actually costs you, and how to pick between Ollama and vLLM for local inference.
Getting bad or inconsistent outputs from an LLM usually isn't the model's fault — it's the prompt. Learn the two core prompting techniques, when to use each, and how few-shot examples unlock custom output formats.
Zero-shot and few-shot get you far. But complex reasoning, math, and open-ended analysis need more — learn the three techniques that make LLMs think before they answer.
An LLM's knowledge stops at its training cutoff and it can't access your private data. ReAct and RAG are the two prompt engineering frameworks that fix both problems — turning a plain LLM into an agent that can act and retrieve.
RAG is how you give LLMs accurate answers from documents they've never seen. This post covers the full architecture: chunking, embeddings, similarity search, vector databases, and a working implementation with ChromaDB.
A complete hands-on implementation of RAG using ChromaDB — persistent storage, collections, metadata filtering, custom embedding models, and a full end-to-end pipeline that answers questions from a private document.
Bad retrieval in RAG almost always traces back to bad chunks. Learn why fixed-size chunking destroys context, when it's acceptable, and what the alternatives are.
Fixed-size chunking breaks sentences mid-word. Sentence-based chunking fixes that by treating each complete sentence as its own chunk — better context, better vectors, better retrieval.
Recursive character splitting is the most practical chunking strategy for real documents — it respects natural boundaries like paragraphs and sentences, falls back gracefully, and uses overlap to preserve cross-boundary context.
Sliding window chunking ignores paragraph and sentence boundaries entirely. Instead it moves a fixed-size window forward by a configurable stride — creating dense, overlapping chunks that preserve context across every split.
Every chunking strategy so far splits by size. Semantic chunking splits by meaning — grouping sentences that discuss the same topic into one chunk, regardless of character or word count.
Basic RAG works for demos. Production RAG needs more — query expansion to handle ambiguous inputs, hybrid search for keyword precision, re-ranking to filter noise, and feedback loops to improve over time.
No posts yet in this category — view all articles.