LLM Parameters Explained: Temperature, Max Tokens, and Context Window

This is Part 5 of the AI Agents series. Part 4 covered making your first API call — client setup, streaming, and system prompts. Now we go one level deeper: the parameters that control how the model responds.

These aren’t optional tweaks. Getting them wrong means your app gives stale repetitive answers, runs up your bill, or crashes with a token error. Getting them right gives you a model that behaves exactly how you need it to.

1. Temperature

Temperature controls how creative the model is. Not physical heat — just how much randomness goes into choosing the next token.

The scale runs from 0 to 1 (some APIs allow up to 2):

Value	Behavior	Use case
`0`	Deterministic — same input → same output every time	Definitions, factual Q&A, structured data extraction
`0.7–0.9`	Creative — different output every time	Jokes, story generation, brainstorming

Low temperature example: A student app asks “What is Newton’s Second Law?” — the answer should always be F = ma, explained clearly. There’s no reason for variety. Set temperature to 0.

High temperature example: A joke generator that gives the same joke every time someone clicks “Generate” is broken. Set temperature high to get genuine variety.

# Factual / deterministic
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Define Newton's Second Law."}],
    temperature=0
)

# Creative / varied
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Tell me a funny one-liner joke."}],
    temperature=0.8
)

The default temperature for most providers is around 0.7–1.0. If you’re building something that needs consistent, factual answers, you need to explicitly set it to 0.

2. Max Tokens

To generate a response, the model uses tokens. A rough rule: 1 token ≈ 1 word. “Sri Ramadasu” is about 3 tokens. A single sentence is roughly 10–20 tokens.

Why you need a limit

You might think: set Max Tokens as high as possible and get the most detailed answers. The problem is cost.

LLM providers charge separately for input tokens (what you send) and output tokens (what the model generates) — and output is significantly more expensive:

Token type	GPT-4 example pricing
Input (per 1M tokens)	~$2
Output (per 1M tokens)	~$8

Output tokens cost 4× more. An uncapped response on a complex question can easily run 500–1000 tokens. If your app handles thousands of requests a day, this adds up fast.

Fewer tokens also means faster responses — the model stops generating sooner.

Setting the right limit

Don’t pick a number randomly. Think about your use case.

A joke app: a good joke is 2–3 sentences, roughly 20–30 tokens. The longest reasonable joke might be 40 tokens. Set max_tokens=50 and you’re covered without waste.

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Tell me a funny one-liner joke."}],
    temperature=0.8,
    max_tokens=50
)

If the model hits the limit mid-sentence, the response gets cut off abruptly. That’s the hard limit working correctly — but for most apps you want complete answers.

The 90th percentile rule

Analyze a sample of real responses for your use case and find the length that covers 90% of them. Set that as your limit. You’ll get complete answers for 90% of queries while keeping costs and latency in check for everyone else.

3. Context Window

The context window is the maximum number of tokens the model can receive as input in a single request. It’s not a parameter you set — it’s a hard constraint built into each model.

Model	Context window
GPT-4 (standard)	8,192 tokens
GPT-4.1	~1,000,000 tokens

If your input exceeds the model’s context window, you get a Token limit exceeded error. The call fails entirely.

What this means in practice

An 8,192-token limit sounds generous until you try to feed in a 300-page textbook and ask it to generate exam questions. That textbook is millions of tokens — not even close. You’d need to chunk it, retrieve the relevant sections, and only send what matters. That’s the core idea behind RAG (covered later in this series).

Newer models with 1M token windows are tempting. You can fit an entire textbook in. But at $2 per 1M input tokens, a single call with a 200-page book as context could cost real money — for one question.

The rule: Only send the context that’s actually necessary. A focused 500-token excerpt gets you a better answer than an unfocused 100,000-token dump, and at a fraction of the cost.

Putting it together

These three parameters interact:

Temperature shapes the style of the output
Max Tokens caps its length and your cost
Context Window limits how much input you can feed in

A production-ready API call that uses all three:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful teaching assistant. Answer clearly and concisely."},
        {"role": "user",   "content": "What is Newton's Second Law?"}
    ],
    temperature=0,    # factual, deterministic
    max_tokens=150    # enough for a clear answer, not a textbook
)

What’s next

Part 6 covers using Groq to make API calls with open-source models like Llama and Mistral — same SDK patterns, no credit card required for experimentation.

Full video walkthrough is embedded above.