LLMsAI Agents 2026-05-28

Your First LLM API Call: OpenAI, Streaming, and System Prompts

Stop using the chat UI — connect to LLMs directly in Python. Covers OpenAI and Groq setup, streaming vs non-streaming, picking the right model for cost, and controlling behavior with system prompts.

This is Part 4 of the AI Agents series. Parts 1–3 covered how LLMs work, how to use them practically, and when to choose open-source vs paid. Now we write actual code.

This is where the series shifts from theory to building. Everything from here on is hands-on.


Two platforms to know

OpenAI — paid, closed-source models (GPT-4o, GPT-4o-mini, o3). Pay per token. Best benchmark performance.

Groq — inference company that hosts open-source models (Llama, Mistral, Gemma). Free tier available, great for experimenting without a credit card.

This post uses the OpenAI SDK, but the patterns — streaming, system prompts, roles — apply to every provider.


Step 1: Get your API key

Go to the OpenAI developer platform → your profile → API keysCreate a new secret key. Give it a name like test-key.

Copy it immediately. The platform won’t show it again. If it gets compromised, delete it and generate a new one.


Step 2: Install the SDK and create a client

pip install openai
from openai import OpenAI

client = OpenAI(api_key="your-api-key-here")

That client object is your entry point to every model OpenAI offers.


Step 3: Pick the right model

Two common choices and why cost matters:

ModelCost (per 1M tokens)When to use
gpt-4oHigherComplex reasoning tasks
gpt-4o-miniMuch cheaperMost use cases — start here

For experimentation and most real applications, start with gpt-4o-mini. It’s cheap enough that mistakes don’t hurt. If you hit quality limits on a specific task, upgrade.


Step 4: Make your first call

response = client.responses.create(
    model="gpt-4o-mini",
    input="Tell me about Bahubali"
)

print(response.output_text)

This works — but there’s a problem with how it feels to use.


Streaming vs Non-Streaming

By default, the API waits until the entire response is generated before returning anything. For a 1000-word answer, that’s 40–60 seconds of a blank screen. A user will assume the app is broken.

LLMs generate text token-by-token internally — streaming just surfaces that in real time.

Non-streaming (default):

response = client.responses.create(
    model="gpt-4o-mini",
    input="Tell me about Bahubali"
)
print(response.output_text)
# → nothing for 40 seconds, then the full answer appears at once

Streaming:

with client.responses.stream(
    model="gpt-4o-mini",
    input="Tell me about Bahubali"
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
# → words appear immediately as they generate

When to use which

ModeUse case
StreamingAny user-facing output — chat, Q&A, real-time generation
Non-streamingBackground tasks where nobody is watching — auto-generating emails, batch processing

If you’re building something a human will read, stream it.


Model reasoning: small vs large

Ask gpt-4o-mini about “Bahubali”:

Returns the plot of the movie franchise directly.

Ask a reasoning-capable model (like o3) the same question:

Pauses, considers that “Bahubali” could refer to the blockbuster film series or to a figure in Indian religious history — then addresses both before answering.

Neither is wrong. The reasoning model is more thorough but slower and more expensive.

Rule of thumb:

  • Simple, well-defined tasks → small cheap model
  • Ambiguous questions, multi-step reasoning, nuanced decisions → reasoning model

Getting this right keeps your costs down and your margins healthy.


Controlling behavior with System Prompts

By default, the model answers anything within its knowledge. You can constrain and shape that behavior using roles.

The API accepts two roles:

RolePurpose
userThe question or prompt
systemInstructions that define the model’s persona, scope, and constraints

Problem: Ask a smart model about “Bahubali” and it might go into religious history — not what you want for a movie recommendation app.

Fix: give it a system prompt

response = client.responses.create(
    model="gpt-4o-mini",
    instructions="You are a movie buff. You only have knowledge about films and cinema. Do not discuss religion, history, or anything outside movies.",
    input="Tell me about Bahubali"
)

print(response.output_text)
# → Only movie content, religious history ignored

Ask about “Sri Ramadasu” next — same constraint applies. The model stays in its lane.

System prompts are how you turn a general-purpose LLM into a specialized assistant for your product.


Alternate syntax: Chat Completions API

There’s a second way to call the API — the Chat Completions endpoint. It works identically but uses a messages list instead of separate input and instructions fields:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a movie buff. Only discuss films."},
        {"role": "user",   "content": "Tell me about Bahubali"}
    ]
)

print(response.choices[0].message.content)

Streaming works here too — same stream=True pattern.

You’ll see both styles in real codebases. The Responses API (client.responses) is newer and cleaner; Chat Completions is older but more widely documented. Both are fine.


What you can build from here

With client + model + system prompt + streaming, you have the core of almost any LLM-powered feature:

  • A chatbot with a custom persona
  • A document Q&A tool
  • An automated content generator
  • The backbone of an AI agent

The next post in the series connects the pieces — adding memory and tools to turn a single API call into a proper agent.

Full video walkthrough is embedded above.

Nerchuko Academy · Free DS Interview Prep