LLMsAI Agents 2026-05-28

Groq + Open-Source Models: Fast API Inference Without Hosting

Open-source models are free, but hosting them is not. Learn how to use Groq to run Llama and other open models via API, understand free-tier limits, and ship your first Groq-powered call in Python.

This is Part 6 of the AI Agents series. Parts 1–5 covered how LLMs work, practical usage, open-source vs paid models, making API calls, and controlling output with parameters. This post solves a specific problem: how do you use open-source models via API without hosting them yourself?


1. The infrastructure problem with open-source models

Open-source models are free to download. Running them is not.

A 70B parameter model needs significant GPU memory, serious compute, and continuous uptime if your app is live. If you spin up a cloud GPU instance for that, you pay for it around the clock — whether you have traffic or not. At low request volumes, that cost doesn’t make sense.

You need a middle path: open-source models, API access, someone else’s infrastructure.


2. What Groq is and why the speed matters

Groq is an inference company that hosts open-source models and exposes them via API. You send a request, they run it on their hardware, you get a response.

What makes Groq worth knowing about is token generation speed. Groq runs models on LPUs (Language Processing Units) — custom chips designed specifically for token generation. The result is noticeably faster output compared to GPU-based inference at the same model size.

For a user watching words appear on screen, the difference between 50 tokens/sec and 200 tokens/sec is the difference between “this feels fast” and “this feels broken.”

Groq also offers a free tier, which means you can get started without a credit card.


3. Setup: API key and environment variable

Create an account at console.groq.com — sign in with Google or GitHub. Generate an API key from the dashboard.

Never put the key directly in your code. Store it as an environment variable:

export GROQ_API_KEY="your-key-here"

Then read it in Python:

import os
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

4. Available models

Groq’s production lineup as of mid-2026:

Model IDParametersUse case
llama-3.1-8b-instant8BFast, lightweight
llama-3.3-70b-versatile70BStrong general-purpose
openai/gpt-oss-20b20BOpenAI open-weight, efficient
openai/gpt-oss-120b120BOpenAI open-weight flagship
whisper-large-v3Speech-to-text

Preview models (not production-stable, may be removed without notice): Llama 4 Scout, Llama Prompt Guard 2, Qwen3-32B.

Note on model versions: The models above are accurate as of May 2026. Groq’s lineup changes frequently — models get added, deprecated, or moved between preview and production tiers. Always verify the current list at console.groq.com/docs/models before hardcoding a model name in production code.


5. Your first Groq API call

pip install groq
import os
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "user", "content": "Tell me about Baahubali"}
    ]
)

print(response.choices[0].message.content)

The response object is identical in shape to what OpenAI returns. Same .choices[0].message.content path.


6. OpenAI compatibility: migrate existing code in minutes

Groq exposes an OpenAI-compatible endpoint. If you already have code written against the OpenAI SDK, you can point it at Groq instead by changing the base_url and swapping the key:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["GROQ_API_KEY"],
    base_url="https://api.groq.com/openai/v1"
)

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "user", "content": "Tell me about Baahubali"}
    ]
)

print(response.choices[0].message.content)

Your request shape, your message format, your response parsing — all stays the same. This makes it easy to benchmark Groq vs OpenAI on the same task without rewriting your app.


7. Rate limits on the free tier

For llama-3.3-70b-versatile on the free tier:

LimitValue
Requests per minute30
Requests per day1,000
Tokens per minute12,000
Tokens per day100,000

These are hard limits. Hit them and you get a 429 Too Many Requests response.

Design for this from the start:

  • Set max_tokens on every call to cap output length and avoid burning through your TPM budget on one long response
  • Add retry logic with exponential backoff for 429 errors
  • If you need more headroom, Groq’s paid Developer plan has significantly higher limits

The exact numbers can change — always verify at console.groq.com/settings/limits.


8. Streaming responses

Part 4 covered streaming with the OpenAI SDK. The pattern is identical on Groq:

import os
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "user", "content": "Tell me about Baahubali"}
    ],
    stream=True
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

With Groq’s fast token generation, streaming feels especially responsive — words appear almost immediately. For any user-facing interface, always stream.


9. Evaluating speed fairly

Groq benchmarks often show faster token generation than other providers. That’s real — LPUs are purpose-built for this.

But the comparison is only fair when you’re looking at the same model. A Llama 70B on Groq versus GPT-4o is not a fair speed comparison — those models have very different capacities.

Evaluate on two dimensions independently:

  1. Speed — where Groq often wins for the same model size
  2. Quality — which depends on the model, not the infrastructure

Groq is an excellent choice when you need open-source model quality at API convenience and fast latency. It is not a replacement for frontier closed models on tasks that need them.


What’s next

Part 7 goes one step further: running open-source models entirely locally on your laptop — no API, no cloud, no data leaving your machine. That post covers the VRAM math, quantization, Ollama, and what’s realistically runnable on consumer hardware.

Full video walkthrough is embedded above.

Nerchuko Academy · Free DS Interview Prep