How to Cut Your LLM API Costs by 80% Without Sacrificing Quality
If you're running LLM API calls in production, you're probably wasting 70-80% of your spend. Not because the models are too expensive — because you're using the wrong model for most requests.
Here's the uncomfortable truth: most production prompts are simple. Summarize this text. Extract these fields. Classify this input. Format this output. These tasks don't need GPT-4o or Claude Sonnet. A model that costs 95% less handles them just as well.
The problem is knowing which prompts are simple and which actually need the expensive model. That's what complexity-based routing solves.
The Cost Problem in Numbers
Let's look at what you're actually paying:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Relative Cost |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 1x |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 1.2x |
| Gemini 1.5 Pro | $1.25 | $5.00 | 0.5x |
| GPT-4o Mini | $0.15 | $0.60 | 0.06x |
| Gemini 2.0 Flash | $0.10 | $0.40 | 0.04x |
Gemini Flash is 25x cheaper than GPT-4o. GPT-4o Mini is 17x cheaper. And for straightforward tasks — which make up the majority of production traffic — they produce equivalent results.
What Is Complexity-Based Routing?
The idea is simple: before sending a prompt to an LLM, classify its complexity. Simple prompts go to cheap models. Complex prompts stay on expensive ones. You keep the quality where it matters and save everywhere else.
A lightweight classifier (no ML model, just heuristics) analyzes the prompt and assigns a complexity score based on:
- Token count — longer prompts tend to be more complex
- Structural markers — multi-step instructions, nested logic, code blocks
- Domain signals — mathematical notation, legal language, technical jargon
- Task type — generation vs extraction vs classification
The classifier runs locally, adds <1ms of latency, and requires zero API calls itself.
Real-World Results
We ran 10,000 production prompts through the classifier. The breakdown:
- 72% classified as simple — routed to Flash/Mini tier
- 18% classified as moderate — routed to mid-tier (Gemini Pro, GPT-4o Mini)
- 10% classified as complex — kept on premium models (GPT-4o, Claude Sonnet)
Result: 80% cost reduction with no measurable quality degradation on simple tasks. Complex tasks (code generation, multi-step reasoning, nuanced writing) stayed on premium models where quality matters.
Implementation: 3 Lines of Code
Here's what it looks like with InferShrink:
from infershrink import optimize
import openai
client = optimize(openai.OpenAI())
# That's it. Every call now routes automatically.
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this in 2 sentences: ..."}]
)
# → Routed to gpt-4o-mini (simple task, same provider)
The optimize() wrapper intercepts each call, classifies complexity, and routes to the cheapest model that can handle it — within the same provider. If you're using OpenAI, it stays on OpenAI. If you're on Google, it stays on Google. No cross-provider surprises.
What About Streaming?
Transparent. If you pass stream=True, InferShrink routes first, then streams from the target model. No buffering, no extra latency beyond the initial classification.
# Streaming works exactly the same
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "List 5 benefits of exercise"}],
stream=True
)
for chunk in stream:
print(chunk.choices[0].delta.content, end="")
# → Streamed from gpt-4o-mini
When NOT to Use This
Routing isn't magic. There are cases where you should pin to a specific model:
- Regulated outputs — if compliance requires a specific model, don't route
- Fine-tuned models — routing bypasses your fine-tune
- Evaluation/benchmarking — you need consistent model identity
- Creative writing with specific "voice" — different models have different styles
For these cases, InferShrink supports a pin=True flag that bypasses routing for specific calls.
The Math
For a typical production workload doing 1M tokens/day on GPT-4o:
| Scenario | Monthly Cost | Savings |
|---|---|---|
| All traffic on GPT-4o | ~$375 | — |
| With complexity routing (72% downgraded) | ~$78 | $297/mo (79%) |
| 10M tokens/day on GPT-4o | ~$3,750 | — |
| With complexity routing | ~$780 | $2,970/mo (79%) |
The savings scale linearly. The more traffic you have, the more you save.
Try It Now
One line to install. Three lines to integrate. Start saving immediately.
pip install infershrink
InferShrink is a Python SDK that wraps your existing OpenAI, Anthropic, or Google client. No infrastructure changes. No new API keys. No data leaves your environment — the classifier runs locally.
Check the documentation for configuration options, or see supported providers for the full routing table.