InferShrink
Cut your LLM costs 80%+ with one line of code
How It Works
Intelligent routing without changing your workflow
01 Classify
Rule-based complexity scoring analyzes your prompts instantly to determine task difficulty.
02 Route
Simple tasks go to cheaper models automatically. gpt-4o → gpt-4o-mini seamlessly.
03 Track
See your savings in real-time. Every request is logged with cost comparison metrics.
Drop-in Replacement
Works with your existing OpenAI and Anthropic clients
OpenAI
Anthropic
Google
import openai
from infershrink import optimize
client = optimize(openai.Client())
# Use exactly as before — InferShrink handles the rest
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is 2+2?"}],
)
# Simple question → routed to gpt-4o-mini (95% cheaper)
# Complex tasks stay on gpt-4o automatically
import anthropic
from infershrink import optimize
client = optimize(anthropic.Anthropic())
# claude-opus → claude-sonnet for simple tasks
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello world"}]
)
from openai import OpenAI
from infershrink import optimize
# Gemini via OpenAI-compatible endpoint
client = optimize(OpenAI(
api_key="your-gemini-key",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
))
response = client.chat.completions.create(
model="gemini-2.5-pro",
messages=[{"role": "user", "content": "What is 2+2?"}],
)
# Simple → routed to gemini-2.5-flash (free tier)
Features
Zero dependencies (core)
Same-provider routing
Streaming support
OpenAI + Anthropic + Google
CLI included
511 tests, CI/CD
Pricing
Start free, scale as you grow
| Dev No Key | Free Starter | Pro $19/mo | Team $49/mo | |
|---|---|---|---|---|
| Requests/mo | Unlimited | 1,000 | 50,000 | 500,000 |
| Model routing | ✅ | ✅ | ✅ | ✅ |
| Compression | ✅ | — | ✅ | ✅ |
| Retrieval | ✅ | — | ✅ | ✅ |
Get in Touch
Questions, enterprise needs, or just want to chat about LLM costs