InferShrink

Cut your LLM costs 80%+ with one line of code

View on PyPI →

How It Works

Intelligent routing without changing your workflow

01 Classify

Rule-based complexity scoring analyzes your prompts instantly to determine task difficulty.

02 Route

Simple tasks go to cheaper models automatically. gpt-4o → gpt-4o-mini seamlessly.

03 Track

See your savings in real-time. Every request is logged with cost comparison metrics.

Drop-in Replacement

Works with your existing OpenAI and Anthropic clients

OpenAI
Anthropic
Google
import openai
from infershrink import optimize

client = optimize(openai.Client())

# Use exactly as before — InferShrink handles the rest
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
# Simple question → routed to gpt-4o-mini (95% cheaper)
# Complex tasks stay on gpt-4o automatically
import anthropic
from infershrink import optimize

client = optimize(anthropic.Anthropic())

# claude-opus → claude-sonnet for simple tasks
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello world"}]
)
from openai import OpenAI
from infershrink import optimize

# Gemini via OpenAI-compatible endpoint
client = optimize(OpenAI(
    api_key="your-gemini-key",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
))

response = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
# Simple → routed to gemini-2.5-flash (free tier)

Features

Zero dependencies (core)
Same-provider routing
Streaming support
OpenAI + Anthropic + Google
CLI included
511 tests, CI/CD

Pricing

Start free, scale as you grow

Dev No Key Free Starter Pro $19/mo Team $49/mo
Requests/mo Unlimited 1,000 50,000 500,000
Model routing
Compression
Retrieval

Get in Touch

Questions, enterprise needs, or just want to chat about LLM costs

Copied to clipboard!