Getting Started
InferShrink is a zero-dependency Python SDK that automatically optimizes your LLM costs by routing simple tasks to cheaper models.
Installation
pip install infershrink # Core (zero deps)
pip install infershrink[retrieval] # + FAISS retrieval
pip install infershrink[compression] # + LLMLingua
pip install infershrink[all] # Everything
Step 1: Wrap your client
Import optimize and wrap your existing client instance. This works for both OpenAI and Anthropic clients.
import openai
from infershrink import optimize
client = optimize(openai.Client())
# Done. All calls now auto-route to cheapest capable model.
Step 2: Use normally
Continue using the client exactly as before. InferShrink intercepts the call, analyzes complexity, and routes accordingly.
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is 2+2?"}],
)
# Simple → routed to gpt-4o-mini (95% cheaper)
Step 3: Check savings
View your accumulated savings and routing statistics.
print(client.infershrink_tracker.summary())
Streaming
InferShrink fully supports streaming responses. The routing decision is made before the stream begins.
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hi"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content, end="")
Full Stack: Retrieval + Compression
For advanced use cases, combine retrieval (RAG) and prompt compression.
from infershrink import TokenShrink, optimize
import openai
ts = TokenShrink()
ts.index("./docs")
result = ts.query("What are the API rate limits?")
client = optimize(openai.Client())
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer using this context:\n" + result.context},
{"role": "user", "content": "What are the API rate limits?"},
],
)