Getting Started

InferShrink is a zero-dependency Python SDK that automatically optimizes your LLM costs by routing simple tasks to cheaper models.

Installation

pip install infershrink          # Core (zero deps)
pip install infershrink[retrieval]    # + FAISS retrieval
pip install infershrink[compression]  # + LLMLingua
pip install infershrink[all]          # Everything

Step 1: Wrap your client

Import optimize and wrap your existing client instance. This works for both OpenAI and Anthropic clients.

import openai
from infershrink import optimize

client = optimize(openai.Client())
# Done. All calls now auto-route to cheapest capable model.

Step 2: Use normally

Continue using the client exactly as before. InferShrink intercepts the call, analyzes complexity, and routes accordingly.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
# Simple → routed to gpt-4o-mini (95% cheaper)

Step 3: Check savings

View your accumulated savings and routing statistics.

print(client.infershrink_tracker.summary())

Streaming

InferShrink fully supports streaming responses. The routing decision is made before the stream begins.

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hi"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content, end="")

Full Stack: Retrieval + Compression

For advanced use cases, combine retrieval (RAG) and prompt compression.

from infershrink import TokenShrink, optimize
import openai

ts = TokenShrink()
ts.index("./docs")
result = ts.query("What are the API rate limits?")

client = optimize(openai.Client())
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Answer using this context:\n" + result.context},
        {"role": "user", "content": "What are the API rate limits?"},
    ],
)