Compression

LLMLingua-2 prompt compression reduces token count by up to 20x while preserving meaning. This is an optional feature that requires additional installation.

Installation

pip install infershrink[compression]

This installs llmlingua and downloads a ~2GB compression model on first use.

Quick Start

Compression happens automatically when using the optimize() wrapper, or you can call it directly:

from infershrink import compress, classify, Complexity
from infershrink.config import build_config

messages = [
    {"role": "user", "content": "Your very long prompt here... " * 200}
]

# Get complexity first
result = classify(messages)
config = build_config()

# Compress
compressed = compress(messages, result.complexity, config)

print(f"Original: {compressed.original_tokens} tokens")
print(f"Compressed: {compressed.compressed_tokens} tokens")
print(f"Ratio: {compressed.ratio:.2f}")

How It Works

LLMLingua-2 uses a BERT-based model to identify which tokens can be removed without losing meaning. It's particularly effective on:

Repetitive content — boilerplate, headers, similar paragraphs
Filler words — "the", "a", "that", "which" when redundant
Verbose phrasing — "in order to" → "to"

Typical savings: 40-70% token reduction on long documents, less on already-concise prompts.

Configuration

Control compression behavior in your config file or via build_config():

config = build_config({
    "compression": {
        "enabled": True,
        "min_tokens": 500,      # Don't compress short prompts
        "skip_for": ["SECURITY_CRITICAL"]  # Skip for sensitive content
    }
})

Options

enabled — Turn compression on/off (default: true)
min_tokens — Minimum token count to trigger compression (default: 500)
skip_for — Complexity levels to skip (default: ["SECURITY_CRITICAL"])

When NOT to Compress

Compression is skipped automatically for:

Security-critical content — passwords, PII, sensitive data
Short prompts — under min_tokens threshold
Code blocks — can break syntax
Structured data — JSON, XML may corrupt

CLI Usage

Check if compression is available:

infershrink status

If installed, you'll see:

Optional features:
  ✓ Compression (LLMLingua)

Performance

Latency: 50-200ms per compression (depends on prompt length)
Memory: ~2GB for the model
GPU: Supported but not required (CPU works fine)