Compression
LLMLingua-2 prompt compression reduces token count by up to 20x while preserving meaning. This is an optional feature that requires additional installation.
Installation
pip install infershrink[compression]
This installs llmlingua and downloads a ~2GB compression model on first use.
Quick Start
Compression happens automatically when using the optimize() wrapper, or you can call it directly:
from infershrink import compress, classify, Complexity
from infershrink.config import build_config
messages = [
{"role": "user", "content": "Your very long prompt here... " * 200}
]
# Get complexity first
result = classify(messages)
config = build_config()
# Compress
compressed = compress(messages, result.complexity, config)
print(f"Original: {compressed.original_tokens} tokens")
print(f"Compressed: {compressed.compressed_tokens} tokens")
print(f"Ratio: {compressed.ratio:.2f}")
How It Works
LLMLingua-2 uses a BERT-based model to identify which tokens can be removed without losing meaning. It's particularly effective on:
- Repetitive content — boilerplate, headers, similar paragraphs
- Filler words — "the", "a", "that", "which" when redundant
- Verbose phrasing — "in order to" → "to"
Typical savings: 40-70% token reduction on long documents, less on already-concise prompts.
Configuration
Control compression behavior in your config file or via build_config():
config = build_config({
"compression": {
"enabled": True,
"min_tokens": 500, # Don't compress short prompts
"skip_for": ["SECURITY_CRITICAL"] # Skip for sensitive content
}
})
Options
enabled— Turn compression on/off (default:true)min_tokens— Minimum token count to trigger compression (default:500)skip_for— Complexity levels to skip (default:["SECURITY_CRITICAL"])
When NOT to Compress
Compression is skipped automatically for:
- Security-critical content — passwords, PII, sensitive data
- Short prompts — under
min_tokensthreshold - Code blocks — can break syntax
- Structured data — JSON, XML may corrupt
CLI Usage
Check if compression is available:
infershrink status
If installed, you'll see:
Optional features:
✓ Compression (LLMLingua)
Performance
- Latency: 50-200ms per compression (depends on prompt length)
- Memory: ~2GB for the model
- GPU: Supported but not required (CPU works fine)