Compression

LLMLingua-2 prompt compression reduces token count by up to 20x while preserving meaning. This is an optional feature that requires additional installation.

Installation

pip install infershrink[compression]

This installs llmlingua and downloads a ~2GB compression model on first use.

Quick Start

Compression happens automatically when using the optimize() wrapper, or you can call it directly:

from infershrink import compress, classify, Complexity
from infershrink.config import build_config

messages = [
    {"role": "user", "content": "Your very long prompt here... " * 200}
]

# Get complexity first
result = classify(messages)
config = build_config()

# Compress
compressed = compress(messages, result.complexity, config)

print(f"Original: {compressed.original_tokens} tokens")
print(f"Compressed: {compressed.compressed_tokens} tokens")
print(f"Ratio: {compressed.ratio:.2f}")

How It Works

LLMLingua-2 uses a BERT-based model to identify which tokens can be removed without losing meaning. It's particularly effective on:

Typical savings: 40-70% token reduction on long documents, less on already-concise prompts.

Configuration

Control compression behavior in your config file or via build_config():

config = build_config({
    "compression": {
        "enabled": True,
        "min_tokens": 500,      # Don't compress short prompts
        "skip_for": ["SECURITY_CRITICAL"]  # Skip for sensitive content
    }
})

Options

When NOT to Compress

Compression is skipped automatically for:

CLI Usage

Check if compression is available:

infershrink status

If installed, you'll see:

Optional features:
  ✓ Compression (LLMLingua)

Performance