Retrieval

FAISS-powered semantic search lets you index documents and retrieve only the relevant chunks for your query. Combined with deduplication and importance scoring, this dramatically reduces context size for RAG applications.

Installation

pip install infershrink[retrieval]

This installs faiss-cpu and sentence-transformers. First use downloads a ~400MB embedding model.

Quick Start

from infershrink import TokenShrink

# Initialize and index your docs
ts = TokenShrink()
ts.index("./docs")

# Query with semantic search
result = ts.query("What are the API rate limits?")

print(f"Sources: {result.sources}")
print(f"Stats: {result.savings}")
print(result.context)

How It Works

  1. Indexing: Documents are chunked and embedded using sentence-transformers
  2. Search: Your query is embedded and matched against the index via cosine similarity
  3. Deduplication: Near-duplicate chunks are removed (configurable threshold)
  4. Importance scoring: Each chunk gets a score based on similarity + information density
  5. Compression: If LLMLingua is installed, chunks are compressed adaptively based on importance

REFRAG-inspired: High-importance chunks (relevant + dense) are preserved more, low-importance chunks are compressed aggressively.

CLI Usage

# Index a directory
infershrink index ./docs

# Query with compression
infershrink query "How do I authenticate?" --compress

# Show importance scores
infershrink query "API limits" --scores

# Raw search (no compression)
infershrink search "rate limits"

Configuration

ts = TokenShrink(
    index_dir=".tokenshrink",  # Where to store the index
    model="all-MiniLM-L6-v2",  # Embedding model
    chunk_size=512,            # Words per chunk
    chunk_overlap=50,          # Overlap between chunks
    compression=True,          # Enable LLMLingua (if installed)
    adaptive=True,             # REFRAG-inspired adaptive compression
    dedup=True,                # Remove near-duplicate chunks
    dedup_threshold=0.85,      # Cosine similarity threshold for dedup
)

Query Options

result = ts.query(
    "your question",
    k=5,                 # Number of chunks to retrieve
    min_score=0.3,       # Minimum similarity score
    max_tokens=2000,     # Target token limit
    compress=True,       # Override compression setting
    adaptive=True,       # Override adaptive setting
    dedup=True,          # Override dedup setting
)

Understanding Results

result = ts.query("rate limits")

# Basics
result.context           # The combined, possibly compressed text
result.sources           # List of source file paths
result.original_tokens   # Token count before compression
result.compressed_tokens # Token count after compression
result.ratio             # Compression ratio (0.3 = 70% reduction)
result.savings           # Human-readable: "Saved 70% (1000 → 300 tokens)"
result.savings_pct       # Just the percentage: 70.0

# Advanced (REFRAG features)
result.chunk_scores      # List of ChunkScore objects
result.dedup_removed     # Number of chunks removed as duplicates

# Each ChunkScore contains:
for cs in result.chunk_scores:
    cs.similarity        # Cosine similarity to query (0-1)
    cs.density           # Information density (0-1)
    cs.importance        # Combined score
    cs.compression_ratio # Adaptive ratio for this chunk
    cs.deduplicated      # True if marked as duplicate

Storage Requirements

Best Practices