Retrieval

FAISS-powered semantic search lets you index documents and retrieve only the relevant chunks for your query. Combined with deduplication and importance scoring, this dramatically reduces context size for RAG applications.

Installation

pip install infershrink[retrieval]

This installs faiss-cpu and sentence-transformers. First use downloads a ~400MB embedding model.

Quick Start

from infershrink import TokenShrink

# Initialize and index your docs
ts = TokenShrink()
ts.index("./docs")

# Query with semantic search
result = ts.query("What are the API rate limits?")

print(f"Sources: {result.sources}")
print(f"Stats: {result.savings}")
print(result.context)

How It Works

Indexing: Documents are chunked and embedded using sentence-transformers
Search: Your query is embedded and matched against the index via cosine similarity
Deduplication: Near-duplicate chunks are removed (configurable threshold)
Importance scoring: Each chunk gets a score based on similarity + information density
Compression: If LLMLingua is installed, chunks are compressed adaptively based on importance

REFRAG-inspired: High-importance chunks (relevant + dense) are preserved more, low-importance chunks are compressed aggressively.

CLI Usage

# Index a directory
infershrink index ./docs

# Query with compression
infershrink query "How do I authenticate?" --compress

# Show importance scores
infershrink query "API limits" --scores

# Raw search (no compression)
infershrink search "rate limits"

Configuration

ts = TokenShrink(
    index_dir=".tokenshrink",  # Where to store the index
    model="all-MiniLM-L6-v2",  # Embedding model
    chunk_size=512,            # Words per chunk
    chunk_overlap=50,          # Overlap between chunks
    compression=True,          # Enable LLMLingua (if installed)
    adaptive=True,             # REFRAG-inspired adaptive compression
    dedup=True,                # Remove near-duplicate chunks
    dedup_threshold=0.85,      # Cosine similarity threshold for dedup
)

Query Options

result = ts.query(
    "your question",
    k=5,                 # Number of chunks to retrieve
    min_score=0.3,       # Minimum similarity score
    max_tokens=2000,     # Target token limit
    compress=True,       # Override compression setting
    adaptive=True,       # Override adaptive setting
    dedup=True,          # Override dedup setting
)

Understanding Results

result = ts.query("rate limits")

# Basics
result.context           # The combined, possibly compressed text
result.sources           # List of source file paths
result.original_tokens   # Token count before compression
result.compressed_tokens # Token count after compression
result.ratio             # Compression ratio (0.3 = 70% reduction)
result.savings           # Human-readable: "Saved 70% (1000 → 300 tokens)"
result.savings_pct       # Just the percentage: 70.0

# Advanced (REFRAG features)
result.chunk_scores      # List of ChunkScore objects
result.dedup_removed     # Number of chunks removed as duplicates

# Each ChunkScore contains:
for cs in result.chunk_scores:
    cs.similarity        # Cosine similarity to query (0-1)
    cs.density           # Information density (0-1)
    cs.importance        # Combined score
    cs.compression_ratio # Adaptive ratio for this chunk
    cs.deduplicated      # True if marked as duplicate

Storage Requirements

Embedding model: ~400MB (downloaded on first use)
Index size: Roughly 4KB per chunk indexed
Memory: Model loads ~1GB RAM during queries

Best Practices

Chunk size: 512 words works well for most content. Smaller chunks for code, larger for prose.
min_score: Default 0.3 is lenient. Raise to 0.5+ for stricter relevance.
dedup_threshold: 0.85 catches near-duplicates without being overly aggressive.
Index location: Use a persistent directory to avoid re-indexing on every run.