Retrieval
FAISS-powered semantic search lets you index documents and retrieve only the relevant chunks for your query. Combined with deduplication and importance scoring, this dramatically reduces context size for RAG applications.
Installation
pip install infershrink[retrieval]
This installs faiss-cpu and sentence-transformers. First use downloads a ~400MB embedding model.
Quick Start
from infershrink import TokenShrink
# Initialize and index your docs
ts = TokenShrink()
ts.index("./docs")
# Query with semantic search
result = ts.query("What are the API rate limits?")
print(f"Sources: {result.sources}")
print(f"Stats: {result.savings}")
print(result.context)
How It Works
- Indexing: Documents are chunked and embedded using sentence-transformers
- Search: Your query is embedded and matched against the index via cosine similarity
- Deduplication: Near-duplicate chunks are removed (configurable threshold)
- Importance scoring: Each chunk gets a score based on similarity + information density
- Compression: If LLMLingua is installed, chunks are compressed adaptively based on importance
REFRAG-inspired: High-importance chunks (relevant + dense) are preserved more, low-importance chunks are compressed aggressively.
CLI Usage
# Index a directory
infershrink index ./docs
# Query with compression
infershrink query "How do I authenticate?" --compress
# Show importance scores
infershrink query "API limits" --scores
# Raw search (no compression)
infershrink search "rate limits"
Configuration
ts = TokenShrink(
index_dir=".tokenshrink", # Where to store the index
model="all-MiniLM-L6-v2", # Embedding model
chunk_size=512, # Words per chunk
chunk_overlap=50, # Overlap between chunks
compression=True, # Enable LLMLingua (if installed)
adaptive=True, # REFRAG-inspired adaptive compression
dedup=True, # Remove near-duplicate chunks
dedup_threshold=0.85, # Cosine similarity threshold for dedup
)
Query Options
result = ts.query(
"your question",
k=5, # Number of chunks to retrieve
min_score=0.3, # Minimum similarity score
max_tokens=2000, # Target token limit
compress=True, # Override compression setting
adaptive=True, # Override adaptive setting
dedup=True, # Override dedup setting
)
Understanding Results
result = ts.query("rate limits")
# Basics
result.context # The combined, possibly compressed text
result.sources # List of source file paths
result.original_tokens # Token count before compression
result.compressed_tokens # Token count after compression
result.ratio # Compression ratio (0.3 = 70% reduction)
result.savings # Human-readable: "Saved 70% (1000 → 300 tokens)"
result.savings_pct # Just the percentage: 70.0
# Advanced (REFRAG features)
result.chunk_scores # List of ChunkScore objects
result.dedup_removed # Number of chunks removed as duplicates
# Each ChunkScore contains:
for cs in result.chunk_scores:
cs.similarity # Cosine similarity to query (0-1)
cs.density # Information density (0-1)
cs.importance # Combined score
cs.compression_ratio # Adaptive ratio for this chunk
cs.deduplicated # True if marked as duplicate
Storage Requirements
- Embedding model: ~400MB (downloaded on first use)
- Index size: Roughly 4KB per chunk indexed
- Memory: Model loads ~1GB RAM during queries
Best Practices
- Chunk size: 512 words works well for most content. Smaller chunks for code, larger for prose.
- min_score: Default 0.3 is lenient. Raise to 0.5+ for stricter relevance.
- dedup_threshold: 0.85 catches near-duplicates without being overly aggressive.
- Index location: Use a persistent directory to avoid re-indexing on every run.