
TurboAgents
Supercharge Your Agents with TurboQuant
Keep your stack. Add compressed retrieval.
TurboQuant-style KV-cache and vector compression for agent and RAG systems. Works with your existing vector store, retrieval layer, and agent runtime. Framework-agnostic. Benchmark-first.
How It Works
Sits Between Your Embeddings and Your Results
TurboAgents applies compressed scoring and reranking between your vector database and your agent or RAG pipeline. Simple by design.
Your Vector DB
Chroma · FAISS · pgvector · LanceDB · SurrealDB
TurboAgents
Compressed scoring & reranking
Your Agent / RAG
Final optimized results
The design principle: keep your framework, keep your vector database, add TurboAgents where retrieval cost and memory pressure start to hurt. No giant migration plan required.
See the full architecture
Quantization pipeline, adapter integration, and benchmark methodology - all documented.
What TurboAgents Does
Compressed retrieval, vector reranking, KV-style optimization, and benchmark-first validation - all in one package.
KV-Cache Compression
Walsh-Hadamard rotation with PolarQuant-style angle/radius encoding. Compress KV-cache for extended context windows in local and server inference.
Vector Payload Compression
Reduce vector storage costs and scale retrieval systems without replacing your existing vector database backend.
Quality/Latency Benchmarking
Explicit, measurable recall metrics with CLI-driven benchmarks. Know exactly what you trade before you commit.
Multi-Backend Adapters
Validated adapters for Chroma, FAISS, LanceDB, pgvector, and SurrealDB. Preserves your existing infrastructure investment.
MLX & llama.cpp Support
First-class MLX integration with a validated 3.5-bit sweet spot on 3B models. Experimental vLLM support for server-side inference.
Validated Results
Every result is validated, documented, and reproducible. Real recall metrics across all supported backends.
Validated Results
Backend Benchmark Coverage
Every adapter is benchmarked with real recall metrics.
Full benchmark methodology and results at superagenticai.github.io/turboagents/benchmarks
Run the benchmarks yourself
CLI-driven benchmarks with turboagents bench rag. Compare backends, measure tradeoffs, decide with data.
Native Adapters
Built-In Vector Store Adapters
Drop-in compressed retrieval for the vector databases you already use. Each adapter wraps the native client and adds TurboQuant-style scoring and reranking.
Chroma
TurboChromaPersistent compressed retrieval with collection-level storage. Wraps the Chroma client and adds quantized scoring without changing your schema.
LanceDB
TurboLanceDBLocal or remote LanceDB with table-level compressed retrieval. URI-based configuration for flexible deployment topologies.
SurrealDB
TurboSurrealDBWebSocket-connected SurrealDB with namespace and database isolation. Compressed vector storage for multi-tenant agent deployments.
TurboFAISSIn-memory compressed retrieval. No persistence needed.
TurboPgvectorPostgreSQL-native compressed retrieval for production setups.
Beyond Retrieval
Inference Runtime Engines
TurboAgents compresses KV-cache payloads so local and server-side inference can hold more context. Wrapper-first integration with MLX, llama.cpp, and vLLM - without pretending native kernels exist where they do not.
MLX
Apple Silicon3.5 bits identified as the best quality/performance tradeoff on Llama-3.2-3B-Instruct-4bit. First-class local serving via turboagents serve --backend mlx.
llama.cpp
Cross-PlatformWrapper for building and inspecting runtime command paths. Integrates with existing GGUF model workflows for compressed inference on commodity hardware.
vLLM
Server-SidePlugin scaffold for vLLM-based server deployments. Dry-run mode builds the correct serve commands for production exploration.
❯# Local MLX inference❯turboagents serve --backend mlx --model mlx-community/Qwen3-0.6B-4bit --dry-run❯# llama.cpp on GGUF models❯turboagents serve --backend llamacpp --model model.gguf --dry-run❯# vLLM server-side❯turboagents serve --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --dry-run
Explore the runtime engine layer
Architecture docs cover the wrapper-first strategy, runtime detection, and how compression integrates with each inference backend.
Developer Experience
CLI-First Toolkit
Environment inspection, benchmarking, compression, and runtime serving - all from a single CLI entry point.
turboagents doctorSystem diagnostics - platform, Python version, optional packages, adapter status for llama.cpp, MLX, and vLLM
turboagents bench kv --format jsonSynthetic KV-style reconstruction metrics across bit-widths. Measure compression quality before committing.
turboagents bench rag --format markdownSynthetic retrieval metrics across bit-widths for all supported vector backends.
turboagents compress --input vectors.npy --bits 3.5 --head-dim 128Direct vector compression to serialized payloads. Specify bit-width, head dimension, and seed for reproducibility.
turboagents serve --backend mlx --model [model] --dry-runBuild and inspect runtime serve commands for MLX, llama.cpp, or vLLM backends. Dry-run by default.
Three Ways to Integrate
TurboAgents works as an agent runtime layer, RAG middleware, or standalone evaluation tooling - depending on where you need compression.
Agent Runtime Layer
turboagents.engines.*Add beneath existing agent frameworks using turboagents.engines.mlx, llamacpp, or vllm. Compress KV-cache payloads so inference holds more context without modifying application-level agent code.
RAG Middleware
turboagents.rag.*Integrate at the retrieval layer using TurboFAISS, TurboChroma, TurboLanceDB, TurboSurrealDB, or TurboPgvector. Native client or sidecar patterns depending on the backend.
Evaluation Tooling
turboagents bench / compressUse the CLI for compression fit assessment before deeper integration. Run bench kv, bench rag, or bench paper to understand tradeoffs with your specific data.
See all integration patterns
The architecture docs cover native vs sidecar patterns, engine wrappers, and how to evaluate compression fit for your specific stack.
Get Started
Install in Seconds. Benchmark Immediately.
Three install paths depending on what you need. Then run your first benchmark.
Core install
❯uv add turboagents
Add retrieval adapters or MLX support
❯uv add "turboagents[rag]" # Retrieval adapters❯uv add "turboagents[mlx]" # MLX support❯uv add "turboagents[all]" # Everything
Run your first benchmark
❯turboagents bench rag --backend chroma
Full setup guide and examples
Adapter configuration, benchmark options, framework integration patterns, and MLX optimization - all in the docs.
Reference Integration
TurboAgents + SuperOptiX
TurboAgents is standalone, but SuperOptiX is the first full reference integration - end-to-end compressed retrieval inside a real agent framework.
turboagents-chromaChroma-backed compressed retrieval
turboagents-faissFAISS-backed compressed retrieval
turboagents-lancedbLanceDB-backed compressed retrieval
turboagents-surrealdbSurrealDB-backed compressed retrieval
❯# Install SuperOptiX with TurboAgents❯uv pip install "superoptix[turboagents]"
TurboAgents can operate as standalone infrastructure or power GEPA vector-store backends and shared RAG retrievers inside SuperOptiX. Four validated backends, full playbook integration via YAML RAG blocks.
Deep Dive
Read the full announcement for the design philosophy and the story behind TurboAgents.
See It in Action
TurboAgents Demo
Watch the walkthrough covering installation, CLI benchmarks, adapter integration, and compressed retrieval in practice.
Ready to Compress Your Retrieval Stack?
Install TurboAgents, run it against your vector store, and see where it helps. Docs, benchmarks, adapters, and examples - everything is ready.

