![Open-source single-GPU reproductions of Cartridges and STILL for neural KV-cache compaction [P]](https://bestaifor.s3.eu-west-1.amazonaws.com/open_source_single_gpu_reproductions_of_cartridges_featured_954b9a265f.jpg)
Neural KV-cache compaction — using learned compression rather than heuristic eviction — is one of the more credible paths to running long-context LLMs without bleeding GPU memory. Cartridges and STILL are two recent papers pushing this frontier, and community single-GPU reproductions have now made both accessible for independent evaluation. The benchmark gains in memory reduction are documented and reproducible. Whether the quality trade-offs hold at production context lengths, across model families beyond the ones tested, remains a genuinely open question.
Memory in transformer inference is not a single problem. It has a shape. The KV cache — the store of attention key-value pairs for every processed token — grows linearly with sequence length, scales with model depth and head dimension, and sits entirely on GPU VRAM at inference time. For a 70B-parameter model at full precision, the KV cache alone at 128K context can exceed the memory budget of a single H100. You're either paying for more GPUs, or you're cutting context.
Three strategies have dominated the practical response: token eviction (drop low-scoring tokens from the cache), quantization (compress the numerical precision of stored values), and prefill recomputation (don't store everything, recompute on demand). Each involves a different trade-off between throughput, quality, and engineering complexity.
Neural compaction is a fourth category. Instead of deciding which tokens to evict or how many bits to use, a learned encoder compresses the KV states into a dense latent space — smaller in size, but reconstructable (imperfectly) on demand. Cartridges and STILL are two of the most technically rigorous papers in this space. The fact that they are now reproducible on a single GPU is the specific development worth paying attention to, because it changes who can run the benchmark independently.
Cartridges frames the problem as learned document caching. A trained encoder compresses the KV cache for a chunk of context (a "cartridge") into a compact latent. At inference time, the decoder reconstructs an approximation of the original KV states from the latent before attention is computed. The key design decision is that the encoder is trained jointly with a reconstruction objective and an end-task distillation loss — so it learns what information in the KV states actually matters for downstream generation quality, not just raw reconstruction fidelity.
The original paper benchmarks on LLaMA-family models and reports 4–8× memory reduction at 4K–16K token contexts, with perplexity increases of less than 0.5 points on standard language modeling benchmarks. The single-GPU reproduction confirms these numbers hold for the 7B model size on A100 hardware.
STILL takes a different entry point. Rather than training an external encoder-decoder, STILL applies structured sparsity directly to the KV cache during inference, guided by a lightweight neural predictor trained to identify which KV positions are recoverable. The predictor adds a small inference-time cost — measured in milliseconds, not seconds — but allows the compression ratio to be dynamically adjusted per layer based on attention pattern statistics gathered during the training phase.
The practical difference from Cartridges: STILL doesn't require storing explicit compressed latents. It discards positions it predicts as recoverable and reconstructs them via a small learned fallback. This makes it more memory-aggressive at the cost of a slightly higher quality floor — the reconstruction is probabilistic, not deterministic.
| Dimension | Cartridges | STILL | Notes |
|---|---|---|---|
| Compression approach | Learned encoder-decoder (explicit latent) | Structured sparsity + neural predictor | Fundamentally different architectures |
| Memory reduction (7B, 8K ctx) | ~6× (paper) / ~5.8× (reproduced) | ~4.5× (paper) / ~4.3× (reproduced) | Reproductions slightly below paper claims |
| Perplexity delta (WikiText-103) | +0.4 ppl (paper) / +0.5 ppl (reproduced) | +0.3 ppl (paper) / +0.4 ppl (reproduced) | Both within acceptable range |
| Training cost (7B model, A100 80GB) | ~18–24h for encoder training | ~8–12h for predictor training | STILL significantly cheaper to fine-tune |
| 32K context stability | Not tested in original paper | Partially tested, degradation noted | Both reproductions flag this as a gap |
| Drop-in compatibility | No — requires training pipeline change | No — requires training pipeline change | Neither replaces standard KV management |
| License | Open (reproduction) | Open (reproduction) | Original paper code: check per-repo |
| Single-GPU verified | Yes (A100 80GB) | Yes (A100 80GB) | RTX 4090 partial support reported |
The reproductions are honest about where the numbers diverge: both methods degrade more than advertised at context lengths above 32K, and both have only been verified on LLaMA-family architectures. Mistral and Qwen results are early-stage, with the reproduction authors flagging attention pattern differences that affect the trained predictors.
The original papers, like most research in this area, were benchmarked in environments that practitioners cannot easily replicate: multi-GPU clusters, custom CUDA kernels, and evaluation pipelines not released at submission time. This is not bad faith — it's the practical reality of research workflows. But it means that the benchmark numbers in the paper are, until reproduced independently, a single data point from a controlled environment.
Single-GPU reproductions change the epistemics here. When a community researcher posts "I ran Cartridges on an A100 80GB and got these perplexity numbers," and those numbers track within 5% of the paper's claims, that's meaningful confirmation. When they diverge — as they do at 32K+ context — that's equally meaningful signal. The benchmark becomes a live thing rather than a static table in a PDF.
This matters specifically for AI engineering teams evaluating whether to integrate neural compaction into their inference stack, because the integration decision is almost never made in a multi-GPU research environment. It's made by engineers who have one or two A100s, a production model, and a real latency budget.
Independent benchmark verification on commodity hardware surfaces failure modes that cluster experiments miss. The 32K context degradation in both reproductions is a good example: research evaluation often stays within the "sweet spot" context range that makes a method look best. Production inference does not have that luxury. The reproduction benchmark runs expose the edges.
There's also a tooling signal here. The reproduction authors have released evaluation harnesses with pluggable backends, which means researchers can now run comparative benchmarks between Cartridges, STILL, and existing approaches (like H2O and SnapKV) on their own hardware and datasets. This is the infrastructure that turns a paper into a field.
Don't use these methods if your context lengths are under 4K tokens. The training overhead and inference reconstruction cost are not justified by the memory savings at short contexts. Standard KV quantization (e.g., INT8 or INT4 KV caches) is cheaper to implement and recovers most of the memory budget at that scale.
Don't assume the benchmark numbers transfer to your model architecture. Both Cartridges and STILL were trained and evaluated primarily on LLaMA-2 and LLaMA-3 variants. Architectures with grouped query attention (GQA), sliding window attention, or non-standard positional embeddings (Mistral, Mixtral) have different KV cache structures that the trained encoders/predictors may not handle without retraining.
Don't deploy without domain-matched training data. The neural compressor learns what information matters from the distribution it was trained on. If your production use case is code generation and the model was trained on web text, the compression will likely drop the wrong tokens. This is not a bug — it's a consequence of learned compression — but it's a failure mode that doesn't appear in the original benchmark suite.
Don't treat the reproduction numbers as production-ready benchmarks. The single-GPU reproductions confirm the core claims, but they use evaluation splits from the same datasets as the original papers. Held-out domain generalization benchmarks are still missing from the public reproduction record.
Don't skip the training infrastructure audit. Both methods require a training phase that modifies the model's inference path. This means your serving infrastructure needs to support the modified forward pass, which is non-trivial to integrate with vLLM, TGI, or other optimized serving backends without additional engineering.
The benchmark surface is about to get crowded. Cartridges and STILL are not the only neural compaction methods in the pipeline. PyramidKV, MagicPIG, and SnapKV all have variations on the learned-compression theme, and cross-method benchmark comparisons on standardized hardware will become the evaluation norm over the next 12 months. The single-GPU reproduction culture emerging around Cartridges and STILL is setting a precedent for what reproducibility means in this subfield.
Serving stack integration is the next bottleneck. Right now, neither method integrates cleanly with vLLM's PagedAttention or SGLang's RadixAttention without custom patches. The community reproductions surface this clearly — the evaluation harnesses are standalone, not serving-stack integrations. Expect the next wave of engineering work to focus on production-compatible implementations rather than new compression architectures.
Multi-modal KV caches will stress-test these methods in new ways. Vision-language models have KV caches with substantially different statistical properties than text-only models — image patch tokens have high spatial redundancy that pure attention-score-based eviction misses. Neural compaction methods that learn from data may actually have an advantage here, but no benchmarks exist yet.
Quantization and neural compaction are likely to converge. Current implementations treat these as separate techniques. The more interesting research direction — already appearing in a few preprints — is training the neural compressor jointly with quantization-aware objectives, so the latent space is optimized for both size and bit-width simultaneously.
The reproducibility norm is shifting. Papers that don't release code or that benchmark only on proprietary hardware will face increasing skepticism from reviewers and practitioners. The Cartridges and STILL reproduction projects are partly a response to that pressure, and they're raising the floor for what "open-source" means in the inference optimization space.
Does neural KV-cache compaction work without retraining the base LLM? Yes, with caveats. The base model weights are frozen. You're training the encoder (Cartridges) or predictor (STILL) as separate modules that intercept the KV cache during the attention computation. However, "no retraining" is slightly misleading — you're still running a supervised training pass on domain data to calibrate the compressor, which requires GPU compute and representative examples. It's closer to adapter training than fine-tuning in terms of cost.
How does this compare to SnapKV and H2O on the same benchmark tasks? The reproduction benchmarks include partial comparisons. On standard language modeling tasks (WikiText-103, PG-19), Cartridges outperforms H2O at equivalent compression ratios by 0.2–0.4 perplexity points. STILL is closer to SnapKV in quality. The more important comparison — on long-document QA tasks like SCROLLS — is less clear because the reproductions don't fully cover that benchmark suite. Current evidence doesn't confirm which method wins at task-specific long-context evaluation.
Can these methods be combined with KV cache quantization? In principle, yes. In practice, the current open-source implementations don't support combined pipelines. Running INT8 quantization on the already-compressed latents is theoretically sound — you'd stack two compression mechanisms — but the interaction effects on quality haven't been benchmarked. This is a real gap in the current reproduction work.
Is an A100 80GB required, or can this run on consumer hardware? The reproduction authors report partial success on RTX 4090 (24GB VRAM) for 7B models with aggressive batch size reduction. Inference with the trained compressor fits on a 4090; training the encoder or predictor does not, at least not without gradient checkpointing and significant throughput reduction. For 13B+ models, the A100 80GB is effectively the minimum for training.
What's the realistic throughput overhead at serving time? Cartridges adds a decoder step before each attention computation, which the reproduction authors measure at roughly 8–15% latency overhead for 7B models at 8K context. STILL's predictor overhead is lower — around 3–7% — because it's a lightweight scoring pass, not a full decode. Both are within acceptable margins for memory-constrained deployments where the alternative is adding a second GPU.
Are the reproductions peer-reviewed? No. They are community engineering reproductions, not peer-reviewed publications. The value is empirical verification of paper claims on accessible hardware, not novel scientific contribution. Treat the numbers as a second data point, not an authoritative benchmark.
Should I wait for vLLM integration before evaluating these methods? If you're evaluating for production deployment, yes — the standalone evaluation harnesses are too far from a serving stack to give realistic throughput numbers. If you're evaluating for research or feasibility, the current reproductions are sufficient to form a view on quality trade-offs. The serving stack integration question is separable from the compression quality question.