Files
sqrtspace-experiments/experiments/llm_kv_cache/README.md
2025-07-20 03:56:21 -04:00

83 lines
2.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# LLM KV-Cache Experiment
## Overview
This experiment demonstrates space-time tradeoffs in Large Language Model (LLM) attention mechanisms. By varying the KV-cache size, we show how modern AI systems implement Williams' √n pattern through techniques like Flash Attention.
## Background
### The Attention Mechanism
In transformers, attention computes:
```
Attention(Q,K,V) = softmax(QK^T/√d)V
```
For each new token, we need K and V matrices from all previous tokens.
### KV-Cache Strategies
1. **Full Cache O(n)**: Store all past keys/values
- Maximum memory usage
- No recomputation needed
- Used in standard implementations
2. **Flash Attention O(√n)**: Store recent √n tokens
- Balanced memory/compute
- Recompute older tokens as needed
- Used in production LLMs
3. **Minimal Cache O(1)**: Store almost nothing
- Minimum memory usage
- Maximum recomputation
- Used in extreme memory-constrained settings
## Running the Experiment
```bash
python llm_kv_cache_experiment.py
```
Simulates attention computation for sequences of 512, 1024, and 2048 tokens.
## Surprising Results
Our experiment revealed a counterintuitive finding:
| Cache Size | Memory | Tokens/sec | Speedup |
|------------|--------|------------|---------|
| O(n) Full | 12 MB | 197 | 1.0× |
| O(√n) | 1.1 MB | 1,349 | 6.8× |
| O(1) | 0.05 MB| 4,169 | 21.2× |
**Smaller caches are FASTER!** Why?
1. **Memory bandwidth bottleneck**: Moving 12MB of data is slower than recomputing
2. **Cache locality**: Small working sets fit in L2/L3 cache
3. **Modern CPUs**: Computation is cheap, memory access is expensive
## Real-World Impact
This pattern is used in:
- **GPT-4**: Flash Attention enables 32K+ context windows
- **Claude**: Efficient attention for 100K+ tokens
- **Llama**: Open models with extended context
- **Mobile LLMs**: Running models on phones with limited memory
## Key Insights
1. Williams' bound assumes uniform memory access
2. Real systems have memory hierarchies
3. Sometimes recomputation is faster than memory access
4. The √n pattern emerges naturally as optimal
## Production Techniques
- **Flash Attention**: Fuses operations to minimize memory transfers
- **Paged Attention**: Virtual memory for KV-cache
- **Multi-Query Attention**: Shares keys/values across heads
- **Sliding Window**: Fixed-size attention window
## Generated Files
- `llm_attention_tradeoff.png`: Performance visualization
- `llm_kv_cache_results.json`: Detailed metrics