Initial
This commit is contained in:
83
experiments/llm_kv_cache/README.md
Normal file
83
experiments/llm_kv_cache/README.md
Normal file
@@ -0,0 +1,83 @@
|
||||
# LLM KV-Cache Experiment
|
||||
|
||||
## Overview
|
||||
|
||||
This experiment demonstrates space-time tradeoffs in Large Language Model (LLM) attention mechanisms. By varying the KV-cache size, we show how modern AI systems implement Williams' √n pattern through techniques like Flash Attention.
|
||||
|
||||
## Background
|
||||
|
||||
### The Attention Mechanism
|
||||
In transformers, attention computes:
|
||||
```
|
||||
Attention(Q,K,V) = softmax(QK^T/√d)V
|
||||
```
|
||||
|
||||
For each new token, we need K and V matrices from all previous tokens.
|
||||
|
||||
### KV-Cache Strategies
|
||||
|
||||
1. **Full Cache O(n)**: Store all past keys/values
|
||||
- Maximum memory usage
|
||||
- No recomputation needed
|
||||
- Used in standard implementations
|
||||
|
||||
2. **Flash Attention O(√n)**: Store recent √n tokens
|
||||
- Balanced memory/compute
|
||||
- Recompute older tokens as needed
|
||||
- Used in production LLMs
|
||||
|
||||
3. **Minimal Cache O(1)**: Store almost nothing
|
||||
- Minimum memory usage
|
||||
- Maximum recomputation
|
||||
- Used in extreme memory-constrained settings
|
||||
|
||||
## Running the Experiment
|
||||
|
||||
```bash
|
||||
python llm_kv_cache_experiment.py
|
||||
```
|
||||
|
||||
Simulates attention computation for sequences of 512, 1024, and 2048 tokens.
|
||||
|
||||
## Surprising Results
|
||||
|
||||
Our experiment revealed a counterintuitive finding:
|
||||
|
||||
| Cache Size | Memory | Tokens/sec | Speedup |
|
||||
|------------|--------|------------|---------|
|
||||
| O(n) Full | 12 MB | 197 | 1.0× |
|
||||
| O(√n) | 1.1 MB | 1,349 | 6.8× |
|
||||
| O(1) | 0.05 MB| 4,169 | 21.2× |
|
||||
|
||||
**Smaller caches are FASTER!** Why?
|
||||
|
||||
1. **Memory bandwidth bottleneck**: Moving 12MB of data is slower than recomputing
|
||||
2. **Cache locality**: Small working sets fit in L2/L3 cache
|
||||
3. **Modern CPUs**: Computation is cheap, memory access is expensive
|
||||
|
||||
## Real-World Impact
|
||||
|
||||
This pattern is used in:
|
||||
- **GPT-4**: Flash Attention enables 32K+ context windows
|
||||
- **Claude**: Efficient attention for 100K+ tokens
|
||||
- **Llama**: Open models with extended context
|
||||
- **Mobile LLMs**: Running models on phones with limited memory
|
||||
|
||||
## Key Insights
|
||||
|
||||
1. Williams' bound assumes uniform memory access
|
||||
2. Real systems have memory hierarchies
|
||||
3. Sometimes recomputation is faster than memory access
|
||||
4. The √n pattern emerges naturally as optimal
|
||||
|
||||
## Production Techniques
|
||||
|
||||
- **Flash Attention**: Fuses operations to minimize memory transfers
|
||||
- **Paged Attention**: Virtual memory for KV-cache
|
||||
- **Multi-Query Attention**: Shares keys/values across heads
|
||||
- **Sliding Window**: Fixed-size attention window
|
||||
|
||||
## Generated Files
|
||||
|
||||
- `llm_attention_tradeoff.png`: Performance visualization
|
||||
- `llm_kv_cache_results.json`: Detailed metrics
|
||||
Reference in New Issue
Block a user