Initial

2025-07-20 03:56:21 -04:00
commit 59539f4daa
65 changed files with 6964 additions and 0 deletions
--- a/experiments/llm_kv_cache/README.md
+++ b/experiments/llm_kv_cache/README.md
@@ -0,0 +1,83 @@
+# LLM KV-Cache Experiment
+
+## Overview
+
+This experiment demonstrates space-time tradeoffs in Large Language Model (LLM) attention mechanisms. By varying the KV-cache size, we show how modern AI systems implement Williams' √n pattern through techniques like Flash Attention.
+
+## Background
+
+### The Attention Mechanism
+In transformers, attention computes:
+```
+Attention(Q,K,V) = softmax(QK^T/√d)V
+```
+
+For each new token, we need K and V matrices from all previous tokens.
+
+### KV-Cache Strategies
+
+1. **Full Cache O(n)**: Store all past keys/values
+   - Maximum memory usage
+   - No recomputation needed
+   - Used in standard implementations
+
+2. **Flash Attention O(√n)**: Store recent √n tokens
+   - Balanced memory/compute
+   - Recompute older tokens as needed
+   - Used in production LLMs
+
+3. **Minimal Cache O(1)**: Store almost nothing
+   - Minimum memory usage
+   - Maximum recomputation
+   - Used in extreme memory-constrained settings
+
+## Running the Experiment
+
+```bash
+python llm_kv_cache_experiment.py
+```
+
+Simulates attention computation for sequences of 512, 1024, and 2048 tokens.
+
+## Surprising Results
+
+Our experiment revealed a counterintuitive finding:
+
+| Cache Size | Memory | Tokens/sec | Speedup |
+|------------|--------|------------|---------|
+| O(n) Full  | 12 MB  | 197        | 1.0×    |
+| O(√n)      | 1.1 MB | 1,349      | 6.8×    |
+| O(1)       | 0.05 MB| 4,169      | 21.2×   |
+
+**Smaller caches are FASTER!** Why?
+
+1. **Memory bandwidth bottleneck**: Moving 12MB of data is slower than recomputing
+2. **Cache locality**: Small working sets fit in L2/L3 cache
+3. **Modern CPUs**: Computation is cheap, memory access is expensive
+
+## Real-World Impact
+
+This pattern is used in:
+- **GPT-4**: Flash Attention enables 32K+ context windows
+- **Claude**: Efficient attention for 100K+ tokens
+- **Llama**: Open models with extended context
+- **Mobile LLMs**: Running models on phones with limited memory
+
+## Key Insights
+
+1. Williams' bound assumes uniform memory access
+2. Real systems have memory hierarchies
+3. Sometimes recomputation is faster than memory access
+4. The √n pattern emerges naturally as optimal
+
+## Production Techniques
+
+- **Flash Attention**: Fuses operations to minimize memory transfers
+- **Paged Attention**: Virtual memory for KV-cache
+- **Multi-Query Attention**: Shares keys/values across heads
+- **Sliding Window**: Fixed-size attention window
+
+## Generated Files
+
+- `llm_attention_tradeoff.png`: Performance visualization
+- `llm_kv_cache_results.json`: Detailed metrics