sqrtspace-experiments/experiments/llm_kv_cache/README.md

# LLM KV-Cache Experiment

## Overview

This experiment demonstrates space-time tradeoffs in Large Language Model (LLM) attention mechanisms. By varying the KV-cache size, we show how modern AI systems implement Williams' √n pattern through techniques like Flash Attention.

## Background

### The Attention Mechanism
In transformers, attention computes:
```
Attention(Q,K,V) = softmax(QK^T/√d)V
```

For each new token, we need K and V matrices from all previous tokens.

### KV-Cache Strategies

1. **Full Cache O(n)**: Store all past keys/values
   - Maximum memory usage
   - No recomputation needed
   - Used in standard implementations

2. **Flash Attention O(√n)**: Store recent √n tokens
   - Balanced memory/compute
   - Recompute older tokens as needed
   - Used in production LLMs

3. **Minimal Cache O(1)**: Store almost nothing
   - Minimum memory usage
   - Maximum recomputation
   - Used in extreme memory-constrained settings

## Running the Experiment

```bash
python llm_kv_cache_experiment.py
```

Simulates attention computation for sequences of 512, 1024, and 2048 tokens.

## Surprising Results

Our experiment revealed a counterintuitive finding:

| Cache Size | Memory | Tokens/sec | Speedup |
|------------|--------|------------|---------|
| O(n) Full  | 12 MB  | 197        | 1.0×    |
| O(√n)      | 1.1 MB | 1,349      | 6.8×    |
| O(1)       | 0.05 MB| 4,169      | 21.2×   |

**Smaller caches are FASTER!** Why?

1. **Memory bandwidth bottleneck**: Moving 12MB of data is slower than recomputing
2. **Cache locality**: Small working sets fit in L2/L3 cache
3. **Modern CPUs**: Computation is cheap, memory access is expensive

## Real-World Impact

This pattern is used in:
- **GPT-4**: Flash Attention enables 32K+ context windows
- **Claude**: Efficient attention for 100K+ tokens
- **Llama**: Open models with extended context
- **Mobile LLMs**: Running models on phones with limited memory

## Key Insights

1. Williams' bound assumes uniform memory access
2. Real systems have memory hierarchies
3. Sometimes recomputation is faster than memory access
4. The √n pattern emerges naturally as optimal

## Production Techniques

- **Flash Attention**: Fuses operations to minimize memory transfers
- **Paged Attention**: Virtual memory for KV-cache
- **Multi-Query Attention**: Shares keys/values across heads
- **Sliding Window**: Fixed-size attention window

## Generated Files

- `llm_attention_tradeoff.png`: Performance visualization
- `llm_kv_cache_results.json`: Detailed metrics