Initial

2025-07-20 03:56:21 -04:00
commit 59539f4daa
65 changed files with 6964 additions and 0 deletions
--- a/case_studies/distributed_computing/README.md
+++ b/case_studies/distributed_computing/README.md
@@ -0,0 +1,269 @@
+# Distributed Computing: Space-Time Tradeoffs at Scale
+
+## Overview
+Distributed systems make explicit decisions about replication (space) vs computation (time). Every major distributed framework embodies these tradeoffs.
+
+## 1. MapReduce / Hadoop
+
+### Shuffle Phase - The Classic Tradeoff
+```java
+// Map output: Written to local disk (space for fault tolerance)
+map(key, value):
+    for word in value.split():
+        emit(word, 1)
+
+// Shuffle: All-to-all communication
+// Choice: Buffer in memory vs spill to disk
+shuffle.memory.ratio = 0.7  // 70% of heap for shuffle
+shuffle.spill.percent = 0.8 // Spill when 80% full
+```
+
+**Memory Settings Impact:**
+- High memory: Fast shuffle, risk of OOM
+- Low memory: Frequent spills, 10x slower
+- Sweet spot: √(data_size) memory per node
+
+### Combiner Optimization
+```java
+// Without combiner: Send all data
+map: (word, 1), (word, 1), (word, 1)...
+
+// With combiner: Local aggregation (compute for space)
+combine: (word, 3)
+
+// Network transfer: 100x reduction
+// CPU cost: Local sum computation
+```
+
+## 2. Apache Spark
+
+### RDD Persistence Levels
+```scala
+// MEMORY_ONLY: Fast but memory intensive
+rdd.persist(StorageLevel.MEMORY_ONLY)
+// Space: Full dataset in RAM
+// Time: Instant access
+
+// MEMORY_AND_DISK: Spill to disk when needed
+rdd.persist(StorageLevel.MEMORY_AND_DISK)
+// Space: Min(dataset, available_ram)
+// Time: RAM-speed or disk-speed
+
+// DISK_ONLY: Minimal memory
+rdd.persist(StorageLevel.DISK_ONLY)
+// Space: O(1) RAM
+// Time: Always disk I/O
+
+// MEMORY_ONLY_SER: Serialized in memory
+rdd.persist(StorageLevel.MEMORY_ONLY_SER)
+// Space: 2-5x reduction via serialization
+// Time: CPU cost to deserialize
+```
+
+### Broadcast Variables
+```scala
+// Without broadcast: Send to each task
+val bigData = loadBigDataset() // 1GB
+rdd.map(x => doSomething(x, bigData))
+// Network: 1GB × num_tasks
+
+// With broadcast: Send once per node
+val bcData = sc.broadcast(bigData)
+rdd.map(x => doSomething(x, bcData.value))
+// Network: 1GB × num_nodes
+// Memory: Extra copy per node
+```
+
+## 3. Distributed Key-Value Stores
+
+### Redis Eviction Policies
+```conf
+# No eviction: Fail when full (pure space)
+maxmemory-policy noeviction
+
+# LRU: Recompute evicted data (time for space)
+maxmemory-policy allkeys-lru
+maxmemory 10gb
+
+# LFU: Better hit rate, more CPU
+maxmemory-policy allkeys-lfu
+```
+
+### Memcached Slab Allocation
+- Fixed-size slabs: Internal fragmentation (waste space)
+- Variable-size: External fragmentation (CPU to compact)
+- Typical: √n slab classes for n object sizes
+
+## 4. Kafka / Stream Processing
+
+### Log Compaction
+```properties
+# Keep all messages (max space)
+cleanup.policy=none
+
+# Keep only latest per key (compute to save space)
+cleanup.policy=compact
+min.compaction.lag.ms=86400000
+
+# Compression (CPU for space)
+compression.type=lz4  # 4x space reduction
+compression.type=zstd # 6x reduction, more CPU
+```
+
+### Consumer Groups
+- Replicate processing: Each consumer gets all data
+- Partition assignment: Each message processed once
+- Tradeoff: Redundancy vs coordination overhead
+
+## 5. Kubernetes / Container Orchestration
+
+### Resource Requests vs Limits
+```yaml
+resources:
+  requests:
+    memory: "256Mi"  # Guaranteed (space reservation)
+    cpu: "250m"      # Guaranteed (time reservation)
+  limits:
+    memory: "512Mi"  # Max before OOM
+    cpu: "500m"      # Max before throttling
+```
+
+### Image Layer Caching
+- Base images: Shared across containers (dedup space)
+- Layer reuse: Fast container starts
+- Tradeoff: Registry space vs pull time
+
+## 6. Distributed Consensus
+
+### Raft Log Compaction
+```go
+// Snapshot periodically to bound log size
+if logSize > maxLogSize {
+    snapshot = createSnapshot(stateMachine)
+    truncateLog(snapshot.index)
+}
+// Space: O(snapshot) instead of O(all_operations)
+// Time: Recreate state from snapshot + recent ops
+```
+
+### Multi-Paxos vs Raft
+- Multi-Paxos: Less memory, complex recovery
+- Raft: More memory (full log), simple recovery
+- Tradeoff: Space vs implementation complexity
+
+## 7. Content Delivery Networks (CDNs)
+
+### Edge Caching Strategy
+```nginx
+# Cache everything (max space)
+proxy_cache_valid 200 30d;
+proxy_cache_max_size 100g;
+
+# Cache popular only (compute popularity)
+proxy_cache_min_uses 3;
+proxy_cache_valid 200 1h;
+proxy_cache_max_size 10g;
+```
+
+### Geographic Replication
+- Full replication: Every edge has all content
+- Lazy pull: Fetch on demand
+- Predictive push: ML models predict demand
+
+## 8. Batch Processing Frameworks
+
+### Apache Flink Checkpointing
+```java
+// Checkpoint frequency (space vs recovery time)
+env.enableCheckpointing(10000); // Every 10 seconds
+
+// State backend choice
+env.setStateBackend(new FsStateBackend("hdfs://..."));
+// vs
+env.setStateBackend(new RocksDBStateBackend("file://..."));
+
+// RocksDB: Spill to disk, slower access
+// Memory: Fast access, limited size
+```
+
+### Watermark Strategies
+- Perfect watermarks: Buffer all late data (space)
+- Heuristic watermarks: Drop some late data (accuracy for space)
+- Allowed lateness: Bounded buffer
+
+## 9. Real-World Examples
+
+### Google's MapReduce (2004)
+- Problem: Processing 20TB of web data
+- Solution: Trade disk space for fault tolerance
+- Impact: 1000 machines × 3 hours vs 1 machine × 3000 hours
+
+### Facebook's TAO (2013)
+- Problem: Social graph queries
+- Solution: Replicate to every datacenter
+- Tradeoff: Petabytes of RAM for microsecond latency
+
+### Amazon's Dynamo (2007)
+- Problem: Shopping cart availability
+- Solution: Eventually consistent, multi-version
+- Tradeoff: Space for conflict resolution
+
+## 10. Optimization Patterns
+
+### Hierarchical Aggregation
+```python
+# Naive: All-to-one
+results = []
+for worker in workers:
+    results.extend(worker.compute())
+return aggregate(results)  # Bottleneck!
+
+# Tree aggregation: √n levels
+level1 = [aggregate(chunk) for chunk in chunks(workers, sqrt(n))]
+level2 = [aggregate(chunk) for chunk in chunks(level1, sqrt(n))]
+return aggregate(level2)
+
+# Space: O(√n) intermediate results
+# Time: O(log n) vs O(n)
+```
+
+### Bloom Filters in Distributed Joins
+```java
+// Broadcast join with Bloom filter
+BloomFilter filter = createBloomFilter(smallTable);
+broadcast(filter);
+
+// Each node filters locally
+bigTable.filter(row -> filter.mightContain(row.key))
+        .join(broadcastedSmallTable);
+
+// Space: O(m log n) bits for filter
+// Reduction: 99% fewer network transfers
+```
+
+## Key Insights
+
+1. **Every distributed system** trades replication for computation
+2. **The √n pattern** appears in:
+   - Shuffle buffer sizes
+   - Checkpoint frequencies  
+   - Aggregation tree heights
+   - Cache sizes
+
+3. **Network is the new disk**:
+   - Network transfer ≈ Disk I/O in cost
+   - Same space-time tradeoffs apply
+
+4. **Failures force space overhead**:
+   - Replication for availability
+   - Checkpointing for recovery
+   - Logging for consistency
+
+## Connection to Williams' Result
+
+Distributed systems naturally implement √n algorithms:
+- Shuffle phases: O(√n) memory per node optimal
+- Aggregation trees: O(√n) height minimizes time
+- Cache sizing: √(total_data) per node common
+
+These patterns emerge independently across systems, validating the fundamental nature of the √(t log t) space bound for time-t computations.