Initial

2025-07-20 04:04:41 -04:00
commit 89909d5b20
27 changed files with 11534 additions and 0 deletions
--- a/db_optimizer/README.md
+++ b/db_optimizer/README.md
@@ -0,0 +1,278 @@
+# Memory-Aware Query Optimizer
+
+Database query optimizer that explicitly considers memory hierarchies and space-time tradeoffs based on Williams' theoretical bounds.
+
+## Features
+
+- **Cost Model**: Incorporates L3/RAM/SSD boundaries in cost calculations
+- **Algorithm Selection**: Chooses between hash/sort/nested-loop joins based on true memory costs
+- **Buffer Sizing**: Automatically sizes buffers to √(data_size) for optimal tradeoffs
+- **Spill Planning**: Optimizes when and how to spill to disk
+- **Memory Hierarchy Awareness**: Tracks which level (L1-L3/RAM/Disk) operations will use
+- **AI Explanations**: Clear reasoning for all optimization decisions
+
+## Installation
+
+```bash
+# From sqrtspace-tools root directory
+pip install -r requirements-minimal.txt
+```
+
+## Quick Start
+
+```python
+from db_optimizer.memory_aware_optimizer import MemoryAwareOptimizer
+import sqlite3
+
+# Connect to database
+conn = sqlite3.connect('mydb.db')
+
+# Create optimizer with 10MB memory limit
+optimizer = MemoryAwareOptimizer(conn, memory_limit=10*1024*1024)
+
+# Optimize a query
+sql = """
+SELECT c.name, SUM(o.total) 
+FROM customers c
+JOIN orders o ON c.id = o.customer_id
+GROUP BY c.name
+ORDER BY SUM(o.total) DESC
+"""
+
+result = optimizer.optimize_query(sql)
+print(result.explanation)
+# "Optimized query plan reduces memory usage by 87.3% with 2.1x estimated speedup.
+#  Changed join from nested_loop to hash_join saving 9216KB.
+#  Allocated 4 buffers totaling 2048KB for optimal performance."
+```
+
+## Join Algorithm Selection
+
+The optimizer intelligently selects join algorithms based on memory constraints:
+
+### 1. Hash Join
+- **When**: Smaller table fits in memory
+- **Memory**: O(min(n,m))
+- **Time**: O(n+m)
+- **Best for**: Equi-joins with one small table
+
+### 2. Sort-Merge Join
+- **When**: Both tables fit in memory for sorting
+- **Memory**: O(n+m)
+- **Time**: O(n log n + m log m)
+- **Best for**: Pre-sorted data or when output needs ordering
+
+### 3. Block Nested Loop
+- **When**: Limited memory, uses √n blocks
+- **Memory**: O(√n)
+- **Time**: O(n*m/√n)
+- **Best for**: Memory-constrained environments
+
+### 4. Nested Loop
+- **When**: Extreme memory constraints
+- **Memory**: O(1)
+- **Time**: O(n*m)
+- **Last resort**: When memory is critically limited
+
+## Buffer Management
+
+The optimizer automatically calculates optimal buffer sizes:
+
+```python
+# Get buffer recommendations
+result = optimizer.optimize_query(query)
+for buffer_name, size in result.buffer_sizes.items():
+    print(f"{buffer_name}: {size / 1024:.1f}KB")
+
+# Output:
+# scan_buffer: 316.2KB      # √n sized for sequential scan
+# join_buffer: 1024.0KB     # Optimal for hash table
+# sort_buffer: 447.2KB      # √n sized for external sort
+```
+
+## Spill Strategies
+
+When memory is exceeded, the optimizer plans spilling:
+
+```python
+# Check spill strategy
+if result.spill_strategy:
+    for operation, strategy in result.spill_strategy.items():
+        print(f"{operation}: {strategy}")
+
+# Output:
+# JOIN_0: grace_hash_join              # Partition both inputs
+# SORT_0: multi_pass_external_sort     # Multiple merge passes
+# AGGREGATE_0: spill_partial_aggregates # Write intermediate results
+```
+
+## Query Plan Visualization
+
+```python
+# View query execution plan
+print(optimizer.explain_plan(result.optimized_plan))
+
+# Output:
+# AGGREGATE (hash_aggregate)
+#   Rows: 100
+#   Size: 9.8KB
+#   Memory: 14.6KB (L3)
+#   Cost: 15234
+#   SORT (external_sort)
+#     Rows: 1,000
+#     Size: 97.7KB
+#     Memory: 9.9KB (L3)
+#     Cost: 14234
+#     JOIN (hash_join)
+#       Rows: 1,000
+#       Size: 97.7KB
+#       Memory: 73.2KB (L3)
+#       Cost: 3234
+#       SCAN customers (sequential)
+#         Rows: 100
+#         Size: 9.8KB
+#         Memory: 9.8KB (L2)
+#         Cost: 98
+#       SCAN orders (sequential)
+#         Rows: 1,000
+#         Size: 48.8KB
+#         Memory: 48.8KB (L3)
+#         Cost: 488
+```
+
+## Optimizer Hints
+
+Apply hints to SQL queries:
+
+```python
+# Optimize for minimal memory usage
+hinted_sql = optimizer.apply_hints(
+    sql, 
+    target='memory',
+    memory_limit='1MB'
+)
+# /* SpaceTime Optimizer: Using block nested loop with √n memory ... */
+# SELECT ...
+
+# Optimize for speed
+hinted_sql = optimizer.apply_hints(
+    sql,
+    target='latency'
+)
+# /* SpaceTime Optimizer: Using hash join for minimal latency ... */
+# SELECT ...
+```
+
+## Real-World Examples
+
+### 1. Large Table Join with Memory Limit
+```python
+# 1GB tables, 100MB memory limit
+sql = """
+SELECT l.*, r.details
+FROM large_table l
+JOIN reference_table r ON l.ref_id = r.id
+WHERE l.status = 'active'
+"""
+
+result = optimizer.optimize_query(sql)
+# Chooses: Block nested loop with 10MB blocks
+# Memory: 10MB (fits in L3 cache)
+# Speedup: 10x over naive nested loop
+```
+
+### 2. Multi-Way Join
+```python
+sql = """
+SELECT *
+FROM a
+JOIN b ON a.id = b.a_id
+JOIN c ON b.id = c.b_id
+JOIN d ON c.id = d.c_id
+"""
+
+result = optimizer.optimize_query(sql)
+# Optimizes join order based on sizes
+# Uses different algorithms for each join
+# Allocates buffers to minimize spilling
+```
+
+### 3. Aggregation with Sorting
+```python
+sql = """
+SELECT category, COUNT(*), AVG(price)
+FROM products
+GROUP BY category
+ORDER BY COUNT(*) DESC
+"""
+
+result = optimizer.optimize_query(sql)
+# Hash aggregation with √n memory
+# External sort for final ordering
+# Explains tradeoffs clearly
+```
+
+## Performance Characteristics
+
+### Memory Savings
+- **Typical**: 50-95% reduction vs naive approach
+- **Best case**: 99% reduction (large self-joins)
+- **Worst case**: 10% reduction (already optimal)
+
+### Speed Impact
+- **Hash to Block Nested**: 2-10x speedup
+- **External Sort**: 20-50% overhead vs in-memory
+- **Overall**: Usually faster despite less memory
+
+### Memory Hierarchy Benefits
+- **L3 vs RAM**: 8-10x latency improvement  
+- **RAM vs SSD**: 100-1000x latency improvement
+- **Optimizer targets**: Keep hot data in faster levels
+
+## Integration
+
+### SQLite
+```python
+conn = sqlite3.connect('mydb.db')
+optimizer = MemoryAwareOptimizer(conn)
+```
+
+### PostgreSQL (via psycopg2)
+```python
+# Use explain analyze to get statistics
+# Apply recommendations via SET commands
+```
+
+### MySQL (planned)
+```python
+# Similar approach with optimizer hints
+```
+
+## How It Works
+
+1. **Statistics Collection**: Gathers table sizes, indexes, cardinalities
+2. **Query Analysis**: Parses SQL to extract operations
+3. **Cost Modeling**: Estimates cost with memory hierarchy awareness
+4. **Algorithm Selection**: Chooses optimal algorithms for each operation
+5. **Buffer Allocation**: Sizes buffers using √n principle
+6. **Spill Planning**: Determines graceful degradation strategy
+
+## Limitations
+
+- Simplified cardinality estimation
+- SQLite-focused (PostgreSQL support planned)
+- No runtime adaptation yet
+- Requires accurate statistics
+
+## Future Enhancements
+
+- Runtime plan adjustment
+- Learned cost models
+- PostgreSQL native integration
+- Distributed query optimization
+- GPU memory hierarchy support
+
+## See Also
+
+- [SpaceTimeCore](../core/spacetime_core.py): Memory hierarchy modeling
+- [SpaceTime Profiler](../profiler/): Find queries needing optimization