2025-07-20 04:11:04 -04:00
# SqrtSpace SpaceTime for Python
[](https://badge.fury.io/py/sqrtspace-spacetime)
[](https://pypi.org/project/sqrtspace-spacetime/)
2025-12-28 17:31:26 +00:00
[](https://git.marketally.com/sqrtspace/sqrtspace-python/src/branch/main/LICENSE)
2025-07-20 04:11:04 -04:00
[](https://sqrtspace-spacetime.readthedocs.io/en/latest/?badge=latest)
Memory-efficient algorithms and data structures for Python using Williams' √n space-time tradeoffs.
2025-07-20 15:11:49 -04:00
**Paper Repository ** : [github.com/sqrtspace/sqrtspace-paper ](https://github.com/sqrtspace/sqrtspace-paper )
2025-07-20 04:11:04 -04:00
## Installation
``` bash
pip install sqrtspace-spacetime
```
For ML features:
``` bash
pip install sqrtspace-spacetime[ ml]
```
For all features:
``` bash
pip install sqrtspace-spacetime[ all]
```
## Core Concepts
SpaceTime implements theoretical computer science results showing that many algorithms can achieve better memory usage by accepting slightly slower runtime. The key insight is using √n memory instead of n memory, where n is the input size.
### Key Features
- **Memory-Efficient Collections**: Arrays and dictionaries that automatically spill to disk
- **External Algorithms**: Sort and group large datasets using minimal memory
- **Streaming Operations**: Process files larger than RAM with elegant API
- **Auto-Checkpointing**: Resume long computations from where they left off
- **Memory Profiling**: Identify optimization opportunities in your code
- **ML Optimizations**: Reduce neural network training memory by up to 90%
## Quick Start
### Basic Usage
``` python
from sqrtspace_spacetime import SpaceTimeArray , external_sort , Stream
# Memory-efficient array that spills to disk
array = SpaceTimeArray ( threshold = 10000 )
for i in range ( 1000000 ) :
array . append ( i )
# Sort large datasets with minimal memory
huge_list = list ( range ( 10000000 , 0 , - 1 ) )
sorted_data = external_sort ( huge_list ) # Uses only √n memory
# Stream processing
Stream . from_csv ( ' huge_file.csv ' ) \
. filter ( lambda row : row [ ' value ' ] > 100 ) \
. map ( lambda row : row [ ' value ' ] * 1.1 ) \
. group_by ( lambda row : row [ ' category ' ] ) \
. to_csv ( ' processed.csv ' )
```
## Examples
### Basic Examples
See [`examples/basic_usage.py` ](examples/basic_usage.py ) for comprehensive examples of:
- SpaceTimeArray and SpaceTimeDict usage
- External sorting and grouping
- Stream processing
- Memory profiling
- Auto-checkpointing
### FastAPI Web Application
Check out [`examples/fastapi-app/` ](examples/fastapi-app/ ) for a production-ready web application featuring:
- Streaming endpoints for large datasets
- Server-Sent Events (SSE) for real-time data
- Memory-efficient CSV exports
- Checkpointed background tasks
- ML model serving with memory constraints
See the [FastAPI example README ](examples/fastapi-app/README.md ) for detailed documentation.
### Machine Learning Pipeline
Explore [`examples/ml-pipeline/` ](examples/ml-pipeline/ ) for ML-specific patterns:
- Training models on datasets larger than RAM
- Memory-efficient feature extraction
- Checkpointed training loops
- Streaming predictions
- Integration with PyTorch and TensorFlow
See the [ML Pipeline README ](examples/ml-pipeline/README.md ) for complete documentation.
### Memory-Efficient Collections
``` python
from sqrtspace_spacetime import SpaceTimeArray , SpaceTimeDict
# Array that automatically manages memory
array = SpaceTimeArray ( threshold = 1000 ) # Keep 1000 items in memory
for i in range ( 1000000 ) :
array . append ( f " item_ { i } " )
# Dictionary with LRU eviction to disk
cache = SpaceTimeDict ( threshold = 10000 )
for key , value in huge_dataset :
cache [ key ] = expensive_computation ( value )
```
### External Algorithms
``` python
from sqrtspace_spacetime import external_sort , external_groupby
# Sort 100M items using only ~10K memory
data = list ( range ( 100_000_000 , 0 , - 1 ) )
sorted_data = external_sort ( data )
# Group by with aggregation
sales = [
{ ' store ' : ' A ' , ' amount ' : 100 } ,
{ ' store ' : ' B ' , ' amount ' : 200 } ,
# ... millions more
]
by_store = external_groupby (
sales ,
key_func = lambda x : x [ ' store ' ]
)
# Aggregate with minimal memory
from sqrtspace_spacetime . algorithms import groupby_sum
totals = groupby_sum (
sales ,
key_func = lambda x : x [ ' store ' ] ,
value_func = lambda x : x [ ' amount ' ]
)
```
### Streaming Operations
``` python
from sqrtspace_spacetime import Stream
# Process large files efficiently
stream = Stream . from_csv ( ' sales_2023.csv ' )
. filter ( lambda row : row [ ' amount ' ] > 0 )
. map ( lambda row : {
' month ' : row [ ' date ' ] [ : 7 ] ,
' amount ' : float ( row [ ' amount ' ] )
} )
. group_by ( lambda row : row [ ' month ' ] )
. to_csv ( ' monthly_summary.csv ' )
# Chain operations
top_products = Stream . from_jsonl ( ' products.jsonl ' ) \
. filter ( lambda p : p [ ' in_stock ' ] ) \
. sort ( key = lambda p : p [ ' revenue ' ] , reverse = True ) \
. take ( 100 ) \
. collect ( )
```
### Auto-Checkpointing
``` python
from sqrtspace_spacetime . checkpoint import auto_checkpoint
@auto_checkpoint ( total_iterations = 1000000 )
def process_large_dataset ( data ) :
results = [ ]
for i , item in enumerate ( data ) :
# Process item
result = expensive_computation ( item )
results . append ( result )
# Yield state for checkpointing
yield { ' i ' : i , ' results ' : results }
return results
# Automatically resumes from checkpoint if interrupted
results = process_large_dataset ( huge_dataset )
```
### Memory Profiling
``` python
from sqrtspace_spacetime . profiler import profile , profile_memory
@profile ( output_file = " profile.json " )
def my_algorithm ( data ) :
# Process data
return results
# Get detailed memory analysis
result , report = my_algorithm ( data )
print ( report . summary )
# Simple memory tracking
@profile_memory ( threshold_mb = 100 )
def memory_heavy_function ( ) :
# Alerts if memory usage exceeds threshold
large_list = list ( range ( 10000000 ) )
return sum ( large_list )
```
### ML Memory Optimization
``` python
from sqrtspace_spacetime . ml import MLMemoryOptimizer
import torch . nn as nn
# Analyze model memory usage
model = nn . Sequential (
nn . Linear ( 784 , 256 ) ,
nn . ReLU ( ) ,
nn . Linear ( 256 , 128 ) ,
nn . ReLU ( ) ,
nn . Linear ( 128 , 10 )
)
optimizer = MLMemoryOptimizer ( )
profile = optimizer . analyze_model ( model , input_shape = ( 784 , ) , batch_size = 32 )
# Get optimization plan
plan = optimizer . optimize ( profile , target_batch_size = 128 )
print ( plan . explanation )
# Apply optimizations
config = optimizer . get_training_config ( plan , profile )
```
## Advanced Features
### Memory Pressure Handling
``` python
from sqrtspace_spacetime . memory import MemoryMonitor , LoggingHandler
# Monitor memory pressure
monitor = MemoryMonitor ( )
monitor . add_handler ( LoggingHandler ( ) )
# Your arrays automatically respond to memory pressure
array = SpaceTimeArray ( )
# Arrays spill to disk when memory is low
```
### Configuration
``` python
from sqrtspace_spacetime import SpaceTimeConfig
# Global configuration
SpaceTimeConfig . set_defaults (
memory_limit = 2 * 1024 * * 3 , # 2GB
chunk_strategy = ' sqrt_n ' ,
compression = ' gzip ' ,
external_storage_path = ' /fast/ssd/temp '
)
```
### Parallel Processing
``` python
from sqrtspace_spacetime . batch import BatchProcessor
processor = BatchProcessor (
memory_threshold = 0.8 ,
checkpoint_enabled = True
)
# Process in memory-efficient batches
result = processor . process (
huge_list ,
lambda batch : [ transform ( item ) for item in batch ]
)
print ( f " Processed { result . get_success_count ( ) } items " )
```
## Real-World Examples
### Processing Large CSV Files
``` python
from sqrtspace_spacetime import Stream
from sqrtspace_spacetime . profiler import profile_memory
@profile_memory ( threshold_mb = 500 )
def analyze_sales_data ( filename ) :
# Stream process to stay under memory limit
return Stream . from_csv ( filename ) \
. filter ( lambda row : row [ ' status ' ] == ' completed ' ) \
. map ( lambda row : {
' product ' : row [ ' product_id ' ] ,
' revenue ' : float ( row [ ' price ' ] ) * int ( row [ ' quantity ' ] )
} ) \
. group_by ( lambda row : row [ ' product ' ] ) \
. sort ( key = lambda group : sum ( r [ ' revenue ' ] for r in group [ 1 ] ) , reverse = True ) \
. take ( 10 ) \
. collect ( )
top_products = analyze_sales_data ( ' sales_2023.csv ' )
```
### Training Large Neural Networks
``` python
from sqrtspace_spacetime . ml import MLMemoryOptimizer , GradientCheckpointer
import torch . nn as nn
# Memory-efficient training
def train_large_model ( model , train_loader , epochs = 10 ) :
# Analyze memory requirements
optimizer = MLMemoryOptimizer ( )
profile = optimizer . analyze_model ( model , input_shape = ( 3 , 224 , 224 ) , batch_size = 32 )
# Get optimization plan
plan = optimizer . optimize ( profile , target_batch_size = 128 )
# Apply gradient checkpointing
checkpointer = GradientCheckpointer ( )
model = checkpointer . apply_checkpointing ( model , plan . checkpoint_layers )
# Train with optimized settings
for epoch in range ( epochs ) :
for batch in train_loader :
# Training loop with automatic memory management
pass
```
### Data Pipeline with Checkpoints
``` python
from sqrtspace_spacetime import Stream
from sqrtspace_spacetime . checkpoint import auto_checkpoint
@auto_checkpoint ( total_iterations = 1000000 )
def process_user_events ( event_file ) :
processed = 0
for event in Stream . from_jsonl ( event_file ) :
# Complex processing
user_profile = enhance_profile ( event )
recommendations = generate_recommendations ( user_profile )
save_to_database ( recommendations )
processed + = 1
# Checkpoint state
yield { ' processed ' : processed , ' last_event ' : event [ ' id ' ] }
return processed
# Automatically resumes if interrupted
total = process_user_events ( ' events.jsonl ' )
```
## Performance Benchmarks
| Operation | Standard Python | SpaceTime | Memory Reduction | Time Overhead |
|-----------|----------------|-----------|------------------|---------------|
| Sort 10M integers | 400MB | 20MB | 95% | 40% |
| Process 1GB CSV | 1GB | 32MB | 97% | 20% |
| Group by on 1M rows | 200MB | 14MB | 93% | 30% |
| Neural network training | 8GB | 2GB | 75% | 15% |
## API Reference
### Collections
- `SpaceTimeArray` : Memory-efficient list with disk spillover
- `SpaceTimeDict` : Memory-efficient dictionary with LRU eviction
### Algorithms
- `external_sort()` : Sort large datasets with √n memory
- `external_groupby()` : Group large datasets with √n memory
- `external_join()` : Join large datasets efficiently
### Streaming
- `Stream` : Lazy evaluation stream processing
- `FileStream` : Stream lines from files
- `CSVStream` : Stream CSV rows
- `JSONLStream` : Stream JSON Lines
### Memory Management
- `MemoryMonitor` : Monitor memory pressure
- `MemoryPressureHandler` : Custom pressure handlers
### Checkpointing
- `@auto_checkpoint` : Automatic checkpointing decorator
- `CheckpointManager` : Manual checkpoint control
### ML Optimization
- `MLMemoryOptimizer` : Analyze and optimize models
- `GradientCheckpointer` : Apply gradient checkpointing
### Profiling
- `@profile` : Full profiling decorator
- `@profile_memory` : Memory-only profiling
- `SpaceTimeProfiler` : Programmatic profiling
## Contributing
We welcome contributions! Please see our [Contributing Guide ](CONTRIBUTING.md ) for details.
## License
Apache License 2.0. See [LICENSE ](LICENSE ) for details.
## Citation
If you use SpaceTime in your research, please cite:
``` bibtex
@software { sqrtspace_spacetime ,
title = { SqrtSpace SpaceTime: Memory-Efficient Python Library } ,
author = { Friedel Jr., David H. } ,
year = { 2025 } ,
2025-12-28 17:31:26 +00:00
url = { https://git.marketally.com/sqrtspace/sqrtspace-python }
2025-07-20 04:11:04 -04:00
}
```
## Links
- [Documentation ](https://sqrtspace-spacetime.readthedocs.io )
- [PyPI Package ](https://pypi.org/project/sqrtspace-spacetime/ )
2025-12-28 17:31:26 +00:00
- [Repository ](https://git.marketally.com/sqrtspace/sqrtspace-python )
- [Issue Tracker ](https://git.marketally.com/sqrtspace/sqrtspace-python/issues )