This document describes the memory monitoring and prediction system implemented for funannotate2's ab initio gene prediction step.
The memory monitoring system provides:
- Memory Usage Prediction - Estimate memory requirements based on contig length
- Real-time Memory Monitoring - Track actual memory usage of subprocess calls
- Memory-aware CPU Allocation - Adjust parallelization based on memory constraints
- Memory Usage Reporting - Generate detailed memory usage reports
The system includes empirical models to predict memory usage for each ab initio tool:
- SNAP: Base 50 MB + 0.5 MB per MB of sequence
- Augustus: Base 100 MB + 2.0 MB per MB of sequence
- GlimmerHMM: Base 30 MB + 0.3 MB per MB of sequence
- GeneMark: Base 80 MB + 1.0 MB per MB of sequence
These models provide rough estimates that can be refined with actual usage data.
Uses psutil to monitor subprocess memory usage in real-time:
- Tracks RSS (Resident Set Size) and VMS (Virtual Memory Size)
- Monitors parent process and all child processes
- Samples memory usage at configurable intervals (default: 100ms)
- Calculates peak, average, and duration statistics
Automatically adjusts CPU allocation based on:
- Available system memory
- Predicted memory usage per process
- User-specified memory limits
- System memory buffer (20% reserved for OS)
The memory monitoring is integrated into:
runSubprocess()- Optional memory monitoring for individual commandsabinitio_wrapper()- Memory prediction and logging per contigrunProcessJob()- Memory-aware CPU allocation for multiprocessing
Memory monitoring is enabled by default in funannotate2 predict:
# Memory monitoring is enabled by default
funannotate2 predict -i input_dir
# Set memory limit with memory monitoring (default behavior)
funannotate2 predict -i input_dir --memory-limit 16
# Disable memory monitoring if needed
funannotate2 predict -i input_dir --disable-memory-monitoring--disable-memory-monitoring: Disable memory monitoring and prediction (enabled by default)--memory-limit GB: Set memory limit in GB to adjust CPU allocation
With memory monitoring enabled by default, you'll see output like:
Memory monitoring: using 14.4 GB limit (90% of 16.0 GB total)
Memory limit set to 16.0 GB
Memory usage estimate for 150 contigs with tools ['snap', 'augustus']:
Total estimated peak memory: 2847.3 MB
System memory: 14.2 GB available
Processing contig scaffold_1.fasta (length: 2,847,392 bp)
SNAP memory prediction for scaffold_1.fasta: 51.4 MB
Augustus memory prediction for scaffold_1.fasta: 105.4 MB
Memory usage for snap-scaffold_1.fasta:
Process: snap-scaffold_1.fasta
Duration: 12.34 seconds
Peak RSS: 48.2 MB
Peak VMS: 156.7 MB
Average RSS: 42.1 MB
Samples collected: 247
Predict memory usage for an ab initio tool based on contig length.
Parameters:
tool_name: Name of the ab initio tool ('snap', 'augustus', etc.)contig_length: Length of the contig in base pairsprediction_data: Optional historical data for improved predictions
Returns: Dictionary with predicted memory usage statistics
Monitor memory usage of a subprocess in real-time.
Parameters:
process: subprocess.Popen object to monitorprocess_name: Name identifier for the process
Returns: Dictionary containing memory statistics
Estimate total memory usage for running ab initio predictions on multiple contigs.
Parameters:
contigs: List of contig file pathstools: List of ab initio tools to runprediction_data: Optional historical data
Returns: Dictionary with total memory estimates
Suggest optimal CPU allocation based on memory constraints.
Parameters:
total_memory_estimate: Total estimated memory usage in MBavailable_memory_gb: Available system memory in GBmax_cpus: Maximum number of CPUs available
Returns: Dictionary with CPU allocation suggestions
Get current system memory information.
Returns: Dictionary with system memory statistics
Get the length of a contig from a FASTA file.
Parameters:
contig_file: Path to the contig FASTA file
Returns: Length of the contig in base pairs
Format memory statistics into a human-readable report.
Parameters:
stats: Memory statistics dictionary
Returns: Formatted string report
- Prediction Phase: Before running ab initio tools, estimate memory usage based on contig lengths
- System Check: Assess available system memory and suggest CPU allocation
- Real-time Monitoring: During subprocess execution, sample memory usage at regular intervals
- Reporting: Log memory statistics and generate reports
- Model Updates: Optionally update prediction models with actual usage data
The memory monitor:
- Creates a
psutil.Processobject for the subprocess - Samples memory usage every 100ms (configurable)
- Tracks both the main process and all child processes
- Handles process termination gracefully
- Calculates statistics from all samples
The system adjusts CPU allocation by:
- Estimating memory usage per parallel process
- Calculating how many processes can fit in available memory
- Leaving a 20% buffer for the operating system
- Ensuring at least 1 CPU is allocated
- Not exceeding the user-specified maximum
Run the test suite to verify functionality:
python test_memory_monitoring.pyThis will test:
- Memory prediction models
- System memory information
- CPU allocation suggestions
- Total memory estimation
- Real-time memory monitoring
The memory monitoring system requires:
psutil- For system and process memory monitoringjson- For saving/loading memory statisticstime- For timing and samplingthreading- For concurrent memory monitoring
Potential improvements include:
- Machine Learning Models - Use actual usage data to train better prediction models
- Memory Profiling - Detailed analysis of memory allocation patterns
- Dynamic Scheduling - Adjust CPU allocation during runtime based on actual usage
- Memory Limits - Hard memory limits with process termination
- Historical Analysis - Long-term memory usage trends and optimization
- Tool-specific Tuning - Fine-tune memory models for different ab initio tools
- psutil not available: Install with
pip install psutil - Permission errors: Some systems may restrict process monitoring
- Inaccurate predictions: Models are empirical and may need tuning for your data
- Memory monitoring overhead: Monitoring adds small CPU/memory overhead
Memory monitoring has minimal performance impact:
- ~1-2% CPU overhead for sampling
- ~1-5 MB memory overhead for the monitor
- Sampling interval can be adjusted to reduce overhead