-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Overview
Identify and optimize historical data that is unlikely to change, preparing it for efficient game integration with pre-computed aggregations and optimized formats.
Context
Much of our alignment research data (especially papers from 2020-2023) is essentially frozen - these papers won't change. We can heavily optimize this data for game performance.
Data Categories by Update Frequency
Static (Never Updates)
- Published papers (2020-2023)
- Historical blog posts
- Archived forum discussions
- Past funding events
Semi-Static (Rare Updates)
- Paper metadata (citations may increase)
- Author profiles (minor corrections)
- Organization info (annual updates)
Dynamic (Frequent Updates)
- Recent papers (2024-2025)
- Active forum discussions
- Ongoing funding rounds
- Weekly alignment research dataset updates
Optimization Opportunities
1. Pre-Compute Aggregations
For static data (2020-2023), compute once and cache:
- Papers per year/quarter/month
- Top authors by publication count
- Topic distribution over time
- Citation networks (who cites whom)
- Collaboration networks (co-authorship)
- Keyword frequency trends
- Average reading time by category
- Research impact scores
Output: data/serveable/analytics/aggregated/static/
2. Generate Indexes
Create lookup indexes for fast game queries:
- paper_by_id.json - Direct ID lookup
- papers_by_year.json - Grouped by year
- papers_by_author.json - Author's publication list
- papers_by_topic.json - Topic-based filtering
- papers_by_impact.json - Sorted by importance
3. Optimize File Formats
For static data:
- Minified JSON (no whitespace)
- Gzip compression for large files
- Split large files into chunks
- Create manifest files for versioning
4. Create Summary Datasets
For game UI/displays:
- Top 100 most important papers
- Timeline highlights (major breakthroughs)
- Key researchers profiles
- Organization overviews
- Quick stats dashboard
Implementation
Script: scripts/optimization/optimize_static_data.py
class StaticDataOptimizer:
def __init__(self, cutoff_date='2024-01-01'):
self.cutoff_date = cutoff_date
self.logger = get_logger('static_optimization')
def identify_static_data(self):
# Find all records before cutoff date
pass
def compute_aggregations(self, records):
# Pre-compute all aggregations
pass
def generate_indexes(self, records):
# Create lookup indexes
pass
def optimize_formats(self, data):
# Minify and compress
pass
def create_summaries(self, records):
# Generate summary datasets
passWorkflow: .github/workflows/optimize-static-data.yml
Run monthly (since this data doesn't change):
name: Optimize Static Data
on:
schedule:
- cron: '0 3 1 * *' # First day of month, 3am
workflow_dispatch:
jobs:
optimize:
steps:
- name: Identify static data
- name: Compute aggregations
- name: Generate indexes
- name: Optimize formats
- name: Create summaries
- name: Validate outputs
- name: Commit optimized dataSpecific Optimizations
Alignment Research (2020-2023)
Current: ~600 records, ~18MB raw
After optimization:
- Aggregated stats: <100KB
- Indexes: <500KB
- Compressed papers: ~5MB
- Summary dataset: <200KB
Total size reduction: 70-80%
Benefits for pdoom1 Game
- Faster load times (smaller files)
- Instant lookups (pre-built indexes)
- Rich analytics (pre-computed stats)
- Better UX (summary datasets for UI)
- Reduced bandwidth (compression)
Implementation Steps
- Create optimization script
- Implement static data detection (by date)
- Add aggregation computations
- Papers per period
- Author statistics
- Topic trends
- Citation networks
- Generate all indexes
- Implement format optimization
- JSON minification
- Gzip compression
- File chunking
- Create summary datasets
- Add validation checks
- Test with game integration
- Document optimization process
- Set up monthly workflow
Acceptance Criteria
- Static data identified and separated
- All aggregations computed and validated
- Indexes generated and tested
- File sizes reduced by 70%+
- Summary datasets accurate
- Optimized data loads correctly in game
- Monthly workflow runs successfully
- Documentation complete
Performance Targets
| Metric | Before | After | Improvement |
|---|---|---|---|
| File Size | 18 MB | <5 MB | 72% reduction |
| Load Time | ~2s | <0.5s | 75% faster |
| Query Time | Linear scan | Index lookup | 100x faster |
| Aggregation Time | On-demand | Pre-computed | Instant |
Timeline
Estimated effort: 6-8 hours
Priority: Low-Medium (optimization, not blocker)
Dependencies
- Implement data cleaning and enrichment pipeline for transformed zone #14: Ideally have enrichment pipeline first
- Transform alignment research data to timeline events for pdoom1 game integration #13: Timeline transformation provides context
Notes
- Focus on 2020-2023 data initially (most stable)
- Can extend to 2024 data in future
- Consider using tools like jq for JSON optimization
- May want to use MessagePack or Protocol Buffers for even smaller sizes
- Keep original enriched data intact (optimization is separate)
Related Issues
- Transform alignment research data to timeline events for pdoom1 game integration #13: Transform to timeline events
- Implement data cleaning and enrichment pipeline for transformed zone #14: Cleaning and enrichment pipelines
- Implement serveable zone and pdoom1 data sync mechanism #15: Serveable zone and sync
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels