Skip to content

Optimize static historical data for game integration #16

@PipFoweraker

Description

@PipFoweraker

Overview

Identify and optimize historical data that is unlikely to change, preparing it for efficient game integration with pre-computed aggregations and optimized formats.

Context

Much of our alignment research data (especially papers from 2020-2023) is essentially frozen - these papers won't change. We can heavily optimize this data for game performance.

Data Categories by Update Frequency

Static (Never Updates)

  • Published papers (2020-2023)
  • Historical blog posts
  • Archived forum discussions
  • Past funding events

Semi-Static (Rare Updates)

  • Paper metadata (citations may increase)
  • Author profiles (minor corrections)
  • Organization info (annual updates)

Dynamic (Frequent Updates)

  • Recent papers (2024-2025)
  • Active forum discussions
  • Ongoing funding rounds
  • Weekly alignment research dataset updates

Optimization Opportunities

1. Pre-Compute Aggregations

For static data (2020-2023), compute once and cache:

  • Papers per year/quarter/month
  • Top authors by publication count
  • Topic distribution over time
  • Citation networks (who cites whom)
  • Collaboration networks (co-authorship)
  • Keyword frequency trends
  • Average reading time by category
  • Research impact scores

Output: data/serveable/analytics/aggregated/static/

2. Generate Indexes

Create lookup indexes for fast game queries:

  • paper_by_id.json - Direct ID lookup
  • papers_by_year.json - Grouped by year
  • papers_by_author.json - Author's publication list
  • papers_by_topic.json - Topic-based filtering
  • papers_by_impact.json - Sorted by importance

3. Optimize File Formats

For static data:

  • Minified JSON (no whitespace)
  • Gzip compression for large files
  • Split large files into chunks
  • Create manifest files for versioning

4. Create Summary Datasets

For game UI/displays:

  • Top 100 most important papers
  • Timeline highlights (major breakthroughs)
  • Key researchers profiles
  • Organization overviews
  • Quick stats dashboard

Implementation

Script: scripts/optimization/optimize_static_data.py

class StaticDataOptimizer:
    def __init__(self, cutoff_date='2024-01-01'):
        self.cutoff_date = cutoff_date
        self.logger = get_logger('static_optimization')
        
    def identify_static_data(self):
        # Find all records before cutoff date
        pass
        
    def compute_aggregations(self, records):
        # Pre-compute all aggregations
        pass
        
    def generate_indexes(self, records):
        # Create lookup indexes
        pass
        
    def optimize_formats(self, data):
        # Minify and compress
        pass
        
    def create_summaries(self, records):
        # Generate summary datasets
        pass

Workflow: .github/workflows/optimize-static-data.yml

Run monthly (since this data doesn't change):

name: Optimize Static Data

on:
  schedule:
    - cron: '0 3 1 * *'  # First day of month, 3am
  workflow_dispatch:

jobs:
  optimize:
    steps:
      - name: Identify static data
      - name: Compute aggregations
      - name: Generate indexes
      - name: Optimize formats
      - name: Create summaries
      - name: Validate outputs
      - name: Commit optimized data

Specific Optimizations

Alignment Research (2020-2023)

Current: ~600 records, ~18MB raw
After optimization:

  • Aggregated stats: <100KB
  • Indexes: <500KB
  • Compressed papers: ~5MB
  • Summary dataset: <200KB

Total size reduction: 70-80%

Benefits for pdoom1 Game

  1. Faster load times (smaller files)
  2. Instant lookups (pre-built indexes)
  3. Rich analytics (pre-computed stats)
  4. Better UX (summary datasets for UI)
  5. Reduced bandwidth (compression)

Implementation Steps

  • Create optimization script
  • Implement static data detection (by date)
  • Add aggregation computations
    • Papers per period
    • Author statistics
    • Topic trends
    • Citation networks
  • Generate all indexes
  • Implement format optimization
    • JSON minification
    • Gzip compression
    • File chunking
  • Create summary datasets
  • Add validation checks
  • Test with game integration
  • Document optimization process
  • Set up monthly workflow

Acceptance Criteria

  • Static data identified and separated
  • All aggregations computed and validated
  • Indexes generated and tested
  • File sizes reduced by 70%+
  • Summary datasets accurate
  • Optimized data loads correctly in game
  • Monthly workflow runs successfully
  • Documentation complete

Performance Targets

Metric Before After Improvement
File Size 18 MB <5 MB 72% reduction
Load Time ~2s <0.5s 75% faster
Query Time Linear scan Index lookup 100x faster
Aggregation Time On-demand Pre-computed Instant

Timeline

Estimated effort: 6-8 hours
Priority: Low-Medium (optimization, not blocker)

Dependencies

Notes

  • Focus on 2020-2023 data initially (most stable)
  • Can extend to 2024 data in future
  • Consider using tools like jq for JSON optimization
  • May want to use MessagePack or Protocol Buffers for even smaller sizes
  • Keep original enriched data intact (optimization is separate)

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions