Optimize static historical data for game integration

## Overview

Identify and optimize historical data that is unlikely to change, preparing it for efficient game integration with pre-computed aggregations and optimized formats.

## Context

Much of our alignment research data (especially papers from 2020-2023) is essentially frozen - these papers won't change. We can heavily optimize this data for game performance.

## Data Categories by Update Frequency

### Static (Never Updates)
- Published papers (2020-2023)
- Historical blog posts
- Archived forum discussions
- Past funding events

### Semi-Static (Rare Updates)
- Paper metadata (citations may increase)
- Author profiles (minor corrections)
- Organization info (annual updates)

### Dynamic (Frequent Updates)
- Recent papers (2024-2025)
- Active forum discussions
- Ongoing funding rounds
- Weekly alignment research dataset updates

## Optimization Opportunities

### 1. Pre-Compute Aggregations

For static data (2020-2023), compute once and cache:

- Papers per year/quarter/month
- Top authors by publication count
- Topic distribution over time
- Citation networks (who cites whom)
- Collaboration networks (co-authorship)
- Keyword frequency trends
- Average reading time by category
- Research impact scores

Output: data/serveable/analytics/aggregated/static/

### 2. Generate Indexes

Create lookup indexes for fast game queries:

- paper_by_id.json - Direct ID lookup
- papers_by_year.json - Grouped by year
- papers_by_author.json - Author's publication list
- papers_by_topic.json - Topic-based filtering
- papers_by_impact.json - Sorted by importance

### 3. Optimize File Formats

For static data:
- Minified JSON (no whitespace)
- Gzip compression for large files
- Split large files into chunks
- Create manifest files for versioning

### 4. Create Summary Datasets

For game UI/displays:
- Top 100 most important papers
- Timeline highlights (major breakthroughs)
- Key researchers profiles
- Organization overviews
- Quick stats dashboard

## Implementation

### Script: scripts/optimization/optimize_static_data.py

```python
class StaticDataOptimizer:
    def __init__(self, cutoff_date='2024-01-01'):
        self.cutoff_date = cutoff_date
        self.logger = get_logger('static_optimization')
        
    def identify_static_data(self):
        # Find all records before cutoff date
        pass
        
    def compute_aggregations(self, records):
        # Pre-compute all aggregations
        pass
        
    def generate_indexes(self, records):
        # Create lookup indexes
        pass
        
    def optimize_formats(self, data):
        # Minify and compress
        pass
        
    def create_summaries(self, records):
        # Generate summary datasets
        pass
```

### Workflow: .github/workflows/optimize-static-data.yml

Run monthly (since this data doesn't change):

```yaml
name: Optimize Static Data

on:
  schedule:
    - cron: '0 3 1 * *'  # First day of month, 3am
  workflow_dispatch:

jobs:
  optimize:
    steps:
      - name: Identify static data
      - name: Compute aggregations
      - name: Generate indexes
      - name: Optimize formats
      - name: Create summaries
      - name: Validate outputs
      - name: Commit optimized data
```

## Specific Optimizations

### Alignment Research (2020-2023)

Current: ~600 records, ~18MB raw
After optimization:
- Aggregated stats: <100KB
- Indexes: <500KB
- Compressed papers: ~5MB
- Summary dataset: <200KB

Total size reduction: 70-80%

### Benefits for pdoom1 Game

1. Faster load times (smaller files)
2. Instant lookups (pre-built indexes)
3. Rich analytics (pre-computed stats)
4. Better UX (summary datasets for UI)
5. Reduced bandwidth (compression)

## Implementation Steps

- [ ] Create optimization script
- [ ] Implement static data detection (by date)
- [ ] Add aggregation computations
  - [ ] Papers per period
  - [ ] Author statistics
  - [ ] Topic trends
  - [ ] Citation networks
- [ ] Generate all indexes
- [ ] Implement format optimization
  - [ ] JSON minification
  - [ ] Gzip compression
  - [ ] File chunking
- [ ] Create summary datasets
- [ ] Add validation checks
- [ ] Test with game integration
- [ ] Document optimization process
- [ ] Set up monthly workflow

## Acceptance Criteria

- [ ] Static data identified and separated
- [ ] All aggregations computed and validated
- [ ] Indexes generated and tested
- [ ] File sizes reduced by 70%+
- [ ] Summary datasets accurate
- [ ] Optimized data loads correctly in game
- [ ] Monthly workflow runs successfully
- [ ] Documentation complete

## Performance Targets

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| File Size | 18 MB | <5 MB | 72% reduction |
| Load Time | ~2s | <0.5s | 75% faster |
| Query Time | Linear scan | Index lookup | 100x faster |
| Aggregation Time | On-demand | Pre-computed | Instant |

## Timeline

Estimated effort: 6-8 hours
Priority: Low-Medium (optimization, not blocker)

## Dependencies

- #14: Ideally have enrichment pipeline first
- #13: Timeline transformation provides context

## Notes

- Focus on 2020-2023 data initially (most stable)
- Can extend to 2024 data in future
- Consider using tools like jq for JSON optimization
- May want to use MessagePack or Protocol Buffers for even smaller sizes
- Keep original enriched data intact (optimization is separate)

## Related Issues

- #13: Transform to timeline events
- #14: Cleaning and enrichment pipelines
- #15: Serveable zone and sync


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize static historical data for game integration #16

Overview

Context

Data Categories by Update Frequency

Static (Never Updates)

Semi-Static (Rare Updates)

Dynamic (Frequent Updates)

Optimization Opportunities

1. Pre-Compute Aggregations

2. Generate Indexes

3. Optimize File Formats

4. Create Summary Datasets

Implementation

Script: scripts/optimization/optimize_static_data.py

Workflow: .github/workflows/optimize-static-data.yml

Specific Optimizations

Alignment Research (2020-2023)

Benefits for pdoom1 Game

Implementation Steps

Acceptance Criteria

Performance Targets

Timeline

Dependencies

Notes

Related Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Before	After	Improvement
File Size	18 MB	<5 MB	72% reduction
Load Time	~2s	<0.5s	75% faster
Query Time	Linear scan	Index lookup	100x faster
Aggregation Time	On-demand	Pre-computed	Instant

Optimize static historical data for game integration #16

Description

Overview

Context

Data Categories by Update Frequency

Static (Never Updates)

Semi-Static (Rare Updates)

Dynamic (Frequent Updates)

Optimization Opportunities

1. Pre-Compute Aggregations

2. Generate Indexes

3. Optimize File Formats

4. Create Summary Datasets

Implementation

Script: scripts/optimization/optimize_static_data.py

Workflow: .github/workflows/optimize-static-data.yml

Specific Optimizations

Alignment Research (2020-2023)

Benefits for pdoom1 Game

Implementation Steps

Acceptance Criteria

Performance Targets

Timeline

Dependencies

Notes

Related Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions