Run full LME-S 500q benchmark on GPU workstation

## Context
KS76/77 changes (universal prompt, temporal boost, importance scoring, abstention fix) are merged to master. A 52-question sample shows very promising results:

- **Abstention: 19% → 0%** (10 recovered answers in sample)
- **Zero regressions** on previously correct answers
- **29 previously-wrong answers changed** with visibly improved specificity
- Projected ~2.5x improvement over 24.2% baseline

## Task
Run the full 500-question LME-S benchmark with GPT-4o judge on a machine with enough resources. Dev machine (CPU-only, 16GB RAM) chokes when running alongside other workloads.

### Steps
1. Set up GPU workstation with Ollama + qwen2.5:1.5b (and optionally 3b)
2. Build latest master: `cargo build --release -p shrimpk-daemon`
3. Start daemon: `shrimpk-daemon --port 11435`
4. Run generation: `python benchmarks/run_longmemeval.py` (500q, ~11h on CPU)
5. Run GPT-4o evaluation on results
6. Compare vs baseline (24.2% overall, 25.3% task-avg)

### Partial results (52/500 sample)
```
Old baseline: 13/52 correct (25.0%)
Abstentions:  10/52 → 0/52
Regressions:  0
Recovered:    10 answers
Changed:      29 answers (quality TBD)
```

### Success criteria
- Full 500q scored with GPT-4o judge
- Per-category breakdown (IE, KU, TR, NR, AB)
- Comparison table vs 24.2% baseline

## Labels
benchmark, infrastructure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run full LME-S 500q benchmark on GPU workstation #19

Context

Task

Steps

Partial results (52/500 sample)

Success criteria

Labels

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Run full LME-S 500q benchmark on GPU workstation #19

Description

Context

Task

Steps

Partial results (52/500 sample)

Success criteria

Labels

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions