Context
KS76/77 changes (universal prompt, temporal boost, importance scoring, abstention fix) are merged to master. A 52-question sample shows very promising results:
- Abstention: 19% → 0% (10 recovered answers in sample)
- Zero regressions on previously correct answers
- 29 previously-wrong answers changed with visibly improved specificity
- Projected ~2.5x improvement over 24.2% baseline
Task
Run the full 500-question LME-S benchmark with GPT-4o judge on a machine with enough resources. Dev machine (CPU-only, 16GB RAM) chokes when running alongside other workloads.
Steps
- Set up GPU workstation with Ollama + qwen2.5:1.5b (and optionally 3b)
- Build latest master:
cargo build --release -p shrimpk-daemon
- Start daemon:
shrimpk-daemon --port 11435
- Run generation:
python benchmarks/run_longmemeval.py (500q, ~11h on CPU)
- Run GPT-4o evaluation on results
- Compare vs baseline (24.2% overall, 25.3% task-avg)
Partial results (52/500 sample)
Old baseline: 13/52 correct (25.0%)
Abstentions: 10/52 → 0/52
Regressions: 0
Recovered: 10 answers
Changed: 29 answers (quality TBD)
Success criteria
- Full 500q scored with GPT-4o judge
- Per-category breakdown (IE, KU, TR, NR, AB)
- Comparison table vs 24.2% baseline
Labels
benchmark, infrastructure
Context
KS76/77 changes (universal prompt, temporal boost, importance scoring, abstention fix) are merged to master. A 52-question sample shows very promising results:
Task
Run the full 500-question LME-S benchmark with GPT-4o judge on a machine with enough resources. Dev machine (CPU-only, 16GB RAM) chokes when running alongside other workloads.
Steps
cargo build --release -p shrimpk-daemonshrimpk-daemon --port 11435python benchmarks/run_longmemeval.py(500q, ~11h on CPU)Partial results (52/500 sample)
Success criteria
Labels
benchmark, infrastructure