Skip to content

Run full LME-S 500q benchmark on GPU workstation #19

@Liorrr

Description

@Liorrr

Context

KS76/77 changes (universal prompt, temporal boost, importance scoring, abstention fix) are merged to master. A 52-question sample shows very promising results:

  • Abstention: 19% → 0% (10 recovered answers in sample)
  • Zero regressions on previously correct answers
  • 29 previously-wrong answers changed with visibly improved specificity
  • Projected ~2.5x improvement over 24.2% baseline

Task

Run the full 500-question LME-S benchmark with GPT-4o judge on a machine with enough resources. Dev machine (CPU-only, 16GB RAM) chokes when running alongside other workloads.

Steps

  1. Set up GPU workstation with Ollama + qwen2.5:1.5b (and optionally 3b)
  2. Build latest master: cargo build --release -p shrimpk-daemon
  3. Start daemon: shrimpk-daemon --port 11435
  4. Run generation: python benchmarks/run_longmemeval.py (500q, ~11h on CPU)
  5. Run GPT-4o evaluation on results
  6. Compare vs baseline (24.2% overall, 25.3% task-avg)

Partial results (52/500 sample)

Old baseline: 13/52 correct (25.0%)
Abstentions:  10/52 → 0/52
Regressions:  0
Recovered:    10 answers
Changed:      29 answers (quality TBD)

Success criteria

  • Full 500q scored with GPT-4o judge
  • Per-category breakdown (IE, KU, TR, NR, AB)
  • Comparison table vs 24.2% baseline

Labels

benchmark, infrastructure

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions