Skip to content

HOMEGOLF-15: Drive the local-cluster HOMEGOLF score-improvement loop to operational closure #633

@AtlantisPleb

Description

@AtlantisPleb

Summary

Use the reachable local clustered HOMEGOLF devices as the primary proving ground for the Rust-native Parameter Golf lane until system changes are no longer blocked by operational failures and are visibly improving retained local scores.

This is the main tracking issue for the next loop.

Why

Current retained reality:

  • Psionic's retained real full-validation PGOLF score is still 6.306931747817168
  • recent local clustered HOMEGOLF runs exposed real operational blockers instead of a clean score-improvement loop
  • 20260328f, 20260328k, and 20260328l exported artifacts but still did not close into detached artifact_score_report.json receipts
  • 20260328j died with cudaMalloc failed: out of memory
  • 20260328m hit NaN on step 2 and then panicked during EMA final-surface application

That means the immediate job is not “turn the big GPUs back on and hope.” The immediate job is to drive out Rust runtime, score-closeout, and operator-kink failures on the local clustered HOMEGOLF lane until system changes produce retained score movement.

Intent

The operating rule from here is:

  • local clustered HOMEGOLF devices first for Rust/runtime/ops closure
  • require retained local score improvement from those system changes
  • promote only de-risked candidates to H100
  • use 8xH100 only after the current Rust lane stops tripping over avoidable blockers

Scope

This master task tracks the issue stack required to enter an honest improvement loop:

  • fix detached score-closeout so artifact-only runs actually produce score receipts
  • stabilize the local clustered competitive runner under explicit grad-accum, raw, and EMA postures
  • align the exact competitive lane to the public 11L winner family where Psionic is still behind
  • build a retained local-cluster scoreboard and stop-condition loop
  • define the single-H100 and later 8xH100 promotion gate
  • prove whether XTRAIN improves the score path or keep it off the critical path

Acceptance Criteria

  • the subordinate issue stack is closed
  • the local clustered HOMEGOLF lane can run without the current score-closeout, OOM, NaN, or queue-truth failures
  • at least one retained local clustered HOMEGOLF score receipt improves because of concrete Rust-system or model-lane changes
  • the current best local clustered candidate has an explicit promotion verdict for single-H100 and later 8xH100

References

  • docs/2026-03-28-parameter-golf-winner-gap-and-psionic-path-audit.md
  • docs/2026-03-28-homegolf-xtrain-pgolf-explicit-grad-accum-queue-correction-audit.md
  • docs/HOMEGOLF_TRACK.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend workmaster-taskTracking issueplanningPlanning workqaQuality and validation workroadmapRoadmap work

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions