Skip to content

data: rerun pd-allyl/rh-conjugate/heck-relay with n=10 (confirms no real improvement)#9

Merged
ericchansen merged 1 commit into
mainfrom
data/metals-noise-honest-redo
May 28, 2026
Merged

data: rerun pd-allyl/rh-conjugate/heck-relay with n=10 (confirms no real improvement)#9
ericchansen merged 1 commit into
mainfrom
data/metals-noise-honest-redo

Conversation

@ericchansen
Copy link
Copy Markdown
Owner

Summary

Reruns pd-allyl, rh-conjugate, and heck-relay with the new --n-evals 10 flag (q2mm#286, now on master) so the "within noise" caveats from #6 become statistically defensible verdicts.

Companion docs PR: ericchansen/q2mm#TBD (paired link added once opened)

Verdicts (now decisive)

System Mean Δ% CI₉₅ Verdict
pd-allyl −0.029 % ±0.34 % NOT SIGNIFICANT
rh-conjugate −0.080 % ±1.18 % NOT SIGNIFICANT
heck-relay* −0.59 % ±3.26 % NOT SIGNIFICANT

* heck-relay run with --ratio-tol none (ratio = 1.378, formally fails default gate); even with the gate bypassed, the JaxLoss surrogate broke down (2 non-finite line-search values) and the result is inside the noise band.

What this PR contains

Regenerated artifacts (5 files):

  • benchmarks/pd-allyl-amination/convergence/{validation_results,paper_metrics}.json
  • benchmarks/rh-1,4-conjugate-addition/convergence/{validation_results,paper_metrics,rh-conjugate_optimized.fld}
  • benchmarks/heck-relay/convergence/{validation_results,paper_metrics,heck-relay_optimized.fld}

(pd-allyl's pd-allyl_optimized.fld is bit-identical to the previous version — both this run and #6 reverted to initial parameters because the surrogate step worsened OF, so the saved FF is the same.)

Why these verdicts matter

The earlier #6 results reported "within noise" for these three systems and couldn't say whether there was a real signal hiding under the per-call GPU noise. With n=10 + Student-t 95% CI we can now confidently say there isn't:

  • pd-allyl: CI excludes any improvement larger than ~0.3 %
  • rh-conjugate: CI excludes any improvement larger than ~1.2 %
  • heck-relay: CI excludes any improvement larger than ~3.3 %

All well below any publishable improvement claim. The published Wahlers/Rosales FFs sit at JaxLoss local minima for our engine; further improvement requires the engine-parity work in q2mm#284, not optimizer tweaking.

Provenance (per JSON)

Wall time

  • pd-allyl: ~21 min opt + ~16 min post-eval
  • rh-conjugate: ~10 min opt + ~13 min post-eval
  • heck-relay: ~24 min opt + ~38 min post-eval
  • Total: ~2.0 hr on RTX 5090

Audit-orphans CI

This PR is the first real test of the audit-orphans.yml workflow merged in #8 — every directory modified is already referenced in q2mm/docs/systems/*.md, so it should pass.

…aluation

Re-runs the three "within noise" published-FF systems from q2mm-data#6
with the new --n-evals 10 flag (q2mm#286, landed on master).  The
n=10 samples give a Student-t 95% CI on the improvement that's tight
enough to make confident scientific verdicts:

| System          | Mean Δ%  | CI₉₅   | Verdict        |
|-----------------|---------:|-------:|----------------|
| pd-allyl        | -0.029%  | ±0.34% | NOT SIGNIFICANT |
| rh-conjugate    | -0.080%  | ±1.18% | NOT SIGNIFICANT |
| heck-relay\*    | -0.59%   | ±3.26% | NOT SIGNIFICANT |

\* heck-relay run with --ratio-tol none (ratio=1.378, formally fails
default gate); even with the gate bypassed, the JaxLoss surrogate
broke down (2 non-finite line-search values) and the result is
inside the noise band.

These are statistically defensible "no improvement" verdicts, not
"within noise so we can't tell" verdicts.  The CI₉₅ excludes any
improvement larger than ~0.3 %, ~1.2 %, and ~3.3 % for pd-allyl,
rh-conjugate, and heck-relay respectively — well below any
publishable improvement claim.

Provenance:
- q2mm git_sha: 86d8483 (master, post #286)
- q2mm-data git_sha: a3cc8d7 (main, post #8)
- n_evals: 10
- ratio_tol: 0.15 (default) for pd-allyl/rh-conjugate; null for heck-relay

Wall time:
- pd-allyl:      ~21 min opt + 16 min post-eval
- rh-conjugate:  ~10 min opt + 13 min post-eval
- heck-relay:    ~24 min opt + 38 min post-eval
- Total:        ~2.0 hr GPU on RTX 5090

Companion docs update lives in ericchansen/q2mm
docs/systems/{pd-allyl,rh-conjugate,heck-relay}.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ericchansen added a commit to ericchansen/q2mm that referenced this pull request May 28, 2026
…e/heck-relay (n=10)

Reruns the three "within noise" published-FF systems with the
--n-evals 10 statistical evaluation (q2mm#286, landed on master).
The n=10 samples give a Student-t 95% CI tight enough to make
confident scientific verdicts where the earlier single-call PR
#283 could only flag the results as "within noise":

| System          | Mean Δ%  | CI₉₅   | Verdict        |
|-----------------|---------:|-------:|----------------|
| pd-allyl        | -0.029%  | ±0.34% | NOT SIGNIFICANT |
| rh-conjugate    | -0.080%  | ±1.18% | NOT SIGNIFICANT |
| heck-relay*     | -0.59%   | ±3.26% | NOT SIGNIFICANT |

(*) heck-relay run with --ratio-tol none; even with the gate
bypassed JaxLoss broke down (2 non-finite line-search values).

Each per-system page is rewritten:

- The earlier "within noise floor, cannot claim" caveat is replaced
  with a "Confirmed: published Wahlers/Rosales FF sits at a JaxLoss
  local minimum" success callout — the CI excludes any improvement
  larger than the per-system noise floor, so this is now a
  defensible "no real improvement available", not "we can't tell".
- Metric tables updated to show mean ± CI₉₅ %, not single-call
  values.
- The 4602-ratio non-determinism caveat on rh-conjugate is removed
  (with n=10 the ratio is stable at 1.01) and #278 stays closed.
- heck-relay's "keep default ratio_tol=0.15" recommendation is
  strengthened: with statistical rigor in place, --ratio-tol none
  demonstrably doesn't unlock useful optimization.

Companion data PR with the regenerated JSON + FFs:
ericchansen/q2mm-data#9.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ericchansen added a commit to ericchansen/q2mm that referenced this pull request May 28, 2026
…e/heck-relay (n=10)

Reruns the three "within noise" published-FF systems with the
--n-evals 10 statistical evaluation (q2mm#286, landed on master).
The n=10 samples give a Student-t 95% CI tight enough to make
confident scientific verdicts where the earlier single-call PR
#283 could only flag the results as "within noise":

| System          | Mean Δ%  | CI₉₅   | Verdict        |
|-----------------|---------:|-------:|----------------|
| pd-allyl        | -0.029%  | ±0.34% | NOT SIGNIFICANT |
| rh-conjugate    | -0.080%  | ±1.18% | NOT SIGNIFICANT |
| heck-relay*     | -0.59%   | ±3.26% | NOT SIGNIFICANT |

(*) heck-relay run with --ratio-tol none; even with the gate
bypassed JaxLoss broke down (2 non-finite line-search values).

Each per-system page is rewritten:

- The earlier "within noise floor, cannot claim" caveat is replaced
  with a "Confirmed: published Wahlers/Rosales FF sits at a JaxLoss
  local minimum" success callout — the CI excludes any improvement
  larger than the per-system noise floor, so this is now a
  defensible "no real improvement available", not "we can't tell".
- Metric tables updated to show mean ± CI₉₅ %, not single-call
  values.
- The 4602-ratio non-determinism caveat on rh-conjugate is removed
  (with n=10 the ratio is stable at 1.01) and #278 stays closed.
- heck-relay's "keep default ratio_tol=0.15" recommendation is
  strengthened: with statistical rigor in place, --ratio-tol none
  demonstrably doesn't unlock useful optimization.

Companion data PR with the regenerated JSON + FFs:
ericchansen/q2mm-data#9.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ericchansen ericchansen merged commit 82d8819 into main May 28, 2026
1 check passed
@ericchansen ericchansen deleted the data/metals-noise-honest-redo branch May 28, 2026 03:09
ericchansen added a commit to ericchansen/q2mm that referenced this pull request May 28, 2026
…e/heck-relay (n=10)

Reruns the three "within noise" published-FF systems with the
--n-evals 10 statistical evaluation (q2mm#286, landed on master).
The n=10 samples give a Student-t 95% CI tight enough to make
confident scientific verdicts where the earlier single-call PR
#283 could only flag the results as "within noise":

| System          | Mean Δ%  | CI₉₅   | Verdict        |
|-----------------|---------:|-------:|----------------|
| pd-allyl        | -0.029%  | ±0.34% | NOT SIGNIFICANT |
| rh-conjugate    | -0.080%  | ±1.18% | NOT SIGNIFICANT |
| heck-relay*     | -0.59%   | ±3.26% | NOT SIGNIFICANT |

(*) heck-relay run with --ratio-tol none; even with the gate
bypassed JaxLoss broke down (2 non-finite line-search values).

Each per-system page is rewritten:

- The earlier "within noise floor, cannot claim" caveat is replaced
  with a "Confirmed: published Wahlers/Rosales FF sits at a JaxLoss
  local minimum" success callout — the CI excludes any improvement
  larger than the per-system noise floor, so this is now a
  defensible "no real improvement available", not "we can't tell".
- Metric tables updated to show mean ± CI₉₅ %, not single-call
  values.
- The 4602-ratio non-determinism caveat on rh-conjugate is removed
  (with n=10 the ratio is stable at 1.01) and #278 stays closed.
- heck-relay's "keep default ratio_tol=0.15" recommendation is
  strengthened: with statistical rigor in place, --ratio-tol none
  demonstrably doesn't unlock useful optimization.

Companion data PR with the regenerated JSON + FFs:
ericchansen/q2mm-data#9.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ericchansen added a commit that referenced this pull request May 28, 2026
…nt fix

Companion to q2mm fix branch fix/mm3-non-smooth-gradient (commit
78e72fa, PR #TBD).  Re-runs the convergence pipeline with --n-evals
10 against q2mm patched for the angle-term gradient correctness bug
documented in q2mm#284.

Results — two previously "no improvement" verdicts now SIGNIFICANT:

| System          | Pre-fix Δ%       | Post-fix Δ%      | Verdict       |
|-----------------|------------------|------------------|---------------|
| ch3f            | 99.83 % (det.)   | 99.83 % (det.)   | unchanged ✅  |
| rh-enamide      | 44.66 % ± 0.29 % | 44.73 % ± 0.29 % | unchanged ✅  |
| pd-allyl        | -0.029 % ± 0.34% | -0.01 % ± 0.40 % | still NS ❌   |
| rh-conjugate    | -0.080 % ± 1.18% | 18.00 % ± 4.17 % | NEWLY ✅      |
| heck-relay*     | -0.59 % ± 3.26 % | 52.82 % ± 1.54 % | NEWLY ✅      |

(*) heck-relay run with --ratio-tol none; with the fix the ratio
actually drops from 1.378 → 1.085, so the gate would now pass at
default tolerance.  Bypass retained here for direct comparison
against the pre-fix #9 baseline.

What this PR contains

Per-system, the convergence/ directory now has:
- <system>_optimized.fld — optimized force field
- validation_results.json — n=10 mean+CI numbers, full provenance
- paper_metrics.json — paper-comparable Seminario vs. optimized stats

Provenance (every JSON):
- q2mm git_sha: 78e72fa (the fix branch's HEAD)
- q2mm-data git_sha: a3cc8d7 (main, post-#8)
- n_evals: 10
- ratio_tol: 0.15 (default) for 4 systems; null for heck-relay

pd-allyl's pd-allyl_optimized.fld is bit-identical to the previous
version — the surrogate-guided step still worsened the real OF
slightly (within noise), so ScipyOptimizer reverted to initial
params.  Even the fix doesn't unlock pd-allyl: its FF really does
sit at a JaxLoss local minimum, distinct from the rh-conjugate /
heck-relay cases where the clip-arccos bug was preventing the
optimizer from finding real descent directions.

Wall time on RTX 5090:
- ch3f:        ~3 s (deterministic, n=5)
- rh-enamide:  ~26 min (opt + n=5 post-eval)
- pd-allyl:    ~50 min (opt + n=10 post-eval)
- rh-conjugate: ~36 min (opt + n=10 post-eval)
- heck-relay:  ~98 min (opt + n=10 post-eval)
- Total:       ~3.5 hr GPU

The audit-orphans CI workflow (q2mm-data#8) is expected to pass
since every directory modified is already referenced in
q2mm/docs/systems/*.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant