data: rerun pd-allyl/rh-conjugate/heck-relay with n=10 (confirms no real improvement) by ericchansen · Pull Request #9 · ericchansen/q2mm-data

ericchansen · 2026-05-28T00:32:12Z

Summary

Reruns pd-allyl, rh-conjugate, and heck-relay with the new --n-evals 10 flag (q2mm#286, now on master) so the "within noise" caveats from #6 become statistically defensible verdicts.

Companion docs PR: ericchansen/q2mm#TBD (paired link added once opened)

Verdicts (now decisive)

System	Mean Δ%	CI₉₅	Verdict
pd-allyl	−0.029 %	±0.34 %	NOT SIGNIFICANT
rh-conjugate	−0.080 %	±1.18 %	NOT SIGNIFICANT
heck-relay*	−0.59 %	±3.26 %	NOT SIGNIFICANT

* heck-relay run with --ratio-tol none (ratio = 1.378, formally fails default gate); even with the gate bypassed, the JaxLoss surrogate broke down (2 non-finite line-search values) and the result is inside the noise band.

What this PR contains

Regenerated artifacts (5 files):

benchmarks/pd-allyl-amination/convergence/{validation_results,paper_metrics}.json
benchmarks/rh-1,4-conjugate-addition/convergence/{validation_results,paper_metrics,rh-conjugate_optimized.fld}
benchmarks/heck-relay/convergence/{validation_results,paper_metrics,heck-relay_optimized.fld}

(pd-allyl's pd-allyl_optimized.fld is bit-identical to the previous version — both this run and #6 reverted to initial parameters because the surrogate step worsened OF, so the saved FF is the same.)

Why these verdicts matter

The earlier #6 results reported "within noise" for these three systems and couldn't say whether there was a real signal hiding under the per-call GPU noise. With n=10 + Student-t 95% CI we can now confidently say there isn't:

pd-allyl: CI excludes any improvement larger than ~0.3 %
rh-conjugate: CI excludes any improvement larger than ~1.2 %
heck-relay: CI excludes any improvement larger than ~3.3 %

All well below any publishable improvement claim. The published Wahlers/Rosales FFs sit at JaxLoss local minima for our engine; further improvement requires the engine-parity work in q2mm#284, not optimizer tweaking.

Provenance (per JSON)

q2mm git_sha: 86d8483 (master, post #286)
q2mm-data git_sha: a3cc8d7 (main, post ci: run audit-orphans.sh on every PR + weekly cron #8)
n_evals: 10
ratio_tol: 0.15 (default) for pd-allyl/rh-conjugate; null for heck-relay
jax_devices: ["cuda:0"]

Wall time

pd-allyl: ~21 min opt + ~16 min post-eval
rh-conjugate: ~10 min opt + ~13 min post-eval
heck-relay: ~24 min opt + ~38 min post-eval
Total: ~2.0 hr on RTX 5090

Audit-orphans CI

This PR is the first real test of the audit-orphans.yml workflow merged in #8 — every directory modified is already referenced in q2mm/docs/systems/*.md, so it should pass.

…aluation Re-runs the three "within noise" published-FF systems from q2mm-data#6 with the new --n-evals 10 flag (q2mm#286, landed on master). The n=10 samples give a Student-t 95% CI on the improvement that's tight enough to make confident scientific verdicts: | System | Mean Δ% | CI₉₅ | Verdict | |-----------------|---------:|-------:|----------------| | pd-allyl | -0.029% | ±0.34% | NOT SIGNIFICANT | | rh-conjugate | -0.080% | ±1.18% | NOT SIGNIFICANT | | heck-relay\* | -0.59% | ±3.26% | NOT SIGNIFICANT | \* heck-relay run with --ratio-tol none (ratio=1.378, formally fails default gate); even with the gate bypassed, the JaxLoss surrogate broke down (2 non-finite line-search values) and the result is inside the noise band. These are statistically defensible "no improvement" verdicts, not "within noise so we can't tell" verdicts. The CI₉₅ excludes any improvement larger than ~0.3 %, ~1.2 %, and ~3.3 % for pd-allyl, rh-conjugate, and heck-relay respectively — well below any publishable improvement claim. Provenance: - q2mm git_sha: 86d8483 (master, post #286) - q2mm-data git_sha: a3cc8d7 (main, post #8) - n_evals: 10 - ratio_tol: 0.15 (default) for pd-allyl/rh-conjugate; null for heck-relay Wall time: - pd-allyl: ~21 min opt + 16 min post-eval - rh-conjugate: ~10 min opt + 13 min post-eval - heck-relay: ~24 min opt + 38 min post-eval - Total: ~2.0 hr GPU on RTX 5090 Companion docs update lives in ericchansen/q2mm docs/systems/{pd-allyl,rh-conjugate,heck-relay}.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…e/heck-relay (n=10) Reruns the three "within noise" published-FF systems with the --n-evals 10 statistical evaluation (q2mm#286, landed on master). The n=10 samples give a Student-t 95% CI tight enough to make confident scientific verdicts where the earlier single-call PR #283 could only flag the results as "within noise": | System | Mean Δ% | CI₉₅ | Verdict | |-----------------|---------:|-------:|----------------| | pd-allyl | -0.029% | ±0.34% | NOT SIGNIFICANT | | rh-conjugate | -0.080% | ±1.18% | NOT SIGNIFICANT | | heck-relay* | -0.59% | ±3.26% | NOT SIGNIFICANT | (*) heck-relay run with --ratio-tol none; even with the gate bypassed JaxLoss broke down (2 non-finite line-search values). Each per-system page is rewritten: - The earlier "within noise floor, cannot claim" caveat is replaced with a "Confirmed: published Wahlers/Rosales FF sits at a JaxLoss local minimum" success callout — the CI excludes any improvement larger than the per-system noise floor, so this is now a defensible "no real improvement available", not "we can't tell". - Metric tables updated to show mean ± CI₉₅ %, not single-call values. - The 4602-ratio non-determinism caveat on rh-conjugate is removed (with n=10 the ratio is stable at 1.01) and #278 stays closed. - heck-relay's "keep default ratio_tol=0.15" recommendation is strengthened: with statistical rigor in place, --ratio-tol none demonstrably doesn't unlock useful optimization. Companion data PR with the regenerated JSON + FFs: ericchansen/q2mm-data#9. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…nt fix Companion to q2mm fix branch fix/mm3-non-smooth-gradient (commit 78e72fa, PR #TBD). Re-runs the convergence pipeline with --n-evals 10 against q2mm patched for the angle-term gradient correctness bug documented in q2mm#284. Results — two previously "no improvement" verdicts now SIGNIFICANT: | System | Pre-fix Δ% | Post-fix Δ% | Verdict | |-----------------|------------------|------------------|---------------| | ch3f | 99.83 % (det.) | 99.83 % (det.) | unchanged ✅ | | rh-enamide | 44.66 % ± 0.29 % | 44.73 % ± 0.29 % | unchanged ✅ | | pd-allyl | -0.029 % ± 0.34% | -0.01 % ± 0.40 % | still NS ❌ | | rh-conjugate | -0.080 % ± 1.18% | 18.00 % ± 4.17 % | NEWLY ✅ | | heck-relay* | -0.59 % ± 3.26 % | 52.82 % ± 1.54 % | NEWLY ✅ | (*) heck-relay run with --ratio-tol none; with the fix the ratio actually drops from 1.378 → 1.085, so the gate would now pass at default tolerance. Bypass retained here for direct comparison against the pre-fix #9 baseline. What this PR contains Per-system, the convergence/ directory now has: - <system>_optimized.fld — optimized force field - validation_results.json — n=10 mean+CI numbers, full provenance - paper_metrics.json — paper-comparable Seminario vs. optimized stats Provenance (every JSON): - q2mm git_sha: 78e72fa (the fix branch's HEAD) - q2mm-data git_sha: a3cc8d7 (main, post-#8) - n_evals: 10 - ratio_tol: 0.15 (default) for 4 systems; null for heck-relay pd-allyl's pd-allyl_optimized.fld is bit-identical to the previous version — the surrogate-guided step still worsened the real OF slightly (within noise), so ScipyOptimizer reverted to initial params. Even the fix doesn't unlock pd-allyl: its FF really does sit at a JaxLoss local minimum, distinct from the rh-conjugate / heck-relay cases where the clip-arccos bug was preventing the optimizer from finding real descent directions. Wall time on RTX 5090: - ch3f: ~3 s (deterministic, n=5) - rh-enamide: ~26 min (opt + n=5 post-eval) - pd-allyl: ~50 min (opt + n=10 post-eval) - rh-conjugate: ~36 min (opt + n=10 post-eval) - heck-relay: ~98 min (opt + n=10 post-eval) - Total: ~3.5 hr GPU The audit-orphans CI workflow (q2mm-data#8) is expected to pass since every directory modified is already referenced in q2mm/docs/systems/*.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ericchansen mentioned this pull request May 28, 2026

docs: confirm 'no real improvement' verdicts for metal-TS systems (n=10) ericchansen/q2mm#287

Merged

ericchansen merged commit 82d8819 into main May 28, 2026
1 check passed

ericchansen deleted the data/metals-noise-honest-redo branch May 28, 2026 03:09

ericchansen mentioned this pull request May 28, 2026

data: regenerate all 5 systems with MM3 angle gradient fix (2 newly significant) #10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data: rerun pd-allyl/rh-conjugate/heck-relay with n=10 (confirms no real improvement)#9

data: rerun pd-allyl/rh-conjugate/heck-relay with n=10 (confirms no real improvement)#9
ericchansen merged 1 commit into
mainfrom
data/metals-noise-honest-redo

ericchansen commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ericchansen commented May 28, 2026

Summary

Verdicts (now decisive)

What this PR contains

Why these verdicts matter

Provenance (per JSON)

Wall time

Audit-orphans CI

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant