data: rerun pd-allyl/rh-conjugate/heck-relay with n=10 (confirms no real improvement)#9
Merged
Merged
Conversation
…aluation Re-runs the three "within noise" published-FF systems from q2mm-data#6 with the new --n-evals 10 flag (q2mm#286, landed on master). The n=10 samples give a Student-t 95% CI on the improvement that's tight enough to make confident scientific verdicts: | System | Mean Δ% | CI₉₅ | Verdict | |-----------------|---------:|-------:|----------------| | pd-allyl | -0.029% | ±0.34% | NOT SIGNIFICANT | | rh-conjugate | -0.080% | ±1.18% | NOT SIGNIFICANT | | heck-relay\* | -0.59% | ±3.26% | NOT SIGNIFICANT | \* heck-relay run with --ratio-tol none (ratio=1.378, formally fails default gate); even with the gate bypassed, the JaxLoss surrogate broke down (2 non-finite line-search values) and the result is inside the noise band. These are statistically defensible "no improvement" verdicts, not "within noise so we can't tell" verdicts. The CI₉₅ excludes any improvement larger than ~0.3 %, ~1.2 %, and ~3.3 % for pd-allyl, rh-conjugate, and heck-relay respectively — well below any publishable improvement claim. Provenance: - q2mm git_sha: 86d8483 (master, post #286) - q2mm-data git_sha: a3cc8d7 (main, post #8) - n_evals: 10 - ratio_tol: 0.15 (default) for pd-allyl/rh-conjugate; null for heck-relay Wall time: - pd-allyl: ~21 min opt + 16 min post-eval - rh-conjugate: ~10 min opt + 13 min post-eval - heck-relay: ~24 min opt + 38 min post-eval - Total: ~2.0 hr GPU on RTX 5090 Companion docs update lives in ericchansen/q2mm docs/systems/{pd-allyl,rh-conjugate,heck-relay}.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ericchansen
added a commit
to ericchansen/q2mm
that referenced
this pull request
May 28, 2026
…e/heck-relay (n=10) Reruns the three "within noise" published-FF systems with the --n-evals 10 statistical evaluation (q2mm#286, landed on master). The n=10 samples give a Student-t 95% CI tight enough to make confident scientific verdicts where the earlier single-call PR #283 could only flag the results as "within noise": | System | Mean Δ% | CI₉₅ | Verdict | |-----------------|---------:|-------:|----------------| | pd-allyl | -0.029% | ±0.34% | NOT SIGNIFICANT | | rh-conjugate | -0.080% | ±1.18% | NOT SIGNIFICANT | | heck-relay* | -0.59% | ±3.26% | NOT SIGNIFICANT | (*) heck-relay run with --ratio-tol none; even with the gate bypassed JaxLoss broke down (2 non-finite line-search values). Each per-system page is rewritten: - The earlier "within noise floor, cannot claim" caveat is replaced with a "Confirmed: published Wahlers/Rosales FF sits at a JaxLoss local minimum" success callout — the CI excludes any improvement larger than the per-system noise floor, so this is now a defensible "no real improvement available", not "we can't tell". - Metric tables updated to show mean ± CI₉₅ %, not single-call values. - The 4602-ratio non-determinism caveat on rh-conjugate is removed (with n=10 the ratio is stable at 1.01) and #278 stays closed. - heck-relay's "keep default ratio_tol=0.15" recommendation is strengthened: with statistical rigor in place, --ratio-tol none demonstrably doesn't unlock useful optimization. Companion data PR with the regenerated JSON + FFs: ericchansen/q2mm-data#9. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ericchansen
added a commit
to ericchansen/q2mm
that referenced
this pull request
May 28, 2026
…e/heck-relay (n=10) Reruns the three "within noise" published-FF systems with the --n-evals 10 statistical evaluation (q2mm#286, landed on master). The n=10 samples give a Student-t 95% CI tight enough to make confident scientific verdicts where the earlier single-call PR #283 could only flag the results as "within noise": | System | Mean Δ% | CI₉₅ | Verdict | |-----------------|---------:|-------:|----------------| | pd-allyl | -0.029% | ±0.34% | NOT SIGNIFICANT | | rh-conjugate | -0.080% | ±1.18% | NOT SIGNIFICANT | | heck-relay* | -0.59% | ±3.26% | NOT SIGNIFICANT | (*) heck-relay run with --ratio-tol none; even with the gate bypassed JaxLoss broke down (2 non-finite line-search values). Each per-system page is rewritten: - The earlier "within noise floor, cannot claim" caveat is replaced with a "Confirmed: published Wahlers/Rosales FF sits at a JaxLoss local minimum" success callout — the CI excludes any improvement larger than the per-system noise floor, so this is now a defensible "no real improvement available", not "we can't tell". - Metric tables updated to show mean ± CI₉₅ %, not single-call values. - The 4602-ratio non-determinism caveat on rh-conjugate is removed (with n=10 the ratio is stable at 1.01) and #278 stays closed. - heck-relay's "keep default ratio_tol=0.15" recommendation is strengthened: with statistical rigor in place, --ratio-tol none demonstrably doesn't unlock useful optimization. Companion data PR with the regenerated JSON + FFs: ericchansen/q2mm-data#9. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ericchansen
added a commit
to ericchansen/q2mm
that referenced
this pull request
May 28, 2026
…e/heck-relay (n=10) Reruns the three "within noise" published-FF systems with the --n-evals 10 statistical evaluation (q2mm#286, landed on master). The n=10 samples give a Student-t 95% CI tight enough to make confident scientific verdicts where the earlier single-call PR #283 could only flag the results as "within noise": | System | Mean Δ% | CI₉₅ | Verdict | |-----------------|---------:|-------:|----------------| | pd-allyl | -0.029% | ±0.34% | NOT SIGNIFICANT | | rh-conjugate | -0.080% | ±1.18% | NOT SIGNIFICANT | | heck-relay* | -0.59% | ±3.26% | NOT SIGNIFICANT | (*) heck-relay run with --ratio-tol none; even with the gate bypassed JaxLoss broke down (2 non-finite line-search values). Each per-system page is rewritten: - The earlier "within noise floor, cannot claim" caveat is replaced with a "Confirmed: published Wahlers/Rosales FF sits at a JaxLoss local minimum" success callout — the CI excludes any improvement larger than the per-system noise floor, so this is now a defensible "no real improvement available", not "we can't tell". - Metric tables updated to show mean ± CI₉₅ %, not single-call values. - The 4602-ratio non-determinism caveat on rh-conjugate is removed (with n=10 the ratio is stable at 1.01) and #278 stays closed. - heck-relay's "keep default ratio_tol=0.15" recommendation is strengthened: with statistical rigor in place, --ratio-tol none demonstrably doesn't unlock useful optimization. Companion data PR with the regenerated JSON + FFs: ericchansen/q2mm-data#9. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ericchansen
added a commit
that referenced
this pull request
May 28, 2026
…nt fix Companion to q2mm fix branch fix/mm3-non-smooth-gradient (commit 78e72fa, PR #TBD). Re-runs the convergence pipeline with --n-evals 10 against q2mm patched for the angle-term gradient correctness bug documented in q2mm#284. Results — two previously "no improvement" verdicts now SIGNIFICANT: | System | Pre-fix Δ% | Post-fix Δ% | Verdict | |-----------------|------------------|------------------|---------------| | ch3f | 99.83 % (det.) | 99.83 % (det.) | unchanged ✅ | | rh-enamide | 44.66 % ± 0.29 % | 44.73 % ± 0.29 % | unchanged ✅ | | pd-allyl | -0.029 % ± 0.34% | -0.01 % ± 0.40 % | still NS ❌ | | rh-conjugate | -0.080 % ± 1.18% | 18.00 % ± 4.17 % | NEWLY ✅ | | heck-relay* | -0.59 % ± 3.26 % | 52.82 % ± 1.54 % | NEWLY ✅ | (*) heck-relay run with --ratio-tol none; with the fix the ratio actually drops from 1.378 → 1.085, so the gate would now pass at default tolerance. Bypass retained here for direct comparison against the pre-fix #9 baseline. What this PR contains Per-system, the convergence/ directory now has: - <system>_optimized.fld — optimized force field - validation_results.json — n=10 mean+CI numbers, full provenance - paper_metrics.json — paper-comparable Seminario vs. optimized stats Provenance (every JSON): - q2mm git_sha: 78e72fa (the fix branch's HEAD) - q2mm-data git_sha: a3cc8d7 (main, post-#8) - n_evals: 10 - ratio_tol: 0.15 (default) for 4 systems; null for heck-relay pd-allyl's pd-allyl_optimized.fld is bit-identical to the previous version — the surrogate-guided step still worsened the real OF slightly (within noise), so ScipyOptimizer reverted to initial params. Even the fix doesn't unlock pd-allyl: its FF really does sit at a JaxLoss local minimum, distinct from the rh-conjugate / heck-relay cases where the clip-arccos bug was preventing the optimizer from finding real descent directions. Wall time on RTX 5090: - ch3f: ~3 s (deterministic, n=5) - rh-enamide: ~26 min (opt + n=5 post-eval) - pd-allyl: ~50 min (opt + n=10 post-eval) - rh-conjugate: ~36 min (opt + n=10 post-eval) - heck-relay: ~98 min (opt + n=10 post-eval) - Total: ~3.5 hr GPU The audit-orphans CI workflow (q2mm-data#8) is expected to pass since every directory modified is already referenced in q2mm/docs/systems/*.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reruns pd-allyl, rh-conjugate, and heck-relay with the new
--n-evals 10flag (q2mm#286, now on master) so the "within noise" caveats from #6 become statistically defensible verdicts.Companion docs PR: ericchansen/q2mm#TBD (paired link added once opened)
Verdicts (now decisive)
* heck-relay run with
--ratio-tol none(ratio = 1.378, formally fails default gate); even with the gate bypassed, the JaxLoss surrogate broke down (2 non-finite line-search values) and the result is inside the noise band.What this PR contains
Regenerated artifacts (5 files):
benchmarks/pd-allyl-amination/convergence/{validation_results,paper_metrics}.jsonbenchmarks/rh-1,4-conjugate-addition/convergence/{validation_results,paper_metrics,rh-conjugate_optimized.fld}benchmarks/heck-relay/convergence/{validation_results,paper_metrics,heck-relay_optimized.fld}(pd-allyl's
pd-allyl_optimized.fldis bit-identical to the previous version — both this run and #6 reverted to initial parameters because the surrogate step worsened OF, so the saved FF is the same.)Why these verdicts matter
The earlier #6 results reported "within noise" for these three systems and couldn't say whether there was a real signal hiding under the per-call GPU noise. With n=10 + Student-t 95% CI we can now confidently say there isn't:
All well below any publishable improvement claim. The published Wahlers/Rosales FFs sit at JaxLoss local minima for our engine; further improvement requires the engine-parity work in q2mm#284, not optimizer tweaking.
Provenance (per JSON)
86d8483(master, post #286)a3cc8d7(main, post ci: run audit-orphans.sh on every PR + weekly cron #8)nullfor heck-relay["cuda:0"]Wall time
Audit-orphans CI
This PR is the first real test of the
audit-orphans.ymlworkflow merged in #8 — every directory modified is already referenced inq2mm/docs/systems/*.md, so it should pass.