feat(systems)!: default starting_point to qfuerza#290
Open
ericchansen wants to merge 7 commits into
Open
Conversation
Adds a starting_point parameter to load_system that, when set to
"qfuerza", overwrites the published OPT bond/angle scalars with
QFUERZA (Farrugia 2025) Hessian-derived values while leaving the
published OPT topology, frozen MM3 backbone, vdW, stretch-bend,
Urey-Bradley, and torsion rows untouched. Torsions are zeroed by
qfuerza_into per Farrugia 2025; for systems where published torsions
are already zero (e.g. rh-enamide Donoghue 2008), this is a no-op.
The default "published" path is unchanged — existing baselines and
behavior are preserved.
A per-param-type audit is attached to SystemData.metadata under
starting_point_audit, classifying every scalar as qfuerza_overwritten,
retained_published, or frozen. This makes leakage (vdW, unmatched
bond/angle rows, etc.) honest and visible rather than hidden.
The qfuerza_fresh strategy (CH3F) treats "qfuerza" as a no-op since
its starting FF is already QFUERZA-derived.
scripts/regenerate_convergence_results.py grows a --starting-point
{published,qfuerza} flag; output subdirectory becomes from-qfuerza/
instead of convergence/ when qfuerza is selected, preserving baseline
artifacts.
Anchors: Farrugia 2025 (DOI 10.1021/acs.jctc.5c01751), AGENTS.md §6.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Documents results of starting q2mm optimization from QFUERZA Hessian-derived bond/angle values (instead of published OPT values) on all 5 TS systems. Headline result: mixed. rh-enamide and pd-allyl converge to the same basin as published-start runs. pd-conjugate, rh-conjugate, and heck-relay land in different (worse) basins, with heck-relay failing entirely due to JaxLoss surrogate divergence at the poor starting FF. The page is explicit that this is NOT a from-scratch FF generation: QFUERZA only overwrites bond/angle scalars on top of the published OPT topology, frozen/active partition, vdW, SB, and atom-type rows. Per-row audit numbers are reported per system. Data lives in ericchansen/q2mm-data/benchmarks/<system>/from-qfuerza/. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a new starting_point="qfuerza" option to the benchmark system loader so optimizations can start from QFUERZA Hessian-derived bond/angle values (while preserving published topology/frozen partition), and wires this through the convergence regeneration script, tests, and documentation.
Changes:
- Extend
load_system()with astarting_pointparameter plus a per-parameter “starting point audit” recorded in system metadata. - Add
--starting-point {published,qfuerza}toscripts/regenerate_convergence_results.py, including provenance/output subdirectory routing. - Add regression tests and a new benchmark documentation page for QFUERZA-recovery runs.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
q2mm/diagnostics/systems.py |
Adds starting_point support, QFUERZA overwrite hook, and audit metadata for scalar provenance. |
scripts/regenerate_convergence_results.py |
Adds CLI flag, provenance fields, logging, and output subdir selection for starting point runs. |
test/test_systems.py |
Adds TestStartingPoint coverage for backward-compatibility, overwrite behavior, and audit consistency. |
docs/benchmarks/qfuerza-recovery.md |
Documents methodology and results for QFUERZA-recovery validation across TS systems. |
properdocs.yml |
Adds navigation entry for the new benchmark page. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
PR #290 first iteration shipped 5 systems with a silent-failure optimizer protocol: L-BFGS-B exited at n_iterations<=2 with negligible OF change for every system, scipy reported success=True, and the misleading results made it into the comparison doc. This commit adds the source-code, CLI, agent-skill and documentation guardrails that prevent that pattern from recurring. Source code - q2mm/models/forcefield.py: add ForceField.get_fractional_bounds (fc_fraction, eq_fraction) returning a (val +/- frac*|val|) box per parameter, intersected with DEFAULT_BOUNDS sanity envelope. Sign-aware for TSFF negative force constants; falls back to sanity bounds when |val| < 1e-6 (frozen-at-zero torsions). - q2mm/optimizers/scipy_opt.py: ScipyOptimizer accepts fc_fraction and eq_fraction; uses them for the bounds list when set. Adds a WARNING in _run_minimize when n_iterations<=2 and |delta|/init<1% — the silent-failure fingerprint. - scripts/regenerate_convergence_results.py: add --ftol, --fc-fraction, --eq-fraction CLI flags (defaults preserve backward-compatible behavior for the existing convergence/ baselines). Emit batch-level ERROR + non-zero exit when every optimized system fails the no-progress check. Tests - test/test_models.py: 5 new tests covering fractional bounds for positive/negative FCs, sanity-envelope intersection, zero-value fallback, and the None-passthrough. Agent skills - .copilot/skills/q2mm-benchmark/SKILL.md: forces the agent through the audit gate (run FIRST system, inspect n_iterations & improvement_pct, STOP if either fails) before launching any q2mm batch >30 min. - .copilot/skills/q2mm-analysis-design/SKILL.md: forces the agent to restate the user's question and mock comparison tables BEFORE writing a benchmark analysis doc. Documentation - AGENTS.md: new section 11 "Benchmark Pre-Flight Checklist". Three new Common Pitfalls rows (n_iterations<=2 silent exit, default sanity bounds for from-poor-start runs, batch never optimized). Recommended invocation for from-QFUERZA runs: python scripts/regenerate_convergence_results.py \ --starting-point qfuerza --ratio-tol none \ --ftol 1e-12 --fc-fraction 0.20 --eq-fraction 0.05 Heck-relay specifically: tighter bounds (--fc-fraction 0.05 per AGENTS.md memory) due to the fragile TS landscape with large negative force constants. Refs: #290 (iterating) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces the headline-objective-only results table with the analysis the user actually asked for: published-paper vs q2mm@published vs q2mm@QFUERZA R²/RMSD per system, plus per-parameter motif-grouped comparison tables and a physical-chemistry walkthrough. Re-ran all 5 systems with the Phase 0 protocol fix (ftol=1e-12, fractional bounds: fc_fraction=0.20, eq_fraction=0.05, heck-relay overridden to fc_fraction=0.05). Real OF improvement now achieved on 4/5 systems (was 1/5 with the previous loose-factr protocol). Key new findings documented: - pd-allyl/pd-conjugate: q2mm objective at QFUERZA-optimized is LOWER than at published-optimized. R²/RMSD tables explain why: q2mm's MM3 implementation lacks MacroModel/MM3* cross-terms, so the FF that fits q2mm best is NOT the one that fit MacroModel best. Backend parity is the limiting factor for parameter-recovery validation. - rh-conjugate & heck-relay: optimizer produces NEGATIVE force constants. Optimizer bounds enforce magnitude but not sign — a ±20% bound around a near-zero positive fc can cross sign boundary. Documented as a known issue requiring a positive-fc constraint. - heck-relay: JaxLoss surrogate explodes at QFUERZA start due to bond R² = −6228; L-BFGS-B exits in 0 iterations. Documented as the same JaxLoss-fragility pattern from AGENTS.md §9. New analysis scripts: - scripts/compare_opt_rows.py — loader-bypassing per-param diff that auto-roundtrips published FF through ForceField.to_mm3_fld() so atom tokens and row ordering match the optimizer-saved file - scripts/build_qfuerza_recovery_tables.py — cross-system R²/RMSD rollup builder for the recovery doc - scripts/compare_qfuerza_to_published.py — earlier loader-based diff, superseded by compare_opt_rows.py but kept for the FF-object motif breakdown path Drive-by: fix Wahlers 2021 DOI in validation/published_ffs/README.md (was pointing to 10.1021/acs.joc.0c02918; correct is .1c00136). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Five fixes from copilot-pull-request-reviewer on the QFUERZA-recovery PR plus the test-core (3.10) CI failure: 1. test/test_systems.py — mark `test_qfuerza_is_noop_for_qfuerza_fresh_strategy` with `@pytest.mark.jax` so the core suite (which excludes JAX) skips it rather than failing the unconditional `from JaxEngine` import. Fixes the test-core CI failure introduced when JaxEngine was added to the test in PR #290. 2. test/test_systems.py — use `np.testing.assert_array_equal` for the frozen-scalar bit-identical check; the previous `np.allclose(..., atol=0.0)` still applied the default `rtol=1e-05`, so the test would have admitted a relative drift of up to 0.001%. The intent is true bit-identity for backbone params; the new assertion enforces it. 3. q2mm/diagnostics/systems.py — `load_system` now raises `ValueError` when `starting_point` is anything other than `"published"` or `"qfuerza"`. The `StartingPoint = Literal[...]` annotation is a static hint only; without a runtime guard a typo like `"qferza"` silently went through the `published` branch and produced misleading results. 4. q2mm/diagnostics/systems.py — refactor `_audit_starting_point` to derive per-scalar type labels from `ForceField._PARAM_SLOTS` (via a new `_build_param_type_labels` helper) instead of hand-rolling the collection/attr enumeration. Any future change to the parameter vector layout is now reflected automatically; unknown attrs raise `KeyError` rather than silently producing wrong labels. The Urey-Bradley tail still has to be appended explicitly because `_ub_angles` is a property, not in `_PARAM_SLOTS`. 5. scripts/regenerate_convergence_results.py — reflow the docstring so the `` `--starting-point published` `` and `` `--starting-point qfuerza` `` inline-code spans no longer break across newlines. Plus a new `test_unknown_starting_point_raises` test for #3. Validation: - `ruff check q2mm/ test/ scripts/` — clean - `ruff format --check q2mm test scripts examples` — clean - `pytest test/test_systems.py -x -q -m "not (openmm or tinker or jax or jax_md or psi4)"` — 10 passed, 1 deselected (the jax-marked test) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Five additional findings from copilot-pull-request-reviewer on the guardrails commit e81cf00. 1. q2mm/models/forcefield.py — guard `get_fractional_bounds()` against degenerate `lo >= hi` returns when the current param value lies outside the `DEFAULT_BOUNDS` sanity envelope. The naive `max(sanity_lo, val − f·|val|), min(sanity_hi, val + f·|val|)` formula can flip `lo > hi` for vdw_epsilon/ub_k/bond_k values outside the envelope (SciPy would then reject the bounds). Fall back to sanity bounds when the intersection is empty so L-BFGS-B can pull the value back into a physical region. New test `test_fractional_bounds_value_outside_sanity_falls_back` covers the bond_k = 5000 case (sanity envelope ±3600). 2-4. .copilot/skills/q2mm-benchmark/SKILL.md — the skill referenced CLI flags `--bounds-fraction-fc/--bounds-fraction-eq/--factr`, but the script added in this PR exposes `--fc-fraction/--eq-fraction/--ftol`. Following the skill would have failed with "unrecognized arguments" for any future agent. Updated all four occurrences (Step 2 prose, pre-flight checklist, Step 4 example command, quick-reference Q&A) to match the real CLI. 5. docs/benchmarks/qfuerza-recovery.md — the fractional-bounds formula was shown as `fc ∈ [fc₀·(1−f), fc₀·(1+f)]`, which is only correct for positive `fc₀`. The implementation is sign-aware (`val ± f·|val|`), which matters for TSFF negative force constants. Updated the doc to describe the actual sign-aware bound. Validation: - `ruff check q2mm/ test/ scripts/` — clean - `ruff format --check q2mm test scripts examples` — clean - `pytest test/test_models.py -k fractional_bounds` — 6/6 passed Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Flip the canonical default for `load_system()` and `regenerate_convergence_results.py --starting-point` from `"published"` to `"qfuerza"`, and rename the canonical output subdirectory accordingly (`convergence/` now holds QFUERZA-start runs; `from-published/` holds the opt-in publication-baseline runs). Reframe the QFUERZA-recovery doc, AGENTS.md §6 / §11, and the q2mm-benchmark / q2mm-analysis-design skills around the QFUERZA-as-defined story: QFUERZA fills in 100% of the bond/angle scalars it is defined to estimate, given a force-field skeleton (atom types, OPT topology, frozen/active partition, vdW/SB defaults) that the chemist provides via a `.fld` file. This is true for every QFUERZA workflow on every system — there is no `.fld`-free path because those decisions are chemistry calls a tool can't make. Document known limitations honestly: 3 of the 5 TS systems (rh-conjugate, pd-conjugate, heck-relay) don't converge cleanly from QFUERZA with default bounds yet; the doc lists the per-system workarounds (`--fc-fraction 0.05` or `--starting-point published`). Also: - Add a `--starting-point` flag to `q2mm-benchmark` CLI so the user CLI exposes the same lever as the regeneration script. - Pin `scripts/compare_opt_rows.py` to `starting_point="published"` (the script's whole purpose is to diff the publication baseline against an optimizer's output, so it must load literature values). - Pin `test/integration/test_heck_validation.py` to `starting_point="published"` (the test asserts bit-identical OPT values against the literature `.fld`). - Replace `test_default_starting_point_is_published` with `test_default_starting_point_is_qfuerza`; add a separate test that pins `starting_point="published"` and verifies zero QFUERZA overwrite. BREAKING CHANGE: `starting_point` now defaults to `"qfuerza"` on `load_system()` and `--starting-point qfuerza` on `regenerate_convergence_results.py`. Callers that depended on the publication-baseline path must pass `starting_point="published"` (loader) or `--starting-point published` (CLI). The output subdirectory naming also flips: canonical default now writes to `convergence/`; opt-in publication-baseline writes to `from-published/`. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+252
to
+263
| import sys as _sys | ||
| import tempfile as _tempfile | ||
|
|
||
| _sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) | ||
| from q2mm.diagnostics.systems import load_system | ||
|
|
||
| # ``compare_opt_rows.py`` exists to diff the published baseline FF | ||
| # against an optimizer's output, so we must load the literature OPT | ||
| # values verbatim — not the QFUERZA-derived default. | ||
| sd = load_system(args.system, starting_point="published") | ||
| published_path = Path(_tempfile.mkstemp(prefix=f"pub-{args.system}-", suffix=".fld")[1]) | ||
| sd.forcefield.to_mm3_fld(str(published_path)) |
Comment on lines
+106
to
+108
| def _category(row: Row) -> str: | ||
| return f"{row.kind}_eq" if row.kind == "bond" else f"{row.kind}_eq" | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Make QFUERZA the canonical default starting point for TS-system optimizations.
starting_point="qfuerza"is now the default onload_system();--starting-point qfuerzais the default onregenerate_convergence_results.py. Output subdir naming follows:convergence/for the canonical default,from-published/for the opt-in publication-baseline path.A
.fldskeleton (atom types, OPT-row topology, frozen/active partition, vdW, stretch-bend) must still be provided — those are chemistry decisions a tool cannot automate. QFUERZA fills in 100% of the bond/angle scalars it is defined to estimate, given that skeleton. For our 5 TS systems we use the literature.fldfiles (Donoghue 2008, Rosales 2020, Wahlers 2022) as the skeleton.What this PR changes
API / CLI
q2mm.diagnostics.systems.load_system(starting_point=…): default flipped from"published"to"qfuerza". Accepts"published"to retain the literature OPT values verbatim (publication-baseline reproduction).scripts/regenerate_convergence_results.py --starting-point: default flipped to"qfuerza". Output subdir isconvergence/for the canonical default,from-published/for--starting-point published.q2mm-benchmarkuser CLI: new--starting-pointflag for symmetry (default"qfuerza").Guardrails
ForceField.get_fractional_bounds(): builds sign-aware[v − f·|v|, v + f·|v|]bounds intersected withDEFAULT_BOUNDS. Documented degenerate-bounds guard.scripts/regenerate_convergence_results.py:--fc-fraction/--eq-fraction/--ftol/--ratio-tolflags. Batch-level ERROR + non-zero exit when no system shows real progress.q2mm/optimizers/scipy_opt.py::_run_minimize: WARNING whenn_iterations<=2and real-OF delta < 1%.Documentation
docs/benchmarks/qfuerza-recovery.md(454 lines) — QFUERZA-recovery validation. New framing: "QFUERZA workflow (canonical default)" with explicit skeleton-vs-scalars decomposition; per-system R²/RMSD vs paper, q2mm-from-published, q2mm-from-QFUERZA; per-parameter abs deviation tables; physical-chemistry walkthrough of top deviations; "Known limitations" section listing the 3/5 systems that diverge from QFUERZA default and the workarounds.AGENTS.md: new "Starting Point" subsection under §6; §11 pre-flight checklist updated to flip default framing..copilot/skills/q2mm-benchmark/SKILL.md+q2mm-analysis-design/SKILL.md: process skills that walk through the canonical workflow before launching any batch >30 min or writing any comparison doc.Tests
test/test_systems.py::TestStartingPoint(11 tests, all passing): asserts default is"qfuerza"; pinning"published"yields zero QFUERZA overwrite; QFUERZA touches OPT scalars only (frozen MM3 backbone bit-identical); reference data is independent ofstarting_point; audit classification is consistent with_PARAM_SLOTS.test/test_models.py::TestGetFractionalBounds: verifies sign-aware fractional-bound construction including the degenerate-bounds guard.test/integration/test_heck_validation.py: pinned tostarting_point="published"(the test asserts bit-identical literature OPT values).Results
Per-system QFUERZA-from-canonical-default convergence (full discussion in
docs/benchmarks/qfuerza-recovery.md):fc(unphysical, see §3.4)Known limitations are documented honestly in the new "Known limitations" section of
qfuerza-recovery.mdwith concrete workarounds (--fc-fraction 0.05or--starting-point published). Follow-up work tracked: positive-fcsign constraint in the optimizer, q2mm/MacroModel MM3 backend parity audit, heck-relay JaxLoss pre-conditioning.Breaking change
starting_pointnow defaults to"qfuerza"onload_system()and--starting-point qfuerzaonregenerate_convergence_results.py. Callers that depended on the publication-baseline path must passstarting_point="published"(loader) or--starting-point published(CLI). The output subdirectory naming also flips: canonical default now writes toconvergence/; opt-in publication-baseline writes tofrom-published/.Related
from-qfuerza/→convergence/andconvergence/→from-published/directory renames, with updated audit-orphans workflow paths and CONVERGENCE_README.