Skip to content

Add --force_math_sdp flag and test_numerical_stability command#198

Open
davidoj wants to merge 5 commits intomainfrom
safe-numerics-diagnostic
Open

Add --force_math_sdp flag and test_numerical_stability command#198
davidoj wants to merge 5 commits intomainfrom
safe-numerics-diagnostic

Conversation

@davidoj
Copy link
Contributor

@davidoj davidoj commented Mar 19, 2026

Summary

  • Adds --force_math_sdp flag to IndexConfig, available on build, score, trackstar, and other commands. Disables flash and memory-efficient SDPA backends to ensure gradient consistency across different padding lengths.
  • Adds bergson test_numerical_stability CLI command that automatically tests escalating configurations (default → --force_math_sdp--precision fp32 → both) and reports the minimum flags needed for a given model.
    • How it works: for each trial it
      1. Picks two random documents from the dataset
      2. Runs the shorter one alone → collects its gradient
      3. Runs the shorter one batched with the longer one (which adds padding) → collects the short doc's gradient again
      4. Computes cosine similarity between the two gradients
    • Then reports mean/std/min/max across all trials
  • Adds tests verifying the flag persists through collect_gradients and that gradients are consistent when using math-only SDPA.

Test plan

  • Unit tests for apply_force_math_sdp (enable/disable)
  • Integration test: force_math_sdp persists after collect_gradients, gradients consistent across batch composition
  • Full test suite passes (151 passed, 2 skipped)
  • pre-commit run --all-files passes
  • Manual test: bergson test_numerical_stability --model EleutherAI/pythia-14m

🤖 Generated with Claude Code

Example:

bergson test_numerical_stability --model EleutherAI/pythia-160m
[...]

============================================================
Report for EleutherAI/pythia-160m
============================================================
  FAIL  defaults (precision=bf16)  (min cos_sim=-0.422857)
  FAIL  --force_math_sdp (precision=bf16)  (min cos_sim=-0.059151)
  FAIL  --precision fp32  (min cos_sim=0.248888)
  PASS  --precision fp32 --force_math_sdp  (min cos_sim=0.999886)

RESULT: Gradients require non-default settings for consistency.
  Minimum required: --precision fp32 --force_math_sdp

  Add to your bergson commands:
    bergson build <run_path> --model EleutherAI/pythia-160m --force_math_sdp --precision fp32

davidoj and others added 5 commits March 19, 2026 01:54
Adds a --force_math_sdp flag to IndexConfig that disables flash and
memory-efficient SDPA backends, forcing the math-only kernel. Some
models produce inconsistent per-example gradients across different
padding lengths when using optimized attention backends.

Also adds `bergson test_numerical_stability` which automatically tests
escalating configurations (default → math SDP → fp32 → both) and
reports the minimum flags needed for gradient consistency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use device_map instead of .to(device) for model loading
- Assert row is dict for HF dataset iteration type narrowing
- Fix docstring line length

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the --force_math_sdp flag and bergson test_numerical_stability
command with performance benchmarks on Pythia-160M, OLMo-2-1B, and
OLMo-2-7B showing overhead varies from +0.8% to +43.2% depending on
model and precision.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davidoj
Copy link
Contributor Author

davidoj commented Mar 19, 2026

Benchmarking the two defensive interventions

Model Settings Build time vs bf16 baseline
Pythia-160M bf16 30.2s
Pythia-160M bf16 + --force_math_sdp 30.4s +0.8%
Pythia-160M fp32 35.6s +17.9%
Pythia-160M fp32 + --force_math_sdp 39.6s +31.1%
OLMo-2-1B bf16 43.1s
OLMo-2-1B bf16 + --force_math_sdp 53.6s +24.5%
OLMo-2-1B fp32 132.8s +208.1%
OLMo-2-1B fp32 + --force_math_sdp 141.8s +229.0%
OLMo-2-7B bf16 105.5s
OLMo-2-7B bf16 + --force_math_sdp 151.1s +43.2%
OLMo-2-7B fp32 569.2s +439.5%
OLMo-2-7B fp32 + --force_math_sdp 603.6s +472.1%

fp32 generally seems more expensive than math sdpa. We could just default to math sdpa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant