Add --force_math_sdp flag and test_numerical_stability command by davidoj · Pull Request #198 · EleutherAI/bergson

davidoj · 2026-03-19T01:54:47Z

Summary

Adds --force_math_sdp flag to IndexConfig, available on build, score, trackstar, and other commands. Disables flash and memory-efficient SDPA backends to ensure gradient consistency across different padding lengths.
Adds bergson test_numerical_stability CLI command that automatically tests escalating configurations (default → --force_math_sdp → --precision fp32 → both) and reports the minimum flags needed for a given model.
- How it works: for each trial it
  1. Picks two random documents from the dataset
  2. Runs the shorter one alone → collects its gradient
  3. Runs the shorter one batched with the longer one (which adds padding) → collects the short doc's gradient again
  4. Computes cosine similarity between the two gradients
- Then reports mean/std/min/max across all trials
Adds tests verifying the flag persists through collect_gradients and that gradients are consistent when using math-only SDPA.

Test plan

Unit tests for apply_force_math_sdp (enable/disable)
Integration test: force_math_sdp persists after collect_gradients, gradients consistent across batch composition
Full test suite passes (151 passed, 2 skipped)
pre-commit run --all-files passes
Manual test: bergson test_numerical_stability --model EleutherAI/pythia-14m

🤖 Generated with Claude Code

Example:

bergson test_numerical_stability --model EleutherAI/pythia-160m
[...]

============================================================
Report for EleutherAI/pythia-160m
============================================================
  FAIL  defaults (precision=bf16)  (min cos_sim=-0.422857)
  FAIL  --force_math_sdp (precision=bf16)  (min cos_sim=-0.059151)
  FAIL  --precision fp32  (min cos_sim=0.248888)
  PASS  --precision fp32 --force_math_sdp  (min cos_sim=0.999886)

RESULT: Gradients require non-default settings for consistency.
  Minimum required: --precision fp32 --force_math_sdp

  Add to your bergson commands:
    bergson build <run_path> --model EleutherAI/pythia-160m --force_math_sdp --precision fp32

Adds a --force_math_sdp flag to IndexConfig that disables flash and memory-efficient SDPA backends, forcing the math-only kernel. Some models produce inconsistent per-example gradients across different padding lengths when using optimized attention backends. Also adds `bergson test_numerical_stability` which automatically tests escalating configurations (default → math SDP → fp32 → both) and reports the minimum flags needed for gradient consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

- Use device_map instead of .to(device) for model loading - Assert row is dict for HF dataset iteration type narrowing - Fix docstring line length Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Documents the --force_math_sdp flag and bergson test_numerical_stability command with performance benchmarks on Pythia-160M, OLMo-2-1B, and OLMo-2-7B showing overhead varies from +0.8% to +43.2% depending on model and precision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davidoj · 2026-03-19T05:07:10Z

Benchmarking the two defensive interventions

Model	Settings	Build time	vs bf16 baseline
Pythia-160M	bf16	30.2s	—
Pythia-160M	bf16 + `--force_math_sdp`	30.4s	+0.8%
Pythia-160M	fp32	35.6s	+17.9%
Pythia-160M	fp32 + `--force_math_sdp`	39.6s	+31.1%
OLMo-2-1B	bf16	43.1s	—
OLMo-2-1B	bf16 + `--force_math_sdp`	53.6s	+24.5%
OLMo-2-1B	fp32	132.8s	+208.1%
OLMo-2-1B	fp32 + `--force_math_sdp`	141.8s	+229.0%
OLMo-2-7B	bf16	105.5s	—
OLMo-2-7B	bf16 + `--force_math_sdp`	151.1s	+43.2%
OLMo-2-7B	fp32	569.2s	+439.5%
OLMo-2-7B	fp32 + `--force_math_sdp`	603.6s	+472.1%

fp32 generally seems more expensive than math sdpa. We could just default to math sdpa

davidoj and others added 5 commits March 19, 2026 01:54

[pre-commit.ci] auto fixes from pre-commit.com hooks

f7149d1

for more information, see https://pre-commit.ci

Fix Pyright type errors in diagnose.py

c06a40f

- Use device_map instead of .to(device) for model loading - Assert row is dict for HF dataset iteration type narrowing - Fix docstring line length Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update benchmark table to baseline on bf16 with full cost breakdown

5730fb1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --force_math_sdp flag and test_numerical_stability command#198

Add --force_math_sdp flag and test_numerical_stability command#198
davidoj wants to merge 5 commits intomainfrom
safe-numerics-diagnostic

davidoj commented Mar 19, 2026 •

edited

Loading

Uh oh!

davidoj commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidoj commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

davidoj commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davidoj commented Mar 19, 2026 •

edited

Loading