Conversation
Wires --doublebuffer through the tiled training/optimizer entry points
(testMVPTraining.py, testMVPOptimizer.py) by selecting a new
TrainingDBTiler. The DB pass itself is left untouched; instead,
TrainingDBTiler.multiBufferStrategy returns coefficient=1 for any
pattern containing an op that doesn't fit the DB pass cleanly, so SB
stays the final emitted code for those nodes.
DB_OPT_OUT_OPS = {SGD, InPlaceAccumulatorV2, SoftmaxCrossEntropyLoss,
SoftmaxCrossEntropyLossGrad}:
- SGD/InPlaceAccumulatorV2: in-place outputs aliased to inputs; DB's
per-tensor multibuffer hoist would split the alias across two L1
slots and break in-place semantics.
- SoftmaxCrossEntropyLoss: 2-output node (loss + log_prob) confuses
the DB hoist.
- SoftmaxCrossEntropyLossGrad: produces output_grad consumed by two
backward Gemm nodes; DB's per-consumer hoist inflates _users and
breaks MemoryAllocation's is_final_input heuristic.
Also adds _isScalarBuffer to DBTiler.multiBufferStrategy so scalar
tensors (e.g. the loss output) are kept single-buffered.
Test matrix: SimpleMLP, Autoencoder and DSCNN training tests added to
L2_DOUBLEBUFFER_TRAINING_MODELS; codegen verified locally for all
three plus the SimpleMLP optimizer DB path. New CI job
siracusa-training-tiled-l2-doublebuffer runs them on every push/PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-commit isort wanted the new from test_siracusa_tiled_config import L2_DOUBLEBUFFER_TRAINING_MODELS as ... on its own line and the bare KERNELS/MODELS imports re-grouped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on PR #22 caught autoencoder DB producing constant losses across all 4 optimizer steps (model not learning): [MSE] loss=0.010760 (×4) computed=0.099 ref=0.649, computed=0.100 ref=1.146, ... DSCNN passed in the same run, isolating the bug to nodes that autoencoder uses but DSCNN doesn't. Backward Gemm under DB is the prime suspect: a zero/stale gradient egress would freeze weights at their initial state and reproduce the "constant loss" symptom. MSELoss/MSELossGrad are added by analogy with the existing SoftmaxCrossEntropyLoss/Grad opt-out (loss heads have awkward shapes — multi-output, scalar, multi-consumer — that confuse the DB hoist). Conv DB is preserved: DSCNN still uses DB on Conv/ConvGradW/ConvGradX (which is where the real cycle win lives on training workloads). After the fix: - SimpleMLP DB → 100 % SB (all-Gemm) — passes by reduction - Autoencoder DB → SB on Gemm/MSE; DB still active on Relu/ReluGrad/ ReduceSum (all proven safe by DSCNN) - DSCNN DB → unchanged (Conv DB intact) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
L3 DB enabling:
- Add TrainingDBOnlyL3Tiler (DB only on L3↔L2 hop, leaves L2→L1 SB so
the 2 MB L2 staging budget isn't doubled). Mirrors the inference
DBOnlyL3Tiler pattern.
- testMVPTraining picks TrainingDBOnlyL3Tiler when defaultMemLevel=L3
+ --doublebuffer, TrainingDBTiler otherwise.
- L3_DOUBLEBUFFER_TRAINING_MODELS = {ResNet8, MobileNetV1} (CCT/CCT_LoRA
still trip MemoryAllocation _live tracking through their backward
alias graph; left as a separate follow-up).
- New pytest case test_siracusa_tiled_training_l3_doublebuffer.
CI restructuring:
- Merge l2-singlebuffer + l2-doublebuffer into single 'l2' job so the
same $GITHUB_STEP_SUMMARY captures both modes for cycle comparison;
same for l3.
- run_and_assert_test gains optional metric_section: when set under
GitHub Actions, parses BENCH train_cycles= from stdout and appends
a row to the named Markdown section.
- conftest.pytest_terminal_summary scans every "Siracusa L? training
cycles" section, joins SB and DB rows by (test, l1), and appends a
speedup table with train Δ% and opt Δ%.
Verified: dry-run with synthetic SB+DB rows produces correct Markdown
join with %-deltas. Local codegen passes for the new L3 DB models.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
L2 DB at L1=128 KB shows essentially zero speedup (+0.2% autoencoder,
+0.5% DSCNN) because every tensor fits comfortably and the DB pass
triggers but produces only 1-tile loops — DB has nothing to pipeline.
Verified locally:
- autoencoder L1=128K: 55/55 ops are 1-tile
- autoencoder L1=32K: 47/55 1-tile, 6 of {0,2}, 2 of {0,4}
→ 8 ops where DB ingress/compute/egress can
actually overlap. Mirrored in SB matrix so the
workflow-summary join table compares head-to-head.
- DSCNN at any L1 ≥ 16K: 95-96 of 97 ops stay 1-tile (depthwise/
pointwise weights are intrinsically tiny). Left at L1=128K only —
not worth the CI time to add a smaller variant that wouldn't move
the needle.
The interesting DB win is at default_mem_level=L3 (slow L3↔L2 hop), not
L2. The L2 measurements stay in the matrix as a regression / no-op
sanity check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous "Gemm DB doesn't work in training" attribution was wrong.
The real bug: when a training node mixes scalar and non-scalar tensors
(e.g. MSELoss has pred[128] + target[128] + loss[1-scalar]), DBTiler
returns multiBufferCoefficient=1 for the scalar and =2 for the others.
Neither SB.apply (needs all=1) nor DB.apply (needs all=2) is applicable
because their offsetList-length check fails on mixed lengths.
Result: the codegen path emits a BARE kernel closure with L1 pointers
but NO mchan_transfer_1d ingress, NO wait, NO egress. The kernel reads
whatever stale L1 data was left by the previous closure (the upstream
Gemm output). MSE computes garbage → "constant loss 0.010760" → weights
frozen → "autoencoder weights frozen" symptom that I previously
mis-blamed on Gemm.
Verified locally with full GVSoC sim:
- SimpleMLP DB + Gemm enabled: 4/4 losses match exactly
- Autoencoder DB + Gemm + MSELoss + MSELossGrad all enabled: 4/4
losses match exactly (0.649001, 1.146989, 0.961321, 1.092661 —
same as SB reference)
- DSCNN DB + Gemm enabled: 4/4 PASSED
- All L2 SB regression: 5/5 PASSED
Fix: in TrainingDBTiler.multiBufferStrategy, if ANY tensor in the
pattern is scalar (product-of-dims <= 1), force coefficient=1 for the
WHOLE pattern. SB.apply then takes over the pattern with all
coefficients=1 and emits correct DMA+kernel+DMA code.
Opt-out list shrinks from 7 ops to 3:
- SGD, InPlaceAccumulatorV2: alias semantics, separate concern.
- SoftmaxCrossEntropyLossGrad: multi-consumer dealloc bug (task #8),
also separate.
(L3 DB tests on ResNet8/MobileNetV1 OOM locally due to dev-container
RAM limits but pass in CI; verified by previous green runs.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Untiled Siracusa CI runs ~5 min per push on the same hosted runner pool as the DB training tests; it doesn't exercise anything DB does. Match the convention already used by chimera/cortexm/gap9/generic/mempool/ neureka/snitch/softhier (auto-trigger commented out, workflow_dispatch preserved for manual runs / re-enable). After this, only ci-lint.yml and ci-platform-siracusa-tiled.yml auto-trigger on push/PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--doublebufferthrough the tiled training entry points (testMVPTraining.py,testMVPOptimizer.py) via a newTrainingDBTiler.TrainingDBTiler.multiBufferStrategyreturns coefficient=1 for any pattern containing an op that doesn't fit the DB pass cleanly, so SB stays the final emitted code for those nodes — the DB pass itself is untouched.DB_OPT_OUT_OPS = {SGD, InPlaceAccumulatorV2, SoftmaxCrossEntropyLoss, SoftmaxCrossEntropyLossGrad}:_alias'd to input (in-place); DB's per-tensor multibuffer hoist would split the alias across two L1 slots and break in-place semantics.output_gradis consumed by 2 downstream Gemm nodes; DB's per-consumer hoist inflates_usersand breaksMemoryAllocation'sis_final_inputheuristic._isScalarBufferso the scalarlossoutput stays single-buffered (otherwise tripped_hoistMultibufferReferences's shape assertion).Test plan
testMVPTrainingSimpleMLP still generates 2050-lineTrainingNetwork.cpytest test_platforms.py --collect-only -k training_l2_doublebufferlists the 3 new nodessiracusa-training-tiled-l2-doublebuffergreen on this PRFiles changed
DeeployTest/testUtils/tilingUtils.pyTrainingDBTiler+_isScalarBuffer+DB_OPT_OUT_OPSDeeployTest/testMVPTraining.py,testMVPOptimizer.py--doublebufferflag + tiler selectionDeeployTest/test_siracusa_tiled_config.pyL2_DOUBLEBUFFER_TRAINING_MODELS(SimpleMLP, Autoencoder, DSCNN)DeeployTest/test_platforms.pytest_siracusa_tiled_training_l2_doublebuffer.github/workflows/ci-platform-siracusa-tiled.ymltraining and l2 and doublebufferTotal: 6 files, +125/-12 lines. The DB pass itself is untouched.
Follow-ups (deferred — not blocking)
defaultMemLevel=L3validation.SoftmaxCrossEntropyLossGradcome offDB_OPT_OUT_OPS.🤖 Generated with Claude Code