Skip to content

fix: miscalculation of num_steps when using num_epoch and lmdb#5488

Open
OutisLi wants to merge 1 commit into
deepmodeling:masterfrom
OutisLi:pr/epoch
Open

fix: miscalculation of num_steps when using num_epoch and lmdb#5488
OutisLi wants to merge 1 commit into
deepmodeling:masterfrom
OutisLi:pr/epoch

Conversation

@OutisLi
Copy link
Copy Markdown
Collaborator

@OutisLi OutisLi commented Jun 3, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Improved batch count estimation accuracy for mixed batch mode in data readers
    • Enhanced distributed training batch sampler calculations for more accurate per-rank batch counts
    • Fixed training step calculation to align with actual dataloader batch behavior
  • Tests

    • Added validation tests for dataset batch counting under distributed sampling configurations

Copilot AI review requested due to automatic review settings June 3, 2026 06:04
@dosubot dosubot Bot added the bug label Jun 3, 2026
@github-actions github-actions Bot added the Python label Jun 3, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adjust LMDB batch counting to reflect auto-probability expansion and align distributed/training step calculations, with added regression tests.

Changes:

  • Update total_batch/index semantics to track sampler-expanded batch counts.
  • Use DataLoader length for LR step calculations when training on LmdbDataset.
  • Add tests covering total_batch alignment and distributed sampler length with auto-prob expansion.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
source/tests/pt/test_lmdb_dataloader.py Adds tests validating expanded batch counts and distributed sampler __len__() behavior.
deepmd/pt/utils/lmdb_dataset.py Changes index/total_batch to be derived from the batch sampler length (including expansion).
deepmd/pt/train/training.py Uses len(training_dataloader) to compute batch counts for LR scheduling with LMDB datasets.
deepmd/dpmodel/utils/lmdb_data.py Updates total_batch calculation and adjusts distributed sampler __len__() to include expansion via SameNlocBatchSampler.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +314 to +318
return [self.total_batch]

@property
def total_batch(self) -> int:
return self._reader.total_batch
return len(self._batch_sampler)
Comment on lines 1312 to 1320
"""Number of batches for this rank."""
total = 0
for nloc, indices in self._reader.nloc_groups.items():
bs = self._reader.get_batch_size_for_nloc(nloc)
total += (len(indices) + bs - 1) // bs
total = len(
SameNlocBatchSampler(
self._reader,
shuffle=False,
block_targets=self._block_targets,
)
)
return math.ceil(total / self._world_size)
Comment on lines +627 to +634
def test_total_batch_matches_auto_prob_sampler(self, auto_prob_lmdb):
ds = LmdbDataset(
auto_prob_lmdb,
type_map=["O", "H"],
batch_size=4,
auto_prob_style="prob_sys_size;0:1:0.5;1:3:0.5",
)
assert ds.total_batch == len(ds._batch_sampler)
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 3, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The PR refactors batch count computation across the LMDB data pipeline. LmdbDataReader introduces a total_batch property that accounts for mixed batching modes, LmdbDataset derives its batch counts from the sampler length rather than the reader, and Trainer now uses dataloader lengths instead of dataset-reported totals. DistributedSameNlocBatchSampler.__len__ adopts a consistent expansion-based estimation path. Tests validate the alignment.

Changes

Batch Count Estimation Alignment

Layer / File(s) Summary
Reader-level batch counting
deepmd/dpmodel/utils/lmdb_data.py
LmdbDataReader.total_batch computes batch counts as ceil(nframes/batch_size) for mixed batches or summed per-nloc counts otherwise; index returns [total_batch]. DistributedSameNlocBatchSampler.__len__ replaces direct per-nloc estimation with a SameNlocBatchSampler instantiation and ceil(total/world_size) division.
Dataset-level batch alignment
deepmd/pt/utils/lmdb_dataset.py
LmdbDataset.total_batch now derives from len(self._batch_sampler) instead of reader; index returns [total_batch] directly, aligning reported counts with sampler behavior.
Trainer batch count resolution
deepmd/pt/train/training.py
Single-task and multi-task trainer initialization now use len(self.training_dataloader) and len(self.training_dataloader[model_key]) to compute batch totals, instead of reading training_data.total_batch.
Batch alignment validation
source/tests/pt/test_lmdb_dataloader.py
Two new tests in TestAutoProbDataset verify that ds.total_batch equals len(ds._batch_sampler) and that distributed sampler length equals ceil(global_batches/2) for world_size=2, rank=0.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • deepmodeling/deepmd-kit#5413: Both PRs modify LMDB batching/length semantics—main PR changes LmdbDataReader/LmdbDataset total_batch/index and distributed __len__, while retrieved PR changes LmdbDataReader frame retention and batch-size computation/block targets—so the main PR's batch-count/index fixes are tightly connected to the altered LMDB batching behavior.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title directly addresses the main issue: fixing miscalculation of num_steps when using num_epoch with LMDB datasets, which is the core change reflected across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 1311-1320: The current __len__() returns ceil(total /
self._world_size) for every rank which mismatches the strided partitioning in
__iter__(); change __len__() to compute a rank-aware batch count by getting
total = len(SameNlocBatchSampler(self._reader, shuffle=False,
block_targets=self._block_targets)), then compute base = total //
self._world_size and remainder = total % self._world_size and return base + (1
if self._rank < remainder else 0) so that __len__() matches the actual number of
batches produced by the __iter__() strided partitioning (alternatively, adjust
_partition_batches() to pad/repeat batches so every rank emits
ceil(total/world_size) and keep current __len__()).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8ecc555b-6099-4600-a3c5-6d331cce1d8d

📥 Commits

Reviewing files that changed from the base of the PR and between 27a18b6 and 48ecc68.

📒 Files selected for processing (4)
  • deepmd/dpmodel/utils/lmdb_data.py
  • deepmd/pt/train/training.py
  • deepmd/pt/utils/lmdb_dataset.py
  • source/tests/pt/test_lmdb_dataloader.py

Comment on lines 1311 to 1320
def __len__(self) -> int:
"""Number of batches for this rank."""
total = 0
for nloc, indices in self._reader.nloc_groups.items():
bs = self._reader.get_batch_size_for_nloc(nloc)
total += (len(indices) + bs - 1) // bs
total = len(
SameNlocBatchSampler(
self._reader,
shuffle=False,
block_targets=self._block_targets,
)
)
return math.ceil(total / self._world_size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep __len__() consistent with the batches this rank actually yields.

__iter__() uses strided partitioning (all_batches[self._rank :: self._world_size]), so ranks after the remainder get fewer batches when the global count is not divisible by world_size. __len__() now returns ceil(total / world_size) for every rank, which overstates shorter ranks and no longer matches the iterator. That matters now that deepmd/pt/train/training.py Lines 652 and 681 derive num_epoch -> num_steps from len(self.training_dataloader). Either pad/repeat in _partition_batches() so each rank really emits ceil(total / world_size) batches, or make __len__() rank-aware and resolve a shared epoch length elsewhere.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 1311 - 1320, The current
__len__() returns ceil(total / self._world_size) for every rank which mismatches
the strided partitioning in __iter__(); change __len__() to compute a rank-aware
batch count by getting total = len(SameNlocBatchSampler(self._reader,
shuffle=False, block_targets=self._block_targets)), then compute base = total //
self._world_size and remainder = total % self._world_size and return base + (1
if self._rank < remainder else 0) so that __len__() matches the actual number of
batches produced by the __iter__() strided partitioning (alternatively, adjust
_partition_batches() to pad/repeat batches so every rank emits
ceil(total/world_size) and keep current __len__()).

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

❌ Patch coverage is 76.92308% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.36%. Comparing base (27a18b6) to head (48ecc68).

Files with missing lines Patch % Lines
deepmd/pt/train/training.py 0.00% 2 Missing ⚠️
deepmd/dpmodel/utils/lmdb_data.py 88.88% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5488   +/-   ##
=======================================
  Coverage   81.36%   81.36%           
=======================================
  Files         868      868           
  Lines       96567    96568    +1     
  Branches     4233     4234    +1     
=======================================
+ Hits        78570    78573    +3     
+ Misses      16697    16695    -2     
  Partials     1300     1300           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants