feat(pt): add plugin for data modifier #4661

ChiahsinChu · 2025-03-20T06:09:05Z

Overview

This PR adds a data modifier plugin functionality to the PyTorch implementation of DeepMD. This feature allows for on-the-fly data modification during training and inference, enabling advanced data manipulation capabilities.

Key Changes

1. Added Data Modifier to Training Pipeline

File: deepmd/pt/entrypoints/main.py
- Added imports for data modifier functionality (get_data_modifier)
- Added modifier initialization in get_trainer() function
- Added modifier parameter to data loader initialization for both training and validation datasets
- Enhanced model freezing process to include modifier handling with temporary file management

2. Added Data Modifier to Inference

File: deepmd/pt/infer/deep_eval.py
- Added modifier loading and handling in DeepEval class
- Enhanced model loading process to handle extra files containing modifier data
- Added modifier application in inference methods to modify model predictions

3. Implemented Data Modifier Framework

File: deepmd/pt/modifier/__init__.py (entirely new)
- Created base class BaseModifier with registration system
- Implemented three specific modifier types:
  - ModifierRandomTester: Applies random scaling to energy/force/virial data for testing
  - ModifierZeroTester: Zeroes out energy/force/virial data for testing
  - ModifierScalingTester: Applies scaled model predictions as data modifications
- Added comprehensive argument parsing for modifier configuration

4. Added Data Modifier Tests

File: deepmd/pt/test/test_modifier.py (entirely new)
- Created comprehensive test suite for data modifier functionality
- Tests include:
  - Modifier initialization and data modification verification
  - Ensuring data modification is applied only once
  - Testing inference with data modification by verifying scaled model predictions
- Added helper methods for test data management and comparison

Summary by CodeRabbit

New Features
- Pluggable data modifier API: create, attach to loaders, propagate through training/inference, and embed into exported models.
Behavior
- Modifiers are applied during data loading with optional preloading and caching; modifier outputs can be subtracted from model predictions during training and inference.
Tests
- End-to-end tests covering zeroing, scaling, deterministic/random modifiers and frozen-model inference integration.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

deepmd/pt/modifier/base_modifier.py

coderabbitai · 2025-03-20T06:12:07Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a data-modifier subsystem: BaseModifier class and factory, dataset/loader integration and caching, propagation into training/inference/freeze flows (serialize as extra file in .pth), ModelWrapper support to subtract modifier outputs at runtime, and unit tests for modifier behaviors.

Changes

Cohort / File(s)	Summary
Modifier package `deepmd/pt/modifier/__init__.py`	New factory `get_data_modifier(...)` exported; `__all__` updated to include `BaseModifier` and `get_data_modifier`.
BaseModifier implementation `deepmd/pt/modifier/base_modifier.py`	New `BaseModifier` (torch.nn.Module) with abstract `forward`, `serialize`/`deserialize`, and `modify_data` helper that applies/subtracts modifier outputs from frame data.
Deepmd data core `deepmd/utils/data.py`	Add modifier caching fields (`use_modifier_cache`, `_modified_frame_cache`), change get_item/get_single_frame signatures to accept `num_worker`, apply modifier (via ThreadPoolExecutor) and cache results, and add `preload_and_modify_all_data_torch(num_worker)`.
Dataset & DataLoader wiring `deepmd/pt/utils/dataset.py`, `deepmd/pt/utils/dataloader.py`	`DeepmdDataSetForLoader` and `DpLoaderSet` accept optional `modifier` and forward it into dataset creation; `DpLoaderSet.preload_and_modify_all_data_torch()` added; `NUM_WORKERS` propagated to item calls.
Training preparation `deepmd/pt/train/training.py`	Attach per-model data requirements, invoke `preload_and_modify_all_data_torch()` on training/validation datasets; removed `_validation_data` and `_data_requirement` from `single_model_stat` signature.
Entrypoint & packaging `deepmd/pt/entrypoints/main.py`	Import `get_data_modifier`, instantiate modifier on device, pass into loader construction, and serialize modifier into scripted model `extra_files` as `data_modifier.pth` when freezing.
Model wrapper / runtime `deepmd/pt/train/wrapper.py`	`ModelWrapper` accepts/stores optional `modifier`; during inference-only path subtracts modifier predictions from model outputs when provided.
Inference loading & runtime `deepmd/pt/infer/deep_eval.py`, `deepmd/pt/infer/inference.py`	`.pth` loader reads `data_modifier.pth` extra file and reconstructs modifier in-memory; `DeepEval` stores modifier and wraps model with it; `Tester` loads jitable modifiers when present.
Tests `source/tests/pt/test_data_modifier.py`	New tests adding three modifier implementations (random, zero, scaling), plugin registrations, and unit tests covering modification, caching, single-application guarantees, and inference integration.
Paddle utils update `deepmd/pd/utils/dataset.py`	Propagate `NUM_WORKERS` to paddle dataset call sites.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor CLI as Entrypoint
  participant Factory as get_data_modifier
  participant Loader as DpLoaderSet / Dataset
  participant Data as DeepmdData
  participant Trainer as ModelWrapper
  participant Freezer as Script Export
  participant Eval as DeepEval

  CLI->>Factory: request modifier from model params
  Factory-->>CLI: BaseModifier instance (jitable or not)
  CLI->>Loader: construct loaders with modifier
  Loader->>Data: attach modifier to dataset
  Loader->>Data: preload_and_modify_all_data_torch(num_worker)
  Data->>Data: apply modifier per-frame (threaded) and cache
  Loader-->>CLI: loaders ready

  CLI->>Trainer: instantiate ModelWrapper with modifier
  Trainer->>Trainer: (inference) compute model_pred
  Trainer->>Trainer: compute modifier_pred and subtract from model_pred

  CLI->>Freezer: freeze -> serialize modifier into extra_files["data_modifier.pth"]
  Eval->>Freezer: load .pth with extra_files
  Freezer-->>Eval: provide `data_modifier.pth` bytes
  Eval->>Factory: torch.jit.load bytes -> modifier instance
  Eval->>Trainer: instantiate ModelWrapper with loaded modifier

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

iProzd
njzjz
wanghan-iapcm

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.10% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(pt): add plugin for data modifier' accurately describes the main change—adding a data modifier plugin system to the PyTorch implementation.

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eba1d24 and 6aefe98.

📒 Files selected for processing (2)

deepmd/pd/utils/dataset.py
deepmd/pt/utils/dataset.py

🚧 Files skipped from review as they are similar to previous changes (1)

deepmd/pd/utils/dataset.py

🧰 Additional context used

🧬 Code graph analysis (1)

deepmd/pt/utils/dataset.py (2)

deepmd/pt/modifier/base_modifier.py (1)

BaseModifier (30-187)

deepmd/pd/utils/dataset.py (1)

DeepmdDataSetForLoader (17-56)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)

GitHub Check: Test C++ (false, false, false, true)
GitHub Check: Test C++ (false, true, true, false)
GitHub Check: Test C++ (true, true, true, false)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Test C++ (true, false, false, true)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Test Python (6, 3.13)
GitHub Check: Test Python (5, 3.10)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Test Python (4, 3.13)
GitHub Check: Test Python (4, 3.10)
GitHub Check: Test Python (1, 3.10)
GitHub Check: Test Python (1, 3.13)
GitHub Check: Test Python (5, 3.13)
GitHub Check: Test Python (6, 3.10)
GitHub Check: Test Python (2, 3.13)
GitHub Check: Test Python (3, 3.13)
GitHub Check: Test Python (2, 3.10)
GitHub Check: Test Python (3, 3.10)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Analyze (python)
GitHub Check: Analyze (c-cpp)

🔇 Additional comments (4)

deepmd/pt/utils/dataset.py (4)

12-17: LGTM: Imports correctly support modifier functionality.

The new imports for BaseModifier and NUM_WORKERS are necessary for the modifier integration and parallel data processing.

74-75: Method delegation verified.

DeepmdData.preload_and_modify_all_data_torch exists with the correct signature accepting num_worker: int, and the delegation with max(1, NUM_WORKERS) is proper.

25-30: DeepmdData constructor properly accepts the modifier parameter.

Verification confirms that DeepmdData.__init__ in deepmd/utils/data.py includes modifier: Any | None = None as a parameter and documents it in the docstring as "Data modifier that has the method modify_data". The parameter is correctly stored and passed in deepmd/pt/utils/dataset.py at lines 40-43. No issues found.

54-54: NUM_WORKERS and get_item_torch signature are properly implemented.

The change correctly passes max(1, NUM_WORKERS) to get_item_torch. Verification confirms:

NUM_WORKERS is defined in deepmd/pt/utils/env.py and guaranteed to be non-negative (defaults to min(4, ncpus) or is set to 0 when using non-fork multiprocessing).

DeepmdData.get_item_torch accepts the num_worker parameter (signature: def get_item_torch(self, index: int, num_worker: int = 1,) -> dict:).

The max(1, NUM_WORKERS) pattern safely handles the case where NUM_WORKERS=0 by ensuring at least 1 worker is used.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

deepmd/pt/modifier/base_modifier.py (2)
41-44: Consider simplifying the box assignment.
A ternary operator can reduce verbosity here:
- if data["box"] is None:
-     box = None
- else:
-     box = data["box"][:get_nframes, :]
+ box = None if data["box"] is None else data["box"][:get_nframes, :]
🧰 Tools

🪛 Ruff (0.8.2)

41-44: Use ternary operator box = None if data["box"] is None else data["box"][:get_nframes, :] instead of if-else-block

Replace if-else-block with box = None if data["box"] is None else data["box"][:get_nframes, :]

(SIM108)

47-47: Remove or use the nframes variable.
Currently, nframes = coord.shape[0] is not used, which may confuse future maintainers.

🧰 Tools

🪛 Ruff (0.8.2)

47-47: Local variable nframes is assigned to but never used

Remove assignment to unused variable nframes

(F841)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between baaaa17 and e6bca31.

📒 Files selected for processing (4)

deepmd/pt/modifier/__init__.py (1 hunks)
deepmd/pt/modifier/base_modifier.py (1 hunks)
deepmd/pt/train/training.py (7 hunks)
deepmd/pt/utils/stat.py (1 hunks)

🧰 Additional context used

🧬 Code Definitions (2)

deepmd/pt/modifier/__init__.py (1)

deepmd/pt/modifier/base_modifier.py (1) (1)

BaseModifier (9-56)

deepmd/pt/train/training.py (3)

deepmd/pt/modifier/base_modifier.py (2) (2)

BaseModifier (9-56)

modify_data (14-56)

deepmd/pd/train/training.py (1) (1)

get_additional_data_requirement (1163-1187)

deepmd/pd/utils/stat.py (1) (1)

make_stat_input (40-85)

🪛 Ruff (0.8.2)

deepmd/pt/modifier/base_modifier.py

41-44: Use ternary operator box = None if data["box"] is None else data["box"][:get_nframes, :] instead of if-else-block

Replace if-else-block with box = None if data["box"] is None else data["box"][:get_nframes, :]

(SIM108)

47-47: Local variable nframes is assigned to but never used

Remove assignment to unused variable nframes

(F841)

⏰ Context from checks skipped due to timeout of 90000ms (29)

GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Test Python (4, 3.12)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Test Python (4, 3.9)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Test Python (3, 3.12)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Test Python (3, 3.9)
GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Test C++ (false)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Analyze (python)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Test C++ (true)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test Python (2, 3.9)
GitHub Check: Analyze (c-cpp)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Test Python (1, 3.9)

🔇 Additional comments (11)

deepmd/pt/modifier/__init__.py (1)

1-8: Good job exposing the BaseModifier API.
This new __init__.py cleanly re-exports the BaseModifier class and ensures users can import it directly from deepmd.pt.modifier.

deepmd/pt/utils/stat.py (2)

50-51: Conditional logging is well-handled.
Only logging when nbatches > 0 helps keep logs cleaner in scenarios where no batches are processed.

56-59: Logic for handling nbatches == -1 is clear and correct.
This new condition ensures the entire dataset is used when nbatches is -1. No issues found.

deepmd/pt/train/training.py (8)

39-41: Import of BaseModifier is appropriate.
This import makes the newly introduced functionality available where needed.

140-149: Modifier parameter handling is well-structured.
The assertion preventing usage in multi-task scenarios is clear and avoids incompatible configurations.

231-231: Defaulting modifier to None is appropriate.
Makes the modifier usage optional without complicating the training interface.

239-250: Verify data modification logic.
Applying modifier.modify_data to every system might lead to repeated transformations if single_model_stat is called multiple times. Confirm this matches your intended workflow.

345-345: Single-model signature usage is consistent.
Passing modifier=self.modifier ensures the same modifier instance is applied throughout the training flow.

384-384: Multi-task signature usage is consistent.
Again, passing modifier=self.modifier allows uniform data processing across tasks if needed.

1075-1081: Storing the data_modifier state is a good idea.
Consider providing a loading mechanism in the future so that data_modifier can be restored automatically.

1389-1400: Factory function for data modifiers looks good.
Encapsulates logic for dynamically obtaining modifier classes, making the code more extensible.

codecov · 2025-03-20T06:58:55Z

Codecov Report

❌ Patch coverage is 42.20430% with 215 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.95%. Comparing base (b98f6c5) to head (6aefe98).

Files with missing lines	Patch %	Lines
source/tests/pt/test_data_modifier.py	3.46%	195 Missing ⚠️
deepmd/pt/modifier/base_modifier.py	72.58%	17 Missing ⚠️
deepmd/pt/modifier/__init__.py	81.81%	2 Missing ⚠️
deepmd/utils/data.py	96.55%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            devel    #4661      +/-   ##
==========================================
- Coverage   82.15%   81.95%   -0.21%     
==========================================
  Files         709      712       +3     
  Lines       72468    72824     +356     
  Branches     3616     3616              
==========================================
+ Hits        59535    59681     +146     
- Misses      11769    11981     +212     
+ Partials     1164     1162       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

wanghan-iapcm

please add UT for the implementation.

Copilot

Pull Request Overview

This PR introduces a new plugin for modifying data in the PyTorch backend and integrates it into the training workflow. Key changes include:

Creation of a new BaseModifier class and registration in the modifier package.
Integration of the data modifier into the training process, including saving its state.
Minor adjustments to the statistics data preparation in the utils module.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
deepmd/pt/modifier/init.py	Exposes BaseModifier for external use.
deepmd/pt/modifier/base_modifier.py	Adds a new BaseModifier class for data modification.
deepmd/pt/train/training.py	Integrates the data modifier into training data preparation and model saving.
deepmd/pt/utils/stat.py	Tweaks logging and batch calculation in the statistics utility.

Comments suppressed due to low confidence (2)

deepmd/pt/modifier/base_modifier.py:9

Ensure that make_base_modifier() returns a valid class to use for multiple inheritance with torch.nn.Module. If it does not, consider revising the inheritance structure or renaming for clarity.

class BaseModifier(torch.nn.Module, make_base_modifier()):

deepmd/pt/modifier/base_modifier.py:40

The variable get_nframes is explicitly set to None, which will slice the full array; if a limit on the number of frames was intended, assign get_nframes an appropriate value.

coord = data["coord"][:get_nframes, :]

deepmd/pt/train/training.py

deepmd/pt/utils/dataloader.py

deepmd/pt/utils/dataset.py

deepmd/utils/data.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

deepmd/pt/utils/dataset.py (1)

74-75: Remove or implement the commented-out clear_modified_frame_cache method.

Similar to the dataloader, this commented-out method suggests incomplete implementation. Either implement the functionality if needed, or remove the comment to reduce clutter.

This is the same issue as in deepmd/pt/utils/dataloader.py lines 245-247. Consider a unified approach to cache management across both files.

deepmd/pt/modifier/base_modifier.py (1)

50-50: Remove or use the commented-out nframes variable.

The variable nframes is calculated but immediately commented out and never used. Either remove the comment and use the variable, or delete the line entirely to reduce clutter.

Based on learnings, if this is intentionally kept for future use, please clarify with a TODO comment.

🧹 Nitpick comments (2)

deepmd/pt/modifier/base_modifier.py (1)

17-17: Clarify the purpose of the unused data_sys parameter.

The data_sys parameter is flagged as unused by static analysis. If it's intended for future use or required by subclasses, consider adding a docstring note or a TODO comment. Otherwise, consider removing it to reduce the parameter surface.

deepmd/pt/utils/dataloader.py (1)

245-247: Remove or implement the commented-out clear_modified_frame_cache method.

The commented-out method suggests incomplete implementation. Either implement the cache-clearing functionality if it's needed, or remove the comment to keep the codebase clean.

If cache clearing is required in the future, consider opening an issue to track the feature.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5369e80 and 859292f.

📒 Files selected for processing (6)

deepmd/pt/entrypoints/main.py (4 hunks)
deepmd/pt/modifier/__init__.py (1 hunks)
deepmd/pt/modifier/base_modifier.py (1 hunks)
deepmd/pt/train/training.py (2 hunks)
deepmd/pt/utils/dataloader.py (4 hunks)
deepmd/pt/utils/dataset.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

deepmd/pt/modifier/init.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2024-10-16T21:50:10.680Z

Learnt from: njzjz
Repo: deepmodeling/deepmd-kit PR: 4226
File: deepmd/dpmodel/model/make_model.py:370-373
Timestamp: 2024-10-16T21:50:10.680Z
Learning: In `deepmd/dpmodel/model/make_model.py`, the variable `nall` assigned but not used is intentional and should not be flagged in future reviews.

Applied to files:

deepmd/pt/modifier/base_modifier.py

🧬 Code graph analysis (4)

deepmd/pt/train/training.py (3)

deepmd/pt/utils/dataloader.py (1)

preload_and_modify_all_data (241-243)

deepmd/pt/utils/dataset.py (1)

preload_and_modify_all_data (71-72)

deepmd/utils/data.py (1)

preload_and_modify_all_data (498-513)

deepmd/pt/entrypoints/main.py (3)

deepmd/pt/modifier/base_modifier.py (1)

BaseModifier (12-60)

deepmd/utils/plugin.py (1)

get_class_by_type (144-154)

deepmd/pt/utils/dataloader.py (1)

DpLoaderSet (65-247)

deepmd/pt/utils/dataset.py (2)

deepmd/pt/modifier/base_modifier.py (1)

BaseModifier (12-60)

deepmd/pt/utils/dataloader.py (1)

preload_and_modify_all_data (241-243)

deepmd/pt/utils/dataloader.py (3)

deepmd/pt/modifier/base_modifier.py (1)

BaseModifier (12-60)

deepmd/pt/utils/dataset.py (1)

preload_and_modify_all_data (71-72)

deepmd/utils/data.py (1)

preload_and_modify_all_data (498-513)

🪛 Ruff (0.14.8)

deepmd/pt/entrypoints/main.py

116-116: Avoid specifying long messages outside the exception class

(TRY003)

deepmd/pt/modifier/base_modifier.py

17-17: Unused method argument: data_sys

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)

GitHub Check: Test Python (6, 3.13)
GitHub Check: Test Python (4, 3.13)
GitHub Check: Test Python (5, 3.13)
GitHub Check: Test Python (6, 3.10)
GitHub Check: Test Python (2, 3.13)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Test Python (5, 3.10)
GitHub Check: Test C++ (false, true, true, false)
GitHub Check: Test C++ (true, true, true, false)
GitHub Check: Test Python (2, 3.10)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test Python (1, 3.13)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Test Python (4, 3.10)
GitHub Check: Test C++ (false, false, false, true)
GitHub Check: Test Python (3, 3.10)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Test Python (3, 3.13)
GitHub Check: Test Python (1, 3.10)
GitHub Check: Test C++ (true, false, false, true)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Analyze (c-cpp)
GitHub Check: Analyze (python)

🔇 Additional comments (9)

deepmd/pt/entrypoints/main.py (3)

111-121: LGTM: Data modifier instantiation is well-structured.

The get_data_modifier function correctly validates the presence of the type field, instantiates the modifier using the plugin system, and moves it to the appropriate device.

129-133: LGTM: Modifier wiring integrates cleanly.

The modifier is correctly extracted from model parameters and instantiated only when present, maintaining backward compatibility for configurations without modifiers.

164-180: LGTM: Modifier is correctly passed to both training and validation data loaders.

The modifier is properly threaded through to both DpLoaderSet instances, ensuring consistent data preprocessing across training and validation.

deepmd/pt/train/training.py (2)

353-355: LGTM: Preload and modify step correctly positioned.

The preload_and_modify_all_data() calls are correctly placed after data requirements are added by single_model_stat and before data loaders are constructed. This ensures the modifier operates on the complete, configured data before training begins.

394-396: LGTM: Multi-model path mirrors single-model preload logic.

The multi-model code path correctly applies preload_and_modify_all_data() to each model's training and validation data, maintaining consistency with the single-model approach.

deepmd/pt/utils/dataloader.py (2)

82-102: LGTM: Modifier parameter correctly threaded through DpLoaderSet.

The modifier parameter is properly added to the constructor signature, stored, and forwarded to each DeepmdDataSetForLoader instance. The optional typing (BaseModifier | None) maintains backward compatibility.

241-243: LGTM: Preload method delegates correctly.

The preload_and_modify_all_data() method properly iterates over all systems and delegates to each system's preload method, enabling batch preprocessing before training.

deepmd/pt/utils/dataset.py (2)

22-40: LGTM: Modifier integration is clean and complete.

The modifier parameter is properly added to the constructor, documented, stored as an instance attribute, and forwarded to the underlying DeepmdData instance. This enables the data modification pipeline while maintaining a clear separation of concerns.

71-72: LGTM: Preload method correctly delegates to the data system.

The preload_and_modify_all_data() method properly delegates to self._data_system.preload_and_modify_all_data(), maintaining the abstraction layers and enabling efficient batch preprocessing.

deepmd/pt/modifier/base_modifier.py

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

deepmd/pt/modifier/base_modifier.py (2)
43-43: Clarify or remove the get_nframes pattern.

Setting get_nframes = None and then using it in slicing operations (e.g., [:get_nframes, :]) is equivalent to [:None, :] which means "all frames". This pattern appears to be a placeholder for future functionality but currently adds no value and may confuse readers.

If this is intentional for future extension, consider adding a comment explaining the planned usage. Otherwise, simplify to direct slicing:
-        get_nframes = None
-        coord = data["coord"][:get_nframes, :]
+        coord = data["coord"]
54-59: Consider adding shape validation before reshape.

The code assumes that the outputs from forward() are compatible with the target shapes in data. If a subclass returns tensors with incompatible shapes, the reshape() operations will fail with potentially unclear error messages.

Consider adding assertions to provide clearer error messages:
if "find_energy" in data and data["find_energy"] == 1.0:
    expected_shape = data["energy"].shape
    assert tot_e.numel() == np.prod(expected_shape), \
        f"Energy shape mismatch: forward returned {tot_e.shape}, expected {expected_shape}"
    data["energy"] -= tot_e.reshape(expected_shape)

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 859292f and c69102c.

📒 Files selected for processing (4)

deepmd/pt/modifier/base_modifier.py (1 hunks)
deepmd/pt/utils/dataloader.py (4 hunks)
deepmd/pt/utils/dataset.py (2 hunks)
source/tests/pt/test_data_modifier.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2024-10-16T21:50:10.680Z

Learnt from: njzjz
Repo: deepmodeling/deepmd-kit PR: 4226
File: deepmd/dpmodel/model/make_model.py:370-373
Timestamp: 2024-10-16T21:50:10.680Z
Learning: In `deepmd/dpmodel/model/make_model.py`, the variable `nall` assigned but not used is intentional and should not be flagged in future reviews.

Applied to files:

deepmd/pt/modifier/base_modifier.py

🧬 Code graph analysis (3)

deepmd/pt/utils/dataset.py (2)

deepmd/pt/modifier/base_modifier.py (1)

BaseModifier (12-59)

deepmd/utils/data.py (2)

DeepmdData (34-1069)

preload_and_modify_all_data (498-513)

deepmd/pt/modifier/base_modifier.py (1)

deepmd/dpmodel/modifier/base_modifier.py (1)

make_base_modifier (17-77)

deepmd/pt/utils/dataloader.py (3)

deepmd/pt/modifier/base_modifier.py (1)

BaseModifier (12-59)

deepmd/pt/utils/dataset.py (1)

preload_and_modify_all_data (71-72)

deepmd/utils/data.py (1)

preload_and_modify_all_data (498-513)

🪛 Ruff (0.14.8)

source/tests/pt/test_data_modifier.py

81-81: Unused method argument: data_sys

(ARG002)

141-141: Unused method argument: data_sys

(ARG002)

deepmd/pt/modifier/base_modifier.py

17-17: Unused method argument: data_sys

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)

GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Test C++ (false, true, true, false)
GitHub Check: Analyze (c-cpp)
GitHub Check: Test C++ (false, false, false, true)
GitHub Check: Test C++ (true, false, false, true)
GitHub Check: Analyze (python)
GitHub Check: Test C++ (true, true, true, false)
GitHub Check: Test Python (5, 3.10)
GitHub Check: Test Python (6, 3.13)
GitHub Check: Test Python (2, 3.13)
GitHub Check: Test Python (3, 3.13)
GitHub Check: Test Python (1, 3.13)
GitHub Check: Test Python (4, 3.13)
GitHub Check: Test Python (4, 3.10)
GitHub Check: Test Python (5, 3.13)
GitHub Check: Test Python (6, 3.10)
GitHub Check: Test Python (2, 3.10)
GitHub Check: Test Python (3, 3.10)
GitHub Check: Test Python (1, 3.10)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)

🔇 Additional comments (9)

deepmd/pt/modifier/base_modifier.py (2)

1-15: LGTM!

The imports and class structure are clean. The BaseModifier correctly inherits from torch.nn.Module and the dpmodel base class.

36-41: Good defensive check.

The early return when no find_* flags are present avoids unnecessary computation.

source/tests/pt/test_data_modifier.py (3)

28-35: LGTM!

The plugin registrations for test modifiers are correctly structured.

159-173: Good test setup.

The test configuration properly loads a base config and customizes training parameters for testing.

175-223: Excellent test coverage.

The tests effectively validate:

That data modification is applied during initialization (zero_tester)

That modification happens exactly once and persists through training (random_tester)

This covers the key behaviors of the modifier system.

deepmd/pt/utils/dataloader.py (2)

28-30: Clean integration of modifier support.

The modifier parameter is properly typed, documented in the docstring (implicitly through the system parameter), and correctly propagated to the dataset constructor.

Also applies to: 82-102

241-243: LGTM!

The preload_and_modify_all_data method correctly delegates to each underlying system, providing a convenient way to preload and modify all data in the loader set.

deepmd/pt/utils/dataset.py (2)

12-14: Excellent integration of modifier support.

The modifier parameter is properly:

Typed with BaseModifier | None

Documented in the docstring

Stored as an instance variable

Propagated to the DeepmdData constructor

The implementation is clean and consistent with the overall modifier pattern.

Also applies to: 22-44

71-72: LGTM!

The preload_and_modify_all_data method correctly delegates to the underlying data system, maintaining a clean separation of concerns.

deepmd/pt/modifier/base_modifier.py

source/tests/pt/test_data_modifier.py

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

deepmd/pt/modifier/base_modifier.py (1)
93-107: Clarify the obscure get_nframes = None usage.

Line 93 sets get_nframes = None and then uses it in slicing operations like data["coord"][:get_nframes, :]. When get_nframes is None, the slice [:None, :] is equivalent to [:, :] (selecting all frames).

This pattern is intentional but not immediately obvious to readers.

Consider adding a comment to clarify:
-        get_nframes = None
+        get_nframes = None  # None in slice [:None] means select all frames
         coord = data["coord"][:get_nframes, :]
Or use a more explicit pattern:
-        get_nframes = None
-        coord = data["coord"][:get_nframes, :]
-        atype = data["atype"][:get_nframes, :]
+        # Process all frames in the batch
+        coord = data["coord"]
+        atype = data["atype"]

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c69102c and d804866.

📒 Files selected for processing (3)

deepmd/pt/modifier/base_modifier.py (1 hunks)
deepmd/utils/data.py (3 hunks)
source/tests/pt/test_data_modifier.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2024-10-16T21:50:10.680Z

Learnt from: njzjz
Repo: deepmodeling/deepmd-kit PR: 4226
File: deepmd/dpmodel/model/make_model.py:370-373
Timestamp: 2024-10-16T21:50:10.680Z
Learning: In `deepmd/dpmodel/model/make_model.py`, the variable `nall` assigned but not used is intentional and should not be flagged in future reviews.

Applied to files:

deepmd/pt/modifier/base_modifier.py

🧬 Code graph analysis (1)

deepmd/utils/data.py (3)

deepmd/pt/modifier/base_modifier.py (1)

modify_data (66-117)

deepmd/pt/utils/dataset.py (1)

preload_and_modify_all_data (71-72)

deepmd/pt/utils/dataloader.py (1)

preload_and_modify_all_data (241-243)

🪛 Ruff (0.14.8)

source/tests/pt/test_data_modifier.py

52-52: Unused method argument: atype

(ARG002)

53-53: Unused method argument: box

(ARG002)

54-54: Unused method argument: fparam

(ARG002)

55-55: Unused method argument: aparam

(ARG002)

59-59: Unused method argument: data_sys

(ARG002)

89-89: Unused method argument: atype

(ARG002)

90-90: Unused method argument: box

(ARG002)

91-91: Unused method argument: fparam

(ARG002)

92-92: Unused method argument: aparam

(ARG002)

96-96: Unused method argument: data_sys

(ARG002)

deepmd/pt/modifier/base_modifier.py

66-66: Unused method argument: data_sys

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)

GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Analyze (python)
GitHub Check: Analyze (c-cpp)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test Python (3, 3.13)
GitHub Check: Test Python (2, 3.10)
GitHub Check: Test Python (1, 3.10)
GitHub Check: Test Python (5, 3.10)
GitHub Check: Test Python (2, 3.13)
GitHub Check: Test Python (6, 3.13)
GitHub Check: Test Python (5, 3.13)
GitHub Check: Test Python (6, 3.10)
GitHub Check: Test Python (4, 3.10)
GitHub Check: Test Python (1, 3.13)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Test Python (4, 3.13)
GitHub Check: Test Python (3, 3.10)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Test C++ (true, true, true, false)
GitHub Check: Test C++ (false, true, true, false)
GitHub Check: Test C++ (false, false, false, true)
GitHub Check: Test C++ (true, false, false, true)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)

🔇 Additional comments (8)

deepmd/utils/data.py (3)

142-147: LGTM! Safe initialization of modifier caching.

The caching mechanism is initialized correctly with appropriate guards. The _modified_frame_cache is only created when a modifier exists, and all cache accesses in get_single_frame properly check self.modifier is not None before accessing the cache.

387-394: LGTM! Efficient cache lookup with proper guards.

The early return from cache is well-guarded with all necessary conditions and provides a good optimization for repeated frame access.

498-514: LGTM! Well-implemented preload method with progress logging.

The preload method correctly handles early returns, efficiently skips already-cached frames, and provides useful progress feedback. The implementation aligns well with the broader data loading pipeline shown in the related files.

deepmd/pt/modifier/base_modifier.py (1)

1-65: LGTM! Well-structured base class with proper abstractions.

The class correctly inherits from torch.nn.Module and uses @abstractmethod to enforce subclass implementation of forward(). The serialization methods follow standard patterns.

source/tests/pt/test_data_modifier.py (4)

29-36: LGTM! Plugin registrations correctly implemented.

Both modifier plugins are properly registered with the modifier_args_plugin registry and return empty argument lists, which is appropriate for these test modifiers.

39-73: LGTM! Test modifier correctly implements custom logic.

The ModifierRandomTester class correctly:

Sets modifier_type = "random_tester" matching its registration name (line 47)

Overrides modify_data with custom randomization logic (lines 59-73)

The forward method returns minimal data since modify_data is completely overridden and never calls the base implementation

The unused parameters flagged by static analysis (atype, box, fparam, aparam in forward, and data_sys in modify_data) are required by the base class interface but not used in this simple test implementation. This is acceptable for test code.

76-110: LGTM! Test modifier correctly implements zeroing behavior.

The ModifierZeroTester class correctly sets modifier_type = "zero_tester" matching its registration and implements the expected zeroing logic in modify_data. Like ModifierRandomTester, it completely overrides the base modify_data implementation, so the minimal forward method is acceptable.

113-187: LGTM! Comprehensive test coverage for modifier integration.

The test cases effectively verify:

test_init_modify_data - confirms that zero_tester correctly zeros out training and validation data (lines 130-147)

test_full_modify_data - validates that random_tester produces consistent results before and after trainer.run(), which tests the caching mechanism (lines 149-178)

The tests cover the critical integration points between modifiers, data loading, and the training pipeline. The cleanup in tearDown is thorough.

deepmd/pt/modifier/base_modifier.py

deepmd/utils/data.py

- Add data modifier support in model inference pipeline - Enable saving and loading data modifiers with frozen models - Add ModifierScalingTester for scaling model predictions as data modification - Update test cases to verify data modifier functionality in inference - Enhance modifier argument registration with documentation This allows data modifiers to be applied during model inference and preserves them when saving frozen models for consistent behavior across training and inference stages.

deepmd/pt/train/training.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

deepmd/pt/train/training.py (1)
244-254: Remove obsolete commented-out code.

The signature of single_model_stat was changed to remove _validation_data and _data_requirement parameters, but the function body still contains commented-out code referencing these removed parameters. Since the data requirement logic has been moved outside this function (lines 340-349 for single-task, lines 384-395 for multi-task), this commented-out code is no longer needed and should be removed for consistency.
🔎 Proposed fix
     def single_model_stat(
         _model: Any,
         _data_stat_nbatch: int,
         _training_data: DpLoaderSet,
         _stat_file_path: str | None,
         finetune_has_new_type: bool = False,
     ) -> Callable[[], Any]:
-        # _data_requirement += get_additional_data_requirement(_model)
-        # _training_data.add_data_requirement(_data_requirement)
-        # if _validation_data is not None:
-        #     _validation_data.add_data_requirement(_data_requirement)

         @functools.lru_cache
         def get_sample() -> Any:

🧹 Nitpick comments (3)

deepmd/pt/modifier/base_modifier.py (1)
115-115: Remove commented-out code.

This appears to be leftover debugging code. The static analysis also flagged this as commented-out code.
🔎 Proposed fix
-        # model = self.dp.to(DEVICE)
         prec = NP_PRECISION_DICT[RESERVED_PRECISION_DICT[GLOBAL_PT_FLOAT_PRECISION]]
source/tests/pt/test_data_modifier.py (1)
89-91: Consider removing unnecessary __new__ overrides.

The __new__ methods in all three test modifiers simply delegate to super().__new__(cls) without any customization. These are likely artifacts from the plugin pattern but are redundant here since the base class already handles this.
🔎 Proposed fix for ModifierRandomTester (apply similar pattern to others)
 @BaseModifier.register("random_tester")
 class ModifierRandomTester(BaseModifier):
-    def __new__(cls, *args, **kwargs):
-        return super().__new__(cls)
-
     def __init__(
         self,
         seed: int = 1,
Also applies to: 133-134, 172-173
deepmd/pt/entrypoints/main.py (1)

387-390: Variable shadowing is acceptable here but could be clearer.

The model parameter (a string path) is reassigned to tester.model (a torch.nn.Module). While this works, using distinct names like model_path and model would improve readability. This is a minor stylistic point.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 149ddf3 and e898764.

📒 Files selected for processing (11)

deepmd/pt/entrypoints/main.py
deepmd/pt/infer/deep_eval.py
deepmd/pt/infer/inference.py
deepmd/pt/modifier/__init__.py
deepmd/pt/modifier/base_modifier.py
deepmd/pt/train/training.py
deepmd/pt/train/wrapper.py
deepmd/pt/utils/dataloader.py
deepmd/pt/utils/dataset.py
deepmd/utils/data.py
source/tests/pt/test_data_modifier.py

🚧 Files skipped from review as they are similar to previous changes (1)

deepmd/pt/infer/inference.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2024-10-16T21:50:10.680Z

Learnt from: njzjz
Repo: deepmodeling/deepmd-kit PR: 4226
File: deepmd/dpmodel/model/make_model.py:370-373
Timestamp: 2024-10-16T21:50:10.680Z
Learning: In `deepmd/dpmodel/model/make_model.py`, the variable `nall` assigned but not used is intentional and should not be flagged in future reviews.

Applied to files:

deepmd/pt/modifier/base_modifier.py

🧬 Code graph analysis (7)

deepmd/pt/entrypoints/main.py (2)

deepmd/pt/modifier/__init__.py (1)

get_data_modifier (17-23)

deepmd/pt/infer/inference.py (1)

Tester (28-76)

deepmd/utils/data.py (4)

deepmd/pt/modifier/base_modifier.py (1)

modify_data (88-186)

source/tests/pt/test_data_modifier.py (2)

modify_data (114-128)

modify_data (153-167)

deepmd/pt/utils/dataloader.py (1)

preload_and_modify_all_data (241-243)

deepmd/pt/utils/dataset.py (1)

preload_and_modify_all_data (71-72)

deepmd/pt/modifier/base_modifier.py (3)

deepmd/dpmodel/modifier/base_modifier.py (1)

make_base_modifier (17-77)

deepmd/utils/data.py (1)

DeepmdData (35-1061)

source/tests/pt/test_data_modifier.py (5)

forward (102-112)

forward (141-151)

forward (187-209)

modify_data (114-128)

modify_data (153-167)

deepmd/pt/train/training.py (6)

deepmd/pt/loss/property.py (1)

label_requirement (209-221)

deepmd/pt/loss/tensor.py (1)

label_requirement (173-207)

deepmd/pd/train/training.py (1)

get_additional_data_requirement (1222-1246)

deepmd/pt/utils/dataloader.py (2)

add_data_requirement (215-218)

preload_and_modify_all_data (241-243)

deepmd/pt/utils/dataset.py (2)

add_data_requirement (55-69)

preload_and_modify_all_data (71-72)

deepmd/utils/data.py (1)

preload_and_modify_all_data (499-514)

source/tests/pt/test_data_modifier.py (4)

deepmd/pt/infer/deep_eval.py (2)

DeepEval (87-863)

eval (321-395)

deepmd/pt/modifier/base_modifier.py (3)

BaseModifier (30-186)

forward (76-85)

modify_data (88-186)

deepmd/utils/data.py (2)

DeepmdData (35-1061)

add (150-203)

deepmd/pt/train/wrapper.py (1)

forward (155-205)

deepmd/pt/infer/deep_eval.py (1)

deepmd/pt/train/wrapper.py (1)

ModelWrapper (16-217)

deepmd/pt/modifier/__init__.py (1)

deepmd/utils/plugin.py (1)

get_class_by_type (144-154)

🪛 GitHub Check: CodeQL

deepmd/pt/train/training.py

[notice] 253-254: Commented-out code
This comment appears to contain commented-out code.

🪛 Ruff (0.14.10)

deepmd/pt/modifier/base_modifier.py

88-88: Unused method argument: data_sys

(ARG002)

162-165: Avoid specifying long messages outside the exception class

(TRY003)

171-174: Avoid specifying long messages outside the exception class

(TRY003)

180-183: Avoid specifying long messages outside the exception class

(TRY003)

source/tests/pt/test_data_modifier.py

89-89: Unused static method argument: args

(ARG004)

89-89: Unused static method argument: kwargs

(ARG004)

104-104: Unused method argument: coord

(ARG002)

105-105: Unused method argument: atype

(ARG002)

106-106: Unused method argument: box

(ARG002)

107-107: Unused method argument: fparam

(ARG002)

108-108: Unused method argument: aparam

(ARG002)

109-109: Unused method argument: do_atomic_virial

(ARG002)

114-114: Unused method argument: data_sys

(ARG002)

133-133: Unused static method argument: args

(ARG004)

133-133: Unused static method argument: kwargs

(ARG004)

143-143: Unused method argument: coord

(ARG002)

144-144: Unused method argument: atype

(ARG002)

145-145: Unused method argument: box

(ARG002)

146-146: Unused method argument: fparam

(ARG002)

147-147: Unused method argument: aparam

(ARG002)

148-148: Unused method argument: do_atomic_virial

(ARG002)

153-153: Unused method argument: data_sys

(ARG002)

172-172: Unused static method argument: args

(ARG004)

172-172: Unused static method argument: kwargs

(ARG004)

deepmd/pt/modifier/__init__.py

22-22: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)

GitHub Check: Test Python (6, 3.10)
GitHub Check: Test Python (6, 3.13)
GitHub Check: Test Python (5, 3.10)
GitHub Check: Test Python (3, 3.10)
GitHub Check: Test Python (4, 3.13)
GitHub Check: Test Python (5, 3.13)
GitHub Check: Test Python (3, 3.13)
GitHub Check: Test Python (2, 3.10)
GitHub Check: Test Python (1, 3.10)
GitHub Check: Test Python (2, 3.13)
GitHub Check: Test Python (1, 3.13)
GitHub Check: Test Python (4, 3.10)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Test C++ (true, false, false, true)
GitHub Check: Test C++ (false, false, false, true)
GitHub Check: Test C++ (true, true, true, false)
GitHub Check: Test C++ (false, true, true, false)
GitHub Check: Analyze (python)
GitHub Check: Analyze (c-cpp)

🔇 Additional comments (22)

deepmd/pt/modifier/__init__.py (1)

17-23: LGTM!

The factory function correctly deep-copies the input to avoid mutation, validates the required "type" key, and delegates to the plugin registry. The from None pattern cleanly suppresses the original KeyError traceback.

deepmd/pt/modifier/base_modifier.py (1)

87-186: Solid implementation of the data modification pipeline.

The modify_data method properly handles tensor conversion, calls the abstract forward method, validates output keys, and applies the modifications. The data_sys parameter is unused here but is part of the interface contract for subclasses that may need access to the data system.

source/tests/pt/test_data_modifier.py (2)

296-362: Well-structured inference test with clear verification logic.

The test properly:

Creates a baseline model and freezes it for use as the modifier.

Creates a modified model with the scaling modifier.

Verifies the relationship: output_model - sfactor * output_modifier = output_ref.

The test provides good coverage of the modifier-in-inference path.

216-295: Good test coverage for modifier behavior.

The parameterized tests effectively cover:

test_init_modify_data: Verifies modifier is applied during initialization.

test_full_modify_data: Ensures modifier is applied exactly once (not reapplied on subsequent accesses).

The use of @parameterized with different batch sizes provides good coverage across edge cases.

deepmd/utils/data.py (3)

143-148: Good initialization of modifier caching infrastructure.

The code properly:

Defaults apply_modifier_at_load to True.

Allows modifiers to override this behavior via their own attribute.

Initializes the cache only when a modifier is present.

Minor note: The _modified_frame_cache is only initialized when modifier is not None, so ensure all cache accesses are guarded by the same condition (which they are in lines 388-394 and 490-495).

490-495: Correct modifier application and caching logic.

The implementation correctly:

Always applies the modifier when present.

Only caches when apply_modifier_at_load is True.

Uses copy.deepcopy to ensure cached frames are independent snapshots.

This addresses the previous review feedback about shallow vs. deep copy.

499-514: Useful preloading method for performance optimization.

The preload_and_modify_all_data method:

Correctly guards against no-op scenarios (no modifier or not applying at load).

Provides progress logging every 100 frames.

Leverages get_single_frame which handles caching internally.

One consideration: For large datasets, memory usage could be significant since all frames are cached. This is a reasonable trade-off documented by the method's purpose.

deepmd/pt/infer/deep_eval.py (1)

174-189: Clean implementation of modifier loading from frozen model.

The code correctly:

Extracts the modifier from extra_files embedded in the .pth file.

Uses io.BytesIO for in-memory loading, avoiding temp file cleanup issues.

Handles both cases: modifier present and absent.

Passes the modifier to ModelWrapper for inference-time application.

deepmd/pt/utils/dataset.py (2)

22-44: Clean integration of modifier into dataset constructor.

The modifier parameter is properly:

Type-hinted with BaseModifier | None.

Stored as an instance attribute.

Forwarded to the underlying DeepmdData constructor.

Documented in the docstring.

71-72: Simple delegation pattern for preload method.

The preload_and_modify_all_data method correctly delegates to the underlying data system. This maintains the layered architecture where DeepmdDataSetForLoader is a thin wrapper.

deepmd/pt/train/wrapper.py (2)

17-62: Good extension of ModelWrapper to support modifiers.

The constructor properly:

Accepts an optional modifier parameter.

Stores it as an instance attribute.

Keeps it separate from the model/loss module dictionaries.

The modifier is typed as torch.nn.Module | None, which is appropriate since modifiers extend torch.nn.Module via BaseModifier.

189-195: Correct application of modifier during inference.

The modifier is applied only during inference (self.inference_only or inference_only), not during training. This is intentional: during training, the modifier adjusts the training data labels (via modify_data), while during inference, the modifier output is subtracted from the model output to produce the corrected prediction.

The subtraction model_pred[k] = model_pred[k] - v correctly reverses the modification applied to training labels.

deepmd/pt/entrypoints/main.py (2)

118-122: Clean integration of modifier creation in trainer preparation.

The code properly:

Extracts modifier params from model configuration.

Creates the modifier only when parameters are present.

Moves the modifier to the target device.

This follows the pattern established in deepmd/pt/infer/inference.py.

391-407: Excellent use of in-memory buffer for modifier serialization.

This implements the suggestion from the past review to use io.BytesIO instead of temporary files:

Creates an in-memory buffer.

Serializes the JIT-scripted modifier to the buffer.

Embeds the buffer contents in extra_files.

This approach is cleaner and avoids temp file cleanup issues entirely.

deepmd/pt/utils/dataloader.py (4)

28-30: LGTM!

The import of BaseModifier is clean and necessary for the type annotation in the DpLoaderSet constructor.

89-89: LGTM!

The optional modifier parameter is appropriately typed and defaults to None, making the data modification feature opt-in.

97-102: LGTM!

The modifier is correctly forwarded to each DeepmdDataSetForLoader instance during construction, ensuring consistent modifier application across all systems.

241-243: LGTM!

The convenience method correctly delegates preloading and modification to each system. Based on the training flow in deepmd/pt/train/training.py, this method is called after data requirements are added and before model statistics are computed, ensuring statistics are computed on modified data.

deepmd/pt/train/training.py (4)

340-349: LGTM! Correct ordering of data preparation steps.

The sequence is now properly ordered:

Construct data requirements from loss labels and model parameters

Add requirements to training and validation datasets

Preload and modify all data

Compute model statistics (line 350)

This ensures that model statistics are computed on the modified data, which addresses the concern raised in previous reviews.

350-358: LGTM!

The call to single_model_stat is consistent with the updated signature, correctly omitting the removed _validation_data and _data_requirement parameters.

384-395: LGTM! Multi-task path mirrors single-task logic correctly.

The data preparation steps are correctly applied per model key, maintaining the proper sequence: construct requirements → add to datasets → preload and modify → compute statistics. This ensures each model branch's statistics are computed on appropriately modified data.

396-406: LGTM!

The per-model-key call to single_model_stat is consistent with the updated signature and correctly parameterized for the multi-task scenario.

deepmd/pt/infer/deep_eval.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

deepmd/pt/train/training.py (3)
244-267: Document the precondition for single_model_stat.

The refactored single_model_stat function now assumes that data requirements have already been added to _training_data before invocation. Consider adding a docstring or inline comment documenting this precondition to help future maintainers understand the expected calling pattern.
Suggested documentation
 def single_model_stat(
     _model: Any,
     _data_stat_nbatch: int,
     _training_data: DpLoaderSet,
     _stat_file_path: str | None,
     finetune_has_new_type: bool = False,
 ) -> Callable[[], Any]:
+    """Compute or load model statistics from training data.
+    
+    Note: Assumes data requirements have already been added to _training_data
+    and data has been preloaded/modified if needed.
+    """
     @functools.lru_cache
     def get_sample() -> Any:
335-344: Data requirement and modification flow looks correct.

The sequence is proper: data requirements are computed and attached, data is preloaded and modified, then statistics are computed. This correctly addresses the earlier concern about computing statistics on modified data.

Consider adding a brief comment before the preload_and_modify_all_data() calls to document their purpose, such as:
     training_data.add_data_requirement(data_requirement)
     if validation_data is not None:
         validation_data.add_data_requirement(data_requirement)
+    # Preload and apply modifiers to all data before computing statistics
     training_data.preload_and_modify_all_data()
This would help future maintainers understand why this step occurs before single_model_stat.

379-390: Multi-task data requirement and modification flow is consistent.

The multi-task path correctly mirrors the single-task logic, applying the same data requirement and modification flow for each model key. The per-model data requirement calculation ensures each task gets the correct data.

As with the single-task path, consider adding a comment before the preload_and_modify_all_data() calls to document the purpose of this step in the data preparation pipeline.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e898764 and 18508da.

📒 Files selected for processing (1)

deepmd/pt/train/training.py

🧰 Additional context used

🧬 Code graph analysis (1)

deepmd/pt/train/training.py (7)

deepmd/pt/loss/ener_spin.py (1)

label_requirement (287-340)

deepmd/pt/loss/tensor.py (1)

label_requirement (173-207)

deepmd/pt/loss/ener.py (2)

label_requirement (410-485)

label_requirement (602-615)

deepmd/pd/train/training.py (2)

get_additional_data_requirement (1222-1246)

single_model_stat (209-239)

deepmd/pt/utils/dataset.py (2)

add_data_requirement (55-69)

preload_and_modify_all_data (71-72)

deepmd/pt/utils/dataloader.py (2)

add_data_requirement (215-218)

preload_and_modify_all_data (241-243)

deepmd/utils/finetune.py (1)

get_has_new_type (43-45)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)

GitHub Check: Test Python (4, 3.13)
GitHub Check: Test Python (2, 3.13)
GitHub Check: Test Python (5, 3.10)
GitHub Check: Test Python (6, 3.13)
GitHub Check: Test Python (3, 3.10)
GitHub Check: Test Python (6, 3.10)
GitHub Check: Test Python (2, 3.10)
GitHub Check: Test Python (3, 3.13)
GitHub Check: Test Python (1, 3.10)
GitHub Check: Test Python (5, 3.13)
GitHub Check: Test Python (4, 3.10)
GitHub Check: Test Python (1, 3.13)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Test C++ (true, false, false, true)
GitHub Check: Test C++ (true, true, true, false)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Test C++ (false, true, true, false)
GitHub Check: Analyze (python)
GitHub Check: Test C++ (false, false, false, true)
GitHub Check: Analyze (c-cpp)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)

🔇 Additional comments (1)

deepmd/pt/train/training.py (1)

1365-1398: LGTM: Additional data requirement extraction is well-implemented.

The function correctly extracts additional data requirements (fparam, aparam, spin) from the model, properly handling optional parameters with defaults and required parameters. The defensive check for callable has_spin is good practice.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

deepmd/pt/modifier/base_modifier.py (2)
117-119: Simplify unnecessary tiling operation.

Since nframes is hardcoded to 1, the tiling and reshaping on line 119 is redundant. The operation np.tile(data["atype"], 1).reshape(1, -1) is equivalent to data["atype"].reshape(1, -1).
🔎 Proposed simplification
 nframes = 1
 natoms = len(data["atype"])
-atom_types = np.tile(data["atype"], nframes).reshape(nframes, -1)
+atom_types = data["atype"].reshape(nframes, -1)
147-147: Consider documenting why do_atomic_virial is hardcoded to False.

The parameter is always set to False when calling forward(). If this is intentional (e.g., atomic virial modifications are not supported in the current implementation), consider adding a comment explaining this limitation.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 18508da and e4d5ce2.

📒 Files selected for processing (2)

deepmd/pt/modifier/base_modifier.py
deepmd/pt/train/training.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2024-10-16T21:50:10.680Z

Learnt from: njzjz
Repo: deepmodeling/deepmd-kit PR: 4226
File: deepmd/dpmodel/model/make_model.py:370-373
Timestamp: 2024-10-16T21:50:10.680Z
Learning: In `deepmd/dpmodel/model/make_model.py`, the variable `nall` assigned but not used is intentional and should not be flagged in future reviews.

Applied to files:

deepmd/pt/modifier/base_modifier.py

🧬 Code graph analysis (1)

deepmd/pt/train/training.py (7)

deepmd/pt/loss/dos.py (1)

label_requirement (243-266)

deepmd/pt/loss/ener_spin.py (1)

label_requirement (287-340)

deepmd/pt/loss/property.py (1)

label_requirement (209-221)

deepmd/pt/loss/tensor.py (1)

label_requirement (173-207)

deepmd/pd/train/training.py (2)

get_additional_data_requirement (1222-1246)

single_model_stat (209-239)

deepmd/pt/utils/dataloader.py (2)

add_data_requirement (215-218)

preload_and_modify_all_data (241-243)

deepmd/utils/data.py (1)

preload_and_modify_all_data (499-514)

🪛 Ruff (0.14.10)

deepmd/pt/modifier/base_modifier.py

88-88: Unused method argument: data_sys

(ARG002)

161-164: Avoid specifying long messages outside the exception class

(TRY003)

170-173: Avoid specifying long messages outside the exception class

(TRY003)

179-182: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)

GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Analyze (python)
GitHub Check: Analyze (c-cpp)
GitHub Check: Test Python (4, 3.10)
GitHub Check: Test Python (6, 3.10)
GitHub Check: Test Python (6, 3.13)
GitHub Check: Test Python (5, 3.10)
GitHub Check: Test Python (4, 3.13)
GitHub Check: Test Python (5, 3.13)
GitHub Check: Test Python (1, 3.10)
GitHub Check: Test Python (3, 3.10)
GitHub Check: Test Python (1, 3.13)
GitHub Check: Test Python (2, 3.10)
GitHub Check: Test Python (2, 3.13)
GitHub Check: Test Python (3, 3.13)
GitHub Check: Test C++ (true, false, false, true)
GitHub Check: Test C++ (false, true, true, false)
GitHub Check: Test C++ (true, true, true, false)
GitHub Check: Test C++ (false, false, false, true)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (clang, clang)

🔇 Additional comments (9)

deepmd/pt/modifier/base_modifier.py (6)

31-35: LGTM!

The initialization is clean and sets appropriate defaults for the base modifier.

37-50: LGTM!

Serialization correctly includes class metadata and version information.

52-72: LGTM!

The deserialization correctly removes metadata keys before passing to the constructor. This pattern works well for the base class and provides a foundation for subclasses.

74-85: LGTM!

The abstract forward method is properly declared and provides clear interface expectations for subclasses.

88-88: Unused parameter data_sys is likely intentional.

Static analysis flags data_sys as unused. This is acceptable for a base class method signature that subclasses might override and use, or for future extensibility.

159-185: LGTM!

The error checking for missing keys in modifier_data is comprehensive and provides clear diagnostic messages. The reshape operations ensure shape compatibility.

deepmd/pt/train/training.py (3)

335-344: LGTM! Data modification flow is correctly ordered.

The data requirements are properly constructed from loss labels and additional model requirements, then applied to both training and validation datasets. Calling preload_and_modify_all_data() before single_model_stat() ensures that statistics are computed on modified data, which is the correct behavior for the modifier plugin.

379-390: LGTM! Multi-task data modification flow mirrors single-task correctly.

The multi-task path properly handles per-model data requirements and modification, maintaining the correct ordering (requirements → preload/modify → statistics → data loader) for each model branch.

1365-1398: LGTM! Well-implemented helper function.

The get_additional_data_requirement function properly aggregates optional data requirements (fparam, aparam, spin). The fparam handling is particularly well done, supporting optional defaults when has_default_fparam() returns True, which is more sophisticated than the equivalent PaddlePaddle implementation.

wanghan-iapcm · 2025-12-25T01:30:00Z

deepmd/utils/data.py

        # The prefix sum stores the range of indices contained in each directory, which is needed by get_item method
        self.prefix_sum = np.cumsum(frames_list).tolist()

+        self.apply_modifier_at_load = True


The cache may not work for large dataset due to memory limitation. I would suggest providing users with option to control this feature.

I make this flag of using cache as user-defined. However, there is still an issue about CUDA re-initializiation if the modifier involve torch models. Currently, I skip the UT for this case.

The CUDA re-init issue has been fixed.

…to save the data modification before training or to perform modification on-the-fly.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

deepmd/utils/data.py (1)
490-496: Consider returning the cached copy to avoid an extra deep copy.

The current pattern applies the modifier to frame_data, then deep copies it into the cache, and returns the original frame_data. This means non-cached frames are returned directly (mutable), while subsequent calls return cached copies. For consistency and to avoid the extra deep copy operation:
🔎 Proposed optimization
         if self.modifier is not None:
             # Apply modifier if it exists
             self.modifier.modify_data(frame_data, self)
             if self.use_modifier_cache:
                 # Cache the modified frame to avoid recomputation
-                self._modified_frame_cache[index] = copy.deepcopy(frame_data)
+                cached_frame = copy.deepcopy(frame_data)
+                self._modified_frame_cache[index] = cached_frame
+                return cached_frame
         return frame_data

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4d5ce2 and 432be58.

📒 Files selected for processing (4)

deepmd/pt/modifier/base_modifier.py
deepmd/pt/train/training.py
deepmd/utils/data.py
source/tests/pt/test_data_modifier.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2024-10-16T21:50:10.680Z

Learnt from: njzjz
Repo: deepmodeling/deepmd-kit PR: 4226
File: deepmd/dpmodel/model/make_model.py:370-373
Timestamp: 2024-10-16T21:50:10.680Z
Learning: In `deepmd/dpmodel/model/make_model.py`, the variable `nall` assigned but not used is intentional and should not be flagged in future reviews.

Applied to files:

deepmd/pt/modifier/base_modifier.py

🧬 Code graph analysis (2)

deepmd/pt/modifier/base_modifier.py (1)

deepmd/dpmodel/modifier/base_modifier.py (1)

make_base_modifier (17-77)

deepmd/utils/data.py (4)

deepmd/pt/modifier/base_modifier.py (1)

modify_data (90-187)

source/tests/pt/test_data_modifier.py (2)

modify_data (123-137)

modify_data (165-179)

deepmd/pt/utils/dataset.py (1)

preload_and_modify_all_data (71-72)

deepmd/pt/utils/dataloader.py (1)

preload_and_modify_all_data (241-243)

🪛 Ruff (0.14.10)

deepmd/pt/modifier/base_modifier.py

90-90: Unused method argument: data_sys

(ARG002)

163-166: Avoid specifying long messages outside the exception class

(TRY003)

172-175: Avoid specifying long messages outside the exception class

(TRY003)

181-184: Avoid specifying long messages outside the exception class

(TRY003)

source/tests/pt/test_data_modifier.py

96-96: Unused static method argument: args

(ARG004)

96-96: Unused static method argument: kwargs

(ARG004)

113-113: Unused method argument: coord

(ARG002)

114-114: Unused method argument: atype

(ARG002)

115-115: Unused method argument: box

(ARG002)

116-116: Unused method argument: fparam

(ARG002)

117-117: Unused method argument: aparam

(ARG002)

118-118: Unused method argument: do_atomic_virial

(ARG002)

123-123: Unused method argument: data_sys

(ARG002)

142-142: Unused static method argument: args

(ARG004)

142-142: Unused static method argument: kwargs

(ARG004)

155-155: Unused method argument: coord

(ARG002)

156-156: Unused method argument: atype

(ARG002)

157-157: Unused method argument: box

(ARG002)

158-158: Unused method argument: fparam

(ARG002)

159-159: Unused method argument: aparam

(ARG002)

160-160: Unused method argument: do_atomic_virial

(ARG002)

165-165: Unused method argument: data_sys

(ARG002)

184-184: Unused static method argument: args

(ARG004)

184-184: Unused static method argument: kwargs

(ARG004)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)

GitHub Check: Test Python (4, 3.10)
GitHub Check: Test Python (2, 3.13)
GitHub Check: Test Python (5, 3.10)
GitHub Check: Test Python (6, 3.13)
GitHub Check: Test Python (2, 3.10)
GitHub Check: Test Python (6, 3.10)
GitHub Check: Test Python (5, 3.13)
GitHub Check: Test Python (4, 3.13)
GitHub Check: Test Python (1, 3.13)
GitHub Check: Test Python (3, 3.13)
GitHub Check: Test Python (1, 3.10)
GitHub Check: Test Python (3, 3.10)
GitHub Check: Test C++ (false, true, true, false)
GitHub Check: Analyze (python)
GitHub Check: Test C++ (true, false, false, true)
GitHub Check: Analyze (c-cpp)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Test C++ (false, false, false, true)
GitHub Check: Test C++ (true, true, true, false)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp310-manylinux_aarch64

🔇 Additional comments (23)

deepmd/pt/modifier/base_modifier.py (7)

30-38: LGTM! Clean base class initialization.

The class properly inherits from both torch.nn.Module and the dynamically created base modifier. The use_cache parameter is correctly stored for controlling modification caching behavior.

39-52: LGTM! Serialization provides essential metadata.

The serialize method correctly includes class, type, and version metadata needed for deserialization and plugin registry lookup.

54-74: LGTM! Deserialization properly handles metadata removal.

The fix from past reviews is correctly implemented - metadata keys are popped before passing to constructor.

76-88: LGTM! Abstract forward method with JIT export.

The abstract method is correctly declared with @abstractmethod and @torch.jit.export decorators. The return type annotation dict[str, torch.Tensor] properly describes the expected output format.

89-116: The data_sys parameter is unused but part of the API contract.

The static analysis correctly identifies data_sys as unused (ARG002). However, this parameter is intentionally part of the method signature to provide subclasses access to the DeepmdData instance when they override modify_data. This is acceptable for an extensible API.

117-159: LGTM! Robust input tensor construction.

The implementation correctly:

Handles single-frame data with nframes=1

Reshapes coordinates and box to expected formats

Uses appropriate precision and device settings

Handles optional fparam and aparam inputs

161-187: LGTM! Proper validation of modifier outputs.

The implementation correctly validates that required outputs are present before applying modifications. The KeyError messages are descriptive and helpful for debugging, which justifies the slightly long exception messages flagged by static analysis.

source/tests/pt/test_data_modifier.py (9)

64-91: LGTM! Plugin registration for test modifiers.

The modifier argument definitions are properly registered with the plugin system and include appropriate documentation strings. The use_cache parameter is consistently available across all modifier types.

94-138: LGTM! ModifierRandomTester implementation is correct.

The test modifier properly:

Uses a seeded RNG for deterministic behavior

Implements the abstract forward method (empty return is valid for test purposes since modify_data is overridden)

Correctly scales energy, force, and virial in modify_data

The unused forward parameters are acceptable since this test modifier overrides modify_data directly.

140-180: LGTM! ModifierZeroTester implementation is correct.

The zeroing logic using data["energy"] -= data["energy"] is a valid way to zero out values while preserving the array structure and dtype.

182-223: LGTM! ModifierScalingTester demonstrates model-based modification.

This test modifier shows the intended use case of applying a frozen model's predictions as data modifications. The implementation correctly:

Loads a JIT-scripted model

Calls the model with proper arguments

Scales the outputs by sfactor

225-273: LGTM! Test setup and basic modification test.

The parameterized test setup is well-structured. test_init_modify_data correctly verifies that the zero_tester produces all-zero outputs.

275-312: LGTM! Test verifies modification is applied only once.

test_full_modify_data correctly validates that data modification is idempotent - comparing data before and after trainer.run() ensures the modifier is not applied multiple times.

314-381: LGTM! Inference test validates end-to-end modifier integration.

The test correctly verifies that output_model - sfactor * output_modifier equals the reference model output, confirming proper modifier integration during inference.

383-398: LGTM! Robust tearDown with error handling.

The cleanup properly handles potential failures during file removal, ensuring all cleanup attempts are made even if individual deletions fail.

400-499: LGTM! Well-designed helper methods for test data management.

The helper methods get_dataset_nframes, get_sampled_data, and check_sampled_data are well-structured and provide clear abstractions for test data handling.

deepmd/utils/data.py (4)

5-5: LGTM! Correct import for deep copying.

The copy module import is properly added to support the copy.deepcopy() call for cache immutability.

143-148: LGTM! Conditional cache initialization.

The cache is only initialized when a modifier is present, avoiding unnecessary memory allocation for non-modifier use cases. The use_modifier_cache flag correctly respects the modifier's preference when available.

388-395: LGTM! Efficient cache lookup with early return.

The cache check correctly requires both use_modifier_cache being true and the modifier being present before attempting the cache lookup.

498-513: LGTM! Preload method with progress logging.

The method properly:

Short-circuits when caching is disabled or no modifier exists

Iterates through all frames and populates the cache

Provides periodic progress updates every 100 frames

deepmd/pt/train/training.py (3)

335-353: LGTM! Correct ordering: requirements → preload → statistics.

The implementation correctly:

Adds data requirements to training/validation data

Calls preload_and_modify_all_data() to apply modifiers

Then computes statistics via single_model_stat

This ensures model statistics are computed on the modified data, which is the intended behavior.

379-401: LGTM! Multi-task path mirrors single-task logic.

The per-model-key handling correctly applies the same pattern: add requirements, preload and modify, then compute statistics. The validation data check (if validation_data[model_key] is not None) properly handles cases where validation data may not exist for all model keys.

244-267: LGTM! Simplified single_model_stat signature.

The function signature has been correctly simplified by removing _validation_data and _data_requirement parameters, as data requirements are now attached before this function is called and validation data is not needed for computing statistics.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

deepmd/utils/data.py (1)
501-513: Exceptions from modifier will be silently swallowed.

The executor.submit() returns a Future, but without calling .result() on it, any exception raised by modify_data is silently lost. While the with block will wait for completion, it won't propagate exceptions.

This could lead to data being cached/returned in an inconsistent state if the modifier fails.
🔎 Proposed fix to capture and propagate exceptions
         if self.modifier is not None:
             with ThreadPoolExecutor(max_workers=num_worker) as executor:
                 # Apply modifier if it exists
-                executor.submit(
+                future = executor.submit(
                     self.modifier.modify_data,
                     frame_data,
                     self,
                 )
+            # Wait for completion and propagate any exceptions
+            future.result()
             if self.use_modifier_cache:
                 # Cache the modified frame to avoid recomputation
                 self._modified_frame_cache[index] = copy.deepcopy(frame_data)

🧹 Nitpick comments (1)

deepmd/utils/data.py (1)

515-530: Consider batching ThreadPoolExecutor creation for preload efficiency.

Currently, each frame creates a new ThreadPoolExecutor in get_single_frame. For large datasets, this overhead adds up. Consider creating the executor once in this method and passing it down, or processing frames in parallel batches.

However, per the PR description, the ThreadPoolExecutor is specifically used to avoid CUDA re-initialization, so the current approach may be intentional for isolation.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 432be58 and 20b8739.

📒 Files selected for processing (6)

deepmd/pd/utils/dataset.py
deepmd/pt/train/training.py
deepmd/pt/utils/dataloader.py
deepmd/pt/utils/dataset.py
deepmd/utils/data.py
source/tests/pt/test_data_modifier.py

🚧 Files skipped from review as they are similar to previous changes (1)

deepmd/pt/utils/dataloader.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2024-10-08T15:32:11.479Z

Learnt from: njzjz
Repo: deepmodeling/deepmd-kit PR: 4144
File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
Timestamp: 2024-10-08T15:32:11.479Z
Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905.

Applied to files:

source/tests/pt/test_data_modifier.py

🧬 Code graph analysis (4)

deepmd/pt/train/training.py (2)

deepmd/pd/train/training.py (1)

get_additional_data_requirement (1222-1246)

deepmd/pt/utils/dataset.py (2)

add_data_requirement (58-72)

preload_and_modify_all_data_torch (74-75)

deepmd/pt/utils/dataset.py (4)

deepmd/pt/modifier/base_modifier.py (1)

BaseModifier (30-187)

deepmd/utils/data.py (4)

DataRequirementItem (1080-1160)

DeepmdData (35-1077)

get_item_torch (256-270)

preload_and_modify_all_data_torch (515-530)

deepmd/pd/utils/dataset.py (1)

DeepmdDataSetForLoader (17-56)

deepmd/pt/utils/dataloader.py (1)

preload_and_modify_all_data_torch (241-243)

deepmd/pd/utils/dataset.py (1)

deepmd/utils/data.py (1)

get_item_paddle (272-287)

deepmd/utils/data.py (4)

source/tests/pt/test_data_modifier.py (2)

modify_data (123-137)

modify_data (165-179)

deepmd/pt/modifier/base_modifier.py (1)

modify_data (90-187)

deepmd/pt/utils/dataloader.py (1)

preload_and_modify_all_data_torch (241-243)

deepmd/pt/utils/dataset.py (1)

preload_and_modify_all_data_torch (74-75)

🪛 Ruff (0.14.10)

source/tests/pt/test_data_modifier.py

96-96: Unused static method argument: args

(ARG004)

96-96: Unused static method argument: kwargs

(ARG004)

113-113: Unused method argument: coord

(ARG002)

114-114: Unused method argument: atype

(ARG002)

115-115: Unused method argument: box

(ARG002)

116-116: Unused method argument: fparam

(ARG002)

117-117: Unused method argument: aparam

(ARG002)

118-118: Unused method argument: do_atomic_virial

(ARG002)

123-123: Unused method argument: data_sys

(ARG002)

142-142: Unused static method argument: args

(ARG004)

142-142: Unused static method argument: kwargs

(ARG004)

155-155: Unused method argument: coord

(ARG002)

156-156: Unused method argument: atype

(ARG002)

157-157: Unused method argument: box

(ARG002)

158-158: Unused method argument: fparam

(ARG002)

159-159: Unused method argument: aparam

(ARG002)

160-160: Unused method argument: do_atomic_virial

(ARG002)

165-165: Unused method argument: data_sys

(ARG002)

184-184: Unused static method argument: args

(ARG004)

184-184: Unused static method argument: kwargs

(ARG004)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)

GitHub Check: Test Python (5, 3.10)
GitHub Check: Test Python (6, 3.13)
GitHub Check: Test Python (4, 3.13)
GitHub Check: Test Python (1, 3.13)
GitHub Check: Test Python (2, 3.10)
GitHub Check: Test Python (5, 3.13)
GitHub Check: Test Python (6, 3.10)
GitHub Check: Test Python (4, 3.10)
GitHub Check: Test Python (3, 3.10)
GitHub Check: Test Python (1, 3.10)
GitHub Check: Test Python (3, 3.13)
GitHub Check: Test Python (2, 3.13)
GitHub Check: Test C++ (false, true, true, false)
GitHub Check: Test C++ (false, false, false, true)
GitHub Check: Test C++ (true, true, true, false)
GitHub Check: Test C++ (true, false, false, true)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Analyze (c-cpp)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Analyze (python)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build C++ (cpu, cpu)

🔇 Additional comments (16)

deepmd/pd/utils/dataset.py (1)

8-10: Review NUM_WORKERS default value logic.

NUM_WORKERS is properly defined and exported in deepmd/pd/utils/env.py at line 28. However, the default value calculation appears problematic: min(0, ncpus) will always evaluate to 0 regardless of the number of CPUs available. This should likely be max(0, ncpus) to meaningfully default to the number of CPUs, unless defaulting to 0 as a sentinel value (treated as 1 by the dataloader) is intentional.

deepmd/pt/train/training.py (2)

334-360: LGTM - Data modification correctly applied before statistics computation.

The data modification workflow is properly sequenced:

Build data_requirement from loss labels and model requirements

Add requirements to training/validation data

Preload and apply modifiers via preload_and_modify_all_data_torch()

Then compute statistics via single_model_stat()

This addresses the previous Copilot comment about ordering - statistics are now computed on modified data.

378-401: Multi-task data modification follows consistent pattern.

The multi-task path correctly mirrors the single-task implementation, applying data modification per model key before computing statistics.

source/tests/pt/test_data_modifier.py (7)

64-91: Plugin registration functions are well-documented.

The registration functions properly document parameters with clear descriptions. The use_cache parameter aligns with the broader modifier framework.

94-138: Test modifier implementation is appropriate.

The unused parameters flagged by static analysis (args, kwargs in __new__, and coord, atype, etc. in forward) are required for interface compliance with BaseModifier. The forward method returning an empty dict is intentional since this modifier only operates via modify_data for testing the data pipeline.

140-179: Zero tester modifier correctly implements data zeroing.

The implementation properly subtracts values from themselves to zero out data, which is a valid approach for testing that modifications are applied.

182-222: Scaling tester provides realistic model-based modification.

This modifier appropriately loads a frozen model and applies scaled predictions, testing the integration of model inference within the modifier framework.

253-312: Test coverage appropriately validates modifier behavior.

The tests verify:

test_init_modify_data: Modifications are applied during trainer initialization

test_full_modify_data: Modifications are applied exactly once (comparing before/after training)

Both use parameterized batch sizes and cache settings for thorough coverage.

314-381: Inference test validates end-to-end modifier integration.

The test properly validates the mathematical relationship: output_model = output_ref - sfactor * output_modifier, confirming that modifier predictions are correctly applied during inference.

383-398: Robust tearDown implementation.

The try-except block ensures cleanup continues even if individual file removals fail, addressing the past review suggestion.

deepmd/pt/utils/dataset.py (3)

24-47: Clean modifier integration into dataset wrapper.

The modifier is properly propagated to the underlying DeepmdData system, maintaining a clean delegation pattern.

52-56: Correct NUM_WORKERS propagation.

The NUM_WORKERS parameter is now properly passed to get_item_torch, enabling parallel data modification as designed.

73-75: Preload method delegates correctly.

Simple delegation to the underlying data system's preload method, maintaining the wrapper pattern.

deepmd/utils/data.py (3)

143-148: Cache initialization correctly handles optional modifier settings.

The code properly checks for use_cache attribute on the modifier and initializes the frame cache dictionary when a modifier is present.

256-287: Consistent interface updates for parallel data modification.

Both get_item_torch and get_item_paddle properly pass the num_worker parameter to get_single_frame, maintaining API consistency.

398-406: Efficient cache lookup with early return.

The cache check occurs before any data loading, avoiding unnecessary I/O when modified frames are already cached.

deepmd/pd/utils/dataset.py

…dification during training.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

source/tests/pt/test_data_modifier.py (4)
65-83: Minor inconsistency in return type annotations.

modifier_random_tester() and modifier_zero_tester() use list while modifier_scaling_tester() uses list[Argument]. Consider using the more specific type hint consistently.
🔎 Proposed fix
 @modifier_args_plugin.register("random_tester", doc=doc_random_tester)
-def modifier_random_tester() -> list:
+def modifier_random_tester() -> list[Argument]:
     doc_seed = "Random seed used to initialize the random number generator for deterministic scaling factors."
     doc_use_cache = "Whether to cache modified frames to improve performance by avoiding recomputation."
     return [
         Argument("seed", int, optional=True, doc=doc_seed),
         Argument("use_cache", bool, optional=True, doc=doc_use_cache),
     ]


 @modifier_args_plugin.register("zero_tester", doc=doc_zero_tester)
-def modifier_zero_tester() -> list:
+def modifier_zero_tester() -> list[Argument]:
     doc_use_cache = "Whether to cache modified frames to improve performance by avoiding recomputation."
     return [
         Argument("use_cache", bool, optional=True, doc=doc_use_cache),
     ]
174-179: Consider clearer zeroing syntax.

Using data["energy"] -= data["energy"] works but data["energy"] *= 0 or explicit zero assignment would be more readable. The current approach does preserve shape/dtype which may be intentional.
🔎 Proposed alternative
         if "find_energy" in data and data["find_energy"] == 1.0:
-            data["energy"] -= data["energy"]
+            data["energy"] *= 0
         if "find_force" in data and data["find_force"] == 1.0:
-            data["force"] -= data["force"]
+            data["force"] *= 0
         if "find_virial" in data and data["find_virial"] == 1.0:
-            data["virial"] -= data["virial"]
+            data["virial"] *= 0
378-381: Consider specifying tolerance for floating-point comparison.

The np.testing.assert_allclose uses default rtol=1e-7 which may be too strict for GPU/floating-point arithmetic after model inference and scaling operations. Consider specifying explicit tolerances.
🔎 Proposed fix
         for ii in range(3):
             np.testing.assert_allclose(
-                model_pred[ii], model_pred_ref[ii] - sfactor * modifier_pred[ii]
+                model_pred[ii], model_pred_ref[ii] - sfactor * modifier_pred[ii],
+                rtol=1e-5, atol=1e-8
             )
489-498: Potential KeyError if test_data is missing system or frame IDs.

The method assumes test_data contains the same system IDs and frame IDs as ref_data. If data collection is incomplete, test_data[sid] or test_sys[fid] could raise KeyError. Consider adding defensive checks or a clearer assertion message.
🔎 Proposed fix
         for sid in ref_data.keys():
+            assert sid in test_data, f"System ID {sid} missing from test data"
             ref_sys = ref_data[sid]
             test_sys = test_data[sid]
             for fid in ref_sys.keys():
+                assert fid in test_sys, f"Frame ID {fid} missing from test data for system {sid}"
                 # compare common elements
                 try:
                     ref_label = to_numpy_array(ref_sys[fid][label_kw])
                     test_label = to_numpy_array(test_sys[fid][label_kw])
                 except KeyError:
                     continue
                 np.testing.assert_allclose(ref_label, test_label)
deepmd/utils/data.py (1)
502-509: Consider simplifying ThreadPoolExecutor usage for single-task execution.

The code creates a thread pool with max_workers=num_worker but submits only a single task. While the with statement correctly waits for task completion via shutdown(wait=True), consider whether a simpler threading approach would be clearer:
if self.modifier is not None:
    import threading
    t = threading.Thread(target=self.modifier.modify_data, args=(frame_data, self))
    t.start()
    t.join()
    if self.use_modifier_cache:
        self._modified_frame_cache[index] = copy.deepcopy(frame_data)
This more directly expresses the intent: run the modifier in a separate thread to avoid CUDA re-initialization issues.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20b8739 and 23eb6a5.

📒 Files selected for processing (6)

deepmd/pd/utils/dataset.py
deepmd/pt/train/training.py
deepmd/pt/utils/dataloader.py
deepmd/pt/utils/dataset.py
deepmd/utils/data.py
source/tests/pt/test_data_modifier.py

🚧 Files skipped from review as they are similar to previous changes (2)

deepmd/pt/utils/dataloader.py
deepmd/pd/utils/dataset.py

🧰 Additional context used

🧬 Code graph analysis (3)

deepmd/pt/train/training.py (5)

deepmd/pt/loss/property.py (1)

label_requirement (209-221)

deepmd/pt/loss/tensor.py (1)

label_requirement (173-207)

deepmd/pd/train/training.py (1)

get_additional_data_requirement (1222-1246)

deepmd/pt/utils/dataloader.py (2)

add_data_requirement (215-218)

preload_and_modify_all_data_torch (241-243)

deepmd/pt/utils/dataset.py (2)

add_data_requirement (58-72)

preload_and_modify_all_data_torch (74-75)

deepmd/pt/utils/dataset.py (3)

deepmd/pt/modifier/base_modifier.py (1)

BaseModifier (30-187)

deepmd/utils/data.py (2)

DataRequirementItem (1080-1160)

DeepmdData (35-1077)

deepmd/pd/utils/dataset.py (1)

DeepmdDataSetForLoader (17-56)

deepmd/utils/data.py (5)

data/raw/copy_raw.py (1)

copy (11-71)

source/tests/pt/test_data_modifier.py (2)

modify_data (123-137)

modify_data (165-179)

deepmd/pt/modifier/base_modifier.py (1)

modify_data (90-187)

deepmd/pt/utils/dataloader.py (1)

preload_and_modify_all_data_torch (241-243)

deepmd/pt/utils/dataset.py (1)

preload_and_modify_all_data_torch (74-75)

🪛 Ruff (0.14.10)

source/tests/pt/test_data_modifier.py

96-96: Unused static method argument: args

(ARG004)

96-96: Unused static method argument: kwargs

(ARG004)

113-113: Unused method argument: coord

(ARG002)

114-114: Unused method argument: atype

(ARG002)

115-115: Unused method argument: box

(ARG002)

116-116: Unused method argument: fparam

(ARG002)

117-117: Unused method argument: aparam

(ARG002)

118-118: Unused method argument: do_atomic_virial

(ARG002)

123-123: Unused method argument: data_sys

(ARG002)

142-142: Unused static method argument: args

(ARG004)

142-142: Unused static method argument: kwargs

(ARG004)

155-155: Unused method argument: coord

(ARG002)

156-156: Unused method argument: atype

(ARG002)

157-157: Unused method argument: box

(ARG002)

158-158: Unused method argument: fparam

(ARG002)

159-159: Unused method argument: aparam

(ARG002)

160-160: Unused method argument: do_atomic_virial

(ARG002)

165-165: Unused method argument: data_sys

(ARG002)

184-184: Unused static method argument: args

(ARG004)

184-184: Unused static method argument: kwargs

(ARG004)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)

GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (python)
GitHub Check: Analyze (c-cpp)
GitHub Check: Test Python (4, 3.10)
GitHub Check: Test Python (5, 3.13)
GitHub Check: Test Python (1, 3.10)
GitHub Check: Test Python (6, 3.13)
GitHub Check: Test Python (2, 3.10)
GitHub Check: Test Python (5, 3.10)
GitHub Check: Test Python (4, 3.13)
GitHub Check: Test Python (2, 3.13)
GitHub Check: Test Python (6, 3.10)
GitHub Check: Test Python (3, 3.13)
GitHub Check: Test Python (3, 3.10)
GitHub Check: Test Python (1, 3.13)
GitHub Check: Test C++ (true, false, false, true)
GitHub Check: Test C++ (false, true, true, false)
GitHub Check: Test C++ (true, true, true, false)
GitHub Check: Test C++ (false, false, false, true)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (rocm, rocm)

🔇 Additional comments (15)

source/tests/pt/test_data_modifier.py (3)

1-57: LGTM!

The module docstring is comprehensive and imports are well-organized. The license header and documentation clearly explain the test scope.

94-138: LGTM!

The test modifier implementation is correct. The unused arguments in __new__, forward(), and modify_data() are expected since this is a test stub implementing the BaseModifier interface. The deterministic random scaling logic with seeded RNG is appropriate for reproducible tests.

182-222: LGTM!

The ModifierScalingTester correctly loads a frozen model and applies scaling to predictions. The tuple handling at lines 218-219 properly handles different model return formats.

deepmd/pt/utils/dataset.py (4)

12-17: LGTM!

The imports are appropriate for the modifier functionality being added.

25-47: LGTM!

The modifier parameter is properly threaded through to the underlying DeepmdData instance, and _natoms_vec initialization is consistent with the Paddle backend implementation.

54-54: LGTM!

Passing NUM_WORKERS to get_item_torch enables parallel modifier execution during data loading.

74-75: LGTM!

The preload method provides a clean interface for eager modifier application and caching.

deepmd/pt/train/training.py (3)

335-344: LGTM!

The data requirements are properly assembled from loss and model requirements, and preloading with modifier application occurs before computing statistics—ensuring stats reflect the modified data.

379-390: LGTM!

The multi-task path correctly mirrors the single-task logic, building per-model data requirements and preloading modified data before computing statistics for each model.

1365-1398: LGTM!

The helper function correctly builds additional data requirements for fparam, aparam, and spin based on model configuration, with appropriate must/default settings.

deepmd/utils/data.py (5)

5-5: LGTM!

The copy module import is necessary for the deepcopy operation used in frame caching.

143-148: LGTM!

Cache initialization correctly defaults to enabled and respects the modifier's use_cache preference when available.

256-287: LGTM!

The num_worker parameter is consistently added to both PyTorch and Paddle item accessors, with proper documentation.

510-512: LGTM!

The caching logic correctly uses deepcopy to ensure cached frames are immutable, addressing the concern raised in past reviews about shallow copy issues.

515-530: LGTM!

The preload method efficiently populates the cache by delegating to get_single_frame, with helpful progress logging for large datasets.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

deepmd/utils/data.py (2)

502-514: Consider optimizing ThreadPoolExecutor usage.

The code creates a ThreadPoolExecutor with max_workers=num_worker but submits only a single task. While this works (and reportedly fixes CUDA re-initialization issues), most worker threads remain idle.

If the goal is thread isolation, consider using max_workers=1. If parallel execution is intended for future enhancements, document the design intent.

🔎 Potential optimization

-        if self.modifier is not None:
-            with ThreadPoolExecutor(max_workers=num_worker) as executor:
-                # Apply modifier if it exists
-                future = executor.submit(
-                    self.modifier.modify_data,
-                    frame_data,
-                    self,
-                )
-            # Wait for completion and propagate any exceptions
-            future.result()
+        if self.modifier is not None:
+            with ThreadPoolExecutor(max_workers=1) as executor:
+                future = executor.submit(
+                    self.modifier.modify_data,
+                    frame_data,
+                    self,
+                )
+                # Wait for completion and propagate any exceptions
+                future.result()

517-533: Consider parallelizing preload for large datasets.

The method sequentially processes frames, which may be slow for large datasets. If thread safety permits, consider using ThreadPoolExecutor or multiprocessing to preload multiple frames in parallel.

def preload_and_modify_all_data_torch(self, num_worker: int) -> None:
    """Preload all frames and apply modifier to cache them.
    
    This method is useful when use_modifier_cache is True and you want to
    avoid applying the modifier repeatedly during training.
    """
    if not self.use_modifier_cache or self.modifier is None:
        return
    
    log.info("Preloading and modifying all data frames...")
    
    # Option: Parallelize frame loading
    with ThreadPoolExecutor(max_workers=num_worker) as executor:
        futures = [
            executor.submit(self.get_single_frame, i, 1)
            for i in range(self.nframes)
            if i not in self._modified_frame_cache
        ]
        for i, future in enumerate(as_completed(futures)):
            future.result()  # Propagate exceptions
            if (i + 1) % 100 == 0:
                log.info(f"Processed {i + 1}/{self.nframes} frames")
    
    log.info("All frames preloaded and modified.")

Note: Only implement if thread safety of the modifier and caching mechanism is verified.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23eb6a5 and eba1d24.

📒 Files selected for processing (2)

deepmd/utils/data.py
source/tests/pt/test_data_modifier.py

🧰 Additional context used

🧬 Code graph analysis (1)

deepmd/utils/data.py (3)

deepmd/pt/modifier/base_modifier.py (1)

modify_data (90-187)

deepmd/pt/utils/dataset.py (1)

preload_and_modify_all_data_torch (74-75)

deepmd/pt/utils/dataloader.py (1)

preload_and_modify_all_data_torch (241-243)

🪛 Ruff (0.14.10)

source/tests/pt/test_data_modifier.py

96-96: Unused static method argument: args

(ARG004)

96-96: Unused static method argument: kwargs

(ARG004)

113-113: Unused method argument: coord

(ARG002)

114-114: Unused method argument: atype

(ARG002)

115-115: Unused method argument: box

(ARG002)

116-116: Unused method argument: fparam

(ARG002)

117-117: Unused method argument: aparam

(ARG002)

118-118: Unused method argument: do_atomic_virial

(ARG002)

123-123: Unused method argument: data_sys

(ARG002)

142-142: Unused static method argument: args

(ARG004)

142-142: Unused static method argument: kwargs

(ARG004)

155-155: Unused method argument: coord

(ARG002)

156-156: Unused method argument: atype

(ARG002)

157-157: Unused method argument: box

(ARG002)

158-158: Unused method argument: fparam

(ARG002)

159-159: Unused method argument: aparam

(ARG002)

160-160: Unused method argument: do_atomic_virial

(ARG002)

165-165: Unused method argument: data_sys

(ARG002)

184-184: Unused static method argument: args

(ARG004)

184-184: Unused static method argument: kwargs

(ARG004)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)

GitHub Check: Analyze (python)
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (c-cpp)
GitHub Check: Test Python (6, 3.10)
GitHub Check: Test Python (5, 3.13)
GitHub Check: Test Python (1, 3.13)
GitHub Check: Test Python (4, 3.10)
GitHub Check: Test Python (4, 3.13)
GitHub Check: Test Python (3, 3.13)
GitHub Check: Test Python (5, 3.10)
GitHub Check: Test Python (2, 3.10)
GitHub Check: Test Python (6, 3.13)
GitHub Check: Test Python (3, 3.10)
GitHub Check: Test Python (2, 3.13)
GitHub Check: Test Python (1, 3.10)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test C++ (true, false, false, true)
GitHub Check: Test C++ (false, false, false, true)
GitHub Check: Test C++ (false, true, true, false)
GitHub Check: Test C++ (true, true, true, false)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (clang, clang)

🔇 Additional comments (10)

deepmd/utils/data.py (3)

143-148: LGTM!

The cache initialization logic correctly defaults to True and respects the modifier's use_cache attribute when available. The cache dictionary is initialized only when a modifier is present, which is appropriate.

256-287: LGTM!

The method signature changes to accept num_worker are consistent across both get_item_torch and get_item_paddle. Delegating to a centralized get_single_frame method improves maintainability.

401-406: LGTM!

The early cache check correctly returns cached modified frames when use_modifier_cache is enabled, avoiding redundant modifier computations.

source/tests/pt/test_data_modifier.py (7)

64-91: LGTM!

The modifier argument registration functions are correctly implemented with appropriate documentation. The arguments align with the corresponding modifier class constructors.

94-138: LGTM!

The ModifierRandomTester implementation is correct for testing purposes. The deterministic random scaling using a seeded RNG ensures reproducible tests.

Note: Static analysis warnings about unused arguments are false positives—these parameters are required by the base class interface.

140-179: LGTM!

The ModifierZeroTester correctly zeros out energy, force, and virial data for testing. The data[key] -= data[key] pattern is a clean way to zero arrays in-place.

Note: Static analysis warnings about unused arguments are false positives—these parameters are required by the base class interface.

182-222: LGTM!

The ModifierScalingTester correctly implements model-based data modification using a frozen model. The implementation properly handles potential tuple returns from the model and applies scaling to the predictions.

231-251: LGTM!

The test setup correctly configures single-step training for fast test execution. The tearDown method appropriately handles cleanup failures with try-except blocks.

Note: File cleanup operates on the current directory, which is acceptable for unit tests but assumes tests run sequentially in isolated environments.

Also applies to: 386-401

253-313: LGTM!

Both test methods correctly verify the modifier behavior:

test_init_modify_data ensures modification is applied during initialization

test_full_modify_data ensures modification is applied only once (idempotency)

The parameterized testing across different batch sizes and cache settings provides good coverage.

314-384: LGTM!

The test_inference method provides comprehensive validation of modifier integration during inference, correctly verifying that output_model ≈ output_ref - sfactor * output_modifier.

The helper methods are well-designed:

get_dataset_nframes: Simple frame counting

get_sampled_data: Robust data collection with duplicate handling

check_sampled_data: Clean comparison logic with appropriate error handling

Also applies to: 403-502

github-actions bot added the Python label Mar 20, 2025

github-advanced-security bot found potential problems Mar 20, 2025

View reviewed changes

deepmd/pt/modifier/base_modifier.py Fixed Show fixed Hide fixed

coderabbitai bot reviewed Mar 20, 2025

View reviewed changes

wanghan-iapcm reviewed Mar 21, 2025

View reviewed changes

wanghan-iapcm requested review from iProzd and njzjz March 21, 2025 03:53

njzjz linked an issue Mar 21, 2025 that may be closed by this pull request

[Feature Request] Data modifier in pytorch #4626

Open

njzjz requested a review from Copilot March 21, 2025 17:19

Copilot AI reviewed Mar 21, 2025

View reviewed changes

njzjz reviewed Mar 22, 2025

View reviewed changes

deepmd/pt/train/training.py Outdated Show resolved Hide resolved

deepmd/pt/train/training.py Outdated Show resolved Hide resolved

iProzd reviewed Mar 24, 2025

View reviewed changes

deepmd/pt/train/training.py Outdated Show resolved Hide resolved

github-actions bot added Core CUDA ROCM OP C++ LAMMPS Docs Examples i-PI C labels Dec 17, 2025

ChiahsinChu marked this pull request as draft December 17, 2025 02:15

ChiahsinChu marked this pull request as ready for review December 17, 2025 10:00

ChiahsinChu force-pushed the devel-modifier-plugin branch from 8d02057 to 859292f Compare December 17, 2025 11:12

github-advanced-security bot found potential problems Dec 17, 2025

View reviewed changes

deepmd/pt/utils/dataloader.py Fixed Show fixed Hide fixed

deepmd/pt/utils/dataset.py Fixed Show fixed Hide fixed

deepmd/utils/data.py Fixed Show fixed Hide fixed

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

deepmd/pt/modifier/base_modifier.py Outdated Show resolved Hide resolved

deepmd/pt/modifier/base_modifier.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

deepmd/pt/modifier/base_modifier.py Outdated Show resolved Hide resolved

source/tests/pt/test_data_modifier.py Outdated Show resolved Hide resolved

source/tests/pt/test_data_modifier.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

deepmd/pt/modifier/base_modifier.py Outdated Show resolved Hide resolved

deepmd/utils/data.py Outdated Show resolved Hide resolved

ChiahsinChu force-pushed the devel-modifier-plugin branch from 149ddf3 to d1c269d Compare December 24, 2025 03:24

ChiahsinChu requested a review from njzjz December 24, 2025 03:24

ChiahsinChu added 9 commits December 24, 2025 11:25

add data modifier for pt backend (training)

6c26ca2

add UT for data modifier in pt model training

48bbff1

fix minor bugs

4055c4b

fix bug and add type annotation for BaseModifier

38f9018

minor revision based on coderabbit

949b3a4

fix bug in UT

973a6b9

fix bug in BaseModifier.modify_data

ea507a5

Apply suggestion from @Copilot and @njzjz

e898764

ChiahsinChu force-pushed the devel-modifier-plugin branch from d1c269d to e898764 Compare December 24, 2025 03:25

github-advanced-security bot found potential problems Dec 24, 2025

View reviewed changes

deepmd/pt/train/training.py Fixed Show fixed Hide fixed

coderabbitai bot reviewed Dec 24, 2025

View reviewed changes

deepmd/pt/infer/deep_eval.py Show resolved Hide resolved

remove comment-out codes

e4d5ce2

ChiahsinChu force-pushed the devel-modifier-plugin branch from 18508da to e4d5ce2 Compare December 24, 2025 04:05

coderabbitai bot reviewed Dec 24, 2025

View reviewed changes

resolve nitpick comments

8fd8f65

njzjz approved these changes Dec 24, 2025

View reviewed changes

wanghan-iapcm reviewed Dec 25, 2025

View reviewed changes

add use_cache as modifier var, so that the user can choose whether …

432be58

…to save the data modification before training or to perform modification on-the-fly.

coderabbitai bot reviewed Dec 25, 2025

View reviewed changes

deepmd/pd/utils/dataset.py Outdated Show resolved Hide resolved

use ThreadPoolExecutor to eliminate CUDA re-initialization in data mo…

23eb6a5

…dification during training.

ChiahsinChu force-pushed the devel-modifier-plugin branch from 20b8739 to 23eb6a5 Compare December 25, 2025 09:55

coderabbitai bot reviewed Dec 25, 2025

View reviewed changes

resolve nitpick comments

eba1d24

coderabbitai bot reviewed Dec 25, 2025

View reviewed changes

fix bug about max_workers in ThreadPoolExecutor

6aefe98

feat(pt): add plugin for data modifier #4661

Are you sure you want to change the base?

feat(pt): add plugin for data modifier #4661

Uh oh!

Conversation

ChiahsinChu commented Mar 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes

1. Added Data Modifier to Training Pipeline

2. Added Data Modifier to Inference

3. Implemented Data Modifier Framework

4. Added Data Modifier Tests

Summary by CodeRabbit

Uh oh!

Uh oh!

coderabbitai bot commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wanghan-iapcm left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

wanghan-iapcm Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

ChiahsinChu Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

ChiahsinChu commented Mar 20, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 20, 2025 •

edited

Loading

codecov bot commented Mar 20, 2025 •

edited

Loading