-
Notifications
You must be signed in to change notification settings - Fork 23
feat: Add MLflow artifact upload for traces and logs #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add mlflow_artifacts.py with functions to collect and upload trace/log files - Add upload_mlflow_artifacts() wrapper in global_vars.py - Integrate artifact upload in trainer.py before MLflow run ends - Add mlflow_upload_traces and mlflow_upload_logs config options - Add unique timestamp-based output directories for multi-node consistency - Pass MLflow environment variables through Docker container
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds functionality to automatically upload PyTorch profiler trace files and training log files to MLflow as artifacts when MLflow tracking is enabled. The implementation introduces a new module for artifact collection and upload, integrates it into the training lifecycle, and updates example scripts to support consistent output directories across multi-node training runs.
Key changes:
- New artifact upload module with functions to collect and upload trace/log files to MLflow
- Integration of artifact uploads before MLflow run completion in the trainer
- Configuration options to control trace and log uploads (defaulting to enabled)
- Shell script improvements for timestamp-based output directories with multi-node consistency
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| primus/backends/megatron/training/mlflow_artifacts.py | New module implementing trace/log file discovery and MLflow artifact upload functionality |
| primus/backends/megatron/training/global_vars.py | Adds global variable for exp_root_path and wrapper function for artifact uploads |
| primus/modules/trainer/megatron/trainer.py | Integrates artifact upload calls before MLflow run termination in two exit paths |
| primus/configs/modules/megatron/primus_megatron_module.yaml | Adds mlflow_upload_traces and mlflow_upload_logs config options (both default to true) |
| examples/run_slurm_pretrain.sh | Implements timestamp-based output directory naming and exports timestamp for multi-node consistency |
| examples/run_pretrain.sh | Adds conditional timestamp generation to support both single-node and multi-node scenarios, fixes typo in log message |
| examples/run_local_pretrain.sh | Adds MLflow environment variables and Primus path variables to Docker container environment |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3c149be to
13dfa81
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The experiment name contains square brackets like [deepseek_v2_lite-pretrain_...]-rank[0] which are interpreted as glob pattern character classes, causing glob.glob to return empty results even though files exist. Fixed by using glob.escape() on directory paths before using them with glob.glob().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| else artifact_path | ||
| ) | ||
|
|
||
| mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath) |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logging for uploaded log files is missing in the loop. While trace files log each upload with "Uploaded trace file: {filename}", log files have no per-file logging. This inconsistency makes debugging harder when log file uploads fail silently within the try-except block. Consider adding similar logging for each log file upload for consistency and better observability.
| mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath) | |
| mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath) | |
| log_rank_0(f"[MLflow] Uploaded log file: {log_file}") |
| log_rank_0("[MLflow] Starting artifact upload to MLflow...") | ||
| log_rank_0(f"[MLflow] tensorboard_dir: {tensorboard_dir}") | ||
| log_rank_0(f"[MLflow] exp_root_path: {exp_root_path}") | ||
| log_rank_0(f"[MLflow] upload_traces: {upload_traces}, upload_logs: {upload_logs}") |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message shows only the count and paths but doesn't indicate which rank is uploading. In a distributed setting, this could be confusing when debugging which rank is actually uploading artifacts. Consider adding rank information to these log messages, especially since only rank world_size - 1 uploads artifacts according to the docstring.
| """ | ||
| if mlflow_writer is None: | ||
| return 0 | ||
|
|
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing log message when starting to search for log files. The trace file upload function logs "Searching for trace files in: {path}" but the log file upload function has no equivalent message. This inconsistency makes it harder to debug when log files aren't being found. Consider adding a similar log message before calling _get_all_log_files for consistency.
| log_rank_0(f"[MLflow] Searching for log files in: {exp_root_path}") |
| uploads them to MLflow. In distributed settings, only rank 0 (or the | ||
| last rank where MLflow writer is initialized) should call this. |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring mentions "only rank 0 (or the last rank where MLflow writer is initialized)" but according to the code in global_vars.py line 60, MLflow writer is initialized only on rank world_size - 1 (the last rank), not rank 0. The mention of "rank 0" could be misleading. Consider updating the docstring to clearly state that only the last rank (world_size - 1) uploads artifacts, or explain when rank 0 would be used.
| uploads them to MLflow. In distributed settings, only rank 0 (or the | |
| last rank where MLflow writer is initialized) should call this. | |
| uploads them to MLflow. In distributed settings, this should be called | |
| only on the rank where the MLflow writer is initialized (typically the | |
| last rank, i.e., world_size - 1). |
| - Upload profiler trace files from all profiled ranks (including multi-node) | ||
| - Upload log files from all levels and all ranks | ||
| - Supports both local and distributed training scenarios |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module docstring claims to "Upload profiler trace files from all profiled ranks (including multi-node)" but the implementation only uploads files visible from the filesystem where the MLflow writer rank (world_size - 1) runs. In multi-node setups with local storage (non-shared filesystem), this will only upload files from the node where the last rank runs, not from all nodes. Consider clarifying the documentation to state that shared storage is required for multi-node artifact collection, or implement a gathering mechanism for artifacts from all nodes.
| - Upload profiler trace files from all profiled ranks (including multi-node) | |
| - Upload log files from all levels and all ranks | |
| - Supports both local and distributed training scenarios | |
| - Upload profiler trace files from all profiled ranks that are accessible from the | |
| filesystem of the MLflow writer rank (e.g., via shared storage in multi-node setups) | |
| - Upload log files from all levels and all ranks that write to a filesystem visible | |
| to the MLflow writer rank | |
| - Supports both local and distributed training scenarios when a shared or otherwise | |
| common filesystem is used for artifact directories |
| from primus.modules.module_utils import log_rank_0, warning_rank_0 | ||
|
|
||
|
|
||
| def _get_all_trace_files(tensorboard_dir: str) -> list: |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type hint uses generic list instead of the more specific list[str] (Python 3.9+) or List[str] from typing. Consider using more specific type hints to improve type checking and code clarity.
| return unique_files | ||
|
|
||
|
|
||
| def _get_all_log_files(exp_root_path: str) -> list: |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type hint uses generic list instead of the more specific list[str] (Python 3.9+) or List[str] from typing. Consider using more specific type hints to improve type checking and code clarity.
feat: Add MLflow artifact upload for traces and logs
Adds functionality to automatically upload profiler trace files and training log files
to MLflow as artifacts when MLflow tracking is enabled.
Features
artifacts/traces/artifacts/logs/Config Options
mlflow_upload_traces: true # Upload profiler trace files to MLflow
mlflow_upload_logs: true # Upload training log files to MLflow
Files Changed
primus/backends/megatron/training/mlflow_artifacts.py- New file with trace/log collection and upload functionsprimus/backends/megatron/training/global_vars.py- Addupload_mlflow_artifacts()wrapperprimus/modules/trainer/megatron/trainer.py- Integrate artifact upload before MLflow run endsprimus/configs/modules/megatron/primus_megatron_module.yaml- Add config optionsexamples/run_pretrain.sh- Add timestamp-based output directoriesexamples/run_slurm_pretrain.sh- Share timestamp across nodes for multi-node runsexamples/run_local_pretrain.sh- Pass MLflow environment variables to containerUsage
When MLflow is enabled, artifacts are automatically uploaded at the end of training:
tensorboard_dir→ MLflowartifacts/traces/exp_root_path/logs/→ MLflowartifacts/logs/