feat: Add MLflow artifact upload for traces and logs #440

gphuang · 2025-12-18T09:10:45Z

feat: Add MLflow artifact upload for traces and logs

Adds functionality to automatically upload profiler trace files and training log files
to MLflow as artifacts when MLflow tracking is enabled.

Features

Upload PyTorch profiler trace files to MLflow artifacts/traces/
Upload training log files to MLflow artifacts/logs/
Unique timestamp-based output directories for multi-node consistency
Pass MLflow environment variables through Docker container

Config Options

mlflow_upload_traces: true # Upload profiler trace files to MLflow
mlflow_upload_logs: true # Upload training log files to MLflow

Files Changed

primus/backends/megatron/training/mlflow_artifacts.py - New file with trace/log collection and upload functions
primus/backends/megatron/training/global_vars.py - Add upload_mlflow_artifacts() wrapper
primus/modules/trainer/megatron/trainer.py - Integrate artifact upload before MLflow run ends
primus/configs/modules/megatron/primus_megatron_module.yaml - Add config options
examples/run_pretrain.sh - Add timestamp-based output directories
examples/run_slurm_pretrain.sh - Share timestamp across nodes for multi-node runs
examples/run_local_pretrain.sh - Pass MLflow environment variables to container

Usage

When MLflow is enabled, artifacts are automatically uploaded at the end of training:

Trace files from tensorboard_dir → MLflow artifacts/traces/
Log files from exp_root_path/logs/ → MLflow artifacts/logs/

- Add mlflow_artifacts.py with functions to collect and upload trace/log files - Add upload_mlflow_artifacts() wrapper in global_vars.py - Integrate artifact upload in trainer.py before MLflow run ends - Add mlflow_upload_traces and mlflow_upload_logs config options - Add unique timestamp-based output directories for multi-node consistency - Pass MLflow environment variables through Docker container

Copilot

Pull request overview

This PR adds functionality to automatically upload PyTorch profiler trace files and training log files to MLflow as artifacts when MLflow tracking is enabled. The implementation introduces a new module for artifact collection and upload, integrates it into the training lifecycle, and updates example scripts to support consistent output directories across multi-node training runs.

Key changes:

New artifact upload module with functions to collect and upload trace/log files to MLflow
Integration of artifact uploads before MLflow run completion in the trainer
Configuration options to control trace and log uploads (defaulting to enabled)
Shell script improvements for timestamp-based output directories with multi-node consistency

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
primus/backends/megatron/training/mlflow_artifacts.py	New module implementing trace/log file discovery and MLflow artifact upload functionality
primus/backends/megatron/training/global_vars.py	Adds global variable for exp_root_path and wrapper function for artifact uploads
primus/modules/trainer/megatron/trainer.py	Integrates artifact upload calls before MLflow run termination in two exit paths
primus/configs/modules/megatron/primus_megatron_module.yaml	Adds mlflow_upload_traces and mlflow_upload_logs config options (both default to true)
examples/run_slurm_pretrain.sh	Implements timestamp-based output directory naming and exports timestamp for multi-node consistency
examples/run_pretrain.sh	Adds conditional timestamp generation to support both single-node and multi-node scenarios, fixes typo in log message
examples/run_local_pretrain.sh	Adds MLflow environment variables and Primus path variables to Docker container environment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

examples/run_slurm_pretrain.sh

primus/backends/megatron/training/global_vars.py

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

examples/run_pretrain.sh

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

Copilot · 2025-12-18T10:20:26Z

@gphuang I've opened a new pull request, #441, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/modules/trainer/megatron/trainer.py

The experiment name contains square brackets like [deepseek_v2_lite-pretrain_...]-rank[0] which are interpreted as glob pattern character classes, causing glob.glob to return empty results even though files exist. Fixed by using glob.escape() on directory paths before using them with glob.glob().

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-19T08:37:45Z

primus/backends/megatron/training/mlflow_artifacts.py

+                else artifact_path
+            )
+
+            mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath)


The logging for uploaded log files is missing in the loop. While trace files log each upload with "Uploaded trace file: {filename}", log files have no per-file logging. This inconsistency makes debugging harder when log file uploads fail silently within the try-except block. Consider adding similar logging for each log file upload for consistency and better observability.

Suggested change

mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath)

mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath)

log_rank_0(f"[MLflow] Uploaded log file: {log_file}")

Copilot · 2025-12-19T08:37:46Z

primus/backends/megatron/training/mlflow_artifacts.py

+    log_rank_0("[MLflow] Starting artifact upload to MLflow...")
+    log_rank_0(f"[MLflow] tensorboard_dir: {tensorboard_dir}")
+    log_rank_0(f"[MLflow] exp_root_path: {exp_root_path}")
+    log_rank_0(f"[MLflow] upload_traces: {upload_traces}, upload_logs: {upload_logs}")


The log message shows only the count and paths but doesn't indicate which rank is uploading. In a distributed setting, this could be confusing when debugging which rank is actually uploading artifacts. Consider adding rank information to these log messages, especially since only rank world_size - 1 uploads artifacts according to the docstring.

Copilot · 2025-12-19T08:37:46Z

primus/backends/megatron/training/mlflow_artifacts.py

+    """
+    if mlflow_writer is None:
+        return 0
+


Missing log message when starting to search for log files. The trace file upload function logs "Searching for trace files in: {path}" but the log file upload function has no equivalent message. This inconsistency makes it harder to debug when log files aren't being found. Consider adding a similar log message before calling _get_all_log_files for consistency.

Suggested change

log_rank_0(f"[MLflow] Searching for log files in: {exp_root_path}")

Copilot · 2025-12-19T08:37:46Z

primus/backends/megatron/training/mlflow_artifacts.py

+    uploads them to MLflow. In distributed settings, only rank 0 (or the
+    last rank where MLflow writer is initialized) should call this.


The docstring mentions "only rank 0 (or the last rank where MLflow writer is initialized)" but according to the code in global_vars.py line 60, MLflow writer is initialized only on rank world_size - 1 (the last rank), not rank 0. The mention of "rank 0" could be misleading. Consider updating the docstring to clearly state that only the last rank (world_size - 1) uploads artifacts, or explain when rank 0 would be used.

Suggested change

uploads them to MLflow. In distributed settings, only rank 0 (or the

last rank where MLflow writer is initialized) should call this.

uploads them to MLflow. In distributed settings, this should be called

only on the rank where the MLflow writer is initialized (typically the

last rank, i.e., world_size - 1).

Copilot · 2025-12-19T08:37:46Z

primus/backends/megatron/training/mlflow_artifacts.py

+- Upload profiler trace files from all profiled ranks (including multi-node)
+- Upload log files from all levels and all ranks
+- Supports both local and distributed training scenarios


The module docstring claims to "Upload profiler trace files from all profiled ranks (including multi-node)" but the implementation only uploads files visible from the filesystem where the MLflow writer rank (world_size - 1) runs. In multi-node setups with local storage (non-shared filesystem), this will only upload files from the node where the last rank runs, not from all nodes. Consider clarifying the documentation to state that shared storage is required for multi-node artifact collection, or implement a gathering mechanism for artifacts from all nodes.

Suggested change

- Upload profiler trace files from all profiled ranks (including multi-node)

- Upload log files from all levels and all ranks

- Supports both local and distributed training scenarios

- Upload profiler trace files from all profiled ranks that are accessible from the

filesystem of the MLflow writer rank (e.g., via shared storage in multi-node setups)

- Upload log files from all levels and all ranks that write to a filesystem visible

to the MLflow writer rank

- Supports both local and distributed training scenarios when a shared or otherwise

common filesystem is used for artifact directories

Copilot · 2025-12-19T08:37:47Z

primus/backends/megatron/training/mlflow_artifacts.py

+from primus.modules.module_utils import log_rank_0, warning_rank_0
+
+
+def _get_all_trace_files(tensorboard_dir: str) -> list:


The return type hint uses generic list instead of the more specific list[str] (Python 3.9+) or List[str] from typing. Consider using more specific type hints to improve type checking and code clarity.

Copilot · 2025-12-19T08:37:47Z

primus/backends/megatron/training/mlflow_artifacts.py

+    return unique_files
+
+
+def _get_all_log_files(exp_root_path: str) -> list:


The return type hint uses generic list instead of the more specific list[str] (Python 3.9+) or List[str] from typing. Consider using more specific type hints to improve type checking and code clarity.

Copilot AI review requested due to automatic review settings December 18, 2025 09:10

Copilot started reviewing on behalf of gphuang December 18, 2025 09:11 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

gphuang requested a review from Copilot December 18, 2025 10:10

Copilot started reviewing on behalf of gphuang December 18, 2025 10:11 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Show resolved Hide resolved

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

Copilot AI mentioned this pull request Dec 18, 2025

Move MLflow import to function scope to avoid import-time dependencies #441

Closed

docs: Clarify MLflow upload defaults are opt-out when MLflow enabled

13dfa81

Copilot AI review requested due to automatic review settings December 18, 2025 10:30

Copilot started reviewing on behalf of gphuang December 18, 2025 10:31 View session

gphuang force-pushed the feat/6-enable-mlflow-uploading branch from 3c149be to 13dfa81 Compare December 18, 2025 10:33

Update primus/modules/trainer/megatron/trainer.py

1f2e136

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI reviewed Dec 18, 2025

View reviewed changes

Update examples/run_pretrain.sh

d30b920

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings December 18, 2025 10:37

Copilot started reviewing on behalf of gphuang December 18, 2025 10:38 View session

Update primus/backends/megatron/training/mlflow_artifacts.py

b2da61b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI reviewed Dec 18, 2025

View reviewed changes

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

gphuang mentioned this pull request Dec 18, 2025

feat: Add TraceLens integration for trace analysis with MLflow upload #439

Open

gphuang and others added 2 commits December 18, 2025 15:15

Merge branch 'main' into feat/6-enable-mlflow-uploading

476c05d

Copilot AI review requested due to automatic review settings December 19, 2025 08:26

gphuang marked this pull request as ready for review December 19, 2025 08:26

gphuang requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners December 19, 2025 08:26

Copilot started reviewing on behalf of gphuang December 19, 2025 08:27 View session

Copilot AI reviewed Dec 19, 2025

View reviewed changes

	mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath)
	mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath)
	log_rank_0(f"[MLflow] Uploaded log file: {log_file}")



	log_rank_0(f"[MLflow] Searching for log files in: {exp_root_path}")

		uploads them to MLflow. In distributed settings, only rank 0 (or the
		last rank where MLflow writer is initialized) should call this.

-    uploads them to MLflow. In distributed settings, only rank 0 (or the
-    last rank where MLflow writer is initialized) should call this.
+    uploads them to MLflow. In distributed settings, this should be called
+    only on the rank where the MLflow writer is initialized (typically the
+    last rank, i.e., world_size - 1).

-- Upload profiler trace files from all profiled ranks (including multi-node)
-- Upload log files from all levels and all ranks
-- Supports both local and distributed training scenarios
+- Upload profiler trace files from all profiled ranks that are accessible from the
+  filesystem of the MLflow writer rank (e.g., via shared storage in multi-node setups)
+- Upload log files from all levels and all ranks that write to a filesystem visible
+  to the MLflow writer rank
+- Supports both local and distributed training scenarios when a shared or otherwise
+  common filesystem is used for artifact directories

		from primus.modules.module_utils import log_rank_0, warning_rank_0


		def _get_all_trace_files(tensorboard_dir: str) -> list:

		return unique_files


		def _get_all_log_files(exp_root_path: str) -> list:

feat: Add MLflow artifact upload for traces and logs #440

Are you sure you want to change the base?

feat: Add MLflow artifact upload for traces and logs #440

Uh oh!

Conversation

gphuang commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!