Separate infrastructure logs from training progress #953

tyler-griggs · 2026-01-25T08:09:23Z

Summary

Redirects vLLM and worker infrastructure logs to a log file (/tmp/skyrl-logs/{run_name}/infra.log)
Keeps training progress (config, steps, metrics, rewards) on stdout for visibility
Adds SKYRL_LOG_DIR and SKYRL_LOG_LEVEL environment variables

How it works

vLLM engines and workers call redirect_actor_output_to_file() in their __init__
Training entrypoint does NOT redirect, so training progress reaches stdout
SKYRL_LOG_LEVEL=DEBUG shows all logs on stdout for debugging

Test plan

Run bash examples/gsm8k/run_gsm8k.sh and verify training progress on stdout
Verify vLLM logs go to /tmp/skyrl-logs/gsm8k_test/infra.log
Test SKYRL_LOG_LEVEL=DEBUG shows all logs

Known limitations

Ray (raylet) system logs still appear on stdout (minimal, not the noisy vLLM output)

🤖 Generated with Claude Code

- Add SKYRL_LOG_DIR and SKYRL_LOG_LEVEL environment variables - Create ray_logging.py with redirect_actor_output_to_file() helper - Redirect vLLM engine and worker output to log file - Keep training progress (from skyrl_entrypoint) on stdout - Add documentation in docs/LOGGING_IMPROVEMENTS.md Infrastructure logs (vLLM model loading, KV cache, etc.) now go to /tmp/skyrl-logs/{run_name}/infra.log while training progress remains on stdout for visibility. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vercel · 2026-01-25T08:09:27Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
skyrl-docs	Ready	Preview, Comment	Jan 25, 2026 8:09am

gemini-code-assist

Code Review

This pull request introduces a mechanism to separate infrastructure logs from training progress logs, redirecting the former to a file, which improves log clarity. However, the implementation of the log file path construction in initialize_ray is vulnerable to path traversal. An attacker can control the log file location via the run_name configuration, potentially leading to arbitrary file creation or append in sensitive system directories, and given that log content can be influenced by model outputs, this poses a significant security risk, including potential remote code execution. Additionally, the log redirection does not respect the SKYRL_LOG_LEVEL=DEBUG setting, and the log file setup in initialize_ray could be more robust by being conditional on the log level to prevent empty log files and misleading messages in debug mode.

gemini-code-assist · 2026-01-25T08:12:11Z

skyrl-train/skyrl_train/utils/utils.py

+    log_dir = Path(SKYRL_LOG_DIR) / cfg.trainer.run_name
+    log_dir.mkdir(parents=True, exist_ok=True)
+    log_file = str(log_dir / "infra.log")


The initialize_ray function constructs the log_dir and log_file paths using SKYRL_LOG_DIR and cfg.trainer.run_name without proper validation. This creates a path traversal vulnerability where a malicious run_name (e.g., ../../etc/cron.daily) could lead to arbitrary file creation in sensitive system directories. Since log content can be influenced by model outputs, this poses a significant risk, including potential arbitrary code execution if the file is placed in an executable location (like cron.daily). To remediate this, ensure the constructed path is resolved and verified to be within the intended base directory. Additionally, the current implementation sets up log file redirection regardless of SKYRL_LOG_LEVEL. When SKYRL_LOG_LEVEL is DEBUG, an empty log file is still created and a potentially misleading message is printed. The log file setup should be conditional on SKYRL_LOG_LEVEL not being DEBUG to avoid this.

gemini-code-assist · 2026-01-25T08:12:11Z

skyrl-train/skyrl_train/utils/ray_logging.py

+def redirect_actor_output_to_file():
+    """
+    Redirect stdout and stderr to log file to prevent Ray from forwarding to driver.
+
+    Call this at the very start of any Ray actor/remote function where you want
+    to suppress output from appearing on the driver's stdout. The output will
+    instead be written to the log file specified by SKYRL_LOG_FILE.
+
+    Note: Do NOT call this in skyrl_entrypoint() - training progress should
+    go to stdout.
+    """
+    log_file = os.getenv("SKYRL_LOG_FILE")
+    if log_file:
+        log_fd = open(log_file, "a", buffering=1)  # noqa: SIM115
+        os.dup2(log_fd.fileno(), sys.stdout.fileno())
+        os.dup2(log_fd.fileno(), sys.stderr.fileno())


The current implementation unconditionally redirects stdout/stderr if SKYRL_LOG_FILE is set. However, the documentation and PR description state that setting SKYRL_LOG_LEVEL=DEBUG should show all logs on stdout. To align with this behavior, you should avoid redirection when the log level is 'DEBUG'.

def redirect_actor_output_to_file(): """ Redirect stdout and stderr to log file to prevent Ray from forwarding to driver. Call this at the very start of any Ray actor/remote function where you want to suppress output from appearing on the driver's stdout. The output will instead be written to the log file specified by SKYRL_LOG_FILE. Note: Do NOT call this in skyrl_entrypoint() - training progress should go to stdout. """ from skyrl_train.env_vars import SKYRL_LOG_LEVEL if SKYRL_LOG_LEVEL == "DEBUG": return log_file = os.getenv("SKYRL_LOG_FILE") if log_file: log_fd = open(log_file, "a", buffering=1) # noqa: SIM115 os.dup2(log_fd.fileno(), sys.stdout.fileno()) os.dup2(log_fd.fileno(), sys.stderr.fileno())

vercel bot deployed to Preview January 25, 2026 08:09 View deployment

gemini-code-assist bot reviewed Jan 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate infrastructure logs from training progress #953

Separate infrastructure logs from training progress #953

Uh oh!

tyler-griggs commented Jan 25, 2026

Uh oh!

vercel bot commented Jan 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 25, 2026

Uh oh!

gemini-code-assist bot Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Separate infrastructure logs from training progress #953

Are you sure you want to change the base?

Separate infrastructure logs from training progress #953

Uh oh!

Conversation

tyler-griggs commented Jan 25, 2026

Summary

How it works

Test plan

Known limitations

Uh oh!

vercel bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Jan 25, 2026 •

edited

Loading