-
Notifications
You must be signed in to change notification settings - Fork 233
Separate infrastructure logs from training progress #953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add SKYRL_LOG_DIR and SKYRL_LOG_LEVEL environment variables
- Create ray_logging.py with redirect_actor_output_to_file() helper
- Redirect vLLM engine and worker output to log file
- Keep training progress (from skyrl_entrypoint) on stdout
- Add documentation in docs/LOGGING_IMPROVEMENTS.md
Infrastructure logs (vLLM model loading, KV cache, etc.) now go to
/tmp/skyrl-logs/{run_name}/infra.log while training progress remains
on stdout for visibility.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a mechanism to separate infrastructure logs from training progress logs, redirecting the former to a file, which improves log clarity. However, the implementation of the log file path construction in initialize_ray is vulnerable to path traversal. An attacker can control the log file location via the run_name configuration, potentially leading to arbitrary file creation or append in sensitive system directories, and given that log content can be influenced by model outputs, this poses a significant security risk, including potential remote code execution. Additionally, the log redirection does not respect the SKYRL_LOG_LEVEL=DEBUG setting, and the log file setup in initialize_ray could be more robust by being conditional on the log level to prevent empty log files and misleading messages in debug mode.
| log_dir = Path(SKYRL_LOG_DIR) / cfg.trainer.run_name | ||
| log_dir.mkdir(parents=True, exist_ok=True) | ||
| log_file = str(log_dir / "infra.log") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initialize_ray function constructs the log_dir and log_file paths using SKYRL_LOG_DIR and cfg.trainer.run_name without proper validation. This creates a path traversal vulnerability where a malicious run_name (e.g., ../../etc/cron.daily) could lead to arbitrary file creation in sensitive system directories. Since log content can be influenced by model outputs, this poses a significant risk, including potential arbitrary code execution if the file is placed in an executable location (like cron.daily). To remediate this, ensure the constructed path is resolved and verified to be within the intended base directory. Additionally, the current implementation sets up log file redirection regardless of SKYRL_LOG_LEVEL. When SKYRL_LOG_LEVEL is DEBUG, an empty log file is still created and a potentially misleading message is printed. The log file setup should be conditional on SKYRL_LOG_LEVEL not being DEBUG to avoid this.
| def redirect_actor_output_to_file(): | ||
| """ | ||
| Redirect stdout and stderr to log file to prevent Ray from forwarding to driver. | ||
| Call this at the very start of any Ray actor/remote function where you want | ||
| to suppress output from appearing on the driver's stdout. The output will | ||
| instead be written to the log file specified by SKYRL_LOG_FILE. | ||
| Note: Do NOT call this in skyrl_entrypoint() - training progress should | ||
| go to stdout. | ||
| """ | ||
| log_file = os.getenv("SKYRL_LOG_FILE") | ||
| if log_file: | ||
| log_fd = open(log_file, "a", buffering=1) # noqa: SIM115 | ||
| os.dup2(log_fd.fileno(), sys.stdout.fileno()) | ||
| os.dup2(log_fd.fileno(), sys.stderr.fileno()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation unconditionally redirects stdout/stderr if SKYRL_LOG_FILE is set. However, the documentation and PR description state that setting SKYRL_LOG_LEVEL=DEBUG should show all logs on stdout. To align with this behavior, you should avoid redirection when the log level is 'DEBUG'.
def redirect_actor_output_to_file():
"""
Redirect stdout and stderr to log file to prevent Ray from forwarding to driver.
Call this at the very start of any Ray actor/remote function where you want
to suppress output from appearing on the driver's stdout. The output will
instead be written to the log file specified by SKYRL_LOG_FILE.
Note: Do NOT call this in skyrl_entrypoint() - training progress should
go to stdout.
"""
from skyrl_train.env_vars import SKYRL_LOG_LEVEL
if SKYRL_LOG_LEVEL == "DEBUG":
return
log_file = os.getenv("SKYRL_LOG_FILE")
if log_file:
log_fd = open(log_file, "a", buffering=1) # noqa: SIM115
os.dup2(log_fd.fileno(), sys.stdout.fileno())
os.dup2(log_fd.fileno(), sys.stderr.fileno())
Summary
/tmp/skyrl-logs/{run_name}/infra.log)SKYRL_LOG_DIRandSKYRL_LOG_LEVELenvironment variablesHow it works
redirect_actor_output_to_file()in their__init__SKYRL_LOG_LEVEL=DEBUGshows all logs on stdout for debuggingTest plan
bash examples/gsm8k/run_gsm8k.shand verify training progress on stdout/tmp/skyrl-logs/gsm8k_test/infra.logSKYRL_LOG_LEVEL=DEBUGshows all logsKnown limitations
(raylet)system logs still appear on stdout (minimal, not the noisy vLLM output)🤖 Generated with Claude Code