fix(dashboard): cap job log reads to prevent dashboard agent OOM#61537
fix(dashboard): cap job log reads to prevent dashboard agent OOM#61537yuanjiewei wants to merge 3 commits intoray-project:masterfrom
Conversation
get_logs() previously called f.read() on the entire job driver log file
with no size limit. When a job produces tens of GiB of stdout/stderr,
a single GET /api/jobs/{id}/logs request loads the full file into memory,
then serializes it to JSON (~3x amplification), causing the dashboard
agent to OOM and crash the head node.
Add a configurable cap (JOB_LOG_MAX_READ_BYTES, default 16 MiB). Files
within the cap are returned fully. Oversized files return only the tail
with a truncation notice, bounding peak memory to ~3x the cap.
The cap is tunable via RAY_JOB_LOG_MAX_READ_BYTES env var. Streaming
endpoints (tail_logs, gRPC StreamLog) are unaffected.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a crucial fix to prevent out-of-memory errors in the dashboard agent by capping the size of job logs read into memory. However, the current implementation of the truncation logic introduces a potential Denial of Service (DoS) vulnerability. Specifically, seeking to an arbitrary byte offset in a text-mode file can lead to a UnicodeDecodeError if the offset lands in the middle of a multi-byte UTF-8 character. This can be mitigated by opening the file in binary mode and using safe decoding with error handling. Additionally, a minor refactoring in one of the tests is suggested to improve maintainability by avoiding logic duplication. Otherwise, the changes are well-tested and effectively address the reported issue.
| with open(log_path, "r") as f: | ||
| f.seek(file_size - JOB_LOG_MAX_READ_BYTES) | ||
| f.readline() # skip partial first line | ||
| tail = f.read() |
There was a problem hiding this comment.
The new log truncation logic in get_logs() uses f.seek(file_size - JOB_LOG_MAX_READ_BYTES) on a file opened in text mode ("r"). This seeks to an arbitrary byte offset in the file. If the log file contains multi-byte UTF-8 characters and the seek lands in the middle of a character sequence, subsequent calls to f.readline() or f.read() will raise a UnicodeDecodeError. Since this exception is not caught, it will propagate and cause the request to fail (500 error) or potentially crash the dashboard agent process. An attacker who can control the content of the job logs (e.g., by submitting a job that prints specific UTF-8 sequences) can trigger this error to prevent users from viewing the logs or to cause a Denial of Service (DoS) on the dashboard agent.
| with open(log_path, "r") as f: | |
| f.seek(file_size - JOB_LOG_MAX_READ_BYTES) | |
| f.readline() # skip partial first line | |
| tail = f.read() | |
| with open(log_path, "rb") as f: | |
| f.seek(file_size - JOB_LOG_MAX_READ_BYTES) | |
| f.readline() # skip partial first line | |
| tail_bytes = f.read() | |
| tail = tail_bytes.decode("utf-8", errors="replace") |
| def test_cap_is_configurable_via_env(self, client, tmp_log_dir): | ||
| # Re-import to pick up the env var | ||
| from ray.dashboard.modules.job import job_log_storage_client as mod | ||
|
|
||
| original = mod.JOB_LOG_MAX_READ_BYTES | ||
| # Simulate the env-based init | ||
| mod.JOB_LOG_MAX_READ_BYTES = int( | ||
| os.environ.get("RAY_JOB_LOG_MAX_READ_BYTES", 16 * 1024 * 1024) | ||
| ) | ||
| try: | ||
| assert mod.JOB_LOG_MAX_READ_BYTES == 4096 | ||
|
|
||
| content = "a" * 999 + "\n" # 1000 bytes per line | ||
| content = content * 10 # 10,000 bytes total > 4096 | ||
| log_path = tmp_log_dir("env-job", content) | ||
|
|
||
| with patch.object(client, "get_log_file_path", return_value=log_path): | ||
| result = client.get_logs("env-job") | ||
|
|
||
| assert result.startswith("[LOG TRUNCATED") | ||
| finally: | ||
| mod.JOB_LOG_MAX_READ_BYTES = original |
There was a problem hiding this comment.
While this test correctly verifies the functionality and cleans up after itself, it duplicates the initialization logic from job_log_storage_client.py to set mod.JOB_LOG_MAX_READ_BYTES. This makes the test brittle; if the initialization logic or default value changes in the main module, this test would need a separate update and could hide bugs.
A more robust approach is to use importlib.reload() to re-initialize the module with the patched environment variable. This directly tests the module's actual initialization code. The suggested change implements this approach for a more maintainable test.
def test_cap_is_configurable_via_env(self, client, tmp_log_dir):
import importlib
from ray.dashboard.modules.job import job_log_storage_client as mod
original = mod.JOB_LOG_MAX_READ_BYTES
try:
# Reload the module to apply the environment variable patch.
importlib.reload(mod)
assert mod.JOB_LOG_MAX_READ_BYTES == 4096
content = "a" * 999 + "\n" # 1000 bytes per line
content = content * 10 # 10,000 bytes total > 4096
log_path = tmp_log_dir("env-job", content)
with patch.object(client, "get_log_file_path", return_value=log_path):
result = client.get_logs("env-job")
assert result.startswith("[LOG TRUNCATED")
finally:
# Restore the original value to ensure test isolation.
mod.JOB_LOG_MAX_READ_BYTES = original…te UTF-8
f.seek() with a byte offset on a text-mode file can land in the middle
of a multi-byte UTF-8 character, causing readline()/read() to raise
UnicodeDecodeError. Switch to binary mode ("rb") for the truncation
path and decode with errors="replace", consistent with the existing
fast_tail_last_n_lines in utils.py.
Also use errors="replace" for the small-file path to be defensive.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| with open(log_path, "rb") as f: | ||
| f.seek(file_size - JOB_LOG_MAX_READ_BYTES) | ||
| f.readline() # skip partial first line | ||
| tail = f.read().decode("utf-8", errors="replace") |
There was a problem hiding this comment.
Unbounded f.read() in truncation branch undermines OOM cap
Low Severity
f.read() on the truncation path reads from the seek position to the actual EOF with no size limit. Since file_size was captured earlier by os.path.getsize(), an actively-written job log can grow between the size check and the read, causing f.read() to return significantly more than JOB_LOG_MAX_READ_BYTES. Passing the cap as an argument — f.read(JOB_LOG_MAX_READ_BYTES) — would make the protection watertight against concurrent appends.


Description
JobLogStorageClient.get_logs()callsf.read()on the entire job driver log file with no size limit. When a long-running job writes tens of GiB of stdout/stderr, a singleGET /api/jobs/{id}/logsrequest loads the full file into memory. The response is then serialized viadataclasses.asdict()→json.dumps(), amplifying memory to ~3× the file size — causing the dashboard agent to OOM and crash the head node.Changes:
JOB_LOG_MAX_READ_BYTESconstant (default 16 MiB), configurable viaRAY_JOB_LOG_MAX_READ_BYTESenv var"rb"+decode("utf-8", errors="replace")) with a[LOG TRUNCATED ...]prefixtail_logs(), gRPCStreamLog) are unaffectedRoot cause chain:
Production impact: A 32 GB log file caused the dashboard agent to consume 139 GB RSS on a 222 GB host, triggering the Ray memory monitor to kill running actors/tasks.
Related issues
None found. The closest existing issue (#41040) was about dashboard agent memory from metrics cardinality, not job log reads.
Additional information
Testing: Added
test_job_log_storage_client.pywith 6 tests covering:UnicodeDecodeErrorRun tests with: