fix(dashboard): cap job log reads to prevent dashboard agent OOM by yuanjiewei · Pull Request #61537 · ray-project/ray

yuanjiewei · 2026-03-06T09:48:58Z

Description

JobLogStorageClient.get_logs() calls f.read() on the entire job driver log file with no size limit. When a long-running job writes tens of GiB of stdout/stderr, a single GET /api/jobs/{id}/logs request loads the full file into memory. The response is then serialized via dataclasses.asdict() → json.dumps(), amplifying memory to ~3× the file size — causing the dashboard agent to OOM and crash the head node.

Changes:

Add JOB_LOG_MAX_READ_BYTES constant (default 16 MiB), configurable via RAY_JOB_LOG_MAX_READ_BYTES env var
Files within the cap are returned fully (no behavior change)
Oversized files return only the tail in binary-safe mode ("rb" + decode("utf-8", errors="replace")) with a [LOG TRUNCATED ...] prefix
Peak memory bounded to ~3× the cap (~48 MiB) regardless of log file size
Streaming endpoints (tail_logs(), gRPC StreamLog) are unaffected

Root cause chain:

GET /api/jobs/{id}/logs
  → JobHead.get_job_logs()               # job_head.py
    → JobAgentSubmissionClient            # gRPC to agent
      → JobAgent.get_job_logs()           # job_agent.py
        → JobLogStorageClient.get_logs()
          → f.read()                      # ← unbounded
  → dataclasses.asdict(response)          # copy #2
  → json.dumps()                          # copy #3

Production impact: A 32 GB log file caused the dashboard agent to consume 139 GB RSS on a 222 GB host, triggering the Ray memory monitor to kill running actors/tasks.

Related issues

None found. The closest existing issue (#41040) was about dashboard agent memory from metrics cardinality, not job log reads.

Additional information

Testing: Added test_job_log_storage_client.py with 6 tests covering:

Small file returned fully
Missing file returns empty string
Large file truncated with notice
Truncated output starts on clean line boundary
Multi-byte UTF-8 does not raise UnicodeDecodeError
Cap is configurable via env var

Run tests with:

pytest python/ray/dashboard/modules/job/tests/test_job_log_storage_client.py -v

get_logs() previously called f.read() on the entire job driver log file with no size limit. When a job produces tens of GiB of stdout/stderr, a single GET /api/jobs/{id}/logs request loads the full file into memory, then serializes it to JSON (~3x amplification), causing the dashboard agent to OOM and crash the head node. Add a configurable cap (JOB_LOG_MAX_READ_BYTES, default 16 MiB). Files within the cap are returned fully. Oversized files return only the tail with a truncation notice, bounding peak memory to ~3x the cap. The cap is tunable via RAY_JOB_LOG_MAX_READ_BYTES env var. Streaming endpoints (tail_logs, gRPC StreamLog) are unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

python/ray/dashboard/modules/job/job_log_storage_client.py

gemini-code-assist

Code Review

This pull request introduces a crucial fix to prevent out-of-memory errors in the dashboard agent by capping the size of job logs read into memory. However, the current implementation of the truncation logic introduces a potential Denial of Service (DoS) vulnerability. Specifically, seeking to an arbitrary byte offset in a text-mode file can lead to a UnicodeDecodeError if the offset lands in the middle of a multi-byte UTF-8 character. This can be mitigated by opening the file in binary mode and using safe decoding with error handling. Additionally, a minor refactoring in one of the tests is suggested to improve maintainability by avoiding logic duplication. Otherwise, the changes are well-tested and effectively address the reported issue.

gemini-code-assist · 2026-03-06T10:06:08Z

python/ray/dashboard/modules/job/job_log_storage_client.py

+            with open(log_path, "r") as f:
+                f.seek(file_size - JOB_LOG_MAX_READ_BYTES)
+                f.readline()  # skip partial first line
+                tail = f.read()


The new log truncation logic in get_logs() uses f.seek(file_size - JOB_LOG_MAX_READ_BYTES) on a file opened in text mode ("r"). This seeks to an arbitrary byte offset in the file. If the log file contains multi-byte UTF-8 characters and the seek lands in the middle of a character sequence, subsequent calls to f.readline() or f.read() will raise a UnicodeDecodeError. Since this exception is not caught, it will propagate and cause the request to fail (500 error) or potentially crash the dashboard agent process. An attacker who can control the content of the job logs (e.g., by submitting a job that prints specific UTF-8 sequences) can trigger this error to prevent users from viewing the logs or to cause a Denial of Service (DoS) on the dashboard agent.

Suggested change

with open(log_path, "r") as f:

f.seek(file_size - JOB_LOG_MAX_READ_BYTES)

f.readline() # skip partial first line

tail = f.read()

with open(log_path, "rb") as f:

f.seek(file_size - JOB_LOG_MAX_READ_BYTES)

f.readline() # skip partial first line

tail_bytes = f.read()

tail = tail_bytes.decode("utf-8", errors="replace")

gemini-code-assist · 2026-03-06T10:06:08Z

python/ray/dashboard/modules/job/tests/test_job_log_storage_client.py

+    def test_cap_is_configurable_via_env(self, client, tmp_log_dir):
+        # Re-import to pick up the env var
+        from ray.dashboard.modules.job import job_log_storage_client as mod
+
+        original = mod.JOB_LOG_MAX_READ_BYTES
+        # Simulate the env-based init
+        mod.JOB_LOG_MAX_READ_BYTES = int(
+            os.environ.get("RAY_JOB_LOG_MAX_READ_BYTES", 16 * 1024 * 1024)
+        )
+        try:
+            assert mod.JOB_LOG_MAX_READ_BYTES == 4096
+
+            content = "a" * 999 + "\n"  # 1000 bytes per line
+            content = content * 10  # 10,000 bytes total > 4096
+            log_path = tmp_log_dir("env-job", content)
+
+            with patch.object(client, "get_log_file_path", return_value=log_path):
+                result = client.get_logs("env-job")
+
+            assert result.startswith("[LOG TRUNCATED")
+        finally:
+            mod.JOB_LOG_MAX_READ_BYTES = original


While this test correctly verifies the functionality and cleans up after itself, it duplicates the initialization logic from job_log_storage_client.py to set mod.JOB_LOG_MAX_READ_BYTES. This makes the test brittle; if the initialization logic or default value changes in the main module, this test would need a separate update and could hide bugs.

A more robust approach is to use importlib.reload() to re-initialize the module with the patched environment variable. This directly tests the module's actual initialization code. The suggested change implements this approach for a more maintainable test.

def test_cap_is_configurable_via_env(self, client, tmp_log_dir): import importlib from ray.dashboard.modules.job import job_log_storage_client as mod original = mod.JOB_LOG_MAX_READ_BYTES try: # Reload the module to apply the environment variable patch. importlib.reload(mod) assert mod.JOB_LOG_MAX_READ_BYTES == 4096 content = "a" * 999 + "\n" # 1000 bytes per line content = content * 10 # 10,000 bytes total > 4096 log_path = tmp_log_dir("env-job", content) with patch.object(client, "get_log_file_path", return_value=log_path): result = client.get_logs("env-job") assert result.startswith("[LOG TRUNCATED") finally: # Restore the original value to ensure test isolation. mod.JOB_LOG_MAX_READ_BYTES = original

…te UTF-8 f.seek() with a byte offset on a text-mode file can land in the middle of a multi-byte UTF-8 character, causing readline()/read() to raise UnicodeDecodeError. Switch to binary mode ("rb") for the truncation path and decode with errors="replace", consistent with the existing fast_tail_last_n_lines in utils.py. Also use errors="replace" for the small-file path to be defensive. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-03-06T10:37:41Z

python/ray/dashboard/modules/job/job_log_storage_client.py

+            with open(log_path, "rb") as f:
+                f.seek(file_size - JOB_LOG_MAX_READ_BYTES)
+                f.readline()  # skip partial first line
+                tail = f.read().decode("utf-8", errors="replace")


Unbounded f.read() in truncation branch undermines OOM cap

Low Severity

f.read() on the truncation path reads from the seek position to the actual EOF with no size limit. Since file_size was captured earlier by os.path.getsize(), an actively-written job log can grow between the size check and the read, causing f.read() to return significantly more than JOB_LOG_MAX_READ_BYTES. Passing the cap as an argument — f.read(JOB_LOG_MAX_READ_BYTES) — would make the protection watertight against concurrent appends.

yuanjiewei requested a review from a team as a code owner March 6, 2026 09:48

cursor bot reviewed Mar 6, 2026

View reviewed changes

python/ray/dashboard/modules/job/job_log_storage_client.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

yuanjiewei and others added 2 commits March 6, 2026 18:28

style: format test file with ruff

dfac012

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor bot reviewed Mar 6, 2026

View reviewed changes

ray-gardener bot added dashboard Issues specific to the Ray Dashboard core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Mar 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dashboard): cap job log reads to prevent dashboard agent OOM#61537

fix(dashboard): cap job log reads to prevent dashboard agent OOM#61537
yuanjiewei wants to merge 3 commits intoray-project:masterfrom
yuanjiewei:fix/cap-job-log-reads

yuanjiewei commented Mar 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yuanjiewei commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 6, 2026

Choose a reason for hiding this comment

Unbounded f.read() in truncation branch undermines OOM cap

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yuanjiewei commented Mar 6, 2026 •

edited

Loading

Unbounded `f.read()` in truncation branch undermines OOM cap