Skip to content

fix(dashboard): cap job log reads to prevent dashboard agent OOM#61537

Open
yuanjiewei wants to merge 3 commits intoray-project:masterfrom
yuanjiewei:fix/cap-job-log-reads
Open

fix(dashboard): cap job log reads to prevent dashboard agent OOM#61537
yuanjiewei wants to merge 3 commits intoray-project:masterfrom
yuanjiewei:fix/cap-job-log-reads

Conversation

@yuanjiewei
Copy link
Contributor

@yuanjiewei yuanjiewei commented Mar 6, 2026

Description

JobLogStorageClient.get_logs() calls f.read() on the entire job driver log file with no size limit. When a long-running job writes tens of GiB of stdout/stderr, a single GET /api/jobs/{id}/logs request loads the full file into memory. The response is then serialized via dataclasses.asdict()json.dumps(), amplifying memory to ~3× the file size — causing the dashboard agent to OOM and crash the head node.

Changes:

  • Add JOB_LOG_MAX_READ_BYTES constant (default 16 MiB), configurable via RAY_JOB_LOG_MAX_READ_BYTES env var
  • Files within the cap are returned fully (no behavior change)
  • Oversized files return only the tail in binary-safe mode ("rb" + decode("utf-8", errors="replace")) with a [LOG TRUNCATED ...] prefix
  • Peak memory bounded to ~3× the cap (~48 MiB) regardless of log file size
  • Streaming endpoints (tail_logs(), gRPC StreamLog) are unaffected

Root cause chain:

GET /api/jobs/{id}/logs
  → JobHead.get_job_logs()               # job_head.py
    → JobAgentSubmissionClient            # gRPC to agent
      → JobAgent.get_job_logs()           # job_agent.py
        → JobLogStorageClient.get_logs()
          → f.read()                      # ← unbounded
  → dataclasses.asdict(response)          # copy #2
  → json.dumps()                          # copy #3

Production impact: A 32 GB log file caused the dashboard agent to consume 139 GB RSS on a 222 GB host, triggering the Ray memory monitor to kill running actors/tasks.

Related issues

None found. The closest existing issue (#41040) was about dashboard agent memory from metrics cardinality, not job log reads.

Additional information

Testing: Added test_job_log_storage_client.py with 6 tests covering:

  • Small file returned fully
  • Missing file returns empty string
  • Large file truncated with notice
  • Truncated output starts on clean line boundary
  • Multi-byte UTF-8 does not raise UnicodeDecodeError
  • Cap is configurable via env var

Run tests with:

pytest python/ray/dashboard/modules/job/tests/test_job_log_storage_client.py -v

get_logs() previously called f.read() on the entire job driver log file
with no size limit. When a job produces tens of GiB of stdout/stderr,
a single GET /api/jobs/{id}/logs request loads the full file into memory,
then serializes it to JSON (~3x amplification), causing the dashboard
agent to OOM and crash the head node.

Add a configurable cap (JOB_LOG_MAX_READ_BYTES, default 16 MiB). Files
within the cap are returned fully. Oversized files return only the tail
with a truncation notice, bounding peak memory to ~3x the cap.

The cap is tunable via RAY_JOB_LOG_MAX_READ_BYTES env var. Streaming
endpoints (tail_logs, gRPC StreamLog) are unaffected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yuanjiewei yuanjiewei requested a review from a team as a code owner March 6, 2026 09:48
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a crucial fix to prevent out-of-memory errors in the dashboard agent by capping the size of job logs read into memory. However, the current implementation of the truncation logic introduces a potential Denial of Service (DoS) vulnerability. Specifically, seeking to an arbitrary byte offset in a text-mode file can lead to a UnicodeDecodeError if the offset lands in the middle of a multi-byte UTF-8 character. This can be mitigated by opening the file in binary mode and using safe decoding with error handling. Additionally, a minor refactoring in one of the tests is suggested to improve maintainability by avoiding logic duplication. Otherwise, the changes are well-tested and effectively address the reported issue.

Comment on lines +38 to +41
with open(log_path, "r") as f:
f.seek(file_size - JOB_LOG_MAX_READ_BYTES)
f.readline() # skip partial first line
tail = f.read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The new log truncation logic in get_logs() uses f.seek(file_size - JOB_LOG_MAX_READ_BYTES) on a file opened in text mode ("r"). This seeks to an arbitrary byte offset in the file. If the log file contains multi-byte UTF-8 characters and the seek lands in the middle of a character sequence, subsequent calls to f.readline() or f.read() will raise a UnicodeDecodeError. Since this exception is not caught, it will propagate and cause the request to fail (500 error) or potentially crash the dashboard agent process. An attacker who can control the content of the job logs (e.g., by submitting a job that prints specific UTF-8 sequences) can trigger this error to prevent users from viewing the logs or to cause a Denial of Service (DoS) on the dashboard agent.

Suggested change
with open(log_path, "r") as f:
f.seek(file_size - JOB_LOG_MAX_READ_BYTES)
f.readline() # skip partial first line
tail = f.read()
with open(log_path, "rb") as f:
f.seek(file_size - JOB_LOG_MAX_READ_BYTES)
f.readline() # skip partial first line
tail_bytes = f.read()
tail = tail_bytes.decode("utf-8", errors="replace")

Comment on lines +83 to +104
def test_cap_is_configurable_via_env(self, client, tmp_log_dir):
# Re-import to pick up the env var
from ray.dashboard.modules.job import job_log_storage_client as mod

original = mod.JOB_LOG_MAX_READ_BYTES
# Simulate the env-based init
mod.JOB_LOG_MAX_READ_BYTES = int(
os.environ.get("RAY_JOB_LOG_MAX_READ_BYTES", 16 * 1024 * 1024)
)
try:
assert mod.JOB_LOG_MAX_READ_BYTES == 4096

content = "a" * 999 + "\n" # 1000 bytes per line
content = content * 10 # 10,000 bytes total > 4096
log_path = tmp_log_dir("env-job", content)

with patch.object(client, "get_log_file_path", return_value=log_path):
result = client.get_logs("env-job")

assert result.startswith("[LOG TRUNCATED")
finally:
mod.JOB_LOG_MAX_READ_BYTES = original
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While this test correctly verifies the functionality and cleans up after itself, it duplicates the initialization logic from job_log_storage_client.py to set mod.JOB_LOG_MAX_READ_BYTES. This makes the test brittle; if the initialization logic or default value changes in the main module, this test would need a separate update and could hide bugs.

A more robust approach is to use importlib.reload() to re-initialize the module with the patched environment variable. This directly tests the module's actual initialization code. The suggested change implements this approach for a more maintainable test.

    def test_cap_is_configurable_via_env(self, client, tmp_log_dir):
        import importlib
        from ray.dashboard.modules.job import job_log_storage_client as mod

        original = mod.JOB_LOG_MAX_READ_BYTES
        try:
            # Reload the module to apply the environment variable patch.
            importlib.reload(mod)
            assert mod.JOB_LOG_MAX_READ_BYTES == 4096

            content = "a" * 999 + "\n"  # 1000 bytes per line
            content = content * 10  # 10,000 bytes total > 4096
            log_path = tmp_log_dir("env-job", content)

            with patch.object(client, "get_log_file_path", return_value=log_path):
                result = client.get_logs("env-job")

            assert result.startswith("[LOG TRUNCATED")
        finally:
            # Restore the original value to ensure test isolation.
            mod.JOB_LOG_MAX_READ_BYTES = original

yuanjiewei and others added 2 commits March 6, 2026 18:28
…te UTF-8

f.seek() with a byte offset on a text-mode file can land in the middle
of a multi-byte UTF-8 character, causing readline()/read() to raise
UnicodeDecodeError. Switch to binary mode ("rb") for the truncation
path and decode with errors="replace", consistent with the existing
fast_tail_last_n_lines in utils.py.

Also use errors="replace" for the small-file path to be defensive.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

with open(log_path, "rb") as f:
f.seek(file_size - JOB_LOG_MAX_READ_BYTES)
f.readline() # skip partial first line
tail = f.read().decode("utf-8", errors="replace")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded f.read() in truncation branch undermines OOM cap

Low Severity

f.read() on the truncation path reads from the seek position to the actual EOF with no size limit. Since file_size was captured earlier by os.path.getsize(), an actively-written job log can grow between the size check and the read, causing f.read() to return significantly more than JOB_LOG_MAX_READ_BYTES. Passing the cap as an argument — f.read(JOB_LOG_MAX_READ_BYTES) — would make the protection watertight against concurrent appends.

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added dashboard Issues specific to the Ray Dashboard core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core dashboard Issues specific to the Ray Dashboard observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant