[Data][LLM] Support openai's nested image_url format in PrepareImageStage by GuyStone · Pull Request #1 · GuyStone/ray

GuyStone · 2025-09-16T02:00:16Z

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: czgdp1807 <gdp.1807@gmail.com>

…es in Ray Serve docs (ray-project#56131) Signed-off-by: Potato <tanxinyu@apache.org> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Douglas Strodtman <douglas@anyscale.com>

…e reused by cache_stopped_nodes (ray-project#56007) Signed-off-by: Rueian <rueian@anyscale.com>

…56133)   ## Why are these changes needed? This reverts PR ray-project#52380. When working with large data blocks, this log can dump entire bock to terminal and can be spammy and insecure. ## Related issue number Fixes ray-project#56092 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Praveen Gorthy <praveeng@anyscale.com> Signed-off-by: Praveen <gorthypraveen@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

and add skip-on-release-tests tag for skipping steps to run on release tests Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

minor typo   ## Why are these changes needed?  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: weiliango <weiliang.dev@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…s resource limits assignment (ray-project#56051) ## Why are these changes needed? The performance tips documentation for setting resource limits in ExecutionOptions is no longer correct and gives an error when directly setting them in 2.49 after ray-project#54694. Update the documentation to show how to correctly set them. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Jack Gammack <49536617+JackGammack@users.noreply.github.com>

…on (ray-project#56066) This PR addresses various grammar, punctuation, and formatting issues throughout the Ray Data documentation in `doc/source/data/` to improve clarity and readability. ## Changes Made **Grammar Fixes:** - Fixed verb agreement errors in `key-concepts.rst` ("define" → "defines", "translate" → "translates") - Corrected missing articles and prepositions ("sharding your dataset" → "sharding of your dataset") - Fixed awkward phrasing in `saving-data.rst` ("have the control" → "have control") - Improved sentence flow in multiple files ("like following" → "as follows") **Formatting Improvements:** - Restructured bullet list formatting in `aggregations.rst` for better readability - Added missing punctuation and commas for proper sentence structure - Improved note formatting and punctuation consistency **Files Modified:** - `doc/source/data/key-concepts.rst` - 3 grammar corrections - `doc/source/data/user-guide.rst` - 1 verb form correction - `doc/source/data/aggregations.rst` - Bullet list formatting improvement - `doc/source/data/joining-data.rst` - 2 grammar and punctuation fixes - `doc/source/data/comparisons.rst` - 1 preposition correction - `doc/source/data/data-internals.rst` - 1 punctuation fix - `doc/source/data/saving-data.rst` - 1 phrasing improvement ## Review Methodology The review was conducted manually across all 45 files in the `doc/source/data/` directory, focusing specifically on: - Typos and spelling errors - Grammar and syntax issues - RST formatting consistency - Punctuation and capitalization The approach was conservative, making only clear corrections without rewriting content for style, preserving the technical accuracy and existing tone of the documentation. ## Impact These changes improve the overall quality and professionalism of the Ray Data documentation while maintaining all technical content and existing structure. The fixes address common grammatical issues that could distract readers from the technical content. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…int tool (ray-project#55417) ## Overview We scanned the ray data code using the PyLint tool and found some defects. Here are some scan results based on ray 2.46 version: "ray/python/ray/data/read_api.py:3214:4: R1705: Unnecessary "elif" after "return", remove the leading "el" from "elif" (no-else-return) ray/python/ray/data/datasource/file_based_datasource.py:276:20: R1730: Consider using 'num_threads = min(num_threads, len(read_paths))' instead of unnecessary if block (consider-using-min-builtin) R1705: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it (no-else-return)" Scanning the latest branch of the master will also yield similar results ## Why are these changes needed? The modifications in PR do not affect the code logic and functionality, nor do they affect existing unit test cases. The aim is to reduce code complexity and redundant code without changing the code logic, and enhance the readability of ray code. ## Related issue number Closes ray-project#53881 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [x] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

…oject#56168)

…project#56101) ## Why are these changes needed? 1. A new `//src/ray/raylet_client:raylet_client_interface` target containing only the `RayletClientInterface`. 2. A new `//src/ray/raylet_client:raylet_client_pool` target moved from the node_manager. 3. A new `//src/ray/raylet_client:node_manager_client` target moved from the node_manager. 4. Remove `using` statements in the `raylet_client.h` that allow others to omit `ray::` implicitly. There are no behavioral changes. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rueian <rueian@anyscale.com>

…roject#56077) Since we introduced panel groups to Default (ray-project#55620) & Data (ray-project#55495) dashboards, applications consuming Grafana dashboards can comfortably embed the full dashboard on any UI now (and the other dashboards are pretty usable even without them). Added a `"supportsFullGrafanaView"` tag to the `rayMeta` list in Default Dashboard to indicate to consumers that we support full Grafana dashboard embedding from now on. --------- Signed-off-by: anmol <anmol@anyscale.com> Co-authored-by: anmol <anmol@anyscale.com>

…or (ray-project#56050) ## Why are these changes needed? This is a followup from ray-project#54244 - Restrict `TryInitiateShutdown`, `TryTransitionToDisconnecting`, and `TryTransitionToShutdown` to private once all production code calls `RequestShutdown`. - Minimize API surface and prevent misuse; with a single entry point, internal transitions need not be externally callable. - Update tests to exercise only `RequestShutdown` ## Related issue number Closes ray-project#55739 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

and perform more aggresive checks, so that people do not forget to declaration when adding new tags. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…-project#56105)   Historically, `ParquetDatasource` have been fetching individual files parquet footer metadata to obtain granular metadata about the dataset. While laudable in principle, it's really inefficient in practice and manifests itself in extremely poor performance on very large datasets (10s to 100s Ks of files). This change revisits this approach by - Removing metadata fetching as a step (and deprecating involved components) - Cleaning up and streamlining some of the other utilities further optimizing performance  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Closes ray-project#55142 Creating asyncio tasks from one thread on to event loop on another thread is not thread safe (but permissible) in asyncio. Currently this happens in two places in router 1. during handling assign_request 2. during creation of `_report_cached_metrics_forever` This PR fixes that, so that task creation happens in thread safe manner. I validated that this does not break bulk task cancellation, by rerunning the repro script from ray-project#52591 --------- Signed-off-by: abrar <abrar@anyscale.com>

…t#56155) Signed-off-by: dayshah <dhyey2019@gmail.com>

…essions in the release test (ray-project#56104)

Signed-off-by: dayshah <dhyey2019@gmail.com>

Used across GCS, Raylet, worker, so should be in `common/`. Also moved implementations to `.cc` file (aside from two single-line functions). --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

## Why are these changes needed? Current tests are setup only to test the code when `DeploymentMode == EveryNode`. In this case, we have proxies on each node. When the mode is overwritten with `HeadOnly` for any reason whatsoever, the test suite fails. This change enables assertions in both deployment modes. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: omkar <omkar@anyscale.com> Signed-off-by: Omkar Kulkarni <omkar@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>

…56023) Add ownership and README for how to modify the proto files in the public directory. This is related to a recent work to define proto exposure via directory structure and set expectations for maintainer/users of these proto. Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com>

…ject#55762) The purpose of this change is to add metrics for monitoring the state of gcs and raylet. This change goes about that by repurposing the usage of RayConfig::instance().event_stats_metrics(). This OS environment variable previously enabled metric emission from all services and all instances of classes which wrapped calls to EventStats. This included things like instrumented_io_context which is a fairly prolific class through the code base. So the main thing we do in this change is change the name of RAY_event_stats_metrics to RAY_ emit_main_service_metrics and bring it's usage back up to the main classes of GCS and Raylet, which then pass this config into the main io_contexts used by those services. We then move usage of EventStats to be opt in, defaulting false for all other code paths. As time goes on and as we identify paths we want more monitoring on, we can opt in those code paths, and as we find cases where we don't care to have this kind of monitoring, move them off usage of things like instrumented_io_context or event_stats completely (since we're not really using the overhead for anything particularly useful). This PR also includes some clean up here and there and some metric type changes to make more sense with what they seem to intend to do. Specifically, operation_run_time_ms and operation_queue_time have been updated to be HISTOGRAM instead of GAUGE. The reason for this is that knowing the run time or queue time of the last event isn't quite as useful as knowing the histogram view which would give proper distributions on QoS. GAUGE does make sense for values which are absolute at a certain time (like queue length or CPU utilization). --------- Signed-off-by: zac <zac@anyscale.com>

…po root (ray-project#55989) Currently, this uses Bazel runfiles which causes a problem when `run_release_test` is called as a binary with Bazel, some files in the working directory not included in Bazel binary data don't get packaged into the zip file when submitting as Anyscale job. This switches to use a path with Bazel workspace directory which points to the source code and doesn't have issues of missing files in the zip file. --------- Signed-off-by: kevin <kevin@anyscale.com>

Making ReturnWorkerLease RPC idempotent and fault tolerant. Added corresponding cpp + python integration tests. This solves the issue mentioned in ray-project#55469 as we now use leaseID and not workerID to track granted leases on the raylet side. Hence, the retry for ReturnWorkerLease will not cause a pre-emptive return of an ongoing lease on the same worker since the lease ids for the retry vs current lease request will contain different lease IDs, thus the retry can just be discarded. Signed-off-by: joshlee <joshlee@anyscale.com>

## Why are these changes needed? It is not used any more. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>

…6152) Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

…ectories (ray-project#56128) Signed-off-by: Potato <tanxinyu@apache.org> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>

…#56129) Signed-off-by: Potato <tanxinyu@apache.org> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ject#56179) Signed-off-by: joshlee <joshlee@anyscale.com>

…ay-project#55750)

…56495) removing python ver check for llm compilation already use --python <ver> flag on compilation Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…resent (ray-project#56435) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

Initial user guide for GPU objects. Missing a couple things that we can add in follow-ups: - installation instructions - full API reference - performance numbers --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>

…#56489)   ## Why are these changes needed?  This PR updates the batch inference release tests to make them easier to run and clearer: * Sets the group name to `batch-inference`, removing the need to list each test individually. * Renames batch_inference_hetero → image_embedding_from_jsonl and batch_inference → image_classification for clarity. * Sets the image and text embedding workloads to run weekly for consistent signal. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

upgrade uv binary --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

Signed-off-by: joshlee <joshlee@anyscale.com>

…-project#56448) Signed-off-by: dayshah <dhyey2019@gmail.com>

Signed-off-by: joshlee <joshlee@anyscale.com>

…y-project#56483) Signed-off-by: Ibrahim Rabbani <irabbani@irabbani-JMY3JQDQW0.local> Signed-off-by: israbbani <israbbani@gmail.com> Co-authored-by: Ibrahim Rabbani <irabbani@irabbani-JMY3JQDQW0.local> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>

so that it is easier to detect the ray version in the image Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

The existing one seemed to do nothing... swapped to using the recommendation from this [stack overflow post](https://stackoverflow.com/questions/55965712/how-do-i-add-clang-formatting-to-pre-commit-hook). --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

building ray img lockfiles for all supported python versions --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

added more tests for asynchronous inference for the below cases: - metrics - health checks - cancel tasks --------- Signed-off-by: harshit <harshit@anyscale.com>

Introduce proxy actor interface. Signed-off-by: Omkar Kulkarni <omkar@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Should fix the windows test, i am 90% sure. I could not manually test this because I am unsuccessfully in running test_logging on windows using this runbook https://www.notion.so/anyscale-hq/How-to-debug-Windows-tests-20e027c809cb803b92c8c796266b7852?source=copy_link. I am sure there is a way but not investing more time into this. --------- Signed-off-by: abrar <abrar@anyscale.com>

The cpp api is only tested on`:ray: core: cpp worker tests` , but we still build it on most ci steps. Ex. this commit was only broken for the cpp api and nothing else, but almost every single ci step broke. https://buildkite.com/ray-project/premerge/builds/48767 This sets `RAY_DISABLE_EXTRA_CPP` in the test containers so the cpp api doesn't need to get rebuilt on every test step. This should make ci a bit faster when making core cpp changes that cause the cpp api to rebuild. It'll still get built when we build the wheels so any compilation errors for the cpp api will get verified there. Signed-off-by: dayshah <dhyey2019@gmail.com>

…ject#56440)   ## Why are these changes needed? The check is redundant here, since the `initial_size` can't be smaller than `min_size` (which must be bigger that 1)  ## Related issue number ray-project#56370 (comment)  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

Making `gcs` contain only the GCS component's files. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…-project#56503)   ## Why are these changes needed? in ray-project#56428 I accidentally added the wrong throughput graph. This is row throughput I wanted.  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Alan Guo <aguo@anyscale.com>

## script used for benchmarking ```python import time from typing import Optional from python.ray._common.test_utils import wait_for_condition from ray import serve from ray.util.state import list_actors import logging logger = logging.getLogger("ray.serve") @serve.deployment(max_ongoing_requests=1000) class MemoryLeakTest: async def __call__(self): logger.info("MemoryLeakTest") return "MemoryLeakTest" app = serve.run(MemoryLeakTest.bind(), logging_config={ "encoding": "JSON", }) def get_replica_pid() -> Optional[int]: all_current_actors = list_actors(filters=[("state", "=", "ALIVE")]) for actor in all_current_actors: if "MemoryLeakTest" in actor["name"]: return actor["pid"] return None wait_for_condition(get_replica_pid) print(get_replica_pid()) # track the memory of the replica in a loop in MB import psutil def track_memory(): pid = get_replica_pid() if pid is not None: process = psutil.Process(pid) return process.memory_info().rss / 1024 / 1024 return None while True: memory_mb = track_memory() print(f"\rMemory usage: {memory_mb:.2f} MB", end="", flush=True) time.sleep(.1) ``` simulating load using `ab -n 500 -c 1 http://127.0.0.1:8000/` used [memray](https://bloomberg.github.io/memray/tutorials/1.html) to profile the proxy process. Used instructions from [here](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/debug-memory.html#memory-profiling-ray-tasks-and-actors). ### On master <img width="1164" height="628" alt="image" src="https://github.com/user-attachments/assets/50d22e10-3206-4aeb-9585-97245523a5cb" /> ### With fix <img width="1161" height="621" alt="image" src="https://github.com/user-attachments/assets/19224538-cbd7-4be7-b830-29e1b468625f" /> When we reduce the garbage collection (GC) frequency to every 10k allocations, proxy memory peaks at **1.3 GB** for my test workload. By contrast, under the default GC frequency (700 allocations), peak RSS memory is **700 MB**. The higher memory footprint with less frequent GC occurs because this workload involves large object transactions. With GC running only after 10k allocations, these large objects remain in RSS longer, inflating memory usage until a collection cycle is triggered. Importantly, I found no evidence of a memory leak under sustained load. With the fix, memory stabilizes at around **700 MB**, and even without the fix, usage plateaus at **1.3 GB** rather than growing unbounded. This feature was added in ray-project#49720 as a performance optimization. So we are taking slight hit in RPS for stable memory usage for larger payloads. --------- Signed-off-by: abrar <abrar@anyscale.com>

using `--check` feature to verify llm lock files are unchanged --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

this allows using `base-extra` or `base-extra-testdeps` or other base variations for building ray images. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

Non-GCS component files have been moved; no longer need the nesting. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…#56551) they are only used within the class Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

make the check stricter Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…-project#56458) ## Why are these changes needed? As part of this PR I am trying to address Problem 2 raised in issue ray-project#44226. The main aim is to enable KubeRay to exclusively check the status of only DECLARATIVE Serve apps. The solution would be build on top of this ray-project#45522 Based on my current understanding, it seems KubeRay should only operate on the DECLARATIVE Serve apps Thus my solution will involve two key steps: This PR- Update the /api/serve/applications/ endpoint to read the APIType from the request body and pass it on to the controller controller.get_serve_instance_details Next modify KubeRay to explicitly pass Declarative as the APIType when calling the /api/serve/applications/  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: jugalshah291 <shah.jugal291@gmail.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>

…tage Signed-off-by: Guy Stone <guys@spotify.com>

… condition (ray-project#55367) ## Why are these changes needed? Workers crash with a fatal `RAY_CHECK` failure when the plasma store connection is broken during shutdown, causing the following error: ``` RAY_CHECK failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe ``` Stacktrace: ``` core_worker.cc:720 C Check failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe *** StackTrace Information *** /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x141789a) [0x7924dd2c689a] ray::operator<<() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x479) [0x7924dd2c9319] ray::RayLog::~RayLog() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x95cc8a) [0x7924dc80bc8a] ray::core::CoreWorker::CoreWorker()::{lambda()ray-project#13}::operator()() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager27MarkTaskReturnObjectsFailedERKNS_17TaskSpecificationENS_3rpc9ErrorTypeEPKNS5_12RayErrorInfoERKN4absl12lts_2023080213flat_hash_setINS_8ObjectIDENSB_13hash_internal4HashISD_EESt8equal_toISD_ESaISD_EEE+0x679) [0x7924dc868f29] ray::core::TaskManager::MarkTaskReturnObjectsFailed() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager15FailPendingTaskERKNS_6TaskIDENS_3rpc9ErrorTypeEPKNS_6StatusEPKNS5_12RayErrorInfoE+0x416) [0x7924dc86f186] ray::core::TaskManager::FailPendingTask() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x9a90e6) [0x7924dc8580e6] ray::core::NormalTaskSubmitter::RequestNewWorkerIfNeeded()::{lambda()#1}::operator()() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc14ClientCallImplINS0_23RequestWorkerLeaseReplyEE15OnReplyReceivedEv+0x68) [0x7924dc94aa48] ray::rpc::ClientCallImpl<>::OnReplyReceived() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data+0x15) [0x7924dc79e285] std::_Function_handler<>::_M_invoke() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd9b4c8) [0x7924dcc4a4c8] EventTracker::RecordExecution() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd4648e) [0x7924dcbf548e] std::_Function_handler<>::_M_invoke() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd46906) [0x7924dcbf5906] boost::asio::detail::completion_handler<>::do_complete() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f417b) [0x7924dd2a317b] boost::asio::detail::scheduler::do_run_one() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f5af9) [0x7924dd2a4af9] boost::asio::detail::scheduler::run() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f6202) [0x7924dd2a5202] boost::asio::io_context::run() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0x91) [0x7924dc793a61] ray::core::CoreWorker::RunIOService() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xcba0b0) [0x7924dcb690b0] thread_proxy /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7924dde71ac3] /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7924ddf03850] ``` Stack trace flow: 1. Task lease request fails -> `NormalTaskSubmitter::RequestNewWorkerIfNeeded()` callback. 2. Triggers `TaskManager::FailPendingTask()` -> `MarkTaskReturnObjectsFailed()`. 3. System attempts to store error objects in plasma via `put_in_local_plasma_callback_`. 4. Plasma connection is broken (raylet/plasma store already shut down). 5. `RAY_CHECK_OK()` in the callback causes fatal crash instead of graceful handling. Root Cause: This is a shutdown ordering race condition: 1. Raylet shuts down first: The raylet stops its IO context ([main_service_.stop()](https://github.com/ray-project/ray/blob/77c5475195e56a26891d88460973198391d20edf/src/ray/object_manager/plasma/store_runner.cc#L146)) which closes plasma store connections. 2. Worker still processes callbacks: Core worker continues processing pending callbacks on separate threads. 3. Broken connection: When the callback tries to store error objects in plasma, the connection is already closed. 4. Fatal crash: The `RAY_CHECK_OK()` treats this as an unexpected error and crashes the process. Fix: 1. Shutdown-aware plasma operations - Add `CoreWorker::IsShuttingDown()` method to check shutdown state. - Skip plasma operations entirely when shutdown is in progress. - Prevents attempting operations on already-closed connections. 2. Targeted error handling for connection failures - Replace blanket `RAY_CHECK_OK()` with specific error type checking. - Handle connection errors (Broken pipe, Connection reset, Bad file descriptor) as warnings during shutdown scenarios. - Maintain `RAY_CHECK_OK()` for other error types to catch real issues. --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

czgdp1807 and others added 30 commits September 2, 2025 10:17

Enable ruff lint for python/ray/tests/ (ray-project#56079)

327bb28

Signed-off-by: czgdp1807 <gdp.1807@gmail.com>

[core][autoscaler] fix races when waiting for stopping aws nodes to b…

de87f6b

…e reused by cache_stopped_nodes (ray-project#56007) Signed-off-by: Rueian <rueian@anyscale.com>

[release test] split images into its own group (ray-project#56122)

7cd7ba7

and add skip-on-release-tests tag for skipping steps to run on release tests Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[Core] Change the size of test_gpu_objects_gloo.py to large (ray-pr…

eb50be6

…oject#56168)

[ci] declare test rule tags in test_rules file (ray-project#56127)

01a5435

and perform more aggresive checks, so that people do not forget to declaration when adding new tags. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[core] Revert granted check change from ray-project#55806 (ray-projec…

44f1c7f

…t#56155) Signed-off-by: dayshah <dhyey2019@gmail.com>

[Serve.llm][PD] changed LMCache dependency to use 0.3.3 to avoid regr…

0c1ae99

…essions in the release test (ray-project#56104)

[core] Fix test_kill_raylet_signal_log on mac (ray-project#56151)

7e48cc4

Signed-off-by: dayshah <dhyey2019@gmail.com>

[core] Move gcs_pb_utils to common (ray-project#56160)

acea036

Used across GCS, Raylet, worker, so should be in `common/`. Also moved implementations to `.cc` file (aside from two single-line functions). --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[core] Fix flaky ShutdownCoordinator test under tsan (ray-project#5…

bd684c7

…6152) Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

[core] Revert lease spec optimization from ray-project#55806 (ray-pro…

252403f

…ject#56179) Signed-off-by: joshlee <joshlee@anyscale.com>

lk-chen and others added 29 commits September 12, 2025 14:46

[LLM][Serve] Allow setting data_parallel_size=1 in engine_kwargs (r…

ef9168e

…ay-project#55750)

[ci] removing python ver check for llm lockfile compile (ray-project#…

3054b71

…56495) removing python ver check for llm compilation already use --python <ver> flag on compilation Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[Data.llm] Fix multimodal image extraction when no system prompt is p…

1028dcc

…resent (ray-project#56435) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

[ci] upgrading uv ver 0.8.17 (latest) (ray-project#56494)

e26d21a

upgrade uv binary --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[core] Fix ASAN issues in object manager test (ray-project#56492)

ce4e473

Signed-off-by: joshlee <joshlee@anyscale.com>

[core] Don't hold shared ptr to client in actor submitter queues (ray…

895d78b

…-project#56448) Signed-off-by: dayshah <dhyey2019@gmail.com>

[core] Fixing timeout in test_object_spilling_3.py (ray-project#56512)

25bb624

Signed-off-by: joshlee <joshlee@anyscale.com>

[core] Fix UBSAN errors in object_manager_test (ray-project#56521)

8259540

Signed-off-by: joshlee <joshlee@anyscale.com>

[image] add label for ray version and commit (ray-project#56493)

33200dd

so that it is easier to detect the ray version in the image Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[ci][deps] raydepsets: building ray img lockfiles (ray-project#56444)

f7ddcbe

building ray img lockfiles for all supported python versions --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

add more tests for async inf (ray-project#56408)

03e5cd9

added more tests for asynchronous inference for the below cases: - metrics - health checks - cancel tasks --------- Signed-off-by: harshit <harshit@anyscale.com>

[SERVE] Proxy Actor Interface (ray-project#56288)

97e2b32

Introduce proxy actor interface. Signed-off-by: Omkar Kulkarni <omkar@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[core] Move gcs_client out of gcs directory (ray-project#56515)

72eb7a6

Making `gcs` contain only the GCS component's files. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[ci] updating raydepsets llm check (ray-project#56439)

ab72665

using `--check` feature to verify llm lock files are unchanged --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[image] allow using explicit base type (ray-project#56545)

ac90fa0

this allows using `base-extra` or `base-extra-testdeps` or other base variations for building ray images. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[core] Remove gcs_server directory nesting (ray-project#56516)

dc01c8a

Non-GCS component files have been moved; no longer need the nesting. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[image] change tag methods of container class to private (ray-project…

212ce29

…#56551) they are only used within the class Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[image] add ray-llm image type check (ray-project#56542)

e5e4ae3

make the check stricter Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[Data][LLM] Support openai's nested image_url format in PrepareImageS…

9ab7a16

…tage Signed-off-by: Guy Stone <guys@spotify.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data][LLM] Support openai's nested image_url format in PrepareImageStage#1

[Data][LLM] Support openai's nested image_url format in PrepareImageStage#1
GuyStone wants to merge 634 commits intomasterfrom
openai-nested-imageurl

GuyStone commented Sep 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

GuyStone commented Sep 16, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants