[Data][LLM] Support openai's nested image_url format in PrepareImageStage#1
Draft
[Data][LLM] Support openai's nested image_url format in PrepareImageStage#1
Conversation
Signed-off-by: czgdp1807 <gdp.1807@gmail.com>
…es in Ray Serve docs (ray-project#56131) Signed-off-by: Potato <tanxinyu@apache.org> Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Douglas Strodtman <douglas@anyscale.com>
…e reused by cache_stopped_nodes (ray-project#56007) Signed-off-by: Rueian <rueian@anyscale.com>
…56133) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? This reverts PR ray-project#52380. When working with large data blocks, this log can dump entire bock to terminal and can be spammy and insecure. ## Related issue number Fixes ray-project#56092 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Praveen Gorthy <praveeng@anyscale.com> Signed-off-by: Praveen <gorthypraveen@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
and add skip-on-release-tests tag for skipping steps to run on release tests Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
minor typo <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: weiliango <weiliang.dev@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…s resource limits assignment (ray-project#56051) ## Why are these changes needed? The performance tips documentation for setting resource limits in ExecutionOptions is no longer correct and gives an error when directly setting them in 2.49 after ray-project#54694. Update the documentation to show how to correctly set them. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Jack Gammack <49536617+JackGammack@users.noreply.github.com>
…on (ray-project#56066) This PR addresses various grammar, punctuation, and formatting issues throughout the Ray Data documentation in `doc/source/data/` to improve clarity and readability. ## Changes Made **Grammar Fixes:** - Fixed verb agreement errors in `key-concepts.rst` ("define" → "defines", "translate" → "translates") - Corrected missing articles and prepositions ("sharding your dataset" → "sharding of your dataset") - Fixed awkward phrasing in `saving-data.rst` ("have the control" → "have control") - Improved sentence flow in multiple files ("like following" → "as follows") **Formatting Improvements:** - Restructured bullet list formatting in `aggregations.rst` for better readability - Added missing punctuation and commas for proper sentence structure - Improved note formatting and punctuation consistency **Files Modified:** - `doc/source/data/key-concepts.rst` - 3 grammar corrections - `doc/source/data/user-guide.rst` - 1 verb form correction - `doc/source/data/aggregations.rst` - Bullet list formatting improvement - `doc/source/data/joining-data.rst` - 2 grammar and punctuation fixes - `doc/source/data/comparisons.rst` - 1 preposition correction - `doc/source/data/data-internals.rst` - 1 punctuation fix - `doc/source/data/saving-data.rst` - 1 phrasing improvement ## Review Methodology The review was conducted manually across all 45 files in the `doc/source/data/` directory, focusing specifically on: - Typos and spelling errors - Grammar and syntax issues - RST formatting consistency - Punctuation and capitalization The approach was conservative, making only clear corrections without rewriting content for style, preserving the technical accuracy and existing tone of the documentation. ## Impact These changes improve the overall quality and professionalism of the Ray Data documentation while maintaining all technical content and existing structure. The fixes address common grammatical issues that could distract readers from the technical content. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…int tool (ray-project#55417) ## Overview We scanned the ray data code using the PyLint tool and found some defects. Here are some scan results based on ray 2.46 version: "ray/python/ray/data/read_api.py:3214:4: R1705: Unnecessary "elif" after "return", remove the leading "el" from "elif" (no-else-return) ray/python/ray/data/datasource/file_based_datasource.py:276:20: R1730: Consider using 'num_threads = min(num_threads, len(read_paths))' instead of unnecessary if block (consider-using-min-builtin) R1705: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it (no-else-return)" Scanning the latest branch of the master will also yield similar results ## Why are these changes needed? The modifications in PR do not affect the code logic and functionality, nor do they affect existing unit test cases. The aim is to reduce code complexity and redundant code without changing the code logic, and enhance the readability of ray code. ## Related issue number Closes ray-project#53881 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [x] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
…project#56101) ## Why are these changes needed? 1. A new `//src/ray/raylet_client:raylet_client_interface` target containing only the `RayletClientInterface`. 2. A new `//src/ray/raylet_client:raylet_client_pool` target moved from the node_manager. 3. A new `//src/ray/raylet_client:node_manager_client` target moved from the node_manager. 4. Remove `using` statements in the `raylet_client.h` that allow others to omit `ray::` implicitly. There are no behavioral changes. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rueian <rueian@anyscale.com>
…roject#56077) Since we introduced panel groups to Default (ray-project#55620) & Data (ray-project#55495) dashboards, applications consuming Grafana dashboards can comfortably embed the full dashboard on any UI now (and the other dashboards are pretty usable even without them). Added a `"supportsFullGrafanaView"` tag to the `rayMeta` list in Default Dashboard to indicate to consumers that we support full Grafana dashboard embedding from now on. --------- Signed-off-by: anmol <anmol@anyscale.com> Co-authored-by: anmol <anmol@anyscale.com>
…or (ray-project#56050) ## Why are these changes needed? This is a followup from ray-project#54244 - Restrict `TryInitiateShutdown`, `TryTransitionToDisconnecting`, and `TryTransitionToShutdown` to private once all production code calls `RequestShutdown`. - Minimize API surface and prevent misuse; with a single entry point, internal transitions need not be externally callable. - Update tests to exercise only `RequestShutdown` ## Related issue number Closes ray-project#55739 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
and perform more aggresive checks, so that people do not forget to declaration when adding new tags. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…-project#56105) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> Historically, `ParquetDatasource` have been fetching individual files parquet footer metadata to obtain granular metadata about the dataset. While laudable in principle, it's really inefficient in practice and manifests itself in extremely poor performance on very large datasets (10s to 100s Ks of files). This change revisits this approach by - Removing metadata fetching as a step (and deprecating involved components) - Cleaning up and streamlining some of the other utilities further optimizing performance <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Closes ray-project#55142 Creating asyncio tasks from one thread on to event loop on another thread is not thread safe (but permissible) in asyncio. Currently this happens in two places in router 1. during handling assign_request 2. during creation of `_report_cached_metrics_forever` This PR fixes that, so that task creation happens in thread safe manner. I validated that this does not break bulk task cancellation, by rerunning the repro script from ray-project#52591 --------- Signed-off-by: abrar <abrar@anyscale.com>
…t#56155) Signed-off-by: dayshah <dhyey2019@gmail.com>
…essions in the release test (ray-project#56104)
Signed-off-by: dayshah <dhyey2019@gmail.com>
Used across GCS, Raylet, worker, so should be in `common/`. Also moved implementations to `.cc` file (aside from two single-line functions). --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
## Why are these changes needed? Current tests are setup only to test the code when `DeploymentMode == EveryNode`. In this case, we have proxies on each node. When the mode is overwritten with `HeadOnly` for any reason whatsoever, the test suite fails. This change enables assertions in both deployment modes. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: omkar <omkar@anyscale.com> Signed-off-by: Omkar Kulkarni <omkar@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
…56023) Add ownership and README for how to modify the proto files in the public directory. This is related to a recent work to define proto exposure via directory structure and set expectations for maintainer/users of these proto. Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com>
…ject#55762) The purpose of this change is to add metrics for monitoring the state of gcs and raylet. This change goes about that by repurposing the usage of RayConfig::instance().event_stats_metrics(). This OS environment variable previously enabled metric emission from all services and all instances of classes which wrapped calls to EventStats. This included things like instrumented_io_context which is a fairly prolific class through the code base. So the main thing we do in this change is change the name of RAY_event_stats_metrics to RAY_ emit_main_service_metrics and bring it's usage back up to the main classes of GCS and Raylet, which then pass this config into the main io_contexts used by those services. We then move usage of EventStats to be opt in, defaulting false for all other code paths. As time goes on and as we identify paths we want more monitoring on, we can opt in those code paths, and as we find cases where we don't care to have this kind of monitoring, move them off usage of things like instrumented_io_context or event_stats completely (since we're not really using the overhead for anything particularly useful). This PR also includes some clean up here and there and some metric type changes to make more sense with what they seem to intend to do. Specifically, operation_run_time_ms and operation_queue_time have been updated to be HISTOGRAM instead of GAUGE. The reason for this is that knowing the run time or queue time of the last event isn't quite as useful as knowing the histogram view which would give proper distributions on QoS. GAUGE does make sense for values which are absolute at a certain time (like queue length or CPU utilization). --------- Signed-off-by: zac <zac@anyscale.com>
…po root (ray-project#55989) Currently, this uses Bazel runfiles which causes a problem when `run_release_test` is called as a binary with Bazel, some files in the working directory not included in Bazel binary data don't get packaged into the zip file when submitting as Anyscale job. This switches to use a path with Bazel workspace directory which points to the source code and doesn't have issues of missing files in the zip file. --------- Signed-off-by: kevin <kevin@anyscale.com>
Making ReturnWorkerLease RPC idempotent and fault tolerant. Added corresponding cpp + python integration tests. This solves the issue mentioned in ray-project#55469 as we now use leaseID and not workerID to track granted leases on the raylet side. Hence, the retry for ReturnWorkerLease will not cause a pre-emptive return of an ongoing lease on the same worker since the lease ids for the retry vs current lease request will contain different lease IDs, thus the retry can just be discarded. Signed-off-by: joshlee <joshlee@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? It is not used any more. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
…6152) Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…ectories (ray-project#56128) Signed-off-by: Potato <tanxinyu@apache.org> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
…#56129) Signed-off-by: Potato <tanxinyu@apache.org> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ject#56179) Signed-off-by: joshlee <joshlee@anyscale.com>
…56495) removing python ver check for llm compilation already use --python <ver> flag on compilation Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…resent (ray-project#56435) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Initial user guide for GPU objects. Missing a couple things that we can add in follow-ups: - installation instructions - full API reference - performance numbers --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
…#56489) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> This PR updates the batch inference release tests to make them easier to run and clearer: * Sets the group name to `batch-inference`, removing the need to list each test individually. * Renames batch_inference_hetero → image_embedding_from_jsonl and batch_inference → image_classification for clarity. * Sets the image and text embedding workloads to run weekly for consistent signal. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
upgrade uv binary --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
…-project#56448) Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
…y-project#56483) Signed-off-by: Ibrahim Rabbani <irabbani@irabbani-JMY3JQDQW0.local> Signed-off-by: israbbani <israbbani@gmail.com> Co-authored-by: Ibrahim Rabbani <irabbani@irabbani-JMY3JQDQW0.local> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
so that it is easier to detect the ray version in the image Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
The existing one seemed to do nothing... swapped to using the recommendation from this [stack overflow post](https://stackoverflow.com/questions/55965712/how-do-i-add-clang-formatting-to-pre-commit-hook). --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
building ray img lockfiles for all supported python versions --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
added more tests for asynchronous inference for the below cases: - metrics - health checks - cancel tasks --------- Signed-off-by: harshit <harshit@anyscale.com>
Introduce proxy actor interface. Signed-off-by: Omkar Kulkarni <omkar@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Should fix the windows test, i am 90% sure. I could not manually test this because I am unsuccessfully in running test_logging on windows using this runbook https://www.notion.so/anyscale-hq/How-to-debug-Windows-tests-20e027c809cb803b92c8c796266b7852?source=copy_link. I am sure there is a way but not investing more time into this. --------- Signed-off-by: abrar <abrar@anyscale.com>
The cpp api is only tested on`:ray: core: cpp worker tests` , but we still build it on most ci steps. Ex. this commit was only broken for the cpp api and nothing else, but almost every single ci step broke. https://buildkite.com/ray-project/premerge/builds/48767 This sets `RAY_DISABLE_EXTRA_CPP` in the test containers so the cpp api doesn't need to get rebuilt on every test step. This should make ci a bit faster when making core cpp changes that cause the cpp api to rebuild. It'll still get built when we build the wheels so any compilation errors for the cpp api will get verified there. Signed-off-by: dayshah <dhyey2019@gmail.com>
…ject#56440) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? The check is redundant here, since the `initial_size` can't be smaller than `min_size` (which must be bigger that 1) <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number ray-project#56370 (comment) <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Making `gcs` contain only the GCS component's files. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…-project#56503) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? in ray-project#56428 I accidentally added the wrong throughput graph. This is row throughput I wanted. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Alan Guo <aguo@anyscale.com>
## script used for benchmarking
```python
import time
from typing import Optional
from python.ray._common.test_utils import wait_for_condition
from ray import serve
from ray.util.state import list_actors
import logging
logger = logging.getLogger("ray.serve")
@serve.deployment(max_ongoing_requests=1000)
class MemoryLeakTest:
async def __call__(self):
logger.info("MemoryLeakTest")
return "MemoryLeakTest"
app = serve.run(MemoryLeakTest.bind(), logging_config={
"encoding": "JSON",
})
def get_replica_pid() -> Optional[int]:
all_current_actors = list_actors(filters=[("state", "=", "ALIVE")])
for actor in all_current_actors:
if "MemoryLeakTest" in actor["name"]:
return actor["pid"]
return None
wait_for_condition(get_replica_pid)
print(get_replica_pid())
# track the memory of the replica in a loop in MB
import psutil
def track_memory():
pid = get_replica_pid()
if pid is not None:
process = psutil.Process(pid)
return process.memory_info().rss / 1024 / 1024
return None
while True:
memory_mb = track_memory()
print(f"\rMemory usage: {memory_mb:.2f} MB", end="", flush=True)
time.sleep(.1)
```
simulating load using `ab -n 500 -c 1 http://127.0.0.1:8000/`
used [memray](https://bloomberg.github.io/memray/tutorials/1.html) to
profile the proxy process. Used instructions from
[here](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/debug-memory.html#memory-profiling-ray-tasks-and-actors).
### On master
<img width="1164" height="628" alt="image"
src="https://github.com/user-attachments/assets/50d22e10-3206-4aeb-9585-97245523a5cb"
/>
### With fix
<img width="1161" height="621" alt="image"
src="https://github.com/user-attachments/assets/19224538-cbd7-4be7-b830-29e1b468625f"
/>
When we reduce the garbage collection (GC) frequency to every 10k
allocations, proxy memory peaks at **1.3 GB** for my test workload. By
contrast, under the default GC frequency (700 allocations), peak RSS
memory is **700 MB**.
The higher memory footprint with less frequent GC occurs because this
workload involves large object transactions. With GC running only after
10k allocations, these large objects remain in RSS longer, inflating
memory usage until a collection cycle is triggered.
Importantly, I found no evidence of a memory leak under sustained load.
With the fix, memory stabilizes at around **700 MB**, and even without
the fix, usage plateaus at **1.3 GB** rather than growing unbounded.
This feature was added in ray-project#49720
as a performance optimization. So we are taking slight hit in RPS for
stable memory usage for larger payloads.
---------
Signed-off-by: abrar <abrar@anyscale.com>
using `--check` feature to verify llm lock files are unchanged --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
this allows using `base-extra` or `base-extra-testdeps` or other base variations for building ray images. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Non-GCS component files have been moved; no longer need the nesting. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…#56551) they are only used within the class Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
make the check stricter Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…-project#56458) ## Why are these changes needed? As part of this PR I am trying to address Problem 2 raised in issue ray-project#44226. The main aim is to enable KubeRay to exclusively check the status of only DECLARATIVE Serve apps. The solution would be build on top of this ray-project#45522 Based on my current understanding, it seems KubeRay should only operate on the DECLARATIVE Serve apps Thus my solution will involve two key steps: This PR- Update the /api/serve/applications/ endpoint to read the APIType from the request body and pass it on to the controller controller.get_serve_instance_details Next modify KubeRay to explicitly pass Declarative as the APIType when calling the /api/serve/applications/ <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: jugalshah291 <shah.jugal291@gmail.com> Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
…tage Signed-off-by: Guy Stone <guys@spotify.com>
GuyStone
pushed a commit
that referenced
this pull request
Sep 16, 2025
… condition (ray-project#55367) ## Why are these changes needed? Workers crash with a fatal `RAY_CHECK` failure when the plasma store connection is broken during shutdown, causing the following error: ``` RAY_CHECK failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe ``` Stacktrace: ``` core_worker.cc:720 C Check failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe *** StackTrace Information *** /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x141789a) [0x7924dd2c689a] ray::operator<<() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x479) [0x7924dd2c9319] ray::RayLog::~RayLog() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x95cc8a) [0x7924dc80bc8a] ray::core::CoreWorker::CoreWorker()::{lambda()ray-project#13}::operator()() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager27MarkTaskReturnObjectsFailedERKNS_17TaskSpecificationENS_3rpc9ErrorTypeEPKNS5_12RayErrorInfoERKN4absl12lts_2023080213flat_hash_setINS_8ObjectIDENSB_13hash_internal4HashISD_EESt8equal_toISD_ESaISD_EEE+0x679) [0x7924dc868f29] ray::core::TaskManager::MarkTaskReturnObjectsFailed() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager15FailPendingTaskERKNS_6TaskIDENS_3rpc9ErrorTypeEPKNS_6StatusEPKNS5_12RayErrorInfoE+0x416) [0x7924dc86f186] ray::core::TaskManager::FailPendingTask() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x9a90e6) [0x7924dc8580e6] ray::core::NormalTaskSubmitter::RequestNewWorkerIfNeeded()::{lambda()#1}::operator()() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc14ClientCallImplINS0_23RequestWorkerLeaseReplyEE15OnReplyReceivedEv+0x68) [0x7924dc94aa48] ray::rpc::ClientCallImpl<>::OnReplyReceived() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data+0x15) [0x7924dc79e285] std::_Function_handler<>::_M_invoke() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd9b4c8) [0x7924dcc4a4c8] EventTracker::RecordExecution() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd4648e) [0x7924dcbf548e] std::_Function_handler<>::_M_invoke() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd46906) [0x7924dcbf5906] boost::asio::detail::completion_handler<>::do_complete() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f417b) [0x7924dd2a317b] boost::asio::detail::scheduler::do_run_one() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f5af9) [0x7924dd2a4af9] boost::asio::detail::scheduler::run() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f6202) [0x7924dd2a5202] boost::asio::io_context::run() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0x91) [0x7924dc793a61] ray::core::CoreWorker::RunIOService() /home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xcba0b0) [0x7924dcb690b0] thread_proxy /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7924dde71ac3] /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7924ddf03850] ``` Stack trace flow: 1. Task lease request fails -> `NormalTaskSubmitter::RequestNewWorkerIfNeeded()` callback. 2. Triggers `TaskManager::FailPendingTask()` -> `MarkTaskReturnObjectsFailed()`. 3. System attempts to store error objects in plasma via `put_in_local_plasma_callback_`. 4. Plasma connection is broken (raylet/plasma store already shut down). 5. `RAY_CHECK_OK()` in the callback causes fatal crash instead of graceful handling. Root Cause: This is a shutdown ordering race condition: 1. Raylet shuts down first: The raylet stops its IO context ([main_service_.stop()](https://github.com/ray-project/ray/blob/77c5475195e56a26891d88460973198391d20edf/src/ray/object_manager/plasma/store_runner.cc#L146)) which closes plasma store connections. 2. Worker still processes callbacks: Core worker continues processing pending callbacks on separate threads. 3. Broken connection: When the callback tries to store error objects in plasma, the connection is already closed. 4. Fatal crash: The `RAY_CHECK_OK()` treats this as an unexpected error and crashes the process. Fix: 1. Shutdown-aware plasma operations - Add `CoreWorker::IsShuttingDown()` method to check shutdown state. - Skip plasma operations entirely when shutdown is in progress. - Prevents attempting operations on already-closed connections. 2. Targeted error handling for connection failures - Replace blanket `RAY_CHECK_OK()` with specific error type checking. - Handle connection errors (Broken pipe, Connection reset, Bad file descriptor) as warnings during shutdown scenarios. - Maintain `RAY_CHECK_OK()` for other error types to catch real issues. --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.