Skip to content

[Data][LLM] Support openai's nested image_url format in PrepareImageStage#1

Draft
GuyStone wants to merge 634 commits intomasterfrom
openai-nested-imageurl
Draft

[Data][LLM] Support openai's nested image_url format in PrepareImageStage#1
GuyStone wants to merge 634 commits intomasterfrom
openai-nested-imageurl

Conversation

@GuyStone
Copy link
Owner

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

czgdp1807 and others added 30 commits September 2, 2025 10:17
Signed-off-by: czgdp1807 <gdp.1807@gmail.com>
…es in Ray Serve docs (ray-project#56131)

Signed-off-by: Potato <tanxinyu@apache.org>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Douglas Strodtman <douglas@anyscale.com>
…e reused by cache_stopped_nodes (ray-project#56007)

Signed-off-by: Rueian <rueian@anyscale.com>
…56133)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
This reverts PR ray-project#52380.

When working with large data blocks, this log can dump entire bock to
terminal and can be spammy and insecure.

## Related issue number

Fixes ray-project#56092

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Praveen Gorthy <praveeng@anyscale.com>
Signed-off-by: Praveen <gorthypraveen@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
and add skip-on-release-tests tag for skipping steps to run on release
tests

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
minor typo

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: weiliango <weiliang.dev@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…s resource limits assignment (ray-project#56051)

## Why are these changes needed?

The performance tips documentation for setting resource limits in
ExecutionOptions is no longer correct and gives an error when directly
setting them in 2.49 after ray-project#54694. Update the documentation to show how
to correctly set them.

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Jack Gammack <49536617+JackGammack@users.noreply.github.com>
…on (ray-project#56066)

This PR addresses various grammar, punctuation, and formatting issues
throughout the Ray Data documentation in `doc/source/data/` to improve
clarity and readability.

## Changes Made

**Grammar Fixes:**
- Fixed verb agreement errors in `key-concepts.rst` ("define" →
"defines", "translate" → "translates")
- Corrected missing articles and prepositions ("sharding your dataset" →
"sharding of your dataset")
- Fixed awkward phrasing in `saving-data.rst` ("have the control" →
"have control")
- Improved sentence flow in multiple files ("like following" → "as
follows")

**Formatting Improvements:**
- Restructured bullet list formatting in `aggregations.rst` for better
readability
- Added missing punctuation and commas for proper sentence structure
- Improved note formatting and punctuation consistency

**Files Modified:**
- `doc/source/data/key-concepts.rst` - 3 grammar corrections
- `doc/source/data/user-guide.rst` - 1 verb form correction  
- `doc/source/data/aggregations.rst` - Bullet list formatting
improvement
- `doc/source/data/joining-data.rst` - 2 grammar and punctuation fixes
- `doc/source/data/comparisons.rst` - 1 preposition correction
- `doc/source/data/data-internals.rst` - 1 punctuation fix
- `doc/source/data/saving-data.rst` - 1 phrasing improvement

## Review Methodology

The review was conducted manually across all 45 files in the
`doc/source/data/` directory, focusing specifically on:
- Typos and spelling errors
- Grammar and syntax issues
- RST formatting consistency
- Punctuation and capitalization

The approach was conservative, making only clear corrections without
rewriting content for style, preserving the technical accuracy and
existing tone of the documentation.

## Impact

These changes improve the overall quality and professionalism of the Ray
Data documentation while maintaining all technical content and existing
structure. The fixes address common grammatical issues that could
distract readers from the technical content.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…int tool (ray-project#55417)

## Overview
We scanned the ray data code using the PyLint tool and found some
defects. Here are some scan results based on ray 2.46 version:
"ray/python/ray/data/read_api.py:3214:4: R1705: Unnecessary "elif" after
"return", remove the leading "el" from "elif" (no-else-return)
ray/python/ray/data/datasource/file_based_datasource.py:276:20: R1730:
Consider using 'num_threads = min(num_threads, len(read_paths))' instead
of unnecessary if block (consider-using-min-builtin)
R1705: Unnecessary "else" after "return", remove the "else" and
de-indent the code inside it (no-else-return)"
Scanning the latest branch of the master will also yield similar results


## Why are these changes needed?
The modifications in PR do not affect the code logic and functionality,
nor do they affect existing unit test cases. The aim is to reduce code
complexity and redundant code without changing the code logic, and
enhance the readability of ray code.

## Related issue number

 Closes ray-project#53881 

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [x] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
…project#56101)

## Why are these changes needed?

1. A new `//src/ray/raylet_client:raylet_client_interface` target
containing only the `RayletClientInterface`.
2. A new `//src/ray/raylet_client:raylet_client_pool` target moved from
the node_manager.
3. A new `//src/ray/raylet_client:node_manager_client` target moved from
the node_manager.
4. Remove `using` statements in the `raylet_client.h` that allow others
to omit `ray::` implicitly.

There are no behavioral changes.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Rueian <rueian@anyscale.com>
…roject#56077)

Since we introduced panel groups to Default
(ray-project#55620) & Data
(ray-project#55495) dashboards, applications
consuming Grafana dashboards can comfortably embed the full dashboard on
any UI now (and the other dashboards are pretty usable even without
them).

Added a `"supportsFullGrafanaView"` tag to the `rayMeta` list in Default
Dashboard to indicate to consumers that we support full Grafana
dashboard embedding from now on.

---------

Signed-off-by: anmol <anmol@anyscale.com>
Co-authored-by: anmol <anmol@anyscale.com>
…or (ray-project#56050)

## Why are these changes needed?

This is a followup from ray-project#54244

- Restrict `TryInitiateShutdown`, `TryTransitionToDisconnecting`, and
`TryTransitionToShutdown` to private once all production code calls
`RequestShutdown`.
- Minimize API surface and prevent misuse; with a single entry point,
internal transitions need not be externally callable.
- Update tests to exercise only `RequestShutdown`

## Related issue number

Closes ray-project#55739

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
and perform more aggresive checks, so that people do not forget to
declaration when adding new tags.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…-project#56105)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

Historically, `ParquetDatasource` have been fetching individual files
parquet footer metadata to obtain granular metadata about the dataset.

While laudable in principle, it's really inefficient in practice and
manifests itself in extremely poor performance on very large datasets
(10s to 100s Ks of files).

This change revisits this approach by 

- Removing metadata fetching as a step (and deprecating involved
components)
- Cleaning up and streamlining some of the other utilities further
optimizing performance


<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Closes ray-project#55142

Creating asyncio tasks from one thread on to event loop on another
thread is not thread safe (but permissible) in asyncio.

Currently this happens in two places in router
1. during handling assign_request
2. during creation of `_report_cached_metrics_forever`

This PR fixes that, so that task creation happens in thread safe manner.

I validated that this does not break bulk task cancellation, by
rerunning the repro script from
ray-project#52591

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Used across GCS, Raylet, worker, so should be in `common/`.

Also moved implementations to `.cc` file (aside from two single-line
functions).

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
## Why are these changes needed?

Current tests are setup only to test the code when `DeploymentMode ==
EveryNode`. In this case, we have proxies on each node. When the mode is
overwritten with `HeadOnly` for any reason whatsoever, the test suite
fails.

This change enables assertions in both deployment modes.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: omkar <omkar@anyscale.com>
Signed-off-by: Omkar Kulkarni <omkar@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
…56023)

Add ownership and README for how to modify the proto files in the public
directory. This is related to a recent work to define proto exposure via
directory structure and set expectations for maintainer/users of these
proto.

Test:
- CI

Signed-off-by: Cuong Nguyen <can@anyscale.com>
…ject#55762)

The purpose of this change is to add metrics for monitoring the state of
gcs and raylet. This change goes about that by repurposing the usage of
RayConfig::instance().event_stats_metrics(). This OS environment
variable previously enabled metric emission from all services and all
instances of classes which wrapped calls to EventStats. This included
things like instrumented_io_context which is a fairly prolific class
through the code base.

So the main thing we do in this change is change the name of
RAY_event_stats_metrics to RAY_ emit_main_service_metrics and bring it's
usage back up to the main classes of GCS and Raylet, which then pass
this config into the main io_contexts used by those services. We then
move usage of EventStats to be opt in, defaulting false for all other
code paths. As time goes on and as we identify paths we want more
monitoring on, we can opt in those code paths, and as we find cases
where we don't care to have this kind of monitoring, move them off usage
of things like instrumented_io_context or event_stats completely (since
we're not really using the overhead for anything particularly useful).

This PR also includes some clean up here and there and some metric type
changes to make more sense with what they seem to intend to do.
Specifically, operation_run_time_ms and operation_queue_time have been
updated to be HISTOGRAM instead of GAUGE. The reason for this is that
knowing the run time or queue time of the last event isn't quite as
useful as knowing the histogram view which would give proper
distributions on QoS. GAUGE does make sense for values which are
absolute at a certain time (like queue length or CPU utilization).

---------

Signed-off-by: zac <zac@anyscale.com>
…po root (ray-project#55989)

Currently, this uses Bazel runfiles which causes a problem when
`run_release_test` is called as a binary with Bazel, some files in the
working directory not included in Bazel binary data don't get packaged
into the zip file when submitting as Anyscale job. This switches to use
a path with Bazel workspace directory which points to the source code
and doesn't have issues of missing files in the zip file.

---------

Signed-off-by: kevin <kevin@anyscale.com>
Making ReturnWorkerLease RPC idempotent and fault tolerant. Added
corresponding cpp + python integration tests.
This solves the issue mentioned in ray-project#55469 as we now use leaseID and not
workerID to track granted leases on the raylet side. Hence, the retry
for ReturnWorkerLease will not cause a pre-emptive return of an ongoing
lease on the same worker since the lease ids for the retry vs current
lease request will contain different lease IDs, thus the retry can just
be discarded.

Signed-off-by: joshlee <joshlee@anyscale.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

It is not used any more.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
…6152)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…ectories (ray-project#56128)

Signed-off-by: Potato <tanxinyu@apache.org>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
…#56129)

Signed-off-by: Potato <tanxinyu@apache.org>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
lk-chen and others added 29 commits September 12, 2025 14:46
…56495)

removing python ver check for llm compilation

already use --python <ver> flag on compilation

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…resent (ray-project#56435)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Initial user guide for GPU objects. Missing a couple things that we can add in
follow-ups:
- installation instructions
- full API reference
- performance numbers

---------

Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
…#56489)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

This PR updates the batch inference release tests to make them easier to
run and clearer:
* Sets the group name to `batch-inference`, removing the need to list
each test individually.
* Renames batch_inference_hetero → image_embedding_from_jsonl and
batch_inference → image_classification for clarity.
* Sets the image and text embedding workloads to run weekly for
consistent signal.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
upgrade uv binary

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
…y-project#56483)

Signed-off-by: Ibrahim Rabbani <irabbani@irabbani-JMY3JQDQW0.local>
Signed-off-by: israbbani <israbbani@gmail.com>
Co-authored-by: Ibrahim Rabbani <irabbani@irabbani-JMY3JQDQW0.local>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
so that it is easier to detect the ray version in the image

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
The existing one seemed to do nothing... swapped to using the
recommendation from this [stack overflow
post](https://stackoverflow.com/questions/55965712/how-do-i-add-clang-formatting-to-pre-commit-hook).

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
building ray img lockfiles for all supported python versions

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
added more tests for asynchronous inference for the below cases:
- metrics
- health checks
- cancel tasks

---------

Signed-off-by: harshit <harshit@anyscale.com>
Introduce proxy actor interface.

Signed-off-by: Omkar Kulkarni <omkar@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Should fix the windows test, i am 90% sure. I could not manually test
this because I am unsuccessfully in running test_logging on windows
using this runbook
https://www.notion.so/anyscale-hq/How-to-debug-Windows-tests-20e027c809cb803b92c8c796266b7852?source=copy_link.
I am sure there is a way but not investing more time into this.

---------

Signed-off-by: abrar <abrar@anyscale.com>
The cpp api is only tested on`:ray: core: cpp worker tests` , but we
still build it on most ci steps. Ex. this commit was only broken for the
cpp api and nothing else, but almost every single ci step broke.
https://buildkite.com/ray-project/premerge/builds/48767

This sets `RAY_DISABLE_EXTRA_CPP` in the test containers so the cpp api
doesn't need to get rebuilt on every test step. This should make ci a
bit faster when making core cpp changes that cause the cpp api to
rebuild. It'll still get built when we build the wheels so any
compilation errors for the cpp api will get verified there.

Signed-off-by: dayshah <dhyey2019@gmail.com>
…ject#56440)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
The check is redundant here, since the `initial_size` can't be smaller
than `min_size` (which must be bigger that 1)
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number
ray-project#56370 (comment)
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Making `gcs` contain only the GCS component's files.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…-project#56503)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
in ray-project#56428 I accidentally added the wrong throughput graph. This is row
throughput I wanted.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Alan Guo <aguo@anyscale.com>
## script used for benchmarking
```python
import time
from typing import Optional
from python.ray._common.test_utils import wait_for_condition
from ray import serve
from ray.util.state import list_actors

import logging

logger = logging.getLogger("ray.serve")

@serve.deployment(max_ongoing_requests=1000)
class MemoryLeakTest:
    async def __call__(self):
        logger.info("MemoryLeakTest")
        return "MemoryLeakTest"

app = serve.run(MemoryLeakTest.bind(), logging_config={
    "encoding": "JSON",
})

def get_replica_pid() -> Optional[int]:
    all_current_actors = list_actors(filters=[("state", "=", "ALIVE")])
    for actor in all_current_actors:
        if "MemoryLeakTest" in actor["name"]:
            return actor["pid"]
    return None


wait_for_condition(get_replica_pid)


print(get_replica_pid())

# track the memory of the replica in a loop in MB
import psutil

def track_memory():
    pid = get_replica_pid()
    if pid is not None:
        process = psutil.Process(pid)
        return process.memory_info().rss / 1024 / 1024
    return None

while True:
    memory_mb = track_memory()
    print(f"\rMemory usage: {memory_mb:.2f} MB", end="", flush=True)
    time.sleep(.1)
```

simulating load using `ab -n 500 -c 1 http://127.0.0.1:8000/`

used [memray](https://bloomberg.github.io/memray/tutorials/1.html) to
profile the proxy process. Used instructions from
[here](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/debug-memory.html#memory-profiling-ray-tasks-and-actors).

### On master
<img width="1164" height="628" alt="image"
src="https://github.com/user-attachments/assets/50d22e10-3206-4aeb-9585-97245523a5cb"
/>


### With fix
<img width="1161" height="621" alt="image"
src="https://github.com/user-attachments/assets/19224538-cbd7-4be7-b830-29e1b468625f"
/>


When we reduce the garbage collection (GC) frequency to every 10k
allocations, proxy memory peaks at **1.3 GB** for my test workload. By
contrast, under the default GC frequency (700 allocations), peak RSS
memory is **700 MB**.

The higher memory footprint with less frequent GC occurs because this
workload involves large object transactions. With GC running only after
10k allocations, these large objects remain in RSS longer, inflating
memory usage until a collection cycle is triggered.

Importantly, I found no evidence of a memory leak under sustained load.
With the fix, memory stabilizes at around **700 MB**, and even without
the fix, usage plateaus at **1.3 GB** rather than growing unbounded.

This feature was added in ray-project#49720
as a performance optimization. So we are taking slight hit in RPS for
stable memory usage for larger payloads.

---------

Signed-off-by: abrar <abrar@anyscale.com>
using `--check` feature to verify llm lock files are unchanged

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
this allows using `base-extra` or `base-extra-testdeps` or other base
variations for building ray images.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Non-GCS component files have been moved; no longer need the nesting.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…#56551)

they are only used within the class

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
make the check stricter

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
…-project#56458)

## Why are these changes needed?
As part of this PR I am trying to address Problem 2 raised in issue
ray-project#44226.

The main aim is to enable KubeRay to exclusively check the status of
only DECLARATIVE Serve apps.
The solution would be build on top of this
ray-project#45522

Based on my current understanding, it seems KubeRay should only operate
on the DECLARATIVE Serve apps
Thus my solution will involve two key steps:

This PR- Update the /api/serve/applications/ endpoint to read the
APIType from the request body and pass it on to the controller
controller.get_serve_instance_details
Next modify KubeRay to explicitly pass Declarative as the APIType when
calling the /api/serve/applications/

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
Co-authored-by: Cindy Zhang <cindyzyx9@gmail.com>
…tage

Signed-off-by: Guy Stone <guys@spotify.com>
GuyStone pushed a commit that referenced this pull request Sep 16, 2025
… condition (ray-project#55367)

## Why are these changes needed?

Workers crash with a fatal `RAY_CHECK` failure when the plasma store
connection is broken during shutdown, causing the following error:
```
RAY_CHECK failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe
```
Stacktrace:
```
core_worker.cc:720 C  Check failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe 
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x141789a) [0x7924dd2c689a] ray::operator<<()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x479) [0x7924dd2c9319] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x95cc8a) [0x7924dc80bc8a] ray::core::CoreWorker::CoreWorker()::{lambda()ray-project#13}::operator()()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager27MarkTaskReturnObjectsFailedERKNS_17TaskSpecificationENS_3rpc9ErrorTypeEPKNS5_12RayErrorInfoERKN4absl12lts_2023080213flat_hash_setINS_8ObjectIDENSB_13hash_internal4HashISD_EESt8equal_toISD_ESaISD_EEE+0x679) [0x7924dc868f29] ray::core::TaskManager::MarkTaskReturnObjectsFailed()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager15FailPendingTaskERKNS_6TaskIDENS_3rpc9ErrorTypeEPKNS_6StatusEPKNS5_12RayErrorInfoE+0x416) [0x7924dc86f186] ray::core::TaskManager::FailPendingTask()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x9a90e6) [0x7924dc8580e6] ray::core::NormalTaskSubmitter::RequestNewWorkerIfNeeded()::{lambda()#1}::operator()()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc14ClientCallImplINS0_23RequestWorkerLeaseReplyEE15OnReplyReceivedEv+0x68) [0x7924dc94aa48] ray::rpc::ClientCallImpl<>::OnReplyReceived()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data+0x15) [0x7924dc79e285] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd9b4c8) [0x7924dcc4a4c8] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd4648e) [0x7924dcbf548e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd46906) [0x7924dcbf5906] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f417b) [0x7924dd2a317b] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f5af9) [0x7924dd2a4af9] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f6202) [0x7924dd2a5202] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0x91) [0x7924dc793a61] ray::core::CoreWorker::RunIOService()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xcba0b0) [0x7924dcb690b0] thread_proxy
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7924dde71ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7924ddf03850]
```

Stack trace flow:
1. Task lease request fails ->
`NormalTaskSubmitter::RequestNewWorkerIfNeeded()` callback.
2. Triggers `TaskManager::FailPendingTask()` ->
`MarkTaskReturnObjectsFailed()`.
3. System attempts to store error objects in plasma via
`put_in_local_plasma_callback_`.
4. Plasma connection is broken (raylet/plasma store already shut down).
5. `RAY_CHECK_OK()` in the callback causes fatal crash instead of
graceful handling.

Root Cause:

This is a shutdown ordering race condition:
1. Raylet shuts down first: The raylet stops its IO context
([main_service_.stop()](https://github.com/ray-project/ray/blob/77c5475195e56a26891d88460973198391d20edf/src/ray/object_manager/plasma/store_runner.cc#L146))
which closes plasma store connections.
2. Worker still processes callbacks: Core worker continues processing
pending callbacks on separate threads.
3. Broken connection: When the callback tries to store error objects in
plasma, the connection is already closed.
4. Fatal crash: The `RAY_CHECK_OK()` treats this as an unexpected error
and crashes the process.

Fix:

1. Shutdown-aware plasma operations
- Add `CoreWorker::IsShuttingDown()` method to check shutdown state.
- Skip plasma operations entirely when shutdown is in progress.
- Prevents attempting operations on already-closed connections.

2. Targeted error handling for connection failures
- Replace blanket `RAY_CHECK_OK()` with specific error type checking.
- Handle connection errors (Broken pipe, Connection reset, Bad file
descriptor) as warnings during shutdown scenarios.
- Maintain `RAY_CHECK_OK()` for other error types to catch real issues.

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.