Skip to content

Codex/runtime raymodule mvp#7

Draft
mikecovlee wants to merge 7 commits into
OpenDCAI:codex/runtime-raymodule-mvpfrom
mikecovlee:codex/runtime-raymodule-mvp
Draft

Codex/runtime raymodule mvp#7
mikecovlee wants to merge 7 commits into
OpenDCAI:codex/runtime-raymodule-mvpfrom
mikecovlee:codex/runtime-raymodule-mvp

Conversation

@mikecovlee

Copy link
Copy Markdown

No description provided.

The control plane governs what should run, when it should run, and what the
outcome was.

- **PostgreSQL / MySQL** – persistent source of truth for DAG definitions, job

@SunnyHaze SunnyHaze Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块我倾向于用一个原生的async Ray进程收集信息就好,有必要的话adapter到mysql或者 一个协程对象可能都可以?主要是减少配置DB的复杂性。如果发现async ray扛不住,可能sqlite也基本够了?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得问题不大,配置DB大多数时候多起一个image就行了。sqlite还是太拉了

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

算法哥本地最理想的就是pip install就能跑,如果考虑到兼容的话,工业级DB要做成optional的Interface,和原生进程 sqlite作为备选


### Observability (future)

- **Prometheus / Grafana** – metrics collection and dashboards via

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些加上kafka啥的不一定必须,因为不是多用户服务队列性质的。整个常见负载就是1坨数据,1个集群打满就行,原生搞一个队列就够了。

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只是加个抽象层,后续需要可以加,现在先不弄

Comment thread docs/dsl_compiler.md
└─▶ (4b) TorchBackend ── torch.fx / torch.compile graph
```

## Python AST Version Adaptation

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块我觉得就是定下来“语法”以后写几个mvp看看版本的影响,我还是感觉我们需要的DSL的FEATURE不是那么多而杂,可能AST随着版本更迭的变化不大。

Comment thread docs/dsl_compiler.md
from rayorch.compiler import emit_ir

pipeline = MyPipeline()
ir = emit_ir(pipeline) # returns dict

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要这么显式么?感觉目前版本直接集成在“pipeline complier”的list里面,需要时写个dump函数就行。(主要是dict里面如果要求是可以直接dump的字符串对象其实有点限制大)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前这个是AI写的,抛砖引玉。需要我们找时间一起定一个规范

Comment thread docs/dsl_compiler.md
exists so that a serialized IR can be loaded and executed without the original
Python source.

### TorchBackend (future)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这玩意儿真的是必须的嘛?给纯torch相关的算子准备?但是多线程流水线并行应该都不是原生支持?单纯吃cuda和显卡原生相关的计算的爽?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是必需的。主要的作用是单机Debug(能快速支持 Apple、Intel、AMD 这种本机算力)以及简单的单机多卡环境的支持

Comment thread docs/dsl_compiler.md

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉多了两层抽象,一个是IR一个是backend,必要性可以讨论下

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep。我个人认为有些抽象是必须在设计初期加上的,哪怕后面发现真的用不上再合并抽象层级

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我本来其实设计的是算子的读写输出位置和性质由算子本身那个class决定。RayOrch可以提供一些官方的dataloader

你这个版本看上去像是提级把这个抽象到了EXCUTOR这里,那可能这里没讨论这个store的目标是怎么下沉到算子和算子操作的,这块可能最后就导致其实就还是用户算子,或者就是至少输入数据的算子成为一些EXCUTOR可以识别的特殊算子。但这步耦合可能本质上就是换算子?本质上API接口放在算子层就好?

所以就是我感觉没必要再EXCUTOR暴露找个接口?可以讨论下

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是,我大概理解你的设计。但其实如果想把整个system的可靠性也好、吞吐量也好提上去,对象存储这个东西跳不过去。最好是算子能通过框架的统一API来读写文件

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants