diff --git a/README.md b/README.md index 0c640fd..a67de33 100644 --- a/README.md +++ b/README.md @@ -32,7 +32,7 @@ Fluxon is designed around these problems. It separates data-plane resources, obj - **MQ (Elastic message queue)**: Decouples system dependencies and supports elastic message transport across heterogeneous resource pools - **FS (`S3`-compatible file, object, and cache acceleration system)**: Unifies multi-form storage so one system can cache key-value, file, and object data, while supporting remote access, `S3` forwarding, and large-scale cross-cluster migration for AI data and model files -![](./pics/fluxon架构图20260423.png) +![](./pics/fluxon_architecture.png) diff --git a/README_CN.md b/README_CN.md index 3f978e7..a138d86 100644 --- a/README_CN.md +++ b/README_CN.md @@ -21,8 +21,7 @@ Fluxon 的设计正是围绕这些问题展开。它将数据面资源、对象 - **MQ(弹性消息队列)**:解耦系统依赖,支撑异构资源池之间的弹性消息传输 - **FS(兼容 `S3` 的文件、对象与缓存加速系统)**:统一键值、文件、对象三类缓存能力,并支持 AI 数据与模型文件的远端访问、`S3` 转发和跨集群大规模迁移 -![](./pics/fluxon架构图20260423.png) - +![](./pics/fluxon_architecture.png)
diff --git "a/fluxon_doc_cn/blog/blog_2_\344\270\200\346\254\241 AI \345\244\247 Payload \346\266\210\346\201\257\351\230\237\345\210\227\347\232\204\346\216\247\345\210\266\351\235\242\351\207\215\346\236\204.md" "b/fluxon_doc_cn/blog/blog_2_\344\270\200\346\254\241 AI \345\244\247 Payload \346\266\210\346\201\257\351\230\237\345\210\227\347\232\204\346\216\247\345\210\266\351\235\242\351\207\215\346\236\204.md" new file mode 100644 index 0000000..f3e4f45 --- /dev/null +++ "b/fluxon_doc_cn/blog/blog_2_\344\270\200\346\254\241 AI \345\244\247 Payload \346\266\210\346\201\257\351\230\237\345\210\227\347\232\204\346\216\247\345\210\266\351\235\242\351\207\215\346\236\204.md" @@ -0,0 +1,164 @@ +# FluxonMQ:一次 AI 大 Payload 消息队列的控制面重构 + +AI 训练和推理系统里的消息队列,处理的已经不再是几 KB 的业务事件。在 VAE 解耦训练、数据处理流水线、多模态中间态传递和跨资源池任务交接里,producer 传出去的往往是几十 MB 甚至更大的张量 Payload。consumer 可能动态加入、退出、扩缩容,也可能分布在不同机器、资源池或子集群。FluxonMQ 服务的就是这类场景:让 producer 和 consumer 通过消息语义解耦,同时让大 Payload 继续利用 Fluxon KV owner 的共享内存和跨节点传输路径。 + +在这个设计里,MQ 层负责消息状态,KV 数据面负责 Payload。其中消息状态覆盖消息可见性、in-flight 归属、提交确认、失败重投和清理确认。Payload 保存在 KV owner 管理的内存和传输路径中,consumer 拿到消息后通过 Payload key 读取数据。这种分工让 MQ 可以承载大对象交接。 + +早期 FluxonMQ 使用 etcd 推进消息状态。producer 写入 Payload 后,把消息可见状态写到 etcd;consumer 从 etcd 扫描和抢占消息,读取 Payload 后再写回消费进度。这条路径结构清晰,也复用了 etcd 的一致性和租约能力。问题出现在高并发热路径上:每条消息周围的 ready、claim、inflight、offset、commit 都会形成控制面读写。Payload 传输还在 KV owner 中进行,但消息能否被及时发现、抢占和提交,开始受 etcd 状态推进速度限制。 + +这次 broker 优化针对的就是这段控制面热路径。etcd 仍然负责成员发现、租约、broker 发现和 channel 长期元数据;broker 接管每条消息的排队、抢占、提交、失败放回和清理确认;KV owner 继续负责 Payload 存储和传输。这个拆分把低频集群元数据和高频队列状态分开,让消息推进从外部 KV 存储操作转为 broker 内存状态更新。 + +![](../../pics/fluxon_mq.png) + +## 基础链路:Payload 在 KV,状态在队列 + +早期链路的关键是把 Payload 和消息状态分离。producer 先把大对象写进 KV owner,再把指向 Payload 的消息状态写入 etcd。consumer 从 etcd 扫描可消费消息,完成抢占后拿到 Payload key,再从 KV owner 读取实际数据,处理完成后把消费进度写回 etcd。etcd 只保存消息状态和进度,避免承担大对象存储压力。 + +随着 producer 和 consumer 数量增加,队列状态推进会成为更明显的成本。consumer 为了保持吞吐,会提高 batch size 和 prefetch 深度。prefetch 可以提前发起查找和抢占,但它并没有减少 etcd 上的控制面操作,只是把这些操作前移。高并发下,本地 inflight 能否填深,取决于 etcd 能否持续快速完成可见消息查找、抢占和提交推进。 + +broker 链路把这些状态推进移到 broker 内部。producer 写入前先向 broker 申请 reservation。reservation 是一次写入尝试的占位,broker 返回 `reservation_id` 和 `msg_id`,并记录这条消息预计占用的 Payload bytes。Payload 写入 KV owner 成功后,producer 调用 `publish`,消息进入可消费队列。Payload 写失败时,producer 调用 `abort`,broker 释放占位和字节预算。这个顺序保证了 consumer 只能看到已经写入成功的 Payload。 + +consumer 通过 `fetch` 获取消息。broker 将消息从可消费队列移动到 in-flight,并返回 Payload key。in-flight 表示消息已经被某个 consumer 拿走,但还没有确认消费完成。consumer 读取 Payload 并完成处理后调用 `commit`,这一步成功后,broker 才认为这条消息已经完成消费。consumer 返回 Payload 后,Rust 后台任务异步删除 KV Payload;删除完成后再由内部 cleanup 路径释放 broker 的 Payload byte budget。consumer 失败、超时或被取消时,未 commit 的消息会重新放回可消费队列,等待后续投递。 + +这个流程把每条消息的状态推进留在 broker 内存中。`fetch`、`commit` 和 `requeue` 通过 P2P RPC 调用 broker,broker 更新本地状态后返回结果;cleanup 只作为 Rust 内部清理路径继续维护容量统计。etcd 从消息热路径中退出,只处理成员、租约和发现这类低频职责。 + +## broker 的进程边界 + +broker 作为独立进程运行,长期维护 MQ 队列状态。它的生命周期独立于 producer、consumer 和 KV owner。master 继续负责集群控制、租约和 owner 管理,broker 负责高频消息排队。把 broker 放在独立进程里,可以避免 MQ 热路径占用 master,并减少 master 故障和 MQ 队列状态之间的耦合。 + +当前实现中,broker 底层通信身份复用 external client,没有新增 closed runtime 角色。MQ 业务身份通过 member metadata 中的 `fluxon_mq_component=broker` 标记。broker 不注册 segment,不贡献共享内存,也不拥有 Payload。producer 和 consumer 通过 broker discovery 找到 broker,再用 P2P RPC 调用 broker。 + +这个边界保留了 Fluxon 现有通信层结构。broker 不会被 master 当作 KV owner 等待 segment 注册,P2P relay 和 external client 接入规则也可以继续复用。MQ 增加了一个控制面进程,但没有扩展一套新的底层角色体系。 + +## 实现结构 + +Rust 侧的 broker 状态位于 `fluxon_rs/fluxon_mq/src/broker.rs`。这部分实现沿用 KV 设计里的角色边界:`master` 维护集群控制面和路由,`owner` 承载共享内存、对象副本和跨节点传输,producer、consumer 和 broker 都以 `external_client` 身份接入,不贡献 owner 容量。这个边界在 [KV 设计 1 - 概览与分层](../design/kv_1_概览与分层.md) 里有完整说明。 + +broker 保存的是消息控制面状态和 Payload 引用。Payload bytes 仍然由 KV owner 管理,broker 只记录 `payload_key`、`payload_bytes`、消息信封和队列位置。 + +```rust +pub struct LocalBroker { + state: BrokerState, // broker 内存状态 +} + +struct BrokerState { + channels: HashMap, // 按 channel_id 保存队列状态 + payload_byte_capacity: u64, // broker 维度的 Payload byte budget 上限 + used_payload_bytes: u64, // 当前未释放消息占用的 Payload byte budget +} + +struct ChannelState { + config: BrokerChannelConfig, // channel_id 和 capacity + next_reservation_id: u64, // channel 内递增的 reservation 编号 + next_msg_by_producer: HashMap, // 每个 producer_id 的下一个 msg_id + pending: HashMap, // 已 reserve、尚未 publish 的消息 + visible: VecDeque, // 已写入 Payload、可被 consumer fetch 的消息 + inflight: HashMap, // 已被 consumer 取走、尚未 commit 的消息 + inflight_order: VecDeque, // inflight 消息顺序 + cleanup: VecDeque, // 已 commit、等待 Payload 清理的消息 + cleanup_inflight: HashMap, // 已分配给清理任务、等待内部清理确认的消息 + used_slots: i64, // channel 当前占用的消息槽位 + reserve_waiters: VecDeque, // 因容量不足等待 reserve 的请求 + fetch_waiters: VecDeque, // 因可见消息不足等待 fetch 的请求 +} +``` + +重复 `commit` 不再依赖单独的 committed 集合。broker 直接从 `cleanup` 和 `cleanup_inflight` 判断这条 reservation 是否已经完成首次提交但清理尚未结束;清理完成后,消息生命周期结束,再次提交会按不存在的 in-flight delivery 处理。 + +broker RPC 和内部状态机使用的主要消息结构如下: + +```rust +pub struct BrokerChannelConfig { + pub channel_id: i64, // channel 标识 + pub capacity: i64, // channel 消息槽位上限 +} + +pub struct BrokerReserveRequest { + pub channel_id: i64, // 目标 channel + pub producer_id: String, // producer 标识 + pub category: MqCategory, // MPSC 或 MPMC 子队列类型 + pub payload_bytes: u64, // 本条消息预计占用的 Payload bytes + pub now_ms: i64, // reserve 时间 +} + +pub struct BrokerFetchRequest { + pub channel_id: i64, // 目标 channel + pub consumer_id: String, // consumer 标识 + pub now_ms: i64, // fetch 时间 +} + +pub struct BrokerEnvelope { + pub channel_id: i64, // channel 标识 + pub producer_id: String, // producer 标识 + pub msg_id: i64, // producer 内递增消息编号 + pub reservation_id: u64, // 本次写入 reservation 编号 + pub payload_key: String, // KV owner 中的 Payload key + pub payload_bytes: u64, // Payload byte budget 计数 + pub reserved_at_ms: i64, // reserve 时间 + pub published_at_ms: Option, // publish 时间,未 publish 时为空 +} + +pub struct BrokerCommitOutcome { + pub first_commit: bool, // 本次 commit 是否首次生效 + pub cleanup: Option, // 首次 commit 后生成的清理任务 +} + +pub struct BrokerCommitBatchOutcome { + pub first_commit_count: usize, // batch 中首次 commit 成功的数量 + pub cleanup: Vec, // batch 生成的清理任务 +} +``` + +状态流转可以简化为下面这条链路: + +![](../../pics/blog2_mq_broker_state.png) + +producer 热路径位于 `fluxon_rs/fluxon_mq/src/producer.rs`。broker 路径的写入顺序是 `reserve -> KV put -> publish`。`reserve` 成功后,broker 已经生成 `payload_key` 并扣减 `payload_bytes`;producer 随后把实际 Payload 写入 KV owner。只有 KV 写入成功后,`publish` 才会把消息从 `pending` 推到 `visible`,因此 consumer 只能 fetch 到已经完成 Payload 写入的消息。如果 KV 写入失败,producer 会调用 `abort` 释放 reservation 和 byte budget。 + +当 channel 满或 `payload_byte_capacity` 不足时,producer 在 Rust 热路径里按 `BrokerError::ChannelFull` 或 `BrokerError::PayloadBytesFull` 做退避重试。这个重试发生在 broker reserve 阶段,等待条件直接来自 `used_slots` 和 `used_payload_bytes`,比 Python 外层固定 sleep 更贴近真实队列状态。 + +consumer 热路径位于 `fluxon_rs/fluxon_mq/src/consumer.rs` 和 `fluxon_rs/fluxon_pyo3/src/mpsc.rs`。consumer 先通过 broker `fetch` 取得 `BrokerEnvelope`,再用其中的 `payload_key` 从 KV owner 读取 Payload。`commit` 成功后,Payload 立即返回给上层;Rust 后台清理任务随后删除 KV Payload,并通过 broker 内部清理确认释放 byte budget。Python 层主要负责 API 包装、bench 编排和 teardown;消息推进、背压等待和 cleanup 状态已经收敛到 Rust broker 路径。 + +![](../../pics/blog2_mq_payload_flow.png) + +MPMC bench 的清理逻辑位于 `fluxon_py/tests/test_api_chan_mpmc/test_mpmc_simple_bench.py`。teardown 时会删除本轮 MPMC 子 MPSC channel,并继续删除 broker 返回的 Payload keys。这里需要同时处理两类资源:broker 侧的 `used_payload_bytes` 和 KV owner 侧的真实 Payload。前者靠 `cleanup_ack`、`abort` 或 `delete_channel` 释放;后者靠对 `payload_key` 执行 KV delete 释放。两边都释放后,连续 case 才不会被上一轮残留数据占住 owner pool 或 broker byte budget。 + +## 性能结果 + +测试环境为单机,owner pool 为 `100GB`,channel capacity 为 `4096`,低日志运行,Payload 为 DLPack 数据。对比对象是 etcd 队列推进和 broker 队列推进,两边使用相同的 producer、consumer、batch、prefetch 和 Payload 参数。 + +![](../../pics/mq_bench.svg) + +| case | P/C | batch/prefetch | Payload | etcd MB/s | broker MB/s | 变化 | +| --- | ---: | ---: | --- | ---: | ---: | ---: | +| 01 | 16/8 | 40/40 | 4.8MB | 7660.80 | 8010.24 | +4.6% | +| 02 | 16/12 | 40/40 | 4.8MB | 7372.80 | 9496.80 | +28.8% | +| 03 | 24/8 | 40/40 | 4.8MB | 7046.40 | 7350.24 | +4.3% | +| 04 | 16/8 | 40/120 | 4.8MB | 6931.20 | 9791.52 | +41.3% | +| 05 | 16/4 | 40/40 | 4.8MB | 7756.80 | 8294.40 | +6.9% | +| 06 | 16/2 | 40/40 | 4.8MB | 6201.60 | 5875.20 | -5.3% | +| 07 | 16/4 | 48/48 | 4.8MB | 7925.76 | 8155.68 | +2.9% | +| 08 | 16/4 | 64/64 | 4.8MB | 7802.88 | 8382.24 | +7.4% | +| 09 | 16/4 | 48/48 | 8MB | 12441.60 | 14153.60 | +13.8% | +| 10 | 16/4 | 48/48 | 12MB | 17625.60 | 18356.40 | +4.1% | +| 11 | 16/4 | 48/48 | 16MB | 22041.60 | 26102.40 | +18.4% | +| 12 | 16/4 | 48/48 | 20MB | 26016.00 | 18222.00 | -30.0% | +| 13 | 16/4 | 48/48 | 24MB | 29030.40 | 46552.80 | +60.4% | +| 14 | 16/4 | 48/48 | 32MB | 34252.80 | 56624.00 | +65.3% | +| 15 | 24/4 | 48/48 | 32MB | 42393.60 | 44067.20 | +3.9% | +| 16 | 32/4 | 48/48 | 32MB | 35328.00 | 42198.40 | +19.4% | +| 17 | 24/2 | 48/48 | 32MB | 17817.60 | 36969.60 | +107.5% | +| 18 | 24/4 | 48/48 | 40MB | 51264.00 | 63656.00 | +24.2% | +| 19 | 24/4 | 48/48 | 48MB | 54835.20 | 54451.20 | -0.7% | +| 20 | 24/4 | 48/48 | 56MB | 57792.00 | 85254.40 | +47.5% | +| 21 | 24/4 | 48/48 | 64MB | 48844.80 | 89952.00 | +84.2% | + +小 Payload 下,broker 的收益取决于并发组织。`16p/12c b40/pf40 4.8MB` 从 `7372.80 MB/s` 提升到 `9496.80 MB/s`,提升 `28.8%`;`16p/8c b40/pf120 4.8MB` 从 `6931.20 MB/s` 提升到 `9791.52 MB/s`,提升 `41.3%`。这些点的共同特征是 consumer 或 prefetch 对控制面推进的需求更强,broker 能让本地 inflight 更稳定地填起来。 + +大 Payload 下,控制面阻塞减少后,数据面更容易持续跑满。`24MB` 从 `29030.40 MB/s` 提升到 `46552.80 MB/s`,`32MB` 从 `34252.80 MB/s` 提升到 `56624.00 MB/s`,`56MB` 从 `57792.00 MB/s` 提升到 `85254.40 MB/s`,`64MB` 从 `48844.80 MB/s` 提升到 `89952.00 MB/s`。纯 etcd 路径的最佳点是 `24p/4c b48/pf48 dlpack 56MB`,稳态吞吐 `57792.00 MB/s`;broker 路径的最佳点是 `24p/4c b48/pf48 dlpack 64MB`,稳态吞吐 `89952.00 MB/s`。 + +## 结尾 + +FluxonMQ broker 优化把每条消息的高频状态推进从 etcd 迁到 broker,etcd 保留成员、租约、发现和长期元数据职责,KV owner 继续承载大 Payload 数据面。这个调整让 MQ 控制面更贴近消息运行时状态,也让 Payload 传输继续复用 Fluxon 的共享内存和跨节点数据路径。 + +在单机 `100GB` owner pool 测试中,etcd 路径最高 `57.79GB/s`,broker 路径最高 `89.95GB/s`。更重要的是,队列推进已经从外部 KV 存储读写变成内存状态机更新,为后续多 broker 分片、批量 RPC、跨节点 MQ 和更细粒度容量治理提供了更清晰的演进基础。 diff --git "a/fluxon_doc_cn/user_doc/\347\224\250\346\210\267 - 1 - \346\236\266\346\236\204\345\222\214\346\246\202\345\277\265.md" "b/fluxon_doc_cn/user_doc/\347\224\250\346\210\267 - 1 - \346\236\266\346\236\204\345\222\214\346\246\202\345\277\265.md" index fc1dd8b..47313f0 100644 --- "a/fluxon_doc_cn/user_doc/\347\224\250\346\210\267 - 1 - \346\236\266\346\236\204\345\222\214\346\246\202\345\277\265.md" +++ "b/fluxon_doc_cn/user_doc/\347\224\250\346\210\267 - 1 - \346\236\266\346\236\204\345\222\214\346\246\202\345\277\265.md" @@ -8,7 +8,7 @@ ### 系统全景架构 -![](../../pics/fluxon架构图20260423.png) +![](../../pics/fluxon_architecture.png) 组件视角的全景图,用来定位各组件的职责和依赖关系。 diff --git a/fluxon_doc_en/user_doc/User - 1 - Architecture and Concepts.md b/fluxon_doc_en/user_doc/User - 1 - Architecture and Concepts.md index f0a6417..2a8fd0a 100644 --- a/fluxon_doc_en/user_doc/User - 1 - Architecture and Concepts.md +++ b/fluxon_doc_en/user_doc/User - 1 - Architecture and Concepts.md @@ -8,7 +8,7 @@ This page explains the core concepts and config fields that appear throughout th ### System Overview -![](../../pics/架构全景图.png) +![](../../pics/fluxon_architecture_overview.png) - Control plane / metadata: `etcd + Master` for members, leases, routing, and connection-state metadata - Data plane: `shared memory + transfer engine` for same-host reuse and cross-node data transfer diff --git a/fluxon_py/_api_ext_chan/mpmc.py b/fluxon_py/_api_ext_chan/mpmc.py index 4ddbc1e..085e76c 100644 --- a/fluxon_py/_api_ext_chan/mpmc.py +++ b/fluxon_py/_api_ext_chan/mpmc.py @@ -96,18 +96,34 @@ LOCAL_MEMBER_ID_RANGE_SIZE = 32 MPMC_CREATE_LOCK_TTL_SECONDS = 10 MPMC_CREATE_LOCK_TIMEOUT_SECONDS = 10.0 +MPMC_CLEANUP_ETCD_TIMEOUT_SECONDS = 2.0 -def new_etcd_client(api: KvClient) -> Result[etcd3.Etcd3Client, ApiError]: +def _close_lease_handle(handle: Optional[object], label: str) -> None: + if handle is None: + return + try: + handle.close() # type: ignore[attr-defined] + except Exception as e: # noqa: BLE001 + logging.warning("failed to close lease handle %s: %s", label, e) + + +def new_etcd_client( + api: KvClient, *, timeout_seconds: Optional[float] = None +) -> Result[etcd3.Etcd3Client, ApiError]: """Create etcd client""" etcd_config: List[str] = api.get_etcd_config() first_address: str = etcd_config[0] host: str port_str: str host, port_str = first_address.split(":") - print(f"new_etcd_client: {host}:{port_str}") try: - client: etcd3.Etcd3Client = etcd3.client(host=host, port=int(port_str)) + kwargs: Dict[str, Any] = {} + if timeout_seconds is not None: + kwargs["timeout"] = float(timeout_seconds) + client: etcd3.Etcd3Client = etcd3.client( + host=host, port=int(port_str), **kwargs + ) return Result.new_ok(client) except Exception as e: return Result.new_error( @@ -136,8 +152,10 @@ def stable_revoke_lease(api: KvClient, lease_id: int) -> Result[OkNone, ApiError endpoint = endpoints[0] if endpoints else None errors: List[str] = [] - for attempt in range(3): - client_res = new_etcd_client(api) + for attempt in range(2): + client_res = new_etcd_client( + api, timeout_seconds=MPMC_CLEANUP_ETCD_TIMEOUT_SECONDS + ) if not client_res.is_ok(): err = client_res.unwrap_error() errors.append(str(err)) @@ -183,8 +201,10 @@ def stable_delete_ready_keys_for_member( member_id_str = str(member_id) errors: List[str] = [] - for attempt in range(3): - client_res = new_etcd_client(api) + for attempt in range(2): + client_res = new_etcd_client( + api, timeout_seconds=MPMC_CLEANUP_ETCD_TIMEOUT_SECONDS + ) if not client_res.is_ok(): err = client_res.unwrap_error() errors.append(str(err)) @@ -203,22 +223,7 @@ def stable_delete_ready_keys_for_member( for key in keys_to_delete: client.delete(key) - # Verify: keys should be gone immediately after delete on the same prefix. - remaining: List[bytes] = [] - for value, meta in client.get_prefix(prefix): - if value is None: - continue - if value.decode() != member_id_str: - continue - remaining.append(meta.key) - - if len(remaining) == 0: - return Result.new_ok(OK_NONE) - - errors.append( - f"attempt={attempt}: remaining ready keys after delete: {remaining!r}" - ) - time.sleep(0.1) + return Result.new_ok(OK_NONE) except Exception as e: # noqa: BLE001 errors.append(f"attempt={attempt}: {e}") time.sleep(0.1) @@ -1802,19 +1807,16 @@ def close(self) -> Result[OkNone, ApiError]: except Exception as e: # noqa: BLE001 logging.warning(f"MPMC channel {self.mpmc_id} stop_watching failed: {e}") - # Drop PyLease handles to stop keepalive; etcd leases with - # revoke_on_drop=False are intentionally not revoked. - # Setting to None drops the PyO3 handle immediately in CPython, - # which releases the underlying Rust RAII and unregisters from - # the keepalive actor. - if hasattr(self, "_lm_mpmc_member"): - self._lm_mpmc_member = None # type: ignore[assignment] - if hasattr(self, "_lm_mpmc_global"): - self._lm_mpmc_global = None # type: ignore[assignment] - if hasattr(self, "_lm_cluster_long"): - self._lm_cluster_long = None # type: ignore[assignment] - if hasattr(self, "_lm_kv_payload"): - self._lm_kv_payload = None # type: ignore[assignment] + # Close lease handles explicitly so keepalive entries are unregistered + # before the owning KvClient starts shutting down. + _close_lease_handle(self._lm_mpmc_member, "mpmc_member") + self._lm_mpmc_member = None + _close_lease_handle(self._lm_mpmc_global, "mpmc_global") + self._lm_mpmc_global = None + _close_lease_handle(self._lm_cluster_long, "mpmc_cluster_long") + self._lm_cluster_long = None + _close_lease_handle(self._lm_kv_payload, "mpmc_kv_payload") + self._lm_kv_payload = None # Return a minimal Ok result to satisfy the explicit Result API contract return Result.new_ok(OK_NONE) @@ -2025,6 +2027,12 @@ def _record_mpsc_producer(self, mpsc_producer: MPSCChanProducer): def put_data( self, value: Dict[str, Union[int, float, bool, str, bytes, DLPacked]] + ) -> Result[bool, ApiError]: + return self._put_data_impl(value) + + def _put_data_impl( + self, + value: Dict[str, Union[int, float, bool, str, bytes, DLPacked]], ) -> Result[bool, ApiError]: """Put data to the MPMC channel. @@ -2051,9 +2059,11 @@ def put_data( ) ) + capacity = int(self.chan_config["capacity"]) + assert capacity > 0, f"invalid MPMC channel capacity: {capacity}" + # Do not hold _op_lock while performing network-heavy operations (count_prefix/put_data). # Otherwise close() may block behind a long RPC and tests like MQ capacity+auto-clean can hang. - capacity = int(self.chan_config["capacity"]) # validated upfront while True: if self.shutdown_ctl.closed: return Result[bool, ApiError].new_error( @@ -2159,6 +2169,20 @@ def put_data( return Result[bool, ApiError].new_ok(True) err = put_result.unwrap_error() + if isinstance(err, MessageBufferFullError): + blocking_observed_unix_ms = int(time.time() * 1000) + try: + candidate.record_blocking_put_observed(blocking_observed_unix_ms) + except Exception as e: # noqa: BLE001 + logging.warning( + "MPMCChanProducer mpmc_id=%s failed to record broker backpressure on mpsc_id=%s producer_idx=%s: %s", + self.mpmc_id, + candidate.get_chan_id(), + candidate.get_producer_id(), + e, + ) + time.sleep(0.02) + continue logging.error( "MPMCChanProducer mpmc_id=%s failed to put data on mpsc_id=%s producer_idx=%s: %s", self.mpmc_id, @@ -2352,12 +2376,25 @@ def __init__( self.mpsc_consumer: Optional[MPSCChanConsumer] = None self.bound_mpsc_id: Optional[str] = None - # Get next available channel and bind to it - fails=[] - for i in range(10): + # Get next available channel and bind to it. Concurrent consumers may + # lose a claim/create race; retry those bounded authority-state races. + fails: List[ApiError] = [] + max_bind_attempts = 10 + for i in range(max_bind_attempts): next_channel_result = self.mpmc_channel.get_next_available_channel(self.api, self.chan_config) if not next_channel_result.is_ok(): - raise ValueError(f"Failed to get next available channel: {next_channel_result.unwrap_error()}") + err = next_channel_result.unwrap_error() + if isinstance(err, (ChanCreateError, ChanBindError)): + logging.warning( + "MPMC consumer failed to get next channel on attempt %s/%s; retrying: %s", + i + 1, + max_bind_attempts, + err, + ) + fails.append(err) + time.sleep(0.1) + continue + raise ValueError(f"Failed to get next available channel: {err}") next_channel = next_channel_result.unwrap() if next_channel is None: @@ -2380,13 +2417,15 @@ def __init__( # claimed inside MPMCChannel return with _mpmc_ready_claimed=True. res=self.mark_channel_ready(next_channel.get_chan_id()) if not res.is_ok(): - logging.warning(f"Failed to mark channel ready: {res.unwrap_error()}") + err = res.unwrap_error() + logging.warning(f"Failed to mark channel ready: {err}") # Close the just-created/bound MPSC consumer to avoid dangling consumers try: next_channel.release_local_handle().unwrap() except Exception as e: logging.debug(f"close leaked MPSC consumer error: {e}") - fails.append(res.unwrap_error()) + fails.append(err) + time.sleep(0.1) continue if res.unwrap(): self.mpsc_consumer = next_channel @@ -2402,12 +2441,15 @@ def __init__( next_channel.release_local_handle().unwrap() except Exception as e: logging.debug(f"close leaked MPSC consumer error: {e}") - fails.append("transaction failed") + fails.append(ChanBindError("ready channel claim transaction failed")) + time.sleep(0.1) continue else: raise ValueError(f"Unexpected channel type: {type(next_channel)}") - raise ValueError(f"Failed to mark channel ready with {len(fails)} fails: {fails}") + raise ValueError( + f"Failed to bind MPMC consumer after {max_bind_attempts} attempts: {fails}" + ) def request_shutdown(self) -> None: if self.shutdown_ctl.closed: @@ -2415,7 +2457,7 @@ def request_shutdown(self) -> None: self.shutdown_ctl.closed = True if self.mpsc_consumer is not None and hasattr(self.mpsc_consumer, "request_shutdown"): self.mpsc_consumer.request_shutdown() - + def get_chan_id(self) -> str: """ Get the channel id. @@ -2431,6 +2473,18 @@ def get_consumer_id(self) -> str: def get_data( self, batch_size: int = 1, try_time: Optional[int] = None, prefetch_num: int = 0 + ) -> Result[List[Dict[str, Union[int, float, bool, str, bytes, DLPacked]]], ApiError]: + del prefetch_num + return self._get_data_impl( + batch_size=batch_size, + try_time=try_time, + ) + + def _get_data_impl( + self, + *, + batch_size: int, + try_time: Optional[int], ) -> Result[List[Dict[str, Union[int, float, bool, str, bytes, DLPacked]]], ApiError]: """Get data from the bound MPSC channel. @@ -2463,22 +2517,7 @@ def get_data( # Get data from MPSC consumer (will automatically return producer info when MPSC acts as submodule) from .mpsc import ConsumedMessage - # # Map MPMC-level prefetch to per-MPSC prefetch: divide by active MPMC consumers, ceil, min divisor=1 - # try: - # active_consumers = self.mpmc_channel._get_active_consumer_count() - # except Exception as e: # noqa: BLE001 - # logging.warning( - # f"[Unreachable] Failed to get active consumer count: {e}" - # ) - # active_consumers = 0 - - # # ceil division without importing math: (a + b - 1) // b - # mapped_prefetch = 0 - # if prefetch_num > 0 and active_consumers > 0: - # mapped_prefetch = (prefetch_num + active_consumers - 1) // active_consumers - result = self.mpsc_consumer.get_data( - batch_size, try_time, prefetch_num=prefetch_num - ) + result = self.mpsc_consumer.get_data(batch_size, try_time=try_time) if not result.is_ok(): err = result.unwrap_error() if self.shutdown_ctl.closed: @@ -2548,6 +2587,18 @@ def close(self) -> Result[OkNone, ApiError]: f"MPMCChanConsumer {self.get_consumer_id()} before_close on underlying MPSC consumer failed: {e}" ) + # Close the underlying MPSC consumer first so local keepalive/prefetch + # tasks stop before lease revoke and ready-key cleanup. + try: + if self.mpsc_consumer is not None: + self.mpsc_consumer.release_local_handle().unwrap() + except Exception as e: # noqa: BLE001 + logging.warning( + f"MPMCChanConsumer {self.get_consumer_id()} failed to close underlying MPSC consumer: {e}" + ) + finally: + self.mpsc_consumer = None + # Delete ready keys for this consumer (best-effort). mpmc_id = self.mpmc_id assert mpmc_id is not None, "MPMC channel ID is None" @@ -2599,17 +2650,6 @@ def close(self) -> Result[OkNone, ApiError]: f"MPMCChanConsumer {self.get_consumer_id()} failed to revoke member lease: {e}" ) - # Close the underlying MPSC consumer and drop the handle. - try: - if self.mpsc_consumer is not None: - self.mpsc_consumer.release_local_handle().unwrap() - except Exception as e: # noqa: BLE001 - logging.warning( - f"MPMCChanConsumer {self.get_consumer_id()} failed to close underlying MPSC consumer: {e}" - ) - finally: - self.mpsc_consumer = None - # Optional sub-component cleanup. try: if hasattr(self, 'rate_limiter') and self.rate_limiter is not None: diff --git a/fluxon_py/_api_ext_chan/mpsc.py b/fluxon_py/_api_ext_chan/mpsc.py index 1eeac76..7905c4e 100644 --- a/fluxon_py/_api_ext_chan/mpsc.py +++ b/fluxon_py/_api_ext_chan/mpsc.py @@ -8,10 +8,9 @@ Old Python implementations (ChanManager, etcd watchers, prefetch queues) have been removed. -Currently this shim focuses on wiring up leases and identities. Data -path operations (`put_data`/`get_data`) are intentionally left as -placeholders and should be implemented in Rust and exposed via -`fluxon_pyo3` in follow-up work. +Broker-backed data-path operations are the default public contract. +The old direct MPSC data path is kept only behind private helpers for +short-lived internal checks. """ from __future__ import annotations @@ -55,6 +54,11 @@ logging = init_logger(__name__) +MPSC_PREFETCH_TARGET_MAX = 256 +MPSC_KVCLIENT_KEEPALIVE_RETRY_SLEEP_SECONDS = 0.05 +MPSC_KVCLIENT_KEEPALIVE_RETRIES = 3 +_LEASE_BACKEND_CALLBACK_LOCKS: Dict[str, threading.Lock] = {} +_LEASE_BACKEND_CALLBACK_LOCKS_GUARD = threading.Lock() # --------------------------------------------------------------------------- # Test-only GC close markers @@ -269,6 +273,11 @@ def _ensure_kvclient_lease_backend(api: KvClient, cluster: str) -> Any: message="KvClient must implement KvLeaseApi for MPSC payload lease", ) + with _LEASE_BACKEND_CALLBACK_LOCKS_GUARD: + callback_lock = _LEASE_BACKEND_CALLBACK_LOCKS.setdefault( + cluster, threading.Lock() + ) + def allocate_cb(ttl_seconds: int) -> int: """Bridge to KvLeaseApi.allocate_lease for the given TTL. @@ -279,7 +288,8 @@ def allocate_cb(ttl_seconds: int) -> int: Do NOT raise ApiError dataclasses here (they are not Exceptions) to avoid PyErr(TypeError: exceptions must derive from BaseException). """ - res = api.allocate_lease(int(ttl_seconds)) + with callback_lock: + res = api.allocate_lease(int(ttl_seconds)) if not res.is_ok(): # Raise a real Python Exception so PyO3 converts it to Err(...) raise RuntimeError( @@ -297,8 +307,21 @@ def keepalive_cb(lease_id: int) -> None: cause type conversion errors in PyO3. See logs: "exceptions must derive from BaseException" when raising non-Exception ApiError values. """ - # Keepalive must not alter TTL; do not pass custom_ttl - res = api.keepalive_lease(int(lease_id)) + # Keepalive must not alter TTL; do not pass custom_ttl. The PyO3 + # KvClient object uses mutable Rust borrows, so serialize callbacks + # from the lease actor to avoid re-entering the same client handle. + for attempt in range(MPSC_KVCLIENT_KEEPALIVE_RETRIES): + with callback_lock: + res = api.keepalive_lease(int(lease_id)) + if res.is_ok(): + _ = res.unwrap() + return None + err = res.unwrap_error() + if "Already mutably borrowed" in str(err) and attempt + 1 < MPSC_KVCLIENT_KEEPALIVE_RETRIES: + time.sleep(MPSC_KVCLIENT_KEEPALIVE_RETRY_SLEEP_SECONDS) + continue + break + if not res.is_ok(): err = res.unwrap_error() # When the client is shutting down, background keepalive calls can race with the @@ -311,9 +334,6 @@ def keepalive_cb(lease_id: int) -> None: raise RuntimeError( f"kvclient keepalive_lease failed for cluster={cluster}: {err}" ) - # Success: consume Ok(None) to satisfy strict Result policy - _ = res.unwrap() - # Success path: return None explicitly to map to Rust () return None # Inject kvclient allocate/keepalive callbacks while constructing LeaseBackendUid. @@ -403,6 +423,11 @@ def new_consumer( parent_mpmc_member_id_opt, ) + def delete_broker_channel(self, chan_id: str) -> list[str]: + if not isinstance(chan_id, str) or not chan_id.isdigit(): + raise ValueError(f"invalid broker channel id: {chan_id!r}") + return list(self._inner.delete_broker_channel(int(chan_id))) + def close(self) -> None: self._inner.close() @@ -503,11 +528,13 @@ def __init__( # through the Rust MPSC layer. self._payload_lease_id = self._handle.payload_lease_id() # type: ignore[attr-defined] + self._handle.init_broker() # type: ignore[attr-defined] + # Expose chan_id for legacy call sites that accessed the attribute. self.chan_id = self._chan_id logging.info( - "%s initialized via Rust MPSC: chan_id=%s, producer_idx=%s", + "%s initialized via Rust MPSC broker path: chan_id=%s, producer_idx=%s", self.dbg_tag(), self.get_chan_id(), self.get_producer_id(), @@ -543,6 +570,25 @@ def record_blocking_put_observed(self, unix_ms: int) -> None: def put_data( self, value: Dict[str, Union[int, float, bool, str, bytes, DLPacked]] + ) -> Result[bool, ApiError]: + return self._put_data_with_writer( + value, + self._handle.put_flat_dict_ptrs, # type: ignore[attr-defined] + ) + + def _put_data_legacy_for_internal_check( + self, value: Dict[str, Union[int, float, bool, str, bytes, DLPacked]] + ) -> Result[bool, ApiError]: + """Use the old direct MPSC write path for temporary internal checks only.""" + return self._put_data_with_writer( + value, + self._handle.put_flat_dict_ptrs_legacy_for_internal_check, # type: ignore[attr-defined] + ) + + def _put_data_with_writer( + self, + value: Dict[str, Union[int, float, bool, str, bytes, DLPacked]], + writer: Any, ) -> Result[bool, ApiError]: """Put data into the channel via Rust backend. @@ -576,7 +622,7 @@ def put_data( dlpack_capsules: List[object] = [] try: ptrs = _fluxon_kv.build_flat_dict_ptrs(value, keepalive, dlpack_capsules) - self._handle.put_flat_dict_ptrs(ptrs) # type: ignore[attr-defined] + writer(ptrs) except Exception as e: # pragma: no cover - thin shim if _is_close_during_put_error(e): self.shutdown_ctl.closed = True @@ -608,6 +654,10 @@ def put_data( # If Rust changes LeaseMgrError variants or mappings, update: # 1) The LeaseMgrError mapping in py_error_from_kv_error; # 2) The check here and its corresponding tests. + if e.__class__.__name__ == "MessageBufferFullError": + logging.debug("%s put_flat_dict_ptrs backpressured: %s", self.dbg_tag(), e) + return Result[bool, ApiError].new_error(e) # type: ignore[arg-type] + logging.error("%s put_flat_dict_ptrs failed: %s", self.dbg_tag(), e) if isinstance(e, PayloadLeaseNotFoundError): # Mark closed and best-effort notify Rust side to stop callbacks/holds. @@ -817,11 +867,12 @@ def __init__( else: self._handle.init_payload_callback(self._build_get_payload()) # type: ignore[attr-defined] self._handle.init_delete_callback(self._build_delete_callback()) # type: ignore[attr-defined] + self._handle.init_broker() # type: ignore[attr-defined] # Guard to make close idempotent without relying on None checks. self._closed_local: bool = False logging.info( - "%s initialized via Rust MPSC: chan_id=%s, consumer_idx=%s, payload_backend=%s", + "%s initialized via Rust MPSC broker path: chan_id=%s, consumer_idx=%s, payload_backend=%s", self._dbg_tag, self._chan_id, self._consumer_id, @@ -1080,38 +1131,144 @@ def get_data( List[Union[Dict[str, Union[int, float, bool, str, bytes, DLPacked]], ConsumedMessage]], ApiError, ]: - """Unified prefetch-first get API. + return self._get_data_broker( + batch_size=batch_size, + try_time=try_time, + prefetch_num=prefetch_num, + ) + + def _get_data_legacy_for_internal_check( + self, + batch_size: int = 1, + try_time: Optional[int] = None, + prefetch_num: int = 0, + ) -> Result[ + List[Union[Dict[str, Union[int, float, bool, str, bytes, DLPacked]], ConsumedMessage]], + ApiError, + ]: + """Use the old prefetch MPSC read path for temporary internal checks only.""" + return self._get_data_legacy_prefetch( + batch_size=batch_size, + try_time=try_time, + prefetch_num=prefetch_num, + ) + + def _get_data_broker( + self, + *, + batch_size: int, + try_time: Optional[int], + prefetch_num: int, + ) -> Result[ + List[Union[Dict[str, Union[int, float, bool, str, bytes, DLPacked]], ConsumedMessage]], + ApiError, + ]: + """Get data via the broker-backed public path.""" + timeout_ms = self._get_timeout_ms(try_time) + prefetch_target = min( + batch_size + max(prefetch_num, 0), + MPSC_PREFETCH_TARGET_MAX, + ) + try: + batch = self._handle.get_batch( # type: ignore[attr-defined] + batch_size, + prefetch_target, + timeout_ms, + ) + except Exception as e: + if self.shutdown_ctl.closed: + api_err: ApiError = ChannelClosedError( + message="Consumer is closed.", + channel_id=self._chan_id, + ) + elif isinstance(e, ApiError): + api_err = e + else: + api_err = MqGetDataUnknownError.from_exception( + e, channel_id=self._chan_id, consumer_id=self._consumer_id + ) + if isinstance(api_err, (MessageConsumptionNoNewMessageError, ChannelClosedError)): + logging.debug("%s get_batch finished without payload: %s", self.dbg_tag(), api_err) + else: + logging.error("%s get_batch failed: %s", self.dbg_tag(), api_err) + return Result[ + List[ + Union[Dict[str, Union[int, float, bool, str, bytes, DLPacked]], ConsumedMessage] + ], + ApiError, + ].new_error(api_err) + + if not batch: + return Result[ + List[ + Union[Dict[str, Union[int, float, bool, str, bytes, DLPacked]], ConsumedMessage] + ], + ApiError, + ].new_error( + MessageConsumptionNoNewMessageError("No message available") + ) + + return Result(batch) + + def _get_data_legacy_prefetch( + self, + *, + batch_size: int, + try_time: Optional[int], + prefetch_num: int, + ) -> Result[ + List[Union[Dict[str, Union[int, float, bool, str, bytes, DLPacked]], ConsumedMessage]], + ApiError, + ]: + """Get data through the old direct MPSC prefetch path.""" + timeout_ms = self._get_timeout_ms(try_time) + + return self._get_data_with_fetcher( + batch_size=batch_size, + fetch_one=lambda prefetch_target, _timeout_ms: self._handle.get_one_legacy_for_internal_check( # type: ignore[attr-defined] + prefetch_target, + timeout_ms, + ), + prefetch_target=min( + batch_size + max(prefetch_num, 0), + MPSC_PREFETCH_TARGET_MAX, + ), + timeout_ms=timeout_ms, + ) + + def _get_timeout_ms(self, try_time: Optional[int]) -> Optional[int]: + if try_time is None: + return None + t_sec = try_time if try_time > 0 else 1 + timeout_ms = int(t_sec * 1000) + assert timeout_ms > 0 + return timeout_ms + + def _get_data_with_fetcher( + self, + *, + batch_size: int, + fetch_one: Any, + prefetch_target: int = 0, + timeout_ms: Optional[int] = None, + ) -> Result[ + List[Union[Dict[str, Union[int, float, bool, str, bytes, DLPacked]], ConsumedMessage]], + ApiError, + ]: + """Common get loop used by broker and internal legacy checks. Semantics: - If it returns Ok([...]), each element is from a successful get_one call. - - If any get_one in this batch raises an error, the entire batch fails and - returns Err(ApiError) immediately (no "partial success" Ok list). - - The window size is mapped to `batch_size + prefetch_num`, so the underlying - Rust actor maintains a local prefetch queue of that size. + - NoNewMessage/ChannelClosed only fail the call when the batch is still empty. + Already-consumed items must be returned to avoid losing partial progress. + - Payload/decode/unknown errors still fail immediately. """ - prefetch_target = batch_size + max(prefetch_num, 0) - - # Inline minimal fetch loop with explicit prefetch_target to keep - # ChannelConsumer.try_get_data signature aligned while still - # honoring the calculated window size here. results: List[Union[Dict[str, Union[int, float, bool, str, bytes, DLPacked]], ConsumedMessage]] = [] - # try_time is seconds in Python; Rust get_one expects milliseconds. - timeout_ms: Optional[int] - if try_time is None: - timeout_ms = None - else: - # Compatibility: try_time must not be 0; if callers pass 0, treat it as 1 second. - t_sec = try_time if try_time > 0 else 1 - timeout_ms = int(t_sec * 1000) - assert timeout_ms > 0 - + for _ in range(batch_size): try: - # Pass timeout_ms (converted from try_time seconds) to Rust. - obj = self._handle.get_one(prefetch_target, timeout_ms) # type: ignore[attr-defined] + obj = fetch_one(prefetch_target, timeout_ms) except Exception as e: - logging.error("%s get_one failed: %s", self.dbg_tag(), e) # Rust is expected to raise an extension-layer ApiError. To avoid carrying # arbitrary Exception types in Result, wrap non-ApiError into # MqGetDataUnknownError to keep the error taxonomy narrow. @@ -1126,6 +1283,12 @@ def get_data( api_err = MqGetDataUnknownError.from_exception( e, channel_id=self._chan_id, consumer_id=self._consumer_id ) + if isinstance(api_err, (MessageConsumptionNoNewMessageError, ChannelClosedError)): + logging.debug("%s get_one finished without payload: %s", self.dbg_tag(), api_err) + if results: + return Result(results) + else: + logging.error("%s get_one failed: %s", self.dbg_tag(), api_err) return Result[ List[ Union[Dict[str, Union[int, float, bool, str, bytes, DLPacked]], ConsumedMessage] diff --git a/fluxon_py/kvclient/fluxon.py b/fluxon_py/kvclient/fluxon.py index 1325e3d..6a4dacc 100644 --- a/fluxon_py/kvclient/fluxon.py +++ b/fluxon_py/kvclient/fluxon.py @@ -299,6 +299,9 @@ def __init__(self, config: FluxonKvClientConfig): self._client: Optional[fluxon_pyo3.KvClient] = None self._config = config self._init_error: Optional[ApiError] = None + self._client_op_lock = threading.RLock() + self._closing = False + self._closed = False cluster_name = config.fluxonkv_spec_cluster_name self._blocking_put_outer_total_log_window = _BlockingPutOuterTotalLogWindow( f"FluxonKVCacheStore[{cluster_name}]" @@ -776,20 +779,31 @@ def instance_key(self) -> Result[str, ApiError]: def close(self) -> Result[OkNone, ApiError]: """Close and tear down the store.""" try: - # Backend returns a Result; MUST be explicitly consumed to avoid - # leaking an unconsumed Result that triggers __del__ assertion. - res = self._client.close() - if not res.is_ok(): - # Propagate backend error (already an ApiError) - return Result.new_error(res.unwrap_error()) - # Consume Ok(None-like) to satisfy strict consumption policy - _ = res.unwrap() - unregister_store_from_cleanup(self) - # English note: - # After a successful close, clear the backend handle to prevent any further calls and - # allow deterministic resource release without relying on Python GC timing. - self._client = None + with self._client_op_lock: + if self._closed: + return Result.new_ok(OkNone()) + self._closing = True + if self._client is None: + self._closed = True + unregister_store_from_cleanup(self) + return Result.new_ok(OkNone()) + # Backend returns a Result; MUST be explicitly consumed to avoid + # leaking an unconsumed Result that triggers __del__ assertion. + res = self._client.close() + if not res.is_ok(): + # Propagate backend error (already an ApiError) + return Result.new_error(res.unwrap_error()) + # Consume Ok(None-like) to satisfy strict consumption policy + _ = res.unwrap() + unregister_store_from_cleanup(self) + # English note: + # After a successful close, clear the backend handle to prevent any further calls and + # allow deterministic resource release without relying on Python GC timing. + self._client = None + self._closed = True return Result.new_ok(OkNone()) + except KeyboardInterrupt as e: + return Result.new_error(GeneralError(f"Store close interrupted: {str(e)}")) except Exception as e: return Result.new_error(GeneralError(f"Failed to close client: {str(e)}")) @@ -892,7 +906,10 @@ def metrics_snapshot(self) -> MetricSnapshot: # --- Fluxon-kv lease helpers (synchronous) --- def allocate_lease(self, ttl_seconds: int) -> Result[int, ApiError]: try: - inner = self._client.allocate_lease(ttl_seconds) + with self._client_op_lock: + if self._closing or self._closed or self._client is None: + return Result.new_error(GeneralError("allocate_lease called after store close started")) + inner = self._client.allocate_lease(ttl_seconds) if not inner.is_ok(): return Result.new_error(inner.unwrap_error()) lease_id = inner.unwrap() @@ -903,7 +920,10 @@ def allocate_lease(self, ttl_seconds: int) -> Result[int, ApiError]: def keepalive_lease(self, lease_id: int) -> Result[OkNone, ApiError]: try: - inner = self._client.keepalive_lease(lease_id, "kvclient") + with self._client_op_lock: + if self._closing or self._closed or self._client is None: + return Result.new_ok(OkNone()) + inner = self._client.keepalive_lease(lease_id, "kvclient") if not inner.is_ok(): return Result.new_error(inner.unwrap_error()) # Success returns a None-like sentinel from PyO3; normalize to OkNone diff --git a/fluxon_py/runtime/__init__.py b/fluxon_py/runtime/__init__.py index 692b741..fda3b65 100644 --- a/fluxon_py/runtime/__init__.py +++ b/fluxon_py/runtime/__init__.py @@ -8,6 +8,10 @@ "run_kv_master_service_blocking", "start_kv_master_process", "start_kv_master_process_with_config_b64", + "run_broker_blocking", + "run_broker_service_blocking", + "start_broker_process", + "start_broker_process_with_config_b64", "run_owner_kvclient_blocking", "run_owner_kvclient_service_blocking", "start_owner_kvclient_process", @@ -37,6 +41,10 @@ "run_kv_master_service_blocking": ("start_master", "run_kv_master_service_blocking"), "start_kv_master_process": ("start_master", "start_kv_master_process"), "start_kv_master_process_with_config_b64": ("start_master", "start_kv_master_process_with_config_b64"), + "run_broker_blocking": ("start_broker", "run_kv_broker_blocking"), + "run_broker_service_blocking": ("start_broker", "run_kv_broker_service_blocking"), + "start_broker_process": ("start_broker", "start_kv_broker_process"), + "start_broker_process_with_config_b64": ("start_broker", "start_kv_broker_process_with_config_b64"), "run_owner_kvclient_blocking": ("start_owner_kvclient", "run_owner_kvclient_blocking"), "run_owner_kvclient_service_blocking": ("start_owner_kvclient", "run_owner_kvclient_service_blocking"), "start_owner_kvclient_process": ("start_owner_kvclient", "start_owner_kvclient_process"), diff --git a/fluxon_py/runtime/start_broker.py b/fluxon_py/runtime/start_broker.py new file mode 100644 index 0000000..dd7a70e --- /dev/null +++ b/fluxon_py/runtime/start_broker.py @@ -0,0 +1,139 @@ +#!/usr/bin/env python3 + +from __future__ import annotations + +import argparse +import subprocess +from pathlib import Path +import yaml + +from fluxon_py.tool import import_fluxon_pyo3_local + +from .process_runner import ( + bind_current_process_parent_death_sigterm, + build_runtime_singleton_spec, + RuntimeConfigInput, + decode_runtime_config_b64, + encode_runtime_config_b64, + resolve_runtime_config_path, + run_singleton_process, + start_python_module_process, + start_python_module_process_with_config_b64, +) + + +BROKER_MODULE_NAME = "fluxon_py.runtime.start_broker" +STOP_EXISTING_BROKER_TIMEOUT_SECONDS = 30 +BROKER_RUNTIME_CONFIG_FILENAME = "kv_broker.runtime.yaml" + + +def run_kv_broker_blocking( + *, + workdir: Path, + config: RuntimeConfigInput | None = None, + config_path: Path | None = None, +) -> None: + resolved_workdir = workdir.resolve() + resolved_config = resolve_runtime_config_path( + workdir=resolved_workdir, + runtime_config_filename=BROKER_RUNTIME_CONFIG_FILENAME, + config=config, + config_path=config_path, + ) + singleton_spec = build_runtime_singleton_spec( + module_name=BROKER_MODULE_NAME, + entrypoint_path=Path(__file__), + workdir=workdir, + ) + run_singleton_process( + config_path=resolved_config, + singleton_spec=singleton_spec, + stop_timeout_seconds=STOP_EXISTING_BROKER_TIMEOUT_SECONDS, + start_fn=lambda: run_kv_broker_service_blocking( + config_path=resolved_config, + workdir=resolved_workdir, + ), + ) + + +def run_kv_broker_service_blocking(*, config_path: Path, workdir: Path) -> None: + fluxon_pyo3 = import_fluxon_pyo3_local() + result = fluxon_pyo3.run_broker_blocking(str(config_path)) + if not result.is_ok(): + raise RuntimeError(f"run_broker_blocking failed: {result.unwrap_error()}") + + _ = result.unwrap() + + +def run_kv_broker_service_blocking_from_yaml_text(*, config_yaml: str) -> None: + config = yaml.safe_load(config_yaml) + if not isinstance(config, dict): + raise TypeError(f"broker config must decode to dict, got {type(config).__name__}") + fluxon_pyo3 = import_fluxon_pyo3_local() + result = fluxon_pyo3.run_broker_blocking(config) + if not result.is_ok(): + raise RuntimeError(f"run_broker_blocking failed: {result.unwrap_error()}") + + _ = result.unwrap() + + +def start_kv_broker_process( + *, + workdir: Path | None = None, + config: RuntimeConfigInput | None = None, + config_path: Path | None = None, + log_path: Path | None = None, +) -> subprocess.Popen[bytes]: + if config_path is None and isinstance(config, dict) and workdir is None: + return start_kv_broker_process_with_config_b64(config=config, log_path=log_path) + if workdir is None: + raise ValueError("workdir is required when config is not a dict and config_path is not provided") + resolved_workdir = workdir.resolve() + resolved_config = resolve_runtime_config_path( + workdir=resolved_workdir, + runtime_config_filename=BROKER_RUNTIME_CONFIG_FILENAME, + config=config, + config_path=config_path, + ) + return start_python_module_process( + module_name=BROKER_MODULE_NAME, + config_path=resolved_config, + workdir=resolved_workdir, + extra_cli_args=(), + log_path=log_path, + ) + + +def start_kv_broker_process_with_config_b64( + *, + config: dict, + log_path: Path | None = None, +) -> subprocess.Popen[bytes]: + return start_python_module_process_with_config_b64( + module_name=BROKER_MODULE_NAME, + config_b64=encode_runtime_config_b64(config), + extra_cli_args=(), + log_path=log_path, + ) + + +def main() -> None: + bind_current_process_parent_death_sigterm() + parser = argparse.ArgumentParser(description="Start Fluxon KV broker (blocking)") + parser.add_argument("-c", "--config", type=Path, required=False, help="Path to broker YAML config") + parser.add_argument("-w", "--workdir", type=Path, required=False, help="Working directory") + parser.add_argument("--config-b64", required=False, help="Base64-encoded YAML config") + args = parser.parse_args() + if args.config_b64 is not None: + # Keep the same config transport contract as other runtime entrypoints. + run_kv_broker_service_blocking_from_yaml_text( + config_yaml=decode_runtime_config_b64(args.config_b64) + ) + return + if args.config is None or args.workdir is None: + raise ValueError("--config and --workdir are required when --config-b64 is not used") + run_kv_broker_blocking(config=args.config, workdir=args.workdir) + + +if __name__ == "__main__": + main() diff --git a/fluxon_py/tests/test_api_chan_mpmc/test_api_chan_mpmc_base.py b/fluxon_py/tests/test_api_chan_mpmc/test_api_chan_mpmc_base.py index f992c2d..8135242 100644 --- a/fluxon_py/tests/test_api_chan_mpmc/test_api_chan_mpmc_base.py +++ b/fluxon_py/tests/test_api_chan_mpmc/test_api_chan_mpmc_base.py @@ -45,6 +45,7 @@ def _find_project_root(start: Path) -> Path: sys.path.insert(0, str(PROJECT_ROOT)) from typing import Dict, List, Optional, Tuple +from types import SimpleNamespace import etcd3 @@ -649,6 +650,17 @@ def scenario_dynamic_producer_consumer( recovered_consumers: List[str] = [] test_mpmc_id: Optional[str] = None + def _print_process_log_tail(log_file: str, *, max_lines: int = 200) -> None: + print(f"=== subprocess log tail: {log_file} ===", flush=True) + try: + with open(log_file, "rb") as handle: + lines = handle.readlines()[-max_lines:] + for raw in lines: + print(raw.decode("utf-8", "replace").rstrip("\n"), flush=True) + except Exception as exc: # noqa: BLE001 + print(f"failed to read subprocess log {log_file}: {exc}", flush=True) + print(f"=== end subprocess log tail: {log_file} ===", flush=True) + def fail_fast_on_subprocess_error(*, process_type_filter: Optional[str] = None) -> None: for identifier, (process_type, proc, log_file) in process_handles_by_id.items(): if process_type_filter is not None and process_type != process_type_filter: @@ -657,6 +669,7 @@ def fail_fast_on_subprocess_error(*, process_type_filter: Optional[str] = None) if rc is None: continue if rc != 0: + _print_process_log_tail(log_file) raise RuntimeError( f"{process_type} {identifier} exited early with return code {rc}. " f"Check log file for details: {log_file}" @@ -680,6 +693,7 @@ def wait_all_of_type(process_type: str, *, timeout_s: int) -> None: print(f"{ptype} {identifier} completed successfully") print(f"Log file: {log_file}") continue + _print_process_log_tail(log_file) raise RuntimeError( f"{ptype} {identifier} failed with return code {proc.returncode}." f" Check log file for details: {log_file}" @@ -1399,6 +1413,33 @@ def test_mpmc_dynamic_suite() -> None: run_with_argmatrix(_test_mpmc_dynamic_suite_once) +def test_mpmc_get_data_prefetch_is_per_consumer_not_divided() -> None: + calls: List[Tuple[int, Optional[int], int]] = [] + + class _DummyInnerConsumer: + def get_data( + self, + batch_size: int, + try_time: Optional[int] = None, + prefetch_num: int = 0, + ) -> Result[List[Dict[str, object]], ApiError]: + calls.append((batch_size, try_time, prefetch_num)) + return Result.new_ok([]) + + consumer = object.__new__(MPMCChanConsumer) + consumer.shutdown_ctl = mpsc.MqShutdownCtl() + consumer.mpmc_id = "123" + consumer.mpmc_channel = SimpleNamespace( + _get_active_consumer_count=lambda: 8, + ) + consumer.mpsc_consumer = _DummyInnerConsumer() + + res = consumer.get_data(batch_size=40, try_time=2, prefetch_num=40) + + assert res.is_ok() + assert calls == [(40, 2, 40)] + + if __name__ == "__main__": diff --git a/fluxon_py/tests/test_api_chan_mpmc/test_mpmc_simple_bench.py b/fluxon_py/tests/test_api_chan_mpmc/test_mpmc_simple_bench.py index a29c46f..903ba7f 100644 --- a/fluxon_py/tests/test_api_chan_mpmc/test_mpmc_simple_bench.py +++ b/fluxon_py/tests/test_api_chan_mpmc/test_mpmc_simple_bench.py @@ -52,10 +52,12 @@ def _find_project_root(start: Path) -> Path: from fluxon_py import FluxonKvClientConfig, new_store # noqa: E402 from fluxon_py.api_error import ( # noqa: E402 ChannelClosedError, + KeyNotFoundError, MessageConsumptionNoNewMessageError, ProducerClosedError, ) from fluxon_py.api_ext_chan import ChanType # noqa: E402 +from fluxon_py._api_ext_chan.mpsc import MpscContext # noqa: E402 from fluxon_py.kvclient import KvClientType # noqa: E402 from fluxon_py.kvclient.nonzerocopy_encode import DLPackBytesView # noqa: E402 from fluxon_py.logging import init_logger # noqa: E402 @@ -382,17 +384,23 @@ def _run_one_case( ) _put_etcd_key(stop_key, b"1") time.sleep(SUMMARY_STOP_GRACE_SECONDS) - _signal_live_processes(worker_processes, signum=signal.SIGINT) try: _wait_for_processes_exit(worker_processes, timeout_seconds=WORKER_EXIT_TIMEOUT_SECONDS) except RuntimeError as err: + _signal_live_processes(worker_processes, signum=signal.SIGINT) logging.warning("[bench] worker shutdown timeout bench_id=%s error=%s", bench_id, err) + raise else: - _warn_if_worker_exited_nonzero(worker_processes, bench_id=bench_id) + _raise_if_worker_exited_nonzero(worker_processes, bench_id=bench_id) finally: _terminate_processes(worker_processes) _delete_etcd_key(stop_key) _clear_etcd_prefix(f"{SUMMARY_KEY_PREFIX}{bench_id}/") + if bootstrap_store is not None and bootstrap_producer is not None: + _best_effort_delete_case_broker_channels( + store=bootstrap_store, + mpmc_id=str(bootstrap_producer.get_chan_id()), + ) if bootstrap_producer is not None: _best_effort_close(bootstrap_producer, role="bootstrap_producer") _best_effort_close(bootstrap_store, role="bootstrap_store") @@ -985,18 +993,18 @@ def _index_summaries_by_consumer_id(summaries: list[dict[str, Any]]) -> dict[str return indexed -def _warn_if_worker_exited_nonzero(processes: list[subprocess.Popen[str]], *, bench_id: str) -> None: +def _raise_if_worker_exited_nonzero(processes: list[subprocess.Popen[str]], *, bench_id: str) -> None: + failures: list[str] = [] for proc in processes: return_code = proc.poll() if return_code is None: continue if return_code != 0: - logging.warning( - "[bench] worker exited non-zero during teardown bench_id=%s pid=%s code=%s", - bench_id, - proc.pid, - return_code, - ) + failures.append(f"pid={proc.pid} code={return_code}") + if failures: + raise RuntimeError( + f"worker exited non-zero during teardown bench_id={bench_id}: {', '.join(failures)}" + ) def _maybe_write_consumer_summary( @@ -1215,6 +1223,71 @@ def _clear_etcd_prefix(prefix: str) -> None: etcd_client.delete(meta.key) +def _best_effort_delete_case_broker_channels(*, store: Any, mpmc_id: str) -> None: + if not isinstance(mpmc_id, str) or not mpmc_id.isdigit(): + logging.warning("[bench] skip broker cleanup for invalid mpmc_id=%r", mpmc_id) + return + + channels_key = f"/mpmc_channels/{mpmc_id}/mpsc_channels" + try: + with etcd3.client(ETCD_HOST, ETCD_PORT) as etcd_client: + raw, _ = etcd_client.get(channels_key) + if raw is None: + return + loaded = json.loads(raw.decode("utf-8")) + if not isinstance(loaded, list): + raise TypeError(f"{channels_key} must contain a list, got {type(loaded).__name__}") + + ctx = MpscContext(store) + payload_key_count = 0 + payload_delete_ok = 0 + payload_delete_failed = 0 + try: + for chan_id in loaded: + if not isinstance(chan_id, str) or not chan_id.isdigit(): + raise ValueError(f"invalid sub-MPSC channel id in {channels_key}: {chan_id!r}") + payload_keys = ctx.delete_broker_channel(chan_id) + payload_key_count += len(payload_keys) + for payload_key in payload_keys: + res = store.remove(payload_key) + if res.is_ok(): + _ = res.unwrap() + payload_delete_ok += 1 + continue + err = res.unwrap_error() + if isinstance(err, KeyNotFoundError): + payload_delete_ok += 1 + continue + payload_delete_failed += 1 + logging.warning( + "[bench] broker payload cleanup failed key=%s err=%s", + payload_key, + err, + ) + finally: + ctx.close() + logging.info( + "[bench] deleted broker channels for mpmc_id=%s count=%s payload_keys=%s payload_delete_ok=%s payload_delete_failed=%s", + mpmc_id, + len(loaded), + payload_key_count, + payload_delete_ok, + payload_delete_failed, + ) + print( + "BENCH_BROKER_CLEANUP " + f"mpmc_id={mpmc_id} channels={len(loaded)} payload_keys={payload_key_count} " + f"payload_delete_ok={payload_delete_ok} payload_delete_failed={payload_delete_failed}", + flush=True, + ) + except Exception as err: # noqa: BLE001 + logging.warning( + "[bench] broker channel cleanup failed for mpmc_id=%s: %s", + mpmc_id, + err, + ) + + def _best_effort_close(obj: Any, *, role: str) -> None: close_res = obj.close() if close_res.is_ok(): diff --git a/fluxon_py/tests/test_api_chan_mpsc/test_api_chan_mpsc_base.py b/fluxon_py/tests/test_api_chan_mpsc/test_api_chan_mpsc_base.py index 884c748..f40a046 100644 --- a/fluxon_py/tests/test_api_chan_mpsc/test_api_chan_mpsc_base.py +++ b/fluxon_py/tests/test_api_chan_mpsc/test_api_chan_mpsc_base.py @@ -30,6 +30,8 @@ ChanKeyNotFoundError, ChanMessageConsumptionError, ChanMessageProduceError, + ChannelClosedError, + MessageConsumptionNoNewMessageError, ConsumerRegistrationError, ProducerRegistrationError, ) @@ -54,6 +56,7 @@ from fluxon_py._api_ext_chan.mpsc import ( # noqa: E402 _new_produce_offset_of_all_producer_key, ) +from fluxon_py._api_ext_chan import mpsc # noqa: E402 from fluxon_py.logging import init_logger # noqa: E402 from fluxon_py.tests.test_lib import ( # noqa: E402 KV_SVC_IP, @@ -1601,6 +1604,119 @@ def test_mpsc_channel_suite() -> None: run_with_argmatrix(_test_mpsc_channel_suite_once) +def test_mpsc_get_data_clamps_prefetch_target() -> None: + consumer = object.__new__(MPSCChanConsumer) + consumer.shutdown_ctl = mpsc.MqShutdownCtl() + consumer._chan_id = "1" + consumer._consumer_id = "2" + consumer._dbg_tag = "[MPSCChanConsumer chan_id=1 consumer_idx=2]" + consumer._closed_local = True + + observed_targets: List[int] = [] + + class _DummyHandle: + def get_one_legacy_for_internal_check( + self, + prefetch_target: int, + timeout_ms: Optional[int], + ) -> Dict[str, bytes]: + observed_targets.append(prefetch_target) + return {"payload": b"x"} + + consumer._handle = _DummyHandle() + + res = consumer._get_data_legacy_for_internal_check( + batch_size=40, + try_time=1, + prefetch_num=400, + ) + + assert res.is_ok() + assert observed_targets + assert all(target == mpsc.MPSC_PREFETCH_TARGET_MAX for target in observed_targets) + + +def test_mpsc_get_data_returns_partial_batch_on_no_message() -> None: + consumer = object.__new__(MPSCChanConsumer) + consumer.shutdown_ctl = mpsc.MqShutdownCtl() + consumer._chan_id = "1" + consumer._consumer_id = "2" + consumer._dbg_tag = "[MPSCChanConsumer chan_id=1 consumer_idx=2]" + consumer._closed_local = True + + class _DummyHandle: + def get_batch( + self, + batch_size: int, + prefetch_target: int, + timeout_ms: Optional[int], + ) -> List[Dict[str, bytes]]: + del batch_size, prefetch_target, timeout_ms + return [{"payload": b"x"}] + + consumer._handle = _DummyHandle() + + res = consumer.get_data(batch_size=8, try_time=1, prefetch_num=0) + + assert res.is_ok() + assert res.unwrap() == [{"payload": b"x"}] + + +def test_mpsc_get_data_returns_partial_batch_on_channel_closed() -> None: + consumer = object.__new__(MPSCChanConsumer) + consumer.shutdown_ctl = mpsc.MqShutdownCtl() + consumer._chan_id = "1" + consumer._consumer_id = "2" + consumer._dbg_tag = "[MPSCChanConsumer chan_id=1 consumer_idx=2]" + consumer._closed_local = True + + class _DummyHandle: + def get_batch( + self, + batch_size: int, + prefetch_target: int, + timeout_ms: Optional[int], + ) -> List[Dict[str, bytes]]: + del batch_size, prefetch_target, timeout_ms + return [{"payload": b"x"}] + + consumer._handle = _DummyHandle() + + res = consumer.get_data(batch_size=8, try_time=1, prefetch_num=0) + + assert res.is_ok() + assert res.unwrap() == [{"payload": b"x"}] + + +def test_mpsc_get_data_broker_passes_prefetch_target_to_batch() -> None: + consumer = object.__new__(MPSCChanConsumer) + consumer.shutdown_ctl = mpsc.MqShutdownCtl() + consumer._chan_id = "1" + consumer._consumer_id = "2" + consumer._dbg_tag = "[MPSCChanConsumer chan_id=1 consumer_idx=2]" + consumer._closed_local = True + + observed: List[int] = [] + + class _DummyHandle: + def get_batch( + self, + batch_size: int, + prefetch_target: int, + timeout_ms: Optional[int], + ) -> List[Dict[str, bytes]]: + del batch_size, timeout_ms + observed.append(prefetch_target) + return [{"payload": b"x"}] + + consumer._handle = _DummyHandle() + + res = consumer.get_data(batch_size=40, try_time=1, prefetch_num=400) + + assert res.is_ok() + assert observed == [mpsc.MPSC_PREFETCH_TARGET_MAX] + + def test_new_or_bind_unique_key_namespace_collision() -> None: setup_test_environment(logging) env = create_channel_env() diff --git a/fluxon_py/tests/test_lib.py b/fluxon_py/tests/test_lib.py index 9be7003..41e4557 100644 --- a/fluxon_py/tests/test_lib.py +++ b/fluxon_py/tests/test_lib.py @@ -173,10 +173,11 @@ def setup_test_environment(logger: Logger, print_config: bool = True): # except RuntimeError as e: # print(f"Failed to set start method to spawn: {e}, current start method: {multiprocessing.get_start_method()}") - loglevel_str="DEBUG" + loglevel_str = os.environ.get("FLUXON_LOG") or os.environ.get("LOG_LEVEL") or "DEBUG" + loglevel_str = str(loglevel_str).upper() os.environ["LOG_LEVEL"] = loglevel_str os.environ["FLUXON_LOG"] = loglevel_str - LOGGING_LEVEL= logging.DEBUG + LOGGING_LEVEL = getattr(logging, loglevel_str, logging.DEBUG) update_log_level(loglevel_str) print("=================================================") @@ -190,7 +191,7 @@ def emit(self, record): self.flush() # Flush immediately for every log record handler = FlushStreamHandler(sys.stdout) - handler.setLevel(logging.DEBUG) + handler.setLevel(LOGGING_LEVEL) formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s') handler.setFormatter(formatter) diff --git a/fluxon_py/tests/test_mq/test_example_ctrl_c_exit.py b/fluxon_py/tests/test_mq/test_example_ctrl_c_exit.py index c1b3193..8242f77 100644 --- a/fluxon_py/tests/test_mq/test_example_ctrl_c_exit.py +++ b/fluxon_py/tests/test_mq/test_example_ctrl_c_exit.py @@ -51,6 +51,7 @@ def _find_project_root(start: Path) -> Path: CHAN_CONFIG_TEST = {"capacity": 10, "ttl_seconds": 90, "weight": 1} MASTER_SCRIPT = [sys.executable, "-m", "fluxon_py.runtime.start_master"] +BROKER_SCRIPT = [sys.executable, "-m", "fluxon_py.runtime.start_broker"] KVCLIENT_SCRIPT = [sys.executable, "-m", "fluxon_py.runtime.start_owner_kvclient"] ETCD_BIN = PROJECT_ROOT / "fluxon_release" / "ext_images" / "etcd" / "etcd" GREPTIME_BIN = PROJECT_ROOT / "fluxon_release" / "ext_images" / "greptime" / "greptime" @@ -191,7 +192,7 @@ def _on_ctrlc(reason: str) -> None: import yaml from fluxon_py.api_ext_chan import ChanRole, ChanType, MPMCChanConsumer, new_or_bind_with_unique_key -from fluxon_py.api_error import ChannelClosedError +from fluxon_py.api_error import ChannelClosedError, MessageConsumptionNoNewMessageError from fluxon_py.config import FluxonKvClientConfig from fluxon_py.kvclient import new_store from fluxon_py.logging import init_logger @@ -279,6 +280,10 @@ def _on_ctrlc(reason: str) -> None: if isinstance(err, ChannelClosedError): logger.info("[consumer] close observed, exit loop") break + if isinstance(err, MessageConsumptionNoNewMessageError): + if shutdown_requested.wait(0.2): + break + continue raise SystemExit(f"get_data failed: {err}") for raw in res.unwrap() or []: payload = raw.get("payload", b"") if isinstance(raw, dict) else raw @@ -463,6 +468,7 @@ def _build_example_config( share_mem_path: str, greptime_http_port: int, master_port: int, + broker_port: int, ) -> dict[str, Any]: capacity = max(128, int(CHAN_CONFIG_TEST["capacity"])) ttl_seconds = max(90, int(CHAN_CONFIG_TEST["ttl_seconds"])) @@ -475,6 +481,14 @@ def _build_example_config( "log_dir": str((Path(share_mem_path).parent / "log" / "master").resolve()), "monitoring": _monitoring_block(greptime_http_port=greptime_http_port), }, + "broker": { + "instance_key": f"example_ctrlc_broker_{unique_suffix}", + "fluxonkv_spec": { + "cluster_name": cluster_name, + "share_mem_path": share_mem_path, + "p2p_listen_port": broker_port, + }, + }, "kvclient": { "instance_key": f"example_ctrlc_owner_{unique_suffix}", "contribute_to_cluster_pool_size": {"dram": 1073741824, "vram": {}}, @@ -589,6 +603,7 @@ def _start_local_stack(*, temp_root: Path, config_path: Path) -> list[tuple[subp cluster_name = f"example_ctrlc_cluster_{unique_suffix}" share_mem_path = str((temp_root / "sharemem").resolve()) master_port = _pick_free_port() + broker_port = _pick_free_port() config = _build_example_config( unique_suffix=unique_suffix, cluster_name=cluster_name, @@ -596,14 +611,17 @@ def _start_local_stack(*, temp_root: Path, config_path: Path) -> list[tuple[subp share_mem_path=share_mem_path, greptime_http_port=greptime_http_port, master_port=master_port, + broker_port=broker_port, ) config_path.write_text( yaml.safe_dump(config, sort_keys=False), encoding="utf-8", ) master_config_path = temp_root / "master.yaml" + broker_config_path = temp_root / "broker.yaml" kvclient_config_path = temp_root / "kvclient.yaml" _write_runtime_subconfig(path=master_config_path, config=config, key="master") + _write_runtime_subconfig(path=broker_config_path, config=config, key="broker") _write_runtime_subconfig(path=kvclient_config_path, config=config, key="kvclient") master_proc = _spawn_logged( @@ -643,8 +661,25 @@ def _start_local_stack(*, temp_root: Path, config_path: Path) -> list[tuple[subp proc=kvclient_proc, log_path=kvclient_log, ) + + broker_log = temp_root / "log" / "broker.log" + broker_proc = _spawn_logged( + cmd=[ + *BROKER_SCRIPT, + "-c", + str(broker_config_path), + "-w", + str((temp_root / "broker_work").resolve()), + ], + workdir=PROJECT_ROOT, + log_path=broker_log, + env=env, + ) + time.sleep(2.0) + _require_process_running(broker_proc, label="broker", log_path=broker_log) return [ (kvclient_proc, kvclient_log), + (broker_proc, broker_log), (master_proc, master_log), (etcd_proc, etcd_log), (greptime_proc, greptime_log), diff --git a/fluxon_rs/Cargo.lock b/fluxon_rs/Cargo.lock index a4b0ecd..964cd8c 100644 --- a/fluxon_rs/Cargo.lock +++ b/fluxon_rs/Cargo.lock @@ -1230,6 +1230,7 @@ dependencies = [ "fluxon_commu", "fluxon_framework", "fluxon_framework_compiled", + "fluxon_mq", "fluxon_observability", "fluxon_util", "futures", @@ -1275,6 +1276,7 @@ version = "0.2.1" dependencies = [ "anyhow", "async-trait", + "bitcode", "downcast-rs", "etcd-client", "fluxon_commu", diff --git a/fluxon_rs/fluxon_commu/src/facade/p2p.rs b/fluxon_rs/fluxon_commu/src/facade/p2p.rs index 8bcc169..79114f1 100644 --- a/fluxon_rs/fluxon_commu/src/facade/p2p.rs +++ b/fluxon_rs/fluxon_commu/src/facade/p2p.rs @@ -93,6 +93,19 @@ pub mod __hidden { self.view.upgrade() } + pub fn try_with_cluster_manager( + &self, + f: impl FnOnce(&crate::cluster_manager::ClusterManager) -> R, + ) -> Option { + let arc_view = self.view.upgrade()?; + unsafe { + let ptr = + std::ptr::NonNull::new(Arc::as_ptr(&arc_view) as *const _ as *mut _).unwrap(); + let view_ref: &dyn P2pModuleViewTrait = ptr.as_ref(); + Some(f(view_ref.cluster_manager())) + } + } + pub fn resource_registry(&self) -> &ResourceRegistry { let arc_view = self.view.upgrade().expect( "view of module P2pModule has been dropped when accessing resource registry", @@ -489,11 +502,6 @@ impl P2pModule { return true; } let view = self.module_view(); - let cm = view.cluster_manager(); - let self_info = cm.get_self_info(); - if self_info.node_role() != crate::NodeRole::External { - return false; - } let snapshot = self.cached_tier_snapshot(); let Some(peer_gen) = snapshot.peer_gen(logical_peer) else { return false; @@ -501,24 +509,31 @@ impl P2pModule { if !snapshot.is_send_ready_intra_effective(&peer_gen) { return false; } - let Some(owner_id) = self_info - .metadata - .get(crate::META_KEY_SHARED_STORAGE_NODE_ID) - else { - return false; - }; - if logical_peer.as_ref() == owner_id.as_str() { - return false; - } - let Some(handle) = cm.ipc_bandwidth_attributor_handle() else { - return false; - }; - match direction { - "tx" => handle.record_rx_bytes(bytes), - "rx" => handle.record_tx_bytes(bytes), - _ => return false, - } - true + view.try_with_cluster_manager(|cm| { + let self_info = cm.get_self_info(); + if self_info.node_role() != crate::NodeRole::External { + return false; + } + let Some(owner_id) = self_info + .metadata + .get(crate::META_KEY_SHARED_STORAGE_NODE_ID) + else { + return false; + }; + if logical_peer.as_ref() == owner_id.as_str() { + return false; + } + let Some(handle) = cm.ipc_bandwidth_attributor_handle() else { + return false; + }; + match direction { + "tx" => handle.record_rx_bytes(bytes), + "rx" => handle.record_tx_bytes(bytes), + _ => return false, + } + true + }) + .unwrap_or(false) } } diff --git a/fluxon_rs/fluxon_commu/src/facade/transfer_engine.rs b/fluxon_rs/fluxon_commu/src/facade/transfer_engine.rs index 878e5c6..e5353a5 100644 --- a/fluxon_rs/fluxon_commu/src/facade/transfer_engine.rs +++ b/fluxon_rs/fluxon_commu/src/facade/transfer_engine.rs @@ -74,12 +74,10 @@ impl ClosedLocalSegmentLeaseRegistry { where G: Send + Sync + 'static, { - let boxed = self - .guards - .lock() - .await - .remove(&handle) - .ok_or_else(|| format!("closed sdk local segment lease handle {handle} not found"))?; + let boxed = + self.guards.lock().await.remove(&handle).ok_or_else(|| { + format!("closed sdk local segment lease handle {handle} not found") + })?; boxed.downcast::().map(|guard| *guard).map_err(|_| { format!( "closed sdk local segment lease handle {handle} has unexpected runtime guard type" @@ -461,7 +459,7 @@ impl ClientTransferEngineCore { len, seg_guard, ) - .await + .await } } @@ -482,7 +480,11 @@ impl ClientTransferEngineCore { let initial_local_segment_guard = match seg_guard { Some(guard) => Some(guard), None if runtime.supports_local_segment_transfer() => { - let local_addr = if peer_src_or_target { target_addr } else { src_addr }; + let local_addr = if peer_src_or_target { + target_addr + } else { + src_addr + }; match runtime.ensure_local_segment_guard(local_addr, None).await { Ok(guard) => Some(guard), Err(_) => None, diff --git a/fluxon_rs/fluxon_commu_closed_sdk_consumer/src/lib.rs b/fluxon_rs/fluxon_commu_closed_sdk_consumer/src/lib.rs index 6fab54e..caad34b 100644 --- a/fluxon_rs/fluxon_commu_closed_sdk_consumer/src/lib.rs +++ b/fluxon_rs/fluxon_commu_closed_sdk_consumer/src/lib.rs @@ -11,9 +11,9 @@ use fluxon_commu_contract::{ ClosedRuntimeCallRawObservedOutputView, ClosedRuntimeClusterEventStreamItem, ClosedRuntimeClusterManagerCall, ClosedRuntimeClusterManagerResponse, ClosedRuntimeClusterRdmaResolvedConfigStreamItem, ClosedRuntimeDesiredTransferPeer, - ClosedRuntimeDispatchRequestView, - ClosedRuntimeDispatchResponse, ClosedRuntimeDispatchTransportPolicy, ClosedRuntimeError, - ClosedRuntimeHandle, ClosedRuntimeHostCallbackHandle, ClosedRuntimeP2pCall, + ClosedRuntimeDispatchRequestView, ClosedRuntimeDispatchResponse, + ClosedRuntimeDispatchTransportPolicy, ClosedRuntimeError, ClosedRuntimeHandle, + ClosedRuntimeHostCallbackHandle, ClosedRuntimeP2pCall, ClosedRuntimeP2pCallRawObservedRequestView, ClosedRuntimeP2pResponse, ClosedRuntimeP2pSendResponseRawRequestView, ClosedRuntimePeerGen, ClosedRuntimeRawSlice, ClosedRuntimeRequest, ClosedRuntimeResponse, ClosedRuntimeTransferEngineCall, @@ -491,11 +491,15 @@ impl WireBodyPartsOwner { let (raw_lengths, raw_payload) = match raw_bytes.len() { 0 => (WireBodyRawLengths::Empty, WireBodyRawPayload::Empty), 1 => { - let part = raw_bytes.into_iter().next().expect("single raw part missing"); - let len = - u32::try_from(part.len()).map_err(|_| ClosedSdkConsumerError::RuntimeDecode { + let part = raw_bytes + .into_iter() + .next() + .expect("single raw part missing"); + let len = u32::try_from(part.len()).map_err(|_| { + ClosedSdkConsumerError::RuntimeDecode { detail: format!("wire raw part too large for u32 length: {}", part.len()), - })?; + } + })?; ( WireBodyRawLengths::Single([len]), WireBodyRawPayload::Single(part), @@ -849,8 +853,7 @@ fn decode_call_raw_observed_output_view( return Err(ClosedSdkConsumerError::RuntimeDecode { detail: format!( "closed SDK call_raw_observed serialize_part overflow: serialize_len={} full_len={}", - message_view.body.serialize_part.len, - message_view.body.full_body.len, + message_view.body.serialize_part.len, message_view.body.full_body.len, ), }); } @@ -860,21 +863,19 @@ fn decode_call_raw_observed_output_view( .ok_or_else(|| ClosedSdkConsumerError::RuntimeDecode { detail: "closed SDK call_raw_observed raw_bytes length overflow".to_string(), })?; - let expected_full_len = - message_view - .body - .serialize_part - .len - .checked_add(raw_total) - .ok_or_else(|| ClosedSdkConsumerError::RuntimeDecode { - detail: "closed SDK call_raw_observed body length overflow".to_string(), - })?; + let expected_full_len = message_view + .body + .serialize_part + .len + .checked_add(raw_total) + .ok_or_else(|| ClosedSdkConsumerError::RuntimeDecode { + detail: "closed SDK call_raw_observed body length overflow".to_string(), + })?; if expected_full_len != message_view.body.full_body.len { return Err(ClosedSdkConsumerError::RuntimeDecode { detail: format!( "closed SDK call_raw_observed body length mismatch: expected={} full_len={}", - expected_full_len, - message_view.body.full_body.len, + expected_full_len, message_view.body.full_body.len, ), }); } @@ -923,9 +924,7 @@ fn decode_call_raw_observed_output_view( frame_recv_done_ts_us: message_view.local_observe.frame_recv_done_ts_us, dispatch_enqueued_ts_us: message_view.local_observe.dispatch_enqueued_ts_us, dispatch_started_ts_us: message_view.local_observe.dispatch_started_ts_us, - complete_pending_call_ts_us: message_view - .local_observe - .complete_pending_call_ts_us, + complete_pending_call_ts_us: message_view.local_observe.complete_pending_call_ts_us, }, }, observe: fluxon_commu_contract::ClosedRuntimeRpcCallTransportObserveTrace { @@ -1550,8 +1549,8 @@ async fn invoke_completion_async_with_keepalive( ) -> i32, ) -> Result<(i32, Bytes), ClosedSdkConsumerError> { let (sender, receiver) = tokio::sync::oneshot::channel::<(i32, Bytes)>(); - let user_data = Box::into_raw(Box::new(RuntimeCompletionState { sender, keepalive })) - .cast::(); + let user_data = + Box::into_raw(Box::new(RuntimeCompletionState { sender, keepalive })).cast::(); let submit_status = submit(user_data, Some(runtime_completion_callback)); if submit_status != 0 { unsafe { @@ -2082,7 +2081,9 @@ pub async fn p2p_call_raw_observed( ) .await?; match status_code { - FLUXON_COMMU_CLOSED_RUNTIME_RESULT_OK => decode_call_raw_observed_output_view(payload.as_ref()), + FLUXON_COMMU_CLOSED_RUNTIME_RESULT_OK => { + decode_call_raw_observed_output_view(payload.as_ref()) + } FLUXON_COMMU_CLOSED_RUNTIME_RESULT_ERR => { let error = bitcode::decode::(payload.as_ref()).map_err( |decode_error| ClosedSdkConsumerError::RuntimeDecode { diff --git a/fluxon_rs/fluxon_fs/src/agent_service/transfer_agent.rs b/fluxon_rs/fluxon_fs/src/agent_service/transfer_agent.rs index 1738ade..ca54a71 100644 --- a/fluxon_rs/fluxon_fs/src/agent_service/transfer_agent.rs +++ b/fluxon_rs/fluxon_fs/src/agent_service/transfer_agent.rs @@ -9,28 +9,23 @@ use std::time::{Duration, Instant}; use fluxon_fs_core::config::{ FS_AGENT_TRANSFER_STREAM_CLOSE_RPC_PATH, FS_AGENT_TRANSFER_STREAM_NEXT_RPC_PATH, - FS_AGENT_TRANSFER_STREAM_OPEN_RPC_PATH, - FS_MASTER_TRANSFER_SCHEDULER_HEARTBEAT_RPC_PATH, FS_MASTER_TRANSFER_SCHEDULER_RESULT_RPC_PATH, - FluxonFsTransferBatchCollectInfoWire, FluxonFsTransferBatchKind, - FluxonFsTransferCollectInfoKind, FluxonFsTransferDispositionWire, - FluxonFsTransferFailedFileReasonKindWire, - FluxonFsTransferReadStreamCloseWire, FluxonFsTransferReadStreamNextResultWire, - FluxonFsTransferReadStreamNextWire, FluxonFsTransferReadStreamOpenResultWire, - FluxonFsTransferReadStreamOpenWire, - FluxonFsTransferSkipEntryKind, FluxonFsTransferSkipEntryWire, - FluxonFsTransferManifestEntryWire, FluxonFsTransferManifestWire, - FluxonFsTransferScanMode, - FluxonFsTransferScanEventAckWire, FluxonFsTransferScanEventKindWire, - FluxonFsTransferScanEventWire, FluxonFsTransferScanLaunchResultWire, + FS_AGENT_TRANSFER_STREAM_OPEN_RPC_PATH, FS_MASTER_TRANSFER_SCHEDULER_HEARTBEAT_RPC_PATH, + FS_MASTER_TRANSFER_SCHEDULER_RESULT_RPC_PATH, FluxonFsTransferBatchCollectInfoWire, + FluxonFsTransferBatchKind, FluxonFsTransferCollectInfoKind, FluxonFsTransferDispositionWire, + FluxonFsTransferFailedFileReasonKindWire, FluxonFsTransferManifestEntryWire, + FluxonFsTransferManifestWire, FluxonFsTransferReadStreamCloseWire, + FluxonFsTransferReadStreamNextResultWire, FluxonFsTransferReadStreamNextWire, + FluxonFsTransferReadStreamOpenResultWire, FluxonFsTransferReadStreamOpenWire, FluxonFsTransferScanAssignmentWire, FluxonFsTransferScanBatchWire, - FluxonFsTransferScanChildUnitWire, FluxonFsTransferScanFrontier, + FluxonFsTransferScanChildUnitWire, FluxonFsTransferScanEventAckWire, + FluxonFsTransferScanEventKindWire, FluxonFsTransferScanEventWire, FluxonFsTransferScanFrontier, FluxonFsTransferScanFrontierDirEntry, FluxonFsTransferScanFrontierEntry, - FluxonFsTransferScanResultWire, - FluxonFsTransferSymlinkNoticeEntryWire, FluxonFsTransferWorkerCollectInfoResultWire, - FluxonFsTransferWorkerAssignmentWire, FluxonFsTransferWorkerFileResultWire, - FluxonFsTransferWorkerFailedFileResultWire, - FluxonFsTransferWorkerHeartbeatResultWire, FluxonFsTransferWorkerHeartbeatTelemetryWire, - FluxonFsTransferWorkerHeartbeatWire, + FluxonFsTransferScanLaunchResultWire, FluxonFsTransferScanMode, FluxonFsTransferScanResultWire, + FluxonFsTransferSkipEntryKind, FluxonFsTransferSkipEntryWire, + FluxonFsTransferSymlinkNoticeEntryWire, FluxonFsTransferWorkerAssignmentWire, + FluxonFsTransferWorkerCollectInfoResultWire, FluxonFsTransferWorkerFailedFileResultWire, + FluxonFsTransferWorkerFileResultWire, FluxonFsTransferWorkerHeartbeatResultWire, + FluxonFsTransferWorkerHeartbeatTelemetryWire, FluxonFsTransferWorkerHeartbeatWire, FluxonFsTransferWorkerLaunchResultWire, FluxonFsTransferWorkerResultAckWire, FluxonFsTransferWorkerResultWire, FluxonFsTransferWorkerStopReasonWire, transfer_collect_info_output_relpath, @@ -39,8 +34,8 @@ use fluxon_fs_core::retry::{ BackoffConfig, DEFAULT_WARN_INTERVAL_SECS, WarnConfig, next_backoff, should_warn, }; use fluxon_kv::rpcresp_kvresult_convert::msg_and_error::{ApiError, KvError}; -use fluxon_kv::user_api::flat_dict::{FlatDict, FlatValue}; use fluxon_kv::user_api::FluxonUserApi; +use fluxon_kv::user_api::flat_dict::{FlatDict, FlatValue}; use parking_lot::{Condvar, Mutex}; use super::{ @@ -202,16 +197,13 @@ fn transfer_scan_session_state() -> &'static Mutex { TRANSFER_SCAN_SESSION_STATE.get_or_init(|| Mutex::new(TransferScanSessionState::default())) } -fn cleanup_expired_transfer_scan_sessions( - state: &mut TransferScanSessionState, - now_unix_ms: i64, -) { - state - .root_dir_listing_sessions - .retain(|_, session| session.lease_expire_unix_ms <= 0 || session.lease_expire_unix_ms > now_unix_ms); - state - .subtree_streaming_sessions - .retain(|_, session| session.lease_expire_unix_ms <= 0 || session.lease_expire_unix_ms > now_unix_ms); +fn cleanup_expired_transfer_scan_sessions(state: &mut TransferScanSessionState, now_unix_ms: i64) { + state.root_dir_listing_sessions.retain(|_, session| { + session.lease_expire_unix_ms <= 0 || session.lease_expire_unix_ms > now_unix_ms + }); + state.subtree_streaming_sessions.retain(|_, session| { + session.lease_expire_unix_ms <= 0 || session.lease_expire_unix_ms > now_unix_ms + }); } fn same_root_continuation_scan_unit( @@ -301,10 +293,7 @@ fn flush_pending_root_direct_files_batch( return Ok(None); } let batch = build_direct_files_only_batch_from_entries_with_batch_id( - direct_files_only_batch_id_for_partition( - assignment, - session.next_direct_files_batch_index, - ), + direct_files_only_batch_id_for_partition(assignment, session.next_direct_files_batch_index), assignment, assignment.root_relpath.clone(), std::mem::take(&mut session.pending_direct_files), @@ -313,7 +302,8 @@ fn flush_pending_root_direct_files_batch( )?; session.pending_direct_bytes = 0; session.next_direct_files_batch_index = session.next_direct_files_batch_index.saturating_add(1); - session.emitted_direct_files_batch_count = session.emitted_direct_files_batch_count.saturating_add(1); + session.emitted_direct_files_batch_count = + session.emitted_direct_files_batch_count.saturating_add(1); Ok(Some(batch)) } @@ -414,7 +404,8 @@ fn open_transfer_root_dir_listing_session( root_dir_abs: &str, assignment: &FluxonFsTransferScanAssignmentWire, ) -> Result, FlatDict> { - let dir_abs = safe_join_root(root_dir_abs, assignment.root_relpath.as_str()).map_err(resp_err_kverr)?; + let dir_abs = + safe_join_root(root_dir_abs, assignment.root_relpath.as_str()).map_err(resp_err_kverr)?; let read_dir = match retry_after_target_path_chmod( dir_abs.as_path(), "root_read_dir", @@ -458,7 +449,10 @@ fn take_transfer_root_dir_listing_session( let now_unix_ms = chrono::Utc::now().timestamp_millis(); let mut state = transfer_scan_session_state().lock(); cleanup_expired_transfer_scan_sessions(&mut state, now_unix_ms); - if let Some(mut session) = state.root_dir_listing_sessions.remove(assignment.scan_unit_id.as_str()) { + if let Some(mut session) = state + .root_dir_listing_sessions + .remove(assignment.scan_unit_id.as_str()) + { if session.job_id == assignment.job_id && session.scan_epoch == assignment.scan_epoch && session.root_relpath == assignment.root_relpath @@ -507,7 +501,8 @@ fn open_transfer_subtree_streaming_session( if is_relpath_skipped(&assignment.skip_entries, assignment.root_relpath.as_str()) { return Ok(None); } - let dir_abs = safe_join_root(root_dir_abs, assignment.root_relpath.as_str()).map_err(resp_err_kverr)?; + let dir_abs = + safe_join_root(root_dir_abs, assignment.root_relpath.as_str()).map_err(resp_err_kverr)?; let root_md = retry_after_target_path_chmod( Path::new(root_dir_abs), "subtree_stream_root_symlink_metadata", @@ -790,7 +785,8 @@ fn collect_transfer_root_dir_listing_slice( assignment: &FluxonFsTransferScanAssignmentWire, deadline: Option, ) -> Result { - let Some(mut session) = take_transfer_root_dir_listing_session(root_dir_abs, assignment)? else { + let Some(mut session) = take_transfer_root_dir_listing_session(root_dir_abs, assignment)? + else { return Ok(TransferRootDirListingOutcome::Finished( build_finished_empty_transfer_scan_result(assignment), )); @@ -848,7 +844,8 @@ fn collect_transfer_root_dir_listing_slice( }; scanned_entries = scanned_entries.saturating_add(1); let name = ent.file_name().to_string_lossy().to_string(); - let child_relpath = normalize_child_relpath(assignment.root_relpath.as_str(), name.as_str()); + let child_relpath = + normalize_child_relpath(assignment.root_relpath.as_str(), name.as_str()); if is_relpath_skipped(&assignment.skip_entries, child_relpath.as_str()) { continue; } @@ -899,10 +896,12 @@ fn collect_transfer_root_dir_listing_slice( let size = md.len().min(i64::MAX as u64) as i64; session.root_visible_entries = true; session.root_total_bytes = session.root_total_bytes.saturating_add(size); - session.pending_direct_files.push(FluxonFsTransferScanFrontierEntry { - relpath: child_relpath, - size, - }); + session + .pending_direct_files + .push(FluxonFsTransferScanFrontierEntry { + relpath: child_relpath, + size, + }); session.pending_direct_bytes = session.pending_direct_bytes.saturating_add(size); if should_flush_direct_batch( assignment.batch_ready_bytes, @@ -910,7 +909,9 @@ fn collect_transfer_root_dir_listing_slice( session.pending_direct_files.len(), session.pending_direct_empty_dirs.len(), ) { - if let Some(batch) = flush_pending_root_direct_files_batch(assignment, &mut session)? { + if let Some(batch) = + flush_pending_root_direct_files_batch(assignment, &mut session)? + { direct_files_only_batches.push(batch); } } @@ -933,14 +934,18 @@ fn collect_transfer_root_dir_listing_slice( session.pending_direct_files.len(), session.pending_direct_empty_dirs.len(), ) { - if let Some(batch) = flush_pending_root_direct_files_batch(assignment, &mut session)? { + if let Some(batch) = + flush_pending_root_direct_files_batch(assignment, &mut session)? + { direct_files_only_batches.push(batch); } } } else { - session.direct_dirs.push(FluxonFsTransferScanFrontierDirEntry { - relpath: child_relpath, - }); + session + .direct_dirs + .push(FluxonFsTransferScanFrontierDirEntry { + relpath: child_relpath, + }); } } } @@ -1244,12 +1249,14 @@ impl TransferWorkerProgressWindow { fn record_written_bytes_and_maybe_ramp(&self, bytes: i64, now_unix_ms: i64) { let normalized = bytes.max(0); self.window_bytes.fetch_add(normalized, Ordering::SeqCst); - self.total_written_bytes.fetch_add(normalized, Ordering::SeqCst); + self.total_written_bytes + .fetch_add(normalized, Ordering::SeqCst); self.maybe_ramp(now_unix_ms); } fn record_materialized_empty_dir(&self) { - self.total_materialized_empty_dirs.fetch_add(1, Ordering::SeqCst); + self.total_materialized_empty_dirs + .fetch_add(1, Ordering::SeqCst); } fn total_materialized_empty_dirs(&self) -> i64 { @@ -1298,8 +1305,9 @@ impl TransferWorkerProgressWindow { } if previous_goodput > 0 { let delta = current_goodput.saturating_sub(previous_goodput); - let improvement_percent = - delta.saturating_mul(100).saturating_div(previous_goodput.max(1)); + let improvement_percent = delta + .saturating_mul(100) + .saturating_div(previous_goodput.max(1)); if improvement_percent < self.policy.min_improvement_percent { return; } @@ -1335,10 +1343,8 @@ impl TransferWorkerProgressWindow { .saturating_mul(1000) .saturating_div(window_elapsed_ms.max(1)) }; - self.peak_sample_goodput_bytes_per_sec.fetch_max( - window_goodput_bytes_per_sec.max(0), - Ordering::SeqCst, - ); + self.peak_sample_goodput_bytes_per_sec + .fetch_max(window_goodput_bytes_per_sec.max(0), Ordering::SeqCst); Some(TransferWorkerThroughputSample { window_started_unix_ms, window_elapsed_ms, @@ -1448,7 +1454,8 @@ impl TransferReadStreamActorOwned { data: Vec::new(), }); } - self.fill_prefetch_queue().map_err(|err| self.cache_terminal_error(err))?; + self.fill_prefetch_queue() + .map_err(|err| self.cache_terminal_error(err))?; let to_take = std::cmp::min(length as usize, (self.file_size - next_offset) as usize); let buf = self .take_prefetched_bytes(to_take) @@ -1457,7 +1464,8 @@ impl TransferReadStreamActorOwned { self.replay_offset = next_offset; self.replay_data = buf.clone(); self.next_offset = next_offset.saturating_add(buf.len() as i64); - self.fill_prefetch_queue().map_err(|err| self.cache_terminal_error(err))?; + self.fill_prefetch_queue() + .map_err(|err| self.cache_terminal_error(err))?; Ok(FluxonFsTransferReadStreamNextResultWire { stream_missing: false, data: buf, @@ -1598,7 +1606,11 @@ impl TransferReadStreamActorHandle { struct TransferWorkerCoordinator where - ReadChunkFn: Fn(&FluxonFsTransferManifestEntryWire, i64, i64) -> Result, TransferWorkerExecutionError>, + ReadChunkFn: Fn( + &FluxonFsTransferManifestEntryWire, + i64, + i64, + ) -> Result, TransferWorkerExecutionError>, CheckpointFn: Fn() -> Result<(), TransferWorkerExecutionError>, { log_context: TransferWorkerLogContext, @@ -1611,7 +1623,11 @@ where impl TransferWorkerCoordinator where - ReadChunkFn: Fn(&FluxonFsTransferManifestEntryWire, i64, i64) -> Result, TransferWorkerExecutionError>, + ReadChunkFn: Fn( + &FluxonFsTransferManifestEntryWire, + i64, + i64, + ) -> Result, TransferWorkerExecutionError>, CheckpointFn: Fn() -> Result<(), TransferWorkerExecutionError>, { fn new( @@ -1693,7 +1709,8 @@ where } fn progress_snapshot(&self) -> TransferWorkerProgressSnapshot { - self.progress.snapshot(chrono::Utc::now().timestamp_millis()) + self.progress + .snapshot(chrono::Utc::now().timestamp_millis()) } fn stop(&self) { @@ -1737,7 +1754,8 @@ impl TransferReadStreamRegistryHandle { }); } } - let full_path = safe_join_root(root_dir_abs, open.relpath.as_str()).map_err(resp_err_kverr)?; + let full_path = + safe_join_root(root_dir_abs, open.relpath.as_str()).map_err(resp_err_kverr)?; let file = open_file_with_target_path_chmod_retry(&full_path, "open_stream")?; let md = file.metadata().map_err(resp_err_io)?; let file_size = md.len().min(i64::MAX as u64) as i64; @@ -1764,12 +1782,15 @@ impl TransferReadStreamRegistryHandle { }); } state.streams.insert(stream_id.clone(), entry); - state.dedup_by_worker_file.insert(dedup_key, stream_id.clone()); + state + .dedup_by_worker_file + .insert(dedup_key, stream_id.clone()); drop(state); if let Err(resp) = TransferReadStreamActorHandle::start(stream_id.as_str(), actor) { let mut state = self.state.lock(); state.streams.remove(stream_id.as_str()); - state.dedup_by_worker_file + state + .dedup_by_worker_file .retain(|_, existing_stream_id| existing_stream_id != &stream_id); return Err(resp); } @@ -1811,7 +1832,9 @@ impl TransferReadStreamRegistryHandle { let Some(entry) = state.streams.remove(stream_id) else { return; }; - state.dedup_by_worker_file.retain(|_, existing_stream_id| existing_stream_id != stream_id); + state + .dedup_by_worker_file + .retain(|_, existing_stream_id| existing_stream_id != stream_id); entry.close(); } } @@ -1894,20 +1917,22 @@ fn decode_transfer_stream_open_result_payload( return Err(TransferWorkerRpcFailure::Fatal(resp.clone())); } Ok(FluxonFsTransferReadStreamOpenResultWire { - stream_id: require_str(resp, "stream_id").map_err(resp_err_kverr).map_err( - |err| { + stream_id: require_str(resp, "stream_id") + .map_err(resp_err_kverr) + .map_err(|err| { invalid_transfer_rpc_response(format!( "transfer read stream open response missing stream_id: err={}", transfer_rpc_response_err_text(&err) )) - }, - )?, - size: require_i64(resp, "size").map_err(resp_err_kverr).map_err(|err| { - invalid_transfer_rpc_response(format!( - "transfer read stream open response missing size: err={}", - transfer_rpc_response_err_text(&err) - )) - })?, + })?, + size: require_i64(resp, "size") + .map_err(resp_err_kverr) + .map_err(|err| { + invalid_transfer_rpc_response(format!( + "transfer read stream open response missing size: err={}", + transfer_rpc_response_err_text(&err) + )) + })?, }) } @@ -1980,11 +2005,15 @@ fn is_relpath_skipped(skip_entries: &[FluxonFsTransferSkipEntryWire], relpath: & } fn file_name_from_relpath(relpath: &str) -> Result<&str, FlatDict> { - relpath.rsplit('/').next().filter(|v| !v.is_empty()).ok_or_else(|| { - resp_err_kverr(KvError::Api(ApiError::InvalidArgument { - detail: format!("relpath must contain file name: {}", relpath), - })) - }) + relpath + .rsplit('/') + .next() + .filter(|v| !v.is_empty()) + .ok_or_else(|| { + resp_err_kverr(KvError::Api(ApiError::InvalidArgument { + detail: format!("relpath must contain file name: {}", relpath), + })) + }) } fn transfer_staging_dir_for_file(staging_prefix: &str, relpath: &str) -> String { @@ -2106,8 +2135,12 @@ where match attempt() { Ok(value) => Ok(value), Err(initial_err) if initial_err.kind() == ErrorKind::PermissionDenied => { - let repair_dir = - repair_permission_denied_dir_for_retry(repair_anchor, op, target_path, &initial_err)?; + let repair_dir = repair_permission_denied_dir_for_retry( + repair_anchor, + op, + target_path, + &initial_err, + )?; attempt().map_err(|retry_err| { resp_err_kverr(KvError::Api(ApiError::Unknown { detail: format!( @@ -2334,10 +2367,11 @@ fn collect_transfer_tree_with_deadline( continue; } }; - out.symlink_notices.push(FluxonFsTransferSymlinkNoticeEntryWire { - relpath: child_rel, - link_target: link_target.to_string_lossy().to_string(), - }); + out.symlink_notices + .push(FluxonFsTransferSymlinkNoticeEntryWire { + relpath: child_rel, + link_target: link_target.to_string_lossy().to_string(), + }); continue; } if md.is_dir() { @@ -2361,7 +2395,8 @@ fn collect_transfer_tree_with_deadline( } out.files.sort_by(|a, b| a.relpath.cmp(&b.relpath)); out.empty_dirs.sort(); - out.symlink_notices.sort_by(|a, b| a.relpath.cmp(&b.relpath)); + out.symlink_notices + .sort_by(|a, b| a.relpath.cmp(&b.relpath)); Ok(out) } @@ -2574,10 +2609,8 @@ fn build_transfer_scan_events_for_result( event_seq_no_start: i64, result: FluxonFsTransferScanResultWire, ) -> (Vec, bool, i64) { - let (child_scan_units, continue_locally) = split_same_root_continuation_from_child_scan_units( - assignment, - result.child_scan_units, - ); + let (child_scan_units, continue_locally) = + split_same_root_continuation_from_child_scan_units(assignment, result.child_scan_units); if continue_locally { let event = build_transfer_scan_event( assignment, @@ -2588,11 +2621,7 @@ fn build_transfer_scan_events_for_result( result.full_dir_batches, String::new(), ); - return ( - vec![event], - true, - event_seq_no_start.saturating_add(1), - ); + return (vec![event], true, event_seq_no_start.saturating_add(1)); } let mut next_event_seq_no = event_seq_no_start; let mut events = Vec::new(); @@ -2620,11 +2649,7 @@ fn build_transfer_scan_events_for_result( Vec::new(), String::new(), )); - ( - events, - false, - next_event_seq_no.saturating_add(1), - ) + (events, false, next_event_seq_no.saturating_add(1)) } fn send_transfer_scan_event_once( @@ -2637,10 +2662,7 @@ fn send_transfer_scan_event_once( } let event_json = serde_json::to_string(event) .map_err(|e| format!("serialize transfer scan event failed: {}", e))?; - let payload = FlatDict::from([( - "scan_event_json".to_string(), - FlatValue::String(event_json), - )]); + let payload = FlatDict::from([("scan_event_json".to_string(), FlatValue::String(event_json))]); let resp = api .rpc_client() .call( @@ -2793,30 +2815,28 @@ fn run_transfer_scan_background_task( } let mut next_event_seq_no = 1_i64; loop { - let result = match build_transfer_scan_result_for_root_dir_abs( - root_dir_abs.as_str(), - &assignment, - ) { - Ok(v) => v, - Err(resp) => { - let failed = build_transfer_scan_event( - &assignment, - next_event_seq_no, - FluxonFsTransferScanEventKindWire::Failed, - Vec::new(), - Vec::new(), - Vec::new(), - transfer_rpc_response_err_text(&resp), - ); - let _ = send_transfer_scan_event_with_retry( - api.as_ref(), - master_id.as_str(), - &mut assignment, - &failed, - ); - break; - } - }; + let result = + match build_transfer_scan_result_for_root_dir_abs(root_dir_abs.as_str(), &assignment) { + Ok(v) => v, + Err(resp) => { + let failed = build_transfer_scan_event( + &assignment, + next_event_seq_no, + FluxonFsTransferScanEventKindWire::Failed, + Vec::new(), + Vec::new(), + Vec::new(), + transfer_rpc_response_err_text(&resp), + ); + let _ = send_transfer_scan_event_with_retry( + api.as_ref(), + master_id.as_str(), + &mut assignment, + &failed, + ); + break; + } + }; let (events, continue_locally, next_seq_no_after_events) = build_transfer_scan_events_for_result(&assignment, next_event_seq_no, result); next_event_seq_no = next_seq_no_after_events; @@ -2922,17 +2942,14 @@ impl TransferScanRegistryHandle { let assignment2 = assignment.clone(); let thread_name = format!("fluxon_fs_transfer_scan_{}", assignment.scan_task_id); match thread::Builder::new().name(thread_name).spawn(move || { - run_transfer_scan_background_task( - registry, - api2, - master_id2, - exports2, - assignment2, - ); + run_transfer_scan_background_task(registry, api2, master_id2, exports2, assignment2); }) { Ok(_) => Ok(FluxonFsTransferScanLaunchResultWire::started()), Err(err) => { - self.state.lock().tasks.remove(assignment.scan_task_id.as_str()); + self.state + .lock() + .tasks + .remove(assignment.scan_task_id.as_str()); Err(resp_err_kverr(KvError::Api(ApiError::Unknown { detail: format!( "spawn transfer scan thread failed: scan_task_id={} err={}", @@ -3006,22 +3023,16 @@ impl TransferWorkerRegistryHandle { let master_id2 = master_id.to_string(); let exports2 = exports.clone(); let assignment2 = assignment.clone(); - let thread_name = format!( - "fluxon_fs_transfer_worker_{}", - assignment.worker_task_id - ); + let thread_name = format!("fluxon_fs_transfer_worker_{}", assignment.worker_task_id); match thread::Builder::new().name(thread_name).spawn(move || { - run_transfer_worker_background_task( - registry, - api2, - master_id2, - exports2, - assignment2, - ); + run_transfer_worker_background_task(registry, api2, master_id2, exports2, assignment2); }) { Ok(_) => Ok(FluxonFsTransferWorkerLaunchResultWire::started()), Err(err) => { - self.state.lock().tasks.remove(assignment.worker_task_id.as_str()); + self.state + .lock() + .tasks + .remove(assignment.worker_task_id.as_str()); Err(resp_err_kverr(KvError::Api(ApiError::Unknown { detail: format!( "spawn transfer worker thread failed: worker_task_id={} err={}", @@ -3298,10 +3309,7 @@ fn send_transfer_worker_result_once( detail: format!("serialize transfer worker result failed: {}", e), }))) })?; - let payload = FlatDict::from([( - "result_json".to_string(), - FlatValue::String(result_json), - )]); + let payload = FlatDict::from([("result_json".to_string(), FlatValue::String(result_json))]); let resp = api .rpc_client() .call( @@ -3396,7 +3404,10 @@ fn open_transfer_read_stream_via_rpc_once( "relpath".to_string(), FlatValue::String(file.relpath.clone()), ), - ("initial_offset".to_string(), FlatValue::Int64(initial_offset)), + ( + "initial_offset".to_string(), + FlatValue::Int64(initial_offset), + ), ]); let resp = api .rpc_client() @@ -3518,25 +3529,25 @@ impl TransferWorkerRemoteControl { loop { self.before_heartbeat_retry_attempt()?; let current_materialized_empty_dirs = self.progress.total_materialized_empty_dirs(); - match self - .heartbeat - .ensure_continue( - force, - current_materialized_empty_dirs, - |heartbeat_unix_ms, _heartbeat_detail| { - let progress_snapshot = - self.progress.snapshot(chrono::Utc::now().timestamp_millis()); - let telemetry = - Some(transfer_worker_telemetry_from_progress_snapshot(&progress_snapshot)); - send_transfer_worker_heartbeat_once( - self.api.as_ref(), - self.master_id.as_str(), - &self.assignment, - heartbeat_unix_ms, - telemetry, - ) - }, - ) { + match self.heartbeat.ensure_continue( + force, + current_materialized_empty_dirs, + |heartbeat_unix_ms, _heartbeat_detail| { + let progress_snapshot = self + .progress + .snapshot(chrono::Utc::now().timestamp_millis()); + let telemetry = Some(transfer_worker_telemetry_from_progress_snapshot( + &progress_snapshot, + )); + send_transfer_worker_heartbeat_once( + self.api.as_ref(), + self.master_id.as_str(), + &self.assignment, + heartbeat_unix_ms, + telemetry, + ) + }, + ) { Ok(()) => return Ok(()), Err(TransferWorkerHeartbeatGateError::Terminal(err)) => return Err(err), Err(TransferWorkerHeartbeatGateError::Retryable { @@ -3631,10 +3642,7 @@ impl TransferWorkerRemoteControl { ) } - fn close_stream_with_retry( - &self, - stream_id: &str, - ) -> Result<(), TransferWorkerExecutionError> { + fn close_stream_with_retry(&self, stream_id: &str) -> Result<(), TransferWorkerExecutionError> { let api = self.api.clone(); let assignment = self.assignment.clone(); let op_detail = format!( @@ -3744,9 +3752,9 @@ impl TransferWorkerRemoteControl { if ack.accepted { return Ok(()); } - Err(TransferWorkerExecutionError::Stop(stop_reason_or_superseded( - ack.stop_reason, - ))) + Err(TransferWorkerExecutionError::Stop( + stop_reason_or_superseded(ack.stop_reason), + )) } } @@ -3825,10 +3833,12 @@ impl TransferWorkerHeartbeatGate { mut heartbeat_op: HeartbeatOp, ) -> Result<(), TransferWorkerHeartbeatGateError> where - HeartbeatOp: FnMut( - i64, - &'static str, - ) -> Result, + HeartbeatOp: + FnMut( + i64, + &'static str, + ) + -> Result, { loop { let (heartbeat_unix_ms, heartbeat_detail) = { @@ -3875,15 +3885,13 @@ impl TransferWorkerHeartbeatGate { state.heartbeat_inflight = false; let result = match heartbeat_result { Ok(heartbeat_result) if heartbeat_result.continue_running => { - state.last_heartbeat_completed_unix_ms = - chrono::Utc::now().timestamp_millis(); + state.last_heartbeat_completed_unix_ms = chrono::Utc::now().timestamp_millis(); state.last_heartbeat_materialized_empty_dirs = current_materialized_empty_dirs; state.granted_lease_expire_unix_ms = heartbeat_result.lease_expire_unix_ms; Ok(()) } Ok(heartbeat_result) => { - state.last_heartbeat_completed_unix_ms = - chrono::Utc::now().timestamp_millis(); + state.last_heartbeat_completed_unix_ms = chrono::Utc::now().timestamp_millis(); state.last_heartbeat_materialized_empty_dirs = current_materialized_empty_dirs; let reason = stop_reason_or_superseded(heartbeat_result.stop_reason); state.terminal_state = @@ -3933,7 +3941,8 @@ fn run_transfer_worker_background_task( )); let dedup_expire_unix_ms = match control.ensure_continue(true) { Ok(()) => { - let dst_export_root = match exports.export_root_dir_abs(assignment.dst_export.as_str()) { + let dst_export_root = match exports.export_root_dir_abs(assignment.dst_export.as_str()) + { Ok(v) => v, Err(err) => { tracing::warn!( @@ -3996,7 +4005,9 @@ fn run_transfer_worker_background_task( }, { let control = control.clone(); - move |file, read_offset, length| control.read_chunk_with_retry(file, read_offset, length) + move |file, read_offset, length| { + control.read_chunk_with_retry(file, read_offset, length) + } }, ) { Ok(result) => { @@ -4004,34 +4015,38 @@ fn run_transfer_worker_background_task( if let Err(resp) = cleanup_transfer_worker_attempt_artifacts(&dst_root, &assignment) { - log_transfer_worker_cleanup_failure("before_result_submit", &assignment, &resp); - } - match control.submit_result_with_retry(&result) { - Ok(()) => control.dedup_expire_unix_ms(), - Err(TransferWorkerExecutionError::Stop(reason)) => { - tracing::info!( - "transfer worker result submission stopped: job_id={} batch_id={} worker_id={} worker_task_id={} reason={:?}", - assignment.job_id, - assignment.batch_id, - assignment.worker_id, - assignment.worker_task_id, - reason + log_transfer_worker_cleanup_failure( + "before_result_submit", + &assignment, + &resp, ); - control.dedup_expire_unix_ms() } - Err(TransferWorkerExecutionError::Fatal(resp)) => { - tracing::warn!( - "transfer worker result submission failed: job_id={} batch_id={} worker_id={} worker_task_id={} resp={:?}", - assignment.job_id, - assignment.batch_id, - assignment.worker_id, - assignment.worker_task_id, - resp - ); - control.dedup_expire_unix_ms() + match control.submit_result_with_retry(&result) { + Ok(()) => control.dedup_expire_unix_ms(), + Err(TransferWorkerExecutionError::Stop(reason)) => { + tracing::info!( + "transfer worker result submission stopped: job_id={} batch_id={} worker_id={} worker_task_id={} reason={:?}", + assignment.job_id, + assignment.batch_id, + assignment.worker_id, + assignment.worker_task_id, + reason + ); + control.dedup_expire_unix_ms() + } + Err(TransferWorkerExecutionError::Fatal(resp)) => { + tracing::warn!( + "transfer worker result submission failed: job_id={} batch_id={} worker_id={} worker_task_id={} resp={:?}", + assignment.job_id, + assignment.batch_id, + assignment.worker_id, + assignment.worker_task_id, + resp + ); + control.dedup_expire_unix_ms() + } } } - } Err(TransferWorkerExecutionError::Stop(reason)) => { control.close_all_streams(); if let Err(resp) = @@ -4054,10 +4069,13 @@ fn run_transfer_worker_background_task( if let Err(cleanup_resp) = cleanup_transfer_worker_attempt_artifacts(&dst_root, &assignment) { - log_transfer_worker_cleanup_failure("after_fatal", &assignment, &cleanup_resp); + log_transfer_worker_cleanup_failure( + "after_fatal", + &assignment, + &cleanup_resp, + ); } - if let Some((fatal_kind, fatal_message)) = - classify_transfer_worker_fatal(&resp) + if let Some((fatal_kind, fatal_message)) = classify_transfer_worker_fatal(&resp) { match report_transfer_worker_fatal_once( control.api.as_ref(), @@ -4107,15 +4125,25 @@ fn run_transfer_worker_background_task( } } Err(TransferWorkerExecutionError::Stop(reason)) => { - let dst_export_root = exports.export_root_dir_abs(assignment.dst_export.as_str()).ok(); + let dst_export_root = exports + .export_root_dir_abs(assignment.dst_export.as_str()) + .ok(); let dst_root = dst_export_root.and_then(|dst_export_root| { - safe_join_root(dst_export_root.as_str(), assignment.dst_root_relpath.as_str()) - .ok() - .map(PathBuf::from) + safe_join_root( + dst_export_root.as_str(), + assignment.dst_root_relpath.as_str(), + ) + .ok() + .map(PathBuf::from) }); if let Some(dst_root) = dst_root { - if let Err(resp) = cleanup_transfer_worker_attempt_artifacts(&dst_root, &assignment) { - log_transfer_worker_cleanup_failure("before_execution_stop", &assignment, &resp); + if let Err(resp) = cleanup_transfer_worker_attempt_artifacts(&dst_root, &assignment) + { + log_transfer_worker_cleanup_failure( + "before_execution_stop", + &assignment, + &resp, + ); } } tracing::info!( @@ -4129,11 +4157,16 @@ fn run_transfer_worker_background_task( control.dedup_expire_unix_ms() } Err(TransferWorkerExecutionError::Fatal(resp)) => { - let dst_export_root = exports.export_root_dir_abs(assignment.dst_export.as_str()).ok(); + let dst_export_root = exports + .export_root_dir_abs(assignment.dst_export.as_str()) + .ok(); let dst_root = dst_export_root.and_then(|dst_export_root| { - safe_join_root(dst_export_root.as_str(), assignment.dst_root_relpath.as_str()) - .ok() - .map(PathBuf::from) + safe_join_root( + dst_export_root.as_str(), + assignment.dst_root_relpath.as_str(), + ) + .ok() + .map(PathBuf::from) }); if let Some(dst_root) = dst_root { if let Err(cleanup_resp) = @@ -4509,7 +4542,9 @@ fn plan_transfer_subtree_batches( total_bytes: 0, root_is_empty: true, mergeable_empty_dir_count: 1, - mergeable_empty_dir_estimated_bytes: estimate_empty_dir_manifest_entry_bytes(root_relpath), + mergeable_empty_dir_estimated_bytes: estimate_empty_dir_manifest_entry_bytes( + root_relpath, + ), direct_files_only_batches: Vec::new(), full_dir_batches: Vec::new(), child_scan_units: Vec::new(), @@ -4540,8 +4575,7 @@ fn plan_transfer_subtree_batches( let child_empty_dir_count = child_plan.mergeable_empty_dir_count; let child_empty_dir_estimated_bytes = child_plan.mergeable_empty_dir_estimated_bytes; - if mergeable_empty_dir_count - .saturating_add(child_empty_dir_count) + if mergeable_empty_dir_count.saturating_add(child_empty_dir_count) > TRANSFER_MERGEABLE_EMPTY_DIR_BUDGET || mergeable_empty_dir_estimated_bytes .saturating_add(child_empty_dir_estimated_bytes) @@ -4716,7 +4750,11 @@ fn build_root_direct_files_only_batch_from_entries( } fn sort_transfer_scan_batches(batches: &mut [FluxonFsTransferScanBatchWire]) { - batches.sort_by(|a, b| a.root_relpath.cmp(&b.root_relpath).then(a.batch_id.cmp(&b.batch_id))); + batches.sort_by(|a, b| { + a.root_relpath + .cmp(&b.root_relpath) + .then(a.batch_id.cmp(&b.batch_id)) + }); } fn build_full_dir_batch_for_mergeable_subtree( @@ -4750,14 +4788,12 @@ fn build_transfer_scan_result_for_subtree_streaming_root_dir_abs( root_dir_abs: &str, assignment: &FluxonFsTransferScanAssignmentWire, ) -> Result { - let Some(mut session) = take_transfer_subtree_streaming_session(root_dir_abs, assignment)? else { + let Some(mut session) = take_transfer_subtree_streaming_session(root_dir_abs, assignment)? + else { return Ok(build_finished_empty_subtree_stream_result(assignment)); }; loop { - if session - .dir_stack - .is_empty() - { + if session.dir_stack.is_empty() { let mut full_dir_batches = Vec::new(); if let Some(batch) = flush_pending_subtree_stream_batch(assignment, &mut session)? { full_dir_batches.push(batch); @@ -4776,7 +4812,9 @@ fn build_transfer_scan_result_for_subtree_streaming_root_dir_abs( finished: true, }); } - if TransferScanDeadline::from_assignment(assignment).is_some_and(|deadline| deadline.reached()) { + if TransferScanDeadline::from_assignment(assignment) + .is_some_and(|deadline| deadline.reached()) + { let mut full_dir_batches = Vec::new(); if let Some(batch) = flush_pending_subtree_stream_batch(assignment, &mut session)? { full_dir_batches.push(batch); @@ -4808,7 +4846,10 @@ fn build_transfer_scan_result_for_subtree_streaming_root_dir_abs( if should_flush_subtree_stream_batch( assignment.batch_ready_bytes, session.pending_bytes, - session.pending_files.len().saturating_add(session.pending_symlink_notices.len()), + session + .pending_files + .len() + .saturating_add(session.pending_symlink_notices.len()), session.pending_empty_dirs.len(), ) { let batch = flush_pending_subtree_stream_batch(assignment, &mut session)?.unwrap(); @@ -4865,22 +4906,30 @@ fn build_transfer_scan_result_for_subtree_streaming_root_dir_abs( }); } else if md.is_dir() { frame.saw_visible_child = true; - session.dir_stack.push(open_transfer_subtree_streaming_dir_frame( - child_path, - child_relpath, - )?); + session + .dir_stack + .push(open_transfer_subtree_streaming_dir_frame( + child_path, + child_relpath, + )?); } else if md.is_file() { frame.saw_visible_child = true; let size = md.len().min(i64::MAX as u64) as i64; session.pending_bytes = session.pending_bytes.saturating_add(size); session .pending_files - .push(FluxonFsTransferScanFrontierEntry { relpath: child_relpath, size }); + .push(FluxonFsTransferScanFrontierEntry { + relpath: child_relpath, + size, + }); } if should_flush_subtree_stream_batch( assignment.batch_ready_bytes, session.pending_bytes, - session.pending_files.len().saturating_add(session.pending_symlink_notices.len()), + session + .pending_files + .len() + .saturating_add(session.pending_symlink_notices.len()), session.pending_empty_dirs.len(), ) { let batch = flush_pending_subtree_stream_batch(assignment, &mut session)?.unwrap(); @@ -4928,11 +4977,12 @@ pub(crate) fn build_transfer_scan_result_for_root_dir_abs( ); } let deadline = TransferScanDeadline::from_assignment(assignment); - let root_listing = match collect_transfer_root_dir_listing_slice(root_dir_abs, assignment, deadline)? { - TransferRootDirListingOutcome::Complete(v) => v, - TransferRootDirListingOutcome::Finished(result) => return Ok(result), - TransferRootDirListingOutcome::Partial(result) => return Ok(result), - }; + let root_listing = + match collect_transfer_root_dir_listing_slice(root_dir_abs, assignment, deadline)? { + TransferRootDirListingOutcome::Complete(v) => v, + TransferRootDirListingOutcome::Finished(result) => return Ok(result), + TransferRootDirListingOutcome::Partial(result) => return Ok(result), + }; let mut direct_files = root_listing.direct_files; let mut direct_symlink_notices = root_listing.direct_symlink_notices; let mut direct_empty_dirs = root_listing.direct_empty_dirs; @@ -4976,7 +5026,10 @@ pub(crate) fn build_transfer_scan_result_for_root_dir_abs( if (!direct_files.is_empty() || !direct_symlink_notices.is_empty() || !direct_empty_dirs.is_empty()) - && !direct_files_only_disposition_covers_root(assignment, assignment.root_relpath.as_str()) + && !direct_files_only_disposition_covers_root( + assignment, + assignment.root_relpath.as_str(), + ) { let mut next_partition_index = root_listing.emitted_direct_files_batch_count; direct_files_only_batches.extend(build_partitioned_root_direct_files_only_batches( @@ -4987,17 +5040,15 @@ pub(crate) fn build_transfer_scan_result_for_root_dir_abs( direct_empty_dirs.clone(), )?); } - child_scan_units.extend( - direct_dirs[delegated_child_scan_unit_count..] - .iter() - .map(|entry| { - new_child_scan_unit( - entry.relpath.clone(), - assignment.generation + 1, - delegated_child_scan_mode(), - ) - }), - ); + child_scan_units.extend(direct_dirs[delegated_child_scan_unit_count..].iter().map( + |entry| { + new_child_scan_unit( + entry.relpath.clone(), + assignment.generation + 1, + delegated_child_scan_mode(), + ) + }, + )); child_scan_units.sort_by(|a, b| a.root_relpath.cmp(&b.root_relpath)); sort_transfer_scan_batches(&mut direct_files_only_batches); return Ok(FluxonFsTransferScanResultWire { @@ -5027,7 +5078,8 @@ pub(crate) fn build_transfer_scan_result_for_root_dir_abs( let mut root_partitioned = root_listing.emitted_direct_files_batch_count > 0 || direct_files_only_disposition_covers_root(assignment, assignment.root_relpath.as_str()); let mut mergeable_empty_dir_count = direct_empty_dirs.len(); - let mut mergeable_empty_dir_estimated_bytes = estimate_empty_dir_manifest_bytes(&direct_empty_dirs); + let mut mergeable_empty_dir_estimated_bytes = + estimate_empty_dir_manifest_bytes(&direct_empty_dirs); for child_relpath in direct_dirs.iter().map(|entry| entry.relpath.clone()) { let child_plan = plan_transfer_subtree_batches( root_dir_abs, @@ -5043,8 +5095,7 @@ pub(crate) fn build_transfer_scan_result_for_root_dir_abs( let child_empty_dir_count = child_plan.mergeable_empty_dir_count; let child_empty_dir_estimated_bytes = child_plan.mergeable_empty_dir_estimated_bytes; - if mergeable_empty_dir_count - .saturating_add(child_empty_dir_count) + if mergeable_empty_dir_count.saturating_add(child_empty_dir_count) > TRANSFER_MERGEABLE_EMPTY_DIR_BUDGET || mergeable_empty_dir_estimated_bytes .saturating_add(child_empty_dir_estimated_bytes) @@ -5083,7 +5134,10 @@ pub(crate) fn build_transfer_scan_result_for_root_dir_abs( if (!direct_files.is_empty() || !direct_symlink_notices.is_empty() || !mergeable_empty_child_relpaths.is_empty()) - && !direct_files_only_disposition_covers_root(assignment, assignment.root_relpath.as_str()) + && !direct_files_only_disposition_covers_root( + assignment, + assignment.root_relpath.as_str(), + ) { let mut next_partition_index = root_listing.emitted_direct_files_batch_count; direct_empty_dirs.extend(mergeable_empty_child_relpaths); @@ -5212,10 +5266,11 @@ fn handle_transfer_scan_assignment( assignment.generation, assignment.known_dispositions.len(), ); - let result = match build_transfer_scan_result_for_root_dir_abs(root_dir_abs.as_str(), &assignment) { - Ok(v) => v, - Err(resp) => return resp, - }; + let result = + match build_transfer_scan_result_for_root_dir_abs(root_dir_abs.as_str(), &assignment) { + Ok(v) => v, + Err(resp) => return resp, + }; encode_transfer_scan_result(&result, "transfer scan result") } @@ -5228,8 +5283,11 @@ fn prepare_transfer_file_streaming( coordinator: &TransferWorkerCoordinator, ) -> Result where - ReadChunkFn: - Fn(&FluxonFsTransferManifestEntryWire, i64, i64) -> Result, TransferWorkerExecutionError>, + ReadChunkFn: Fn( + &FluxonFsTransferManifestEntryWire, + i64, + i64, + ) -> Result, TransferWorkerExecutionError>, CheckpointFn: Fn() -> Result<(), TransferWorkerExecutionError>, { let staging_relpath = transfer_staging_file_relpath(staging_prefix, file.relpath.as_str()) @@ -5239,9 +5297,12 @@ where .map_err(TransferWorkerExecutionError::fatal)?; ensure_transfer_parent_dirs(dst_root, final_relpath.as_str()) .map_err(TransferWorkerExecutionError::fatal)?; - let staging_abs = safe_join_root(dst_root.to_string_lossy().as_ref(), staging_relpath.as_str()) - .map_err(resp_err_kverr) - .map_err(TransferWorkerExecutionError::fatal)?; + let staging_abs = safe_join_root( + dst_root.to_string_lossy().as_ref(), + staging_relpath.as_str(), + ) + .map_err(resp_err_kverr) + .map_err(TransferWorkerExecutionError::fatal)?; let mut dst_file = open_create_file_with_parent_dir_chmod_retry(&staging_abs) .map_err(TransferWorkerExecutionError::fatal)?; dst_file @@ -5254,14 +5315,14 @@ where let remaining = file.size.saturating_sub(copied); let chunk = coordinator.read_chunk(file, copied, remaining.min(CHUNK_BYTES as i64))?; if chunk.is_empty() { - return Err(TransferWorkerExecutionError::fatal(resp_err_kverr(KvError::Api( - ApiError::InvalidArgument { + return Err(TransferWorkerExecutionError::fatal(resp_err_kverr( + KvError::Api(ApiError::InvalidArgument { detail: format!( "transfer worker source ended before expected size: relpath={} expected={} copied={}", file.relpath, file.size, copied ), - }, - )))); + }), + ))); } dst_file .write_all(&chunk) @@ -5271,13 +5332,14 @@ where coordinator.record_written_bytes(chunk.len() as i64); } if copied != file.size { - return Err(TransferWorkerExecutionError::fatal(resp_err_kverr(KvError::Api( - ApiError::InvalidArgument { - detail: format!( - "transfer worker size mismatch before staging completion: relpath={} expected={} actual={}", - file.relpath, file.size, copied - ), - })))); + return Err(TransferWorkerExecutionError::fatal(resp_err_kverr( + KvError::Api(ApiError::InvalidArgument { + detail: format!( + "transfer worker size mismatch before staging completion: relpath={} expected={} actual={}", + file.relpath, file.size, copied + ), + }), + ))); } // The staged file is still invisible at this point, so one more checkpoint // keeps supersession able to stop the worker before any later visible @@ -5302,48 +5364,57 @@ fn execute_transfer_single_file( coordinator: &TransferWorkerCoordinator, ) -> Result where - ReadChunkFn: - Fn(&FluxonFsTransferManifestEntryWire, i64, i64) -> Result, TransferWorkerExecutionError>, + ReadChunkFn: Fn( + &FluxonFsTransferManifestEntryWire, + i64, + i64, + ) -> Result, TransferWorkerExecutionError>, CheckpointFn: Fn() -> Result<(), TransferWorkerExecutionError>, { coordinator.checkpoint_continue()?; - let prepared = match prepare_transfer_file_streaming(dst_root, staging_prefix, file, coordinator) { - Ok(v) => v, - Err(TransferWorkerExecutionError::Fatal(resp)) => { - if let Some(failed) = classify_transfer_failed_file(file, &resp) { - let staging_relpath = - transfer_staging_file_relpath(staging_prefix, file.relpath.as_str()) - .map_err(TransferWorkerExecutionError::fatal)?; - let staging_abs = safe_join_root( - dst_root.to_string_lossy().as_ref(), - staging_relpath.as_str(), - ) - .map_err(resp_err_kverr) - .map_err(TransferWorkerExecutionError::fatal)?; - match fs::remove_file(&staging_abs) { - Ok(()) => {} - Err(err) if err.kind() == ErrorKind::NotFound => {} - Err(err) => return Err(TransferWorkerExecutionError::fatal(resp_err_io(err))), + let prepared = + match prepare_transfer_file_streaming(dst_root, staging_prefix, file, coordinator) { + Ok(v) => v, + Err(TransferWorkerExecutionError::Fatal(resp)) => { + if let Some(failed) = classify_transfer_failed_file(file, &resp) { + let staging_relpath = + transfer_staging_file_relpath(staging_prefix, file.relpath.as_str()) + .map_err(TransferWorkerExecutionError::fatal)?; + let staging_abs = safe_join_root( + dst_root.to_string_lossy().as_ref(), + staging_relpath.as_str(), + ) + .map_err(resp_err_kverr) + .map_err(TransferWorkerExecutionError::fatal)?; + match fs::remove_file(&staging_abs) { + Ok(()) => {} + Err(err) if err.kind() == ErrorKind::NotFound => {} + Err(err) => { + return Err(TransferWorkerExecutionError::fatal(resp_err_io(err))); + } + } + return Ok(TransferWorkerLaneOutcome::Failed( + TransferWorkerLaneFailedFileResult { result: failed }, + )); } - return Ok(TransferWorkerLaneOutcome::Failed( - TransferWorkerLaneFailedFileResult { result: failed }, - )); + return Err(TransferWorkerExecutionError::Fatal(resp)); } - return Err(TransferWorkerExecutionError::Fatal(resp)); - } - Err(err) => return Err(err), - }; + Err(err) => return Err(err), + }; coordinator.checkpoint_continue()?; - let result = promote_prepared_transfer_file(dst_root, PreparedTransferFile { - staging_relpath: prepared.staging_relpath.clone(), - final_relpath: prepared.final_relpath.clone(), - visible_size: prepared.visible_size, - }) + let result = promote_prepared_transfer_file( + dst_root, + PreparedTransferFile { + staging_relpath: prepared.staging_relpath.clone(), + final_relpath: prepared.final_relpath.clone(), + visible_size: prepared.visible_size, + }, + ) .map_err(TransferWorkerExecutionError::fatal); match result { - Ok(result) => Ok(TransferWorkerLaneOutcome::Visible(TransferWorkerLaneFileResult { - result, - })), + Ok(result) => Ok(TransferWorkerLaneOutcome::Visible( + TransferWorkerLaneFileResult { result }, + )), Err(TransferWorkerExecutionError::Fatal(resp)) => { if let Some(failed) = classify_transfer_failed_file(file, &resp) { let staging_abs = safe_join_root( @@ -5374,8 +5445,11 @@ fn execute_transfer_empty_dir( coordinator: &TransferWorkerCoordinator, ) -> Result where - ReadChunkFn: - Fn(&FluxonFsTransferManifestEntryWire, i64, i64) -> Result, TransferWorkerExecutionError>, + ReadChunkFn: Fn( + &FluxonFsTransferManifestEntryWire, + i64, + i64, + ) -> Result, TransferWorkerExecutionError>, CheckpointFn: Fn() -> Result<(), TransferWorkerExecutionError>, { coordinator.checkpoint_continue()?; @@ -5393,11 +5467,14 @@ fn execute_transfer_worker_assignment_with_policy( read_chunk: ReadChunkFn, ) -> Result where - ReadChunkFn: - Fn(&FluxonFsTransferManifestEntryWire, i64, i64) -> Result, TransferWorkerExecutionError> - + Send - + Sync - + 'static, + ReadChunkFn: Fn( + &FluxonFsTransferManifestEntryWire, + i64, + i64, + ) -> Result, TransferWorkerExecutionError> + + Send + + Sync + + 'static, CheckpointFn: Fn() -> Result<(), TransferWorkerExecutionError> + Send + Sync + 'static, { let policy = policy.normalized(); @@ -5424,11 +5501,14 @@ fn execute_transfer_worker_assignment_with_policy_and_progress Result where - ReadChunkFn: - Fn(&FluxonFsTransferManifestEntryWire, i64, i64) -> Result, TransferWorkerExecutionError> - + Send - + Sync - + 'static, + ReadChunkFn: Fn( + &FluxonFsTransferManifestEntryWire, + i64, + i64, + ) -> Result, TransferWorkerExecutionError> + + Send + + Sync + + 'static, CheckpointFn: Fn() -> Result<(), TransferWorkerExecutionError> + Send + Sync + 'static, { create_dir_all_with_parent_dir_chmod_retry(dst_root) @@ -5436,9 +5516,11 @@ where let manifest = FluxonFsTransferManifestWire::decode_from_blob(assignment.manifest_blob.as_slice()) .map_err(|e| { - TransferWorkerExecutionError::fatal(resp_err_kverr(KvError::Api(ApiError::InvalidArgument { - detail: format!("decode transfer worker manifest failed: {}", e), - }))) + TransferWorkerExecutionError::fatal(resp_err_kverr(KvError::Api( + ApiError::InvalidArgument { + detail: format!("decode transfer worker manifest failed: {}", e), + }, + ))) })?; if transfer_manifest_is_empty_dirs_only_batch(&manifest, assignment.collect_infos.as_slice()) { // Empty-dir-only batches never generate byte-based ramp-up signals, so @@ -5664,10 +5746,16 @@ fn promote_prepared_transfer_file( dst_root: &PathBuf, file: PreparedTransferFile, ) -> Result { - let staging_abs = safe_join_root(dst_root.to_string_lossy().as_ref(), file.staging_relpath.as_str()) - .map_err(resp_err_kverr)?; - let final_abs = safe_join_root(dst_root.to_string_lossy().as_ref(), file.final_relpath.as_str()) - .map_err(resp_err_kverr)?; + let staging_abs = safe_join_root( + dst_root.to_string_lossy().as_ref(), + file.staging_relpath.as_str(), + ) + .map_err(resp_err_kverr)?; + let final_abs = safe_join_root( + dst_root.to_string_lossy().as_ref(), + file.final_relpath.as_str(), + ) + .map_err(resp_err_kverr)?; rename_with_dst_parent_dir_chmod_retry(&staging_abs, &final_abs)?; Ok(FluxonFsTransferWorkerFileResultWire { relpath: file.final_relpath.clone(), @@ -5694,15 +5782,15 @@ fn prepare_transfer_collect_info_materialization( ), })) })?; - let staging_relpath = transfer_collect_info_staging_relpath( - batch_id, - worker_task_id, - collect_info.collect_kind, - )?; + let staging_relpath = + transfer_collect_info_staging_relpath(batch_id, worker_task_id, collect_info.collect_kind)?; ensure_transfer_parent_dirs(dst_root, staging_relpath.as_str())?; ensure_transfer_parent_dirs(dst_root, output_relpath.as_str())?; - let staging_abs = safe_join_root(dst_root.to_string_lossy().as_ref(), staging_relpath.as_str()) - .map_err(resp_err_kverr)?; + let staging_abs = safe_join_root( + dst_root.to_string_lossy().as_ref(), + staging_relpath.as_str(), + ) + .map_err(resp_err_kverr)?; let mut dst_file = open_create_file_with_parent_dir_chmod_retry(&staging_abs)?; dst_file.set_len(0).map_err(resp_err_io)?; dst_file @@ -5723,14 +5811,15 @@ fn transfer_collect_info_staging_relpath( worker_task_id: &str, collect_kind: FluxonFsTransferCollectInfoKind, ) -> Result { - let output_relpath = transfer_collect_info_output_relpath(batch_id, collect_kind).map_err(|detail| { - resp_err_kverr(KvError::Api(ApiError::InvalidArgument { - detail: format!( - "build transfer collect info output relpath failed: batch_id={} err={}", - batch_id, detail - ), - })) - })?; + let output_relpath = + transfer_collect_info_output_relpath(batch_id, collect_kind).map_err(|detail| { + resp_err_kverr(KvError::Api(ApiError::InvalidArgument { + detail: format!( + "build transfer collect info output relpath failed: batch_id={} err={}", + batch_id, detail + ), + })) + })?; Ok(format!("{}.{}.fluxon.part", output_relpath, worker_task_id)) } @@ -5750,9 +5839,12 @@ fn prune_empty_parent_dirs(mut current: PathBuf, root: &PathBuf) -> Result<(), F Ok(()) } -fn cleanup_attempt_staging_prefix(dst_root: &PathBuf, staging_prefix: &str) -> Result<(), FlatDict> { - let staging_abs = - safe_join_root(dst_root.to_string_lossy().as_ref(), staging_prefix).map_err(resp_err_kverr)?; +fn cleanup_attempt_staging_prefix( + dst_root: &PathBuf, + staging_prefix: &str, +) -> Result<(), FlatDict> { + let staging_abs = safe_join_root(dst_root.to_string_lossy().as_ref(), staging_prefix) + .map_err(resp_err_kverr)?; match fs::remove_dir_all(&staging_abs) { Ok(()) => {} Err(err) if err.kind() == ErrorKind::NotFound => return Ok(()), @@ -5865,7 +5957,8 @@ pub(crate) fn read_transfer_chunk_from_root_dir_abs( return Ok(Vec::new()); } let to_read = std::cmp::min(length, size - offset) as usize; - f.seek(SeekFrom::Start(offset as u64)).map_err(resp_err_io)?; + f.seek(SeekFrom::Start(offset as u64)) + .map_err(resp_err_io)?; let mut buf = vec![0u8; to_read]; f.read_exact(&mut buf).map_err(resp_err_io)?; Ok(buf) @@ -5881,11 +5974,14 @@ pub(crate) fn execute_transfer_worker_assignment( read_chunk: ReadChunkFn, ) -> Result where - ReadChunkFn: - Fn(&FluxonFsTransferManifestEntryWire, i64, i64) -> Result, TransferWorkerExecutionError> - + Send - + Sync - + 'static, + ReadChunkFn: Fn( + &FluxonFsTransferManifestEntryWire, + i64, + i64, + ) -> Result, TransferWorkerExecutionError> + + Send + + Sync + + 'static, CheckpointFn: Fn() -> Result<(), TransferWorkerExecutionError> + Send + Sync + 'static, { execute_transfer_worker_assignment_with_policy( @@ -5943,12 +6039,19 @@ pub(super) fn handle_transfer_read(exports: &AgentExportsHandle, payload: FlatDi Ok(v) => v, Err(e) => return resp_err_kverr(e), }; - let buf = - match read_transfer_chunk_from_root_dir_abs(root_dir_abs.as_str(), relpath.as_str(), offset, length) { - Ok(v) => v, - Err(resp) => return resp, - }; - resp_ok(BTreeMap::from([("data".to_string(), FlatValue::Bytes(buf))])) + let buf = match read_transfer_chunk_from_root_dir_abs( + root_dir_abs.as_str(), + relpath.as_str(), + offset, + length, + ) { + Ok(v) => v, + Err(resp) => return resp, + }; + resp_ok(BTreeMap::from([( + "data".to_string(), + FlatValue::Bytes(buf), + )])) } pub(super) fn handle_transfer_stream_open( @@ -6010,7 +6113,9 @@ pub(super) fn handle_transfer_worker( Err(resp) => return resp, }; match registry.launch_task(api, master_id, exports, assignment) { - Ok(result) => encode_transfer_worker_launch_result(&result, "transfer worker launch result"), + Ok(result) => { + encode_transfer_worker_launch_result(&result, "transfer worker launch result") + } Err(resp) => resp, } } @@ -6086,15 +6191,15 @@ mod tests { .collect() } - fn assert_all_child_scan_units_are_subtree_streaming( - result: &FluxonFsTransferScanResultWire, - ) { - assert!(result - .child_scan_units - .iter() - .all(|child| child.scan_mode == FluxonFsTransferScanMode::SubtreeStreaming)); - } - + fn assert_all_child_scan_units_are_subtree_streaming(result: &FluxonFsTransferScanResultWire) { + assert!( + result + .child_scan_units + .iter() + .all(|child| child.scan_mode == FluxonFsTransferScanMode::SubtreeStreaming) + ); + } + fn ok_bool(resp: &FlatDict) -> bool { matches!(resp.get("ok"), Some(FlatValue::Bool(true))) } @@ -6106,10 +6211,7 @@ mod tests { panic!("unexpected open result fatal decode error: {:?}", other) } Err(TransferWorkerRpcFailure::Retryable { detail }) => { - panic!( - "unexpected open result retryable decode error: {}", - detail - ) + panic!("unexpected open result retryable decode error: {}", detail) } } } @@ -6121,10 +6223,7 @@ mod tests { panic!("unexpected next result fatal decode error: {:?}", other) } Err(TransferWorkerRpcFailure::Retryable { detail }) => { - panic!( - "unexpected next result retryable decode error: {}", - detail - ) + panic!("unexpected next result retryable decode error: {}", detail) } } } @@ -6140,10 +6239,7 @@ mod tests { .collect() } - fn test_worker_assignment( - relpath: &str, - size: i64, - ) -> FluxonFsTransferWorkerAssignmentWire { + fn test_worker_assignment(relpath: &str, size: i64) -> FluxonFsTransferWorkerAssignmentWire { FluxonFsTransferWorkerAssignmentWire { job_id: "job".to_string(), batch_id: "batch".to_string(), @@ -6158,12 +6254,13 @@ mod tests { root_relpath: ".".to_string(), staging_prefix: ".fluxon.stage/job/batch".to_string(), lease_expire_unix_ms: 0, - manifest_blob: build_transfer_manifest_blob(vec![ - FluxonFsTransferScanFrontierEntry { + manifest_blob: build_transfer_manifest_blob( + vec![FluxonFsTransferScanFrontierEntry { relpath: relpath.to_string(), size, - }, - ], Vec::new()) + }], + Vec::new(), + ) .unwrap(), collect_infos: Vec::new(), } @@ -6183,7 +6280,11 @@ mod tests { read_chunk: ReadChunkFn, ) -> TransferWorkerCoordinator where - ReadChunkFn: Fn(&FluxonFsTransferManifestEntryWire, i64, i64) -> Result, TransferWorkerExecutionError>, + ReadChunkFn: Fn( + &FluxonFsTransferManifestEntryWire, + i64, + i64, + ) -> Result, TransferWorkerExecutionError>, CheckpointFn: Fn() -> Result<(), TransferWorkerExecutionError>, { let policy = Arc::new(TransferWorkerLanePolicy::production_default()); @@ -6217,16 +6318,19 @@ mod tests { #[test] fn build_transfer_manifest_blob_round_trips_entries() { - let blob = build_transfer_manifest_blob(vec![ - FluxonFsTransferScanFrontierEntry { - relpath: "a".to_string(), - size: 1, - }, - FluxonFsTransferScanFrontierEntry { - relpath: "b/c".to_string(), - size: 2, - }, - ], vec!["empty".to_string()]) + let blob = build_transfer_manifest_blob( + vec![ + FluxonFsTransferScanFrontierEntry { + relpath: "a".to_string(), + size: 1, + }, + FluxonFsTransferScanFrontierEntry { + relpath: "b/c".to_string(), + size: 2, + }, + ], + vec!["empty".to_string()], + ) .unwrap(); let manifest = FluxonFsTransferManifestWire::decode_from_blob(&blob).unwrap(); assert_eq!(manifest.entry_count, 2); @@ -6250,11 +6354,12 @@ mod tests { #[test] fn materialize_transfer_collect_info_writes_task_scoped_staging_then_output_file() { let root = TempDir::new().unwrap(); - let collect_infos = build_symlink_collect_infos(vec![FluxonFsTransferSymlinkNoticeEntryWire { - relpath: "root/link-file.bin".to_string(), - link_target: "target/file.bin".to_string(), - }]) - .unwrap(); + let collect_infos = + build_symlink_collect_infos(vec![FluxonFsTransferSymlinkNoticeEntryWire { + relpath: "root/link-file.bin".to_string(), + link_target: "target/file.bin".to_string(), + }]) + .unwrap(); let prepared = prepare_transfer_collect_info_materialization( &root.path().to_path_buf(), "batch-1", @@ -6383,7 +6488,10 @@ mod tests { &exports, FlatDict::from([ ("export".to_string(), FlatValue::String("src".to_string())), - ("relpath".to_string(), FlatValue::String("f.bin".to_string())), + ( + "relpath".to_string(), + FlatValue::String("f.bin".to_string()), + ), ("offset".to_string(), FlatValue::Int64(2)), ("length".to_string(), FlatValue::Int64(3)), ]), @@ -6398,7 +6506,10 @@ mod tests { &exports, FlatDict::from([ ("export".to_string(), FlatValue::String("src".to_string())), - ("relpath".to_string(), FlatValue::String("f.bin".to_string())), + ( + "relpath".to_string(), + FlatValue::String("f.bin".to_string()), + ), ("offset".to_string(), FlatValue::Int64(6)), ("length".to_string(), FlatValue::Int64(1)), ]), @@ -6423,7 +6534,10 @@ mod tests { &exports, FlatDict::from([ ("export".to_string(), FlatValue::String("src".to_string())), - ("relpath".to_string(), FlatValue::String("f.bin".to_string())), + ( + "relpath".to_string(), + FlatValue::String("f.bin".to_string()), + ), ("offset".to_string(), FlatValue::Int64(1)), ("length".to_string(), FlatValue::Int64(3)), ]), @@ -6446,9 +6560,15 @@ mod tests { &exports, FlatDict::from([ ("export".to_string(), FlatValue::String("src".to_string())), - ("relpath".to_string(), FlatValue::String("f.bin".to_string())), + ( + "relpath".to_string(), + FlatValue::String("f.bin".to_string()), + ), ("offset".to_string(), FlatValue::Int64(0)), - ("length".to_string(), FlatValue::Int64(CHUNK_BYTES as i64 + 1)), + ( + "length".to_string(), + FlatValue::Int64(CHUNK_BYTES as i64 + 1), + ), ]), ); assert!(matches!(resp.get("ok"), Some(FlatValue::Bool(false)))); @@ -6470,7 +6590,10 @@ mod tests { FlatValue::String("task-0".to_string()), ), ("export".to_string(), FlatValue::String("src".to_string())), - ("relpath".to_string(), FlatValue::String("f.bin".to_string())), + ( + "relpath".to_string(), + FlatValue::String("f.bin".to_string()), + ), ("initial_offset".to_string(), FlatValue::Int64(0)), ]), ); @@ -6551,7 +6674,10 @@ mod tests { FlatValue::String("task-1".to_string()), ), ("export".to_string(), FlatValue::String("src".to_string())), - ("relpath".to_string(), FlatValue::String("f.bin".to_string())), + ( + "relpath".to_string(), + FlatValue::String("f.bin".to_string()), + ), ("initial_offset".to_string(), FlatValue::Int64(0)), ]), ); @@ -6584,7 +6710,10 @@ mod tests { ("length".to_string(), FlatValue::Int64(2)), ]), ); - assert!(matches!(invalid_resp.get("ok"), Some(FlatValue::Bool(false)))); + assert!(matches!( + invalid_resp.get("ok"), + Some(FlatValue::Bool(false)) + )); } #[test] @@ -6603,7 +6732,10 @@ mod tests { FlatValue::String("task-2".to_string()), ), ("export".to_string(), FlatValue::String("src".to_string())), - ("relpath".to_string(), FlatValue::String("f.bin".to_string())), + ( + "relpath".to_string(), + FlatValue::String("f.bin".to_string()), + ), ("initial_offset".to_string(), FlatValue::Int64(3)), ]), ); @@ -6647,7 +6779,10 @@ mod tests { FlatValue::String("task-3".to_string()), ), ("export".to_string(), FlatValue::String("src".to_string())), - ("relpath".to_string(), FlatValue::String("f.bin".to_string())), + ( + "relpath".to_string(), + FlatValue::String("f.bin".to_string()), + ), ("initial_offset".to_string(), FlatValue::Int64(3)), ]), ); @@ -6718,7 +6853,10 @@ mod tests { let result = decode_result_json(&resp); assert!(result.finished); assert!(result.direct_files_only_batches.is_empty()); - assert_eq!(child_scan_unit_roots(&result), vec!["root/child".to_string()]); + assert_eq!( + child_scan_unit_roots(&result), + vec!["root/child".to_string()] + ); assert_all_child_scan_units_are_subtree_streaming(&result); assert!(result.full_dir_batches.is_empty()); } @@ -6781,7 +6919,8 @@ mod tests { } #[test] - fn handle_transfer_scan_assignment_groups_empty_children_into_direct_batch_without_direct_files() { + fn handle_transfer_scan_assignment_groups_empty_children_into_direct_batch_without_direct_files() + { let root = TempDir::new().unwrap(); write_file(&root, "root/big/data.bin", b"12345"); fs::create_dir_all(root.path().join("root/empty-a")).unwrap(); @@ -6814,9 +6953,10 @@ mod tests { assert_eq!(child_scan_unit_roots(&result), vec!["root/big".to_string()]); assert_all_child_scan_units_are_subtree_streaming(&result); assert!(result.full_dir_batches.is_empty()); - let direct_manifest = - FluxonFsTransferManifestWire::decode_from_blob(&result.direct_files_only_batches[0].manifest_blob) - .unwrap(); + let direct_manifest = FluxonFsTransferManifestWire::decode_from_blob( + &result.direct_files_only_batches[0].manifest_blob, + ) + .unwrap(); assert!(direct_manifest.entries.is_empty()); assert_eq!( direct_manifest.empty_dir_relpaths, @@ -6854,16 +6994,17 @@ mod tests { assert_eq!(result.direct_files_only_batches.len(), 1); assert_eq!(result.full_dir_batches.len(), 0); assert_eq!(result.child_scan_units.len(), 1); - assert_eq!(result.child_scan_units[0].root_relpath, "root/huge".to_string()); + assert_eq!( + result.child_scan_units[0].root_relpath, + "root/huge".to_string() + ); let manifest = FluxonFsTransferManifestWire::decode_from_blob( &result.direct_files_only_batches[0].manifest_blob, ) .unwrap(); assert!(manifest.entries.is_empty()); assert!(!manifest.empty_dir_relpaths.is_empty()); - assert!( - manifest.empty_dir_relpaths.len() <= TRANSFER_MERGEABLE_EMPTY_DIR_BUDGET - ); + assert!(manifest.empty_dir_relpaths.len() <= TRANSFER_MERGEABLE_EMPTY_DIR_BUDGET); assert!( estimate_empty_dir_manifest_bytes(&manifest.empty_dir_relpaths) <= TRANSFER_MERGEABLE_EMPTY_DIR_ESTIMATED_BYTES_BUDGET @@ -6879,10 +7020,10 @@ mod tests { )) + 1; for idx in 0..child_count { - fs::create_dir_all(root.path().join(format!( - "root/branch-{idx:05}/{}", - "x".repeat(200) - ))) + fs::create_dir_all( + root.path() + .join(format!("root/branch-{idx:05}/{}", "x".repeat(200))), + ) .unwrap(); } let result = build_transfer_scan_result_for_root_dir_abs( @@ -6908,10 +7049,12 @@ mod tests { assert!(!result.finished); assert!(result.direct_files_only_batches.is_empty()); assert!(!result.child_scan_units.is_empty()); - assert!(result - .child_scan_units - .iter() - .any(|child| child.scan_mode == FluxonFsTransferScanMode::FullTree)); + assert!( + result + .child_scan_units + .iter() + .any(|child| child.scan_mode == FluxonFsTransferScanMode::FullTree) + ); assert!(result.child_scan_units.iter().all(|child| { child.scan_mode == FluxonFsTransferScanMode::FullTree || child.scan_mode == FluxonFsTransferScanMode::SubtreeStreaming @@ -6963,10 +7106,16 @@ mod tests { assert!(!continue_locally); assert_eq!(next_event_seq_no, 9); assert_eq!(events.len(), 2); - assert_eq!(events[0].event_kind, FluxonFsTransferScanEventKindWire::Append); + assert_eq!( + events[0].event_kind, + FluxonFsTransferScanEventKindWire::Append + ); assert_eq!(events[0].event_seq_no, 7); assert_eq!(events[0].full_dir_batches.len(), 1); - assert_eq!(events[1].event_kind, FluxonFsTransferScanEventKindWire::Finished); + assert_eq!( + events[1].event_kind, + FluxonFsTransferScanEventKindWire::Finished + ); assert_eq!(events[1].event_seq_no, 8); assert!(events[1].direct_files_only_batches.is_empty()); assert!(events[1].child_scan_units.is_empty()); @@ -6997,17 +7146,21 @@ mod tests { skip_entries: Vec::new(), }; - let first = build_transfer_scan_result_for_root_dir_abs( - root.path().to_str().unwrap(), - &assignment, - ) - .unwrap(); + let first = + build_transfer_scan_result_for_root_dir_abs(root.path().to_str().unwrap(), &assignment) + .unwrap(); assert!(!first.finished); assert!(!first.direct_files_only_batches.is_empty()); assert!(first.full_dir_batches.is_empty()); assert_eq!(first.child_scan_units.len(), 1); - assert_eq!(first.child_scan_units[0].scan_unit_id, assignment.scan_unit_id); - assert_eq!(first.child_scan_units[0].root_relpath, assignment.root_relpath); + assert_eq!( + first.child_scan_units[0].scan_unit_id, + assignment.scan_unit_id + ); + assert_eq!( + first.child_scan_units[0].root_relpath, + assignment.root_relpath + ); assert_eq!(first.child_scan_units[0].generation, assignment.generation); let first_entry_count = first .direct_files_only_batches @@ -7019,7 +7172,10 @@ mod tests { .len() }) .sum::(); - assert_eq!(first_entry_count, TRANSFER_SCAN_ROOT_LISTING_SLICE_ENTRY_LIMIT); + assert_eq!( + first_entry_count, + TRANSFER_SCAN_ROOT_LISTING_SLICE_ENTRY_LIMIT + ); let second_assignment = FluxonFsTransferScanAssignmentWire { scan_task_id: "task-2".to_string(), @@ -7086,7 +7242,8 @@ mod tests { } #[test] - fn build_transfer_scan_result_root_direct_fanout_only_emits_child_scan_units_without_recursing() { + fn build_transfer_scan_result_root_direct_fanout_only_emits_child_scan_units_without_recursing() + { let root = TempDir::new().unwrap(); write_file(&root, "root/direct.bin", b"abc"); write_file(&root, "root/child/payload.bin", b"xyz"); @@ -7114,14 +7271,18 @@ mod tests { assert_eq!(result.direct_files_only_batches.len(), 1); assert_eq!(result.child_scan_units.len(), 1); assert!(result.full_dir_batches.is_empty()); - assert_eq!(result.child_scan_units[0].root_relpath, "root/child".to_string()); + assert_eq!( + result.child_scan_units[0].root_relpath, + "root/child".to_string() + ); assert_eq!( result.child_scan_units[0].scan_mode, FluxonFsTransferScanMode::FullTree ); - let manifest = - FluxonFsTransferManifestWire::decode_from_blob(&result.direct_files_only_batches[0].manifest_blob) - .unwrap(); + let manifest = FluxonFsTransferManifestWire::decode_from_blob( + &result.direct_files_only_batches[0].manifest_blob, + ) + .unwrap(); assert_eq!( manifest.entries, vec![FluxonFsTransferManifestEntryWire { @@ -7133,7 +7294,8 @@ mod tests { } #[test] - fn build_transfer_scan_result_directory_direct_fanout_only_emits_child_scan_units_without_recursing() { + fn build_transfer_scan_result_directory_direct_fanout_only_emits_child_scan_units_without_recursing() + { let root = TempDir::new().unwrap(); write_file(&root, "root/child/direct.bin", b"abc"); write_file(&root, "root/child/grand/payload.bin", b"xyz"); @@ -7161,14 +7323,18 @@ mod tests { assert_eq!(result.direct_files_only_batches.len(), 1); assert_eq!(result.child_scan_units.len(), 1); assert!(result.full_dir_batches.is_empty()); - assert_eq!(result.child_scan_units[0].root_relpath, "root/child/grand".to_string()); + assert_eq!( + result.child_scan_units[0].root_relpath, + "root/child/grand".to_string() + ); assert_eq!( result.child_scan_units[0].scan_mode, FluxonFsTransferScanMode::FullTree ); - let manifest = - FluxonFsTransferManifestWire::decode_from_blob(&result.direct_files_only_batches[0].manifest_blob) - .unwrap(); + let manifest = FluxonFsTransferManifestWire::decode_from_blob( + &result.direct_files_only_batches[0].manifest_blob, + ) + .unwrap(); assert_eq!( manifest.entries, vec![FluxonFsTransferManifestEntryWire { @@ -7206,10 +7372,14 @@ mod tests { .unwrap(); assert!(result.finished); assert_eq!(result.direct_files_only_batches.len(), 1); - assert_eq!(result.direct_files_only_batches[0].root_relpath, "root/child"); - let manifest = - FluxonFsTransferManifestWire::decode_from_blob(&result.direct_files_only_batches[0].manifest_blob) - .unwrap(); + assert_eq!( + result.direct_files_only_batches[0].root_relpath, + "root/child" + ); + let manifest = FluxonFsTransferManifestWire::decode_from_blob( + &result.direct_files_only_batches[0].manifest_blob, + ) + .unwrap(); assert_eq!( manifest.entries, vec![FluxonFsTransferManifestEntryWire { @@ -7218,7 +7388,10 @@ mod tests { }] ); assert!(manifest.empty_dir_relpaths.is_empty()); - assert_eq!(child_scan_unit_roots(&result), vec!["root/child/grand".to_string()]); + assert_eq!( + child_scan_unit_roots(&result), + vec!["root/child/grand".to_string()] + ); assert_all_child_scan_units_are_subtree_streaming(&result); assert!(result.full_dir_batches.is_empty()); } @@ -7253,10 +7426,14 @@ mod tests { assert!(result.finished); assert_eq!(result.direct_files_only_batches.len(), 1); assert_eq!(result.child_scan_units.len(), 1); - assert_eq!(result.child_scan_units[0].root_relpath, "root/child".to_string()); - let manifest = - FluxonFsTransferManifestWire::decode_from_blob(&result.direct_files_only_batches[0].manifest_blob) - .unwrap(); + assert_eq!( + result.child_scan_units[0].root_relpath, + "root/child".to_string() + ); + let manifest = FluxonFsTransferManifestWire::decode_from_blob( + &result.direct_files_only_batches[0].manifest_blob, + ) + .unwrap(); assert_eq!( manifest.entries, vec![FluxonFsTransferManifestEntryWire { @@ -7330,9 +7507,10 @@ mod tests { assert_eq!(result.direct_files_only_batches.len(), 1); assert!(result.child_scan_units.is_empty()); assert!(result.full_dir_batches.is_empty()); - let manifest = - FluxonFsTransferManifestWire::decode_from_blob(&result.direct_files_only_batches[0].manifest_blob) - .unwrap(); + let manifest = FluxonFsTransferManifestWire::decode_from_blob( + &result.direct_files_only_batches[0].manifest_blob, + ) + .unwrap(); assert_eq!( manifest.entries, vec![FluxonFsTransferManifestEntryWire { @@ -7407,7 +7585,10 @@ mod tests { assert!(result.finished); assert_eq!(result.direct_files_only_batches.len(), 1); assert_eq!(result.child_scan_units.len(), 1); - assert_eq!(result.child_scan_units[0].root_relpath, "root/child-b".to_string()); + assert_eq!( + result.child_scan_units[0].root_relpath, + "root/child-b".to_string() + ); assert_eq!( result.child_scan_units[0].scan_mode, FluxonFsTransferScanMode::FullTree @@ -7447,9 +7628,10 @@ mod tests { assert_eq!(result.direct_files_only_batches.len(), 1); assert!(result.child_scan_units.is_empty()); assert!(result.full_dir_batches.is_empty()); - let manifest = - FluxonFsTransferManifestWire::decode_from_blob(&result.direct_files_only_batches[0].manifest_blob) - .unwrap(); + let manifest = FluxonFsTransferManifestWire::decode_from_blob( + &result.direct_files_only_batches[0].manifest_blob, + ) + .unwrap(); assert_eq!( manifest.entries, vec![FluxonFsTransferManifestEntryWire { @@ -7490,9 +7672,10 @@ mod tests { assert!(result.direct_files_only_batches.is_empty()); assert!(result.child_scan_units.is_empty()); assert_eq!(result.full_dir_batches.len(), 1); - let manifest = - FluxonFsTransferManifestWire::decode_from_blob(&result.full_dir_batches[0].manifest_blob) - .unwrap(); + let manifest = FluxonFsTransferManifestWire::decode_from_blob( + &result.full_dir_batches[0].manifest_blob, + ) + .unwrap(); assert!(manifest.entries.is_empty()); assert_eq!(manifest.empty_dir_relpaths, vec!["root".to_string()]); } @@ -7533,7 +7716,8 @@ mod tests { } #[test] - fn handle_transfer_scan_assignment_does_not_reaggregate_root_when_descendant_batch_is_durable() { + fn handle_transfer_scan_assignment_does_not_reaggregate_root_when_descendant_batch_is_durable() + { let root = TempDir::new().unwrap(); write_file(&root, "root/direct.bin", b"abc"); write_file(&root, "root/big/data.bin", b"12345"); @@ -7579,21 +7763,29 @@ mod tests { size: 3, }] ); - assert!(result - .full_dir_batches - .iter() - .all(|batch| batch.root_relpath != "root")); - assert!(result - .full_dir_batches - .iter() - .all(|batch| batch.root_relpath != "root/big")); - assert_eq!(child_scan_unit_roots(&result), vec!["root/small".to_string()]); + assert!( + result + .full_dir_batches + .iter() + .all(|batch| batch.root_relpath != "root") + ); + assert!( + result + .full_dir_batches + .iter() + .all(|batch| batch.root_relpath != "root/big") + ); + assert_eq!( + child_scan_unit_roots(&result), + vec!["root/small".to_string()] + ); assert_all_child_scan_units_are_subtree_streaming(&result); assert!(result.full_dir_batches.is_empty()); } #[test] - fn handle_transfer_scan_assignment_honors_cross_generation_descendant_full_dir_during_restart() { + fn handle_transfer_scan_assignment_honors_cross_generation_descendant_full_dir_during_restart() + { let root = TempDir::new().unwrap(); write_file(&root, "root/direct.bin", b"abc"); write_file(&root, "root/big/data.bin", b"12345"); @@ -7639,21 +7831,29 @@ mod tests { size: 3, }] ); - assert!(result - .full_dir_batches - .iter() - .all(|batch| batch.root_relpath != "root")); - assert!(result - .full_dir_batches - .iter() - .all(|batch| batch.root_relpath != "root/big")); - assert_eq!(child_scan_unit_roots(&result), vec!["root/small".to_string()]); + assert!( + result + .full_dir_batches + .iter() + .all(|batch| batch.root_relpath != "root") + ); + assert!( + result + .full_dir_batches + .iter() + .all(|batch| batch.root_relpath != "root/big") + ); + assert_eq!( + child_scan_unit_roots(&result), + vec!["root/small".to_string()] + ); assert_all_child_scan_units_are_subtree_streaming(&result); assert!(result.full_dir_batches.is_empty()); } #[test] - fn handle_transfer_scan_assignment_replays_descendant_current_layer_when_only_partial_descendant_direct_files_batch_is_durable() { + fn handle_transfer_scan_assignment_replays_descendant_current_layer_when_only_partial_descendant_direct_files_batch_is_durable() + { let root = TempDir::new().unwrap(); write_file(&root, "root/child/a.bin", b"ab"); write_file(&root, "root/child/b.bin", b"cd"); @@ -7688,10 +7888,14 @@ mod tests { assert!(result.child_scan_units.is_empty()); assert!(result.full_dir_batches.is_empty()); assert_eq!(result.direct_files_only_batches.len(), 1); - assert_eq!(result.direct_files_only_batches[0].root_relpath, "root/child"); - let manifest = - FluxonFsTransferManifestWire::decode_from_blob(&result.direct_files_only_batches[0].manifest_blob) - .unwrap(); + assert_eq!( + result.direct_files_only_batches[0].root_relpath, + "root/child" + ); + let manifest = FluxonFsTransferManifestWire::decode_from_blob( + &result.direct_files_only_batches[0].manifest_blob, + ) + .unwrap(); assert_eq!( manifest.entries, vec![ @@ -7737,7 +7941,10 @@ mod tests { assert!(result.finished); assert!(result.direct_files_only_batches.is_empty()); - assert_eq!(child_scan_unit_roots(&result), vec!["root/parent".to_string()]); + assert_eq!( + child_scan_unit_roots(&result), + vec!["root/parent".to_string()] + ); assert_all_child_scan_units_are_subtree_streaming(&result); assert!(result.full_dir_batches.is_empty()); } @@ -7772,9 +7979,10 @@ mod tests { assert!(result.finished); assert_eq!(result.direct_files_only_batches.len(), 1); - let manifest = - FluxonFsTransferManifestWire::decode_from_blob(&result.direct_files_only_batches[0].manifest_blob) - .unwrap(); + let manifest = FluxonFsTransferManifestWire::decode_from_blob( + &result.direct_files_only_batches[0].manifest_blob, + ) + .unwrap(); assert_eq!( manifest.entries, vec![FluxonFsTransferManifestEntryWire { @@ -7782,7 +7990,10 @@ mod tests { size: 10, }] ); - assert_eq!(child_scan_unit_roots(&result), vec!["root/child".to_string()]); + assert_eq!( + child_scan_unit_roots(&result), + vec!["root/child".to_string()] + ); assert_all_child_scan_units_are_subtree_streaming(&result); assert!(result.full_dir_batches.is_empty()); } @@ -7820,7 +8031,10 @@ mod tests { assert!(ok_bool(&resp)); assert!(result.finished); - assert_eq!(child_scan_unit_roots(&result), vec!["root/blocked".to_string()]); + assert_eq!( + child_scan_unit_roots(&result), + vec!["root/blocked".to_string()] + ); assert_all_child_scan_units_are_subtree_streaming(&result); assert!(result.full_dir_batches.is_empty()); } @@ -7870,7 +8084,11 @@ mod tests { ); assert_eq!( child_scan_unit_roots(&result), - vec!["root/a".to_string(), "root/b".to_string(), "root/c".to_string()] + vec![ + "root/a".to_string(), + "root/b".to_string(), + "root/c".to_string() + ] ); assert_all_child_scan_units_are_subtree_streaming(&result); assert!(result.full_dir_batches.is_empty()); @@ -7905,7 +8123,10 @@ mod tests { ); let result = decode_result_json(&resp); assert_eq!(result.full_dir_batches.len(), 1); - assert_eq!(result.full_dir_batches[0].batch_kind, FluxonFsTransferBatchKind::SubtreeSlice); + assert_eq!( + result.full_dir_batches[0].batch_kind, + FluxonFsTransferBatchKind::SubtreeSlice + ); assert_eq!(result.full_dir_batches[0].collect_infos.len(), 1); assert_eq!( decode_symlink_notice_collect_blob( @@ -7957,11 +8178,15 @@ mod tests { assert!(result.direct_files_only_batches.is_empty()); assert!(result.child_scan_units.is_empty()); assert_eq!(result.full_dir_batches.len(), 1); - assert_eq!(result.full_dir_batches[0].batch_kind, FluxonFsTransferBatchKind::SubtreeSlice); + assert_eq!( + result.full_dir_batches[0].batch_kind, + FluxonFsTransferBatchKind::SubtreeSlice + ); assert_eq!(result.full_dir_batches[0].root_relpath, "root".to_string()); - let manifest = - FluxonFsTransferManifestWire::decode_from_blob(&result.full_dir_batches[0].manifest_blob) - .unwrap(); + let manifest = FluxonFsTransferManifestWire::decode_from_blob( + &result.full_dir_batches[0].manifest_blob, + ) + .unwrap(); assert_eq!( manifest.entries, vec![ @@ -7982,7 +8207,9 @@ mod tests { FluxonFsTransferCollectInfoKind::SymlinkNotice ); let mut notices = decode_symlink_notice_collect_blob( - direct_files_only_batch.collect_infos[0].collect_blob.as_slice() + direct_files_only_batch.collect_infos[0] + .collect_blob + .as_slice(), ); notices.sort_by(|a, b| a.relpath.cmp(&b.relpath)); assert_eq!( @@ -8004,16 +8231,17 @@ mod tests { fn prepare_transfer_file_from_chunks_promotes_staged_file_to_final_path() { let root = TempDir::new().unwrap(); let dst_root = root.path().to_path_buf(); - let coordinator = test_transfer_coordinator( - || Ok(()), - { - let chunks = Arc::new(Mutex::new(vec![b"ab".to_vec(), b"cde".to_vec(), Vec::new()])); - move |_file, _read_offset, _length| { - let mut chunks = chunks.lock(); - Ok(chunks.remove(0)) - } - }, - ); + let coordinator = test_transfer_coordinator(|| Ok(()), { + let chunks = Arc::new(Mutex::new(vec![ + b"ab".to_vec(), + b"cde".to_vec(), + Vec::new(), + ])); + move |_file, _read_offset, _length| { + let mut chunks = chunks.lock(); + Ok(chunks.remove(0)) + } + }); let prepared = prepare_transfer_file_streaming( &dst_root, ".fluxon.stage/job/batch", @@ -8038,32 +8266,31 @@ mod tests { fs::read(root.path().join("dir/file.bin")).unwrap(), b"abcde".to_vec() ); - assert!(!root - .path() - .join(".fluxon.stage/job/batch/dir/file.bin/file.bin.fluxon.part") - .exists()); + assert!( + !root + .path() + .join(".fluxon.stage/job/batch/dir/file.bin/file.bin.fluxon.part") + .exists() + ); } #[test] fn prepare_transfer_file_from_chunks_truncates_existing_staging_file() { let root = TempDir::new().unwrap(); let dst_root = root.path().to_path_buf(); - let stale_staging = - root.path() - .join(".fluxon.stage/job/batch/dir/file.bin/file.bin.fluxon.part"); + let stale_staging = root + .path() + .join(".fluxon.stage/job/batch/dir/file.bin/file.bin.fluxon.part"); fs::create_dir_all(stale_staging.parent().unwrap()).unwrap(); fs::write(&stale_staging, b"stale-data").unwrap(); - let coordinator = test_transfer_coordinator( - || Ok(()), - { - let chunks = Arc::new(Mutex::new(vec![b"xy".to_vec(), Vec::new()])); - move |_file, _read_offset, _length| { - let mut chunks = chunks.lock(); - Ok(chunks.remove(0)) - } - }, - ); + let coordinator = test_transfer_coordinator(|| Ok(()), { + let chunks = Arc::new(Mutex::new(vec![b"xy".to_vec(), Vec::new()])); + move |_file, _read_offset, _length| { + let mut chunks = chunks.lock(); + Ok(chunks.remove(0)) + } + }); let prepared = prepare_transfer_file_streaming( &dst_root, ".fluxon.stage/job/batch", @@ -8076,23 +8303,23 @@ mod tests { .unwrap(); promote_prepared_transfer_file(&dst_root, prepared).unwrap(); - assert_eq!(fs::read(root.path().join("dir/file.bin")).unwrap(), b"xy".to_vec()); + assert_eq!( + fs::read(root.path().join("dir/file.bin")).unwrap(), + b"xy".to_vec() + ); } #[test] fn prepare_transfer_file_from_chunks_rejects_size_mismatch_and_keeps_staging_file() { let root = TempDir::new().unwrap(); let dst_root = root.path().to_path_buf(); - let coordinator = test_transfer_coordinator( - || Ok(()), - { - let chunks = Arc::new(Mutex::new(vec![b"xy".to_vec(), Vec::new()])); - move |_file, _read_offset, _length| { - let mut chunks = chunks.lock(); - Ok(chunks.remove(0)) - } - }, - ); + let coordinator = test_transfer_coordinator(|| Ok(()), { + let chunks = Arc::new(Mutex::new(vec![b"xy".to_vec(), Vec::new()])); + move |_file, _read_offset, _length| { + let mut chunks = chunks.lock(); + Ok(chunks.remove(0)) + } + }); let err = prepare_transfer_file_streaming( &dst_root, ".fluxon.stage/job/batch", @@ -8275,15 +8502,10 @@ mod tests { let file_bytes = b"hello".to_vec(); let assignment = test_worker_assignment("dir/file.bin", file_bytes.len() as i64); - let result = execute_transfer_worker_assignment( - &assignment, - &dst_root, - || Ok(()), - { - let file_bytes = file_bytes.clone(); - move |_file, _read_offset, _length| Ok(file_bytes.clone()) - }, - ) + let result = execute_transfer_worker_assignment(&assignment, &dst_root, || Ok(()), { + let file_bytes = file_bytes.clone(); + move |_file, _read_offset, _length| Ok(file_bytes.clone()) + }) .unwrap(); assert_eq!(result.file_results.len(), 1); @@ -8303,7 +8525,10 @@ mod tests { create_dir_all_with_parent_dir_chmod_retry(&target).unwrap(); assert!(target.is_dir()); - assert_eq!(fs::metadata(&locked_parent).unwrap().permissions().mode() & 0o777, 0o777); + assert_eq!( + fs::metadata(&locked_parent).unwrap().permissions().mode() & 0o777, + 0o777 + ); } #[cfg(unix)] @@ -8318,21 +8543,19 @@ mod tests { let file_bytes = b"hello".to_vec(); let assignment = test_worker_assignment("dir/file.bin", file_bytes.len() as i64); - let result = execute_transfer_worker_assignment( - &assignment, - &dst_root, - || Ok(()), - { - let file_bytes = file_bytes.clone(); - move |_file, _read_offset, _length| Ok(file_bytes.clone()) - }, - ) + let result = execute_transfer_worker_assignment(&assignment, &dst_root, || Ok(()), { + let file_bytes = file_bytes.clone(); + move |_file, _read_offset, _length| Ok(file_bytes.clone()) + }) .unwrap(); assert_eq!(result.file_results.len(), 1); assert!(dst_root.is_dir()); assert_eq!(fs::read(dst_root.join("dir/file.bin")).unwrap(), file_bytes); - assert_eq!(fs::metadata(&locked_parent).unwrap().permissions().mode() & 0o777, 0o777); + assert_eq!( + fs::metadata(&locked_parent).unwrap().permissions().mode() & 0o777, + 0o777 + ); } #[test] @@ -8349,47 +8572,48 @@ mod tests { let assignment = assignment.clone(); let heartbeat_attempts = heartbeat_attempts.clone(); move || { - retry_transfer_worker_rpc_with_backoff( - &assignment, - "checkpoint", - "test-checkpoint", - BackoffConfig { - initial_secs: 0, - max_secs: 0, - }, - WarnConfig { - warn_interval_secs: 0, - }, - || { - let attempt = - heartbeat_attempts.fetch_add(1, Ordering::SeqCst) + 1; - if attempt < 3 { - return Err(TransferWorkerRpcFailure::Retryable { - detail: format!( - "transient heartbeat failure attempt={}", - attempt - ), - }); - } - Ok(()) - }, - ) - .map_err(TransferWorkerExecutionError::fatal) - } + retry_transfer_worker_rpc_with_backoff( + &assignment, + "checkpoint", + "test-checkpoint", + BackoffConfig { + initial_secs: 0, + max_secs: 0, + }, + WarnConfig { + warn_interval_secs: 0, + }, + || { + let attempt = heartbeat_attempts.fetch_add(1, Ordering::SeqCst) + 1; + if attempt < 3 { + return Err(TransferWorkerRpcFailure::Retryable { + detail: format!( + "transient heartbeat failure attempt={}", + attempt + ), + }); + } + Ok(()) + }, + ) + .map_err(TransferWorkerExecutionError::fatal) + } }, { let file_bytes = file_bytes.clone(); move |file, read_offset, _length| { - if file.relpath != "dir/file.bin" { - return Err(TransferWorkerExecutionError::fatal(resp_err_kverr(KvError::Api(ApiError::InvalidArgument { - detail: format!("unexpected file relpath: {}", file.relpath), - })))); - } - if read_offset == 0 { - return Ok(file_bytes.clone()); + if file.relpath != "dir/file.bin" { + return Err(TransferWorkerExecutionError::fatal(resp_err_kverr( + KvError::Api(ApiError::InvalidArgument { + detail: format!("unexpected file relpath: {}", file.relpath), + }), + ))); + } + if read_offset == 0 { + return Ok(file_bytes.clone()); + } + Ok(Vec::new()) } - Ok(Vec::new()) - } }, ) .unwrap(); @@ -8409,24 +8633,20 @@ mod tests { let file_bytes = b"payload".to_vec(); let assignment = test_worker_assignment("dir/file.bin", file_bytes.len() as i64); let read_attempts = Arc::new(AtomicUsize::new(0)); - let result = execute_transfer_worker_assignment( - &assignment, - &dst_root, - || Ok(()), - { - let assignment = assignment.clone(); - let file_bytes = file_bytes.clone(); - let read_attempts = read_attempts.clone(); - move |file, read_offset, _length| { + let result = execute_transfer_worker_assignment(&assignment, &dst_root, || Ok(()), { + let assignment = assignment.clone(); + let file_bytes = file_bytes.clone(); + let read_attempts = read_attempts.clone(); + move |file, read_offset, _length| { if file.relpath != "dir/file.bin" { - return Err(TransferWorkerExecutionError::fatal(resp_err_kverr(KvError::Api(ApiError::InvalidArgument { - detail: format!("unexpected file relpath: {}", file.relpath), - })))); + return Err(TransferWorkerExecutionError::fatal(resp_err_kverr( + KvError::Api(ApiError::InvalidArgument { + detail: format!("unexpected file relpath: {}", file.relpath), + }), + ))); } - let op_detail = format!( - "test-read relpath={} offset={}", - file.relpath, read_offset - ); + let op_detail = + format!("test-read relpath={} offset={}", file.relpath, read_offset); retry_transfer_worker_rpc_with_backoff( &assignment, "read_chunk", @@ -8443,10 +8663,7 @@ mod tests { let attempt = read_attempts.fetch_add(1, Ordering::SeqCst) + 1; if attempt < 3 { return Err(TransferWorkerRpcFailure::Retryable { - detail: format!( - "transient read failure attempt={}", - attempt - ), + detail: format!("transient read failure attempt={}", attempt), }); } return Ok(file_bytes.clone()); @@ -8456,8 +8673,7 @@ mod tests { ) .map_err(TransferWorkerExecutionError::fatal) } - }, - ) + }) .unwrap(); assert_eq!(read_attempts.load(Ordering::SeqCst), 3); @@ -8481,23 +8697,23 @@ mod tests { { let checkpoint_calls = checkpoint_calls.clone(); move || { - let calls = checkpoint_calls.fetch_add(1, Ordering::SeqCst) + 1; - if calls >= 4 { - return Err(TransferWorkerExecutionError::Stop( - FluxonFsTransferWorkerStopReasonWire::Superseded, - )); + let calls = checkpoint_calls.fetch_add(1, Ordering::SeqCst) + 1; + if calls >= 4 { + return Err(TransferWorkerExecutionError::Stop( + FluxonFsTransferWorkerStopReasonWire::Superseded, + )); + } + Ok(()) } - Ok(()) - } }, { let file_bytes = file_bytes.clone(); move |_file, read_offset, _length| { - if read_offset == 0 { - return Ok(file_bytes.clone()); + if read_offset == 0 { + return Ok(file_bytes.clone()); + } + Ok(Vec::new()) } - Ok(Vec::new()) - } }, ); assert!(matches!( @@ -8567,12 +8783,7 @@ mod tests { break; } if max_in_flight - .compare_exchange( - observed, - current, - Ordering::SeqCst, - Ordering::SeqCst, - ) + .compare_exchange(observed, current, Ordering::SeqCst, Ordering::SeqCst) .is_ok() { break; @@ -8588,8 +8799,14 @@ mod tests { assert_eq!(result.file_results.len(), 2); assert!(max_in_flight.load(Ordering::SeqCst) >= 2); - assert_eq!(fs::read(root.path().join("dir/a.bin")).unwrap(), b"xxx".to_vec()); - assert_eq!(fs::read(root.path().join("dir/b.bin")).unwrap(), b"xxx".to_vec()); + assert_eq!( + fs::read(root.path().join("dir/a.bin")).unwrap(), + b"xxx".to_vec() + ); + assert_eq!( + fs::read(root.path().join("dir/b.bin")).unwrap(), + b"xxx".to_vec() + ); } #[test] @@ -8597,29 +8814,25 @@ mod tests { let root = TempDir::new().unwrap(); let dst_root = root.path().to_path_buf(); let file_bytes = b"hello".to_vec(); - let collect_infos = build_symlink_collect_infos(vec![FluxonFsTransferSymlinkNoticeEntryWire { - relpath: "dir/link.bin".to_string(), - link_target: "dir/file.bin".to_string(), - }]) - .unwrap(); + let collect_infos = + build_symlink_collect_infos(vec![FluxonFsTransferSymlinkNoticeEntryWire { + relpath: "dir/link.bin".to_string(), + link_target: "dir/file.bin".to_string(), + }]) + .unwrap(); let assignment = FluxonFsTransferWorkerAssignmentWire { collect_infos: collect_infos.clone(), ..test_worker_assignment("dir/file.bin", file_bytes.len() as i64) }; - let result = execute_transfer_worker_assignment( - &assignment, - &dst_root, - || Ok(()), - { - let file_bytes = file_bytes.clone(); - move |_file, read_offset, _length| { - if read_offset == 0 { - return Ok(file_bytes.clone()); - } - Ok(Vec::new()) + let result = execute_transfer_worker_assignment(&assignment, &dst_root, || Ok(()), { + let file_bytes = file_bytes.clone(); + move |_file, read_offset, _length| { + if read_offset == 0 { + return Ok(file_bytes.clone()); } - }, - ) + Ok(Vec::new()) + } + }) .unwrap(); assert_eq!(result.file_results.len(), 1); @@ -8633,7 +8846,11 @@ mod tests { "fluxon_collect_info/batches/batch/symlinks.jsonl" ); assert_eq!( - fs::read(root.path().join("fluxon_collect_info/batches/batch/symlinks.jsonl")).unwrap(), + fs::read( + root.path() + .join("fluxon_collect_info/batches/batch/symlinks.jsonl") + ) + .unwrap(), collect_infos[0].collect_blob ); } @@ -8643,16 +8860,19 @@ mod tests { let root = TempDir::new().unwrap(); let dst_root = root.path().to_path_buf(); let assignment = FluxonFsTransferWorkerAssignmentWire { - manifest_blob: build_transfer_manifest_blob(vec![ - FluxonFsTransferScanFrontierEntry { - relpath: "dir/good.bin".to_string(), - size: 5, - }, - FluxonFsTransferScanFrontierEntry { - relpath: "dir/bad.bin".to_string(), - size: 5, - }, - ], Vec::new()) + manifest_blob: build_transfer_manifest_blob( + vec![ + FluxonFsTransferScanFrontierEntry { + relpath: "dir/good.bin".to_string(), + size: 5, + }, + FluxonFsTransferScanFrontierEntry { + relpath: "dir/bad.bin".to_string(), + size: 5, + }, + ], + Vec::new(), + ) .unwrap(), ..test_worker_assignment("dir/good.bin", 5) }; @@ -8816,20 +9036,16 @@ mod tests { .unwrap(); let progress_heartbeat_count = Arc::new(AtomicUsize::new(0)); - gate.ensure_continue( - false, - TRANSFER_WORKER_HEARTBEAT_EMPTY_DIR_PROGRESS_COUNT, - { - let progress_heartbeat_count = progress_heartbeat_count.clone(); - move |_heartbeat_unix_ms, heartbeat_detail| { - assert_eq!(heartbeat_detail, "empty_dir_progress"); - progress_heartbeat_count.fetch_add(1, Ordering::SeqCst); - Ok(FluxonFsTransferWorkerHeartbeatResultWire::continue_running( - chrono::Utc::now().timestamp_millis() + 60_000, - )) - } - }, - ) + gate.ensure_continue(false, TRANSFER_WORKER_HEARTBEAT_EMPTY_DIR_PROGRESS_COUNT, { + let progress_heartbeat_count = progress_heartbeat_count.clone(); + move |_heartbeat_unix_ms, heartbeat_detail| { + assert_eq!(heartbeat_detail, "empty_dir_progress"); + progress_heartbeat_count.fetch_add(1, Ordering::SeqCst); + Ok(FluxonFsTransferWorkerHeartbeatResultWire::continue_running( + chrono::Utc::now().timestamp_millis() + 60_000, + )) + } + }) .unwrap(); gate.ensure_continue( @@ -8927,20 +9143,15 @@ mod tests { let dst_root = root.path().to_path_buf(); let file_bytes = b"hello".to_vec(); let assignment = test_worker_assignment("dir/file.bin", file_bytes.len() as i64); - let result = execute_transfer_worker_assignment( - &assignment, - &dst_root, - || Ok(()), - { - let file_bytes = file_bytes.clone(); - move |_file, read_offset, _length| { - if read_offset == 0 { - return Ok(file_bytes.clone()); - } - Ok(Vec::new()) + let result = execute_transfer_worker_assignment(&assignment, &dst_root, || Ok(()), { + let file_bytes = file_bytes.clone(); + move |_file, read_offset, _length| { + if read_offset == 0 { + return Ok(file_bytes.clone()); } - }, - ) + Ok(Vec::new()) + } + }) .unwrap(); assert_eq!(result.file_results.len(), 1); @@ -8948,7 +9159,10 @@ mod tests { cleanup_transfer_worker_attempt_artifacts(&dst_root, &assignment).unwrap(); - assert_eq!(fs::read(root.path().join("dir/file.bin")).unwrap(), file_bytes); + assert_eq!( + fs::read(root.path().join("dir/file.bin")).unwrap(), + file_bytes + ); assert!(!root.path().join(".fluxon.stage").exists()); } @@ -8957,11 +9171,12 @@ mod tests { let root = TempDir::new().unwrap(); let dst_root = root.path().to_path_buf(); let file_bytes = b"hello".to_vec(); - let collect_infos = build_symlink_collect_infos(vec![FluxonFsTransferSymlinkNoticeEntryWire { - relpath: "root/link-file.bin".to_string(), - link_target: "target/file.bin".to_string(), - }]) - .unwrap(); + let collect_infos = + build_symlink_collect_infos(vec![FluxonFsTransferSymlinkNoticeEntryWire { + relpath: "root/link-file.bin".to_string(), + link_target: "target/file.bin".to_string(), + }]) + .unwrap(); let assignment = FluxonFsTransferWorkerAssignmentWire { collect_infos: collect_infos.clone(), ..test_worker_assignment("dir/file.bin", file_bytes.len() as i64) @@ -8976,9 +9191,11 @@ mod tests { let result = execute_transfer_worker_assignment( &assignment, &dst_root, - || Err(TransferWorkerExecutionError::Stop( - FluxonFsTransferWorkerStopReasonWire::Superseded, - )), + || { + Err(TransferWorkerExecutionError::Stop( + FluxonFsTransferWorkerStopReasonWire::Superseded, + )) + }, { let file_bytes = file_bytes.clone(); move |_file, read_offset, _length| { @@ -8996,11 +9213,20 @@ mod tests { FluxonFsTransferWorkerStopReasonWire::Superseded )) )); - assert!(root.path().join(prepared_collect.staging_relpath.as_str()).exists()); + assert!( + root.path() + .join(prepared_collect.staging_relpath.as_str()) + .exists() + ); cleanup_transfer_worker_attempt_artifacts(&dst_root, &assignment).unwrap(); assert!(!root.path().join(".fluxon.stage").exists()); - assert!(!root.path().join(prepared_collect.staging_relpath.as_str()).exists()); + assert!( + !root + .path() + .join(prepared_collect.staging_relpath.as_str()) + .exists() + ); } } diff --git a/fluxon_rs/fluxon_fs/src/cache_controller.rs b/fluxon_rs/fluxon_fs/src/cache_controller.rs index 8a0845c..13ce5a8 100644 --- a/fluxon_rs/fluxon_fs/src/cache_controller.rs +++ b/fluxon_rs/fluxon_fs/src/cache_controller.rs @@ -429,8 +429,8 @@ fn now_ms() -> i64 { #[cfg(test)] mod tests { use super::*; - use std::sync::mpsc; use std::sync::atomic::{AtomicUsize, Ordering as AtomicOrdering}; + use std::sync::mpsc; use std::sync::{Condvar, Mutex}; use tokio::time::{Duration, sleep}; diff --git a/fluxon_rs/fluxon_fs_s3_gateway/src/lib.rs b/fluxon_rs/fluxon_fs_s3_gateway/src/lib.rs index 827bb23..0866432 100644 --- a/fluxon_rs/fluxon_fs_s3_gateway/src/lib.rs +++ b/fluxon_rs/fluxon_fs_s3_gateway/src/lib.rs @@ -5344,10 +5344,9 @@ mod tests { }; use crate::transfer::encode_transfer_manifest_blob_with_empty_dirs; use fluxon_fs_core::config::{ - FS_CACHE_DEFAULT_WRITE_SESSION_TARGET_INFLIGHT_BYTES_V1, - FS_EXPORT_DEFAULT_INLINE_BYTES_MAX_BYTES_V1, - FS_EXPORT_DEFAULT_METADATA_CACHE_TTL_MS_V1, FLUXON_FS_LOCAL_TRANSFER_CHECK_DST_EXPORT, FLUXON_FS_LOCAL_TRANSFER_CHECK_SRC_EXPORT, + FS_CACHE_DEFAULT_WRITE_SESSION_TARGET_INFLIGHT_BYTES_V1, + FS_EXPORT_DEFAULT_INLINE_BYTES_MAX_BYTES_V1, FS_EXPORT_DEFAULT_METADATA_CACHE_TTL_MS_V1, FluxonFsAccessModel, FluxonFsAccessUser, FluxonFsExport, FluxonFsExportRoutingMode, FluxonFsGlobalConfig, FluxonFsLocalTransferCheckJobSpecWire, FluxonFsRequestIdentity, FluxonFsS3GatewayConfig, FluxonFsS3KvMissPolicy, FluxonFsS3PermissionAccount, diff --git a/fluxon_rs/fluxon_kv/Cargo.toml b/fluxon_rs/fluxon_kv/Cargo.toml index 22ff136..fe7c669 100644 --- a/fluxon_rs/fluxon_kv/Cargo.toml +++ b/fluxon_rs/fluxon_kv/Cargo.toml @@ -95,6 +95,7 @@ limit_thirdparty = { path = "../limit_thirdparty" } fluxon_cli = { path = "../fluxon_cli" } fluxon_util = { path = "../fluxon_util" } fluxon_observability = { path = "../fluxon_observability" } +fluxon_mq = { path = "../fluxon_mq" } [build-dependencies] tonic-build = { workspace = true } fluxon_util = { path = "../fluxon_util" } diff --git a/fluxon_rs/fluxon_kv/framework_init_steps.yaml b/fluxon_rs/fluxon_kv/framework_init_steps.yaml index 923ae30..c90cd28 100644 --- a/fluxon_rs/fluxon_kv/framework_init_steps.yaml +++ b/fluxon_rs/fluxon_kv/framework_init_steps.yaml @@ -4,6 +4,8 @@ title: fluxon_kv init variants: - id: master tags: [master] + - id: broker + tags: [broker, external] - id: owner tags: [owner] - id: external @@ -20,8 +22,8 @@ variants: # - A step depends on a resource by declaring `deps: ["res:"]`. resources: - id: cluster_member_watch_ready - tags: [master, owner, external] - publish_tags: [master, owner, external] + tags: [master, broker, owner, external] + publish_tags: [master, broker, owner, external] published_by: ClusterManager.step.1.init2 doc: | - ClusterManager: member watch is established and continuous observation is available @@ -56,8 +58,8 @@ resources: # `Framework.step.0.attach_views`. module_tags: - ClusterManager: [master, owner, external] - P2pModule: [master, owner, external] + ClusterManager: [master, broker, owner, external] + P2pModule: [master, broker, owner, external] MasterSegManager: [master] MasterKvRouter: [master] MetricReporter: [master, owner, external] diff --git a/fluxon_rs/fluxon_kv/src/client_seg_pool/mod.rs b/fluxon_rs/fluxon_kv/src/client_seg_pool/mod.rs index 1aa6954..8c7cc78 100644 --- a/fluxon_rs/fluxon_kv/src/client_seg_pool/mod.rs +++ b/fluxon_rs/fluxon_kv/src/client_seg_pool/mod.rs @@ -237,10 +237,7 @@ impl ClientSegPool { std::path::Path::new(share_mem_path).join(SIDE_TRANSFER_PEERS_DIRNAME) } - pub fn side_transfer_peer_file_path( - share_mem_path: &str, - side_id: &str, - ) -> std::path::PathBuf { + pub fn side_transfer_peer_file_path(share_mem_path: &str, side_id: &str) -> std::path::PathBuf { Self::side_transfer_peers_dir(share_mem_path).join(format!("{side_id}.json")) } @@ -399,17 +396,13 @@ impl ClientSegPool { crate::rpcresp_kvresult_convert::msg_and_error::SharedMemError::MappingFailed { path: String::new(), len: map_len as u64, - detail: "share_mem_path is empty; explicit configuration required" - .to_string(), + detail: "share_mem_path is empty; explicit configuration required".to_string(), }, )); } let base_path = &share_mem_path; - tracing::info!( - "Using share_mem_path: {} for memory-mapped file", - base_path - ); + tracing::info!("Using share_mem_path: {} for memory-mapped file", base_path); std::fs::create_dir_all(base_path).map_err(|e| { KvError::SharedMem( crate::rpcresp_kvresult_convert::msg_and_error::SharedMemError::MappingFailed { diff --git a/fluxon_rs/fluxon_kv/src/config.rs b/fluxon_rs/fluxon_kv/src/config.rs index f9c7691..1577651 100644 --- a/fluxon_rs/fluxon_kv/src/config.rs +++ b/fluxon_rs/fluxon_kv/src/config.rs @@ -733,7 +733,7 @@ pub struct ClientConfig { pub pprof_duration_seconds: Option, pub redis_compat_listen_addr: Option, pub fluxonkv_spec: FluxonKvSpec, - pub share_mem_path: String, // Mandatory shared bundle path + pub share_mem_path: String, // Mandatory shared bundle path pub large_file_paths: LargeFilePaths, // Mandatory large-file roots for logs and caches pub test_spec_config: TestSpecConfig, } @@ -1170,13 +1170,15 @@ impl ClientConfigYaml { } else { let Some(large_file_paths_yaml) = self.fluxonkv_spec.large_file_paths.as_ref() else { return Err(ConfigError::InvalidClientConfig { - detail: "fluxonkv_spec.large_file_paths is required for owner mode" - .to_string(), + detail: "fluxonkv_spec.large_file_paths is required for owner mode".to_string(), } .into_kverror()); }; LargeFilePaths { - paths: verify_non_empty_root_path_list(&large_file_paths_yaml.0, "large_file_paths")?, + paths: verify_non_empty_root_path_list( + &large_file_paths_yaml.0, + "large_file_paths", + )?, } }; @@ -1647,7 +1649,9 @@ fluxonkv_spec: .unwrap(); let err = cfg.verify().unwrap_err(); let text = format!("{err}"); - assert!(text.contains("fluxonkv_spec.large_file_paths is forbidden in zero-contribution mode")); + assert!( + text.contains("fluxonkv_spec.large_file_paths is forbidden in zero-contribution mode") + ); } #[test] @@ -1667,7 +1671,9 @@ fluxonkv_spec: let logs_dir = large_file_paths.kv_logs_dir("test_cluster").unwrap(); assert_eq!( logs_dir, - first_root.join("child").join("test_cluster_cluster_kv_logs") + first_root + .join("child") + .join("test_cluster_cluster_kv_logs") ); assert!(logs_dir.exists()); diff --git a/fluxon_rs/fluxon_kv/src/external_client_api/mod.rs b/fluxon_rs/fluxon_kv/src/external_client_api/mod.rs index 9cb291f..b7715dd 100644 --- a/fluxon_rs/fluxon_kv/src/external_client_api/mod.rs +++ b/fluxon_rs/fluxon_kv/src/external_client_api/mod.rs @@ -865,8 +865,7 @@ impl ExternalInner { return Ok(false); } - self.finish_owner_recover(&share_mem_path, payload) - .await?; + self.finish_owner_recover(&share_mem_path, payload).await?; Ok(true) } diff --git a/fluxon_rs/fluxon_kv/src/kv_test.rs b/fluxon_rs/fluxon_kv/src/kv_test.rs index 5f0a9e2..910aac8 100644 --- a/fluxon_rs/fluxon_kv/src/kv_test.rs +++ b/fluxon_rs/fluxon_kv/src/kv_test.rs @@ -11,8 +11,9 @@ use crate::cluster_manager::ClusterManagerRdmaControlInit; use crate::config::{ - ClientConfig, ContributeToClusterPoolSize, FluxonKvSpec, LargeFilePaths, MasterConfig, MonitoringConfig, - ProtocolConfig, ProtocolType, TestSpecConfig, TestSpecTransportMode, TransferEngineType, + ClientConfig, ContributeToClusterPoolSize, FluxonKvSpec, LargeFilePaths, MasterConfig, + MonitoringConfig, ProtocolConfig, ProtocolType, TestSpecConfig, TestSpecTransportMode, + TransferEngineType, }; use crate::run_master_with_test_overrides; use crate::{ClientRunTestOverrides, MasterRunTestOverrides, run_client_with_test_overrides}; @@ -802,7 +803,6 @@ impl KvTestRoundOptions { kv_test_run_scope() ) } - } #[derive(Clone, Debug)] @@ -842,8 +842,7 @@ fn default_client_large_file_paths( instance_key: &str, contribute_to_cluster_pool_size: &ContributeToClusterPoolSize, ) -> LargeFilePaths { - if contribute_to_cluster_pool_size.dram == 0 - && contribute_to_cluster_pool_size.vram.is_empty() + if contribute_to_cluster_pool_size.dram == 0 && contribute_to_cluster_pool_size.vram.is_empty() { return LargeFilePaths { paths: Vec::new() }; } @@ -1381,7 +1380,10 @@ async fn key_meta_cache_check( } } - tracing::info!("🔍 Starting PUT and GET in parallel: {}", parallel_unique_key); + tracing::info!( + "🔍 Starting PUT and GET in parallel: {}", + parallel_unique_key + ); for i in 0..10 { let (put_client, other_client) = if i % 2 == 0 { (client, client2) @@ -1420,7 +1422,9 @@ async fn key_meta_cache_check( } assert!( - put_client.client_kv_api().has_cached_key(parallel_unique_key), + put_client + .client_kv_api() + .has_cached_key(parallel_unique_key), "put client should have immediate local cache metadata for key {} after put time {}", parallel_unique_key, i diff --git a/fluxon_rs/fluxon_kv/src/lib.rs b/fluxon_rs/fluxon_kv/src/lib.rs index edaa386..a7fd905 100644 --- a/fluxon_rs/fluxon_kv/src/lib.rs +++ b/fluxon_rs/fluxon_kv/src/lib.rs @@ -86,6 +86,10 @@ use external_client_api::{ExternalClientApi, ExternalClientApiNewArg}; use fluxon_commu::TransferBackendActivationMode; use fluxon_framework::LogicalModule; use fluxon_framework::{AnyResult, define_framework}; +use fluxon_mq::{ + FLUXON_MQ_COMPONENT_BROKER_METADATA_VALUE, FLUXON_MQ_COMPONENT_METADATA_KEY, + register_broker_service, +}; use master_kv_router::{MasterKvRouter, MasterKvRouterNewArg}; use master_seg_manager::MasterSegManager; use metric_reporter::{ @@ -194,6 +198,11 @@ pub(crate) struct MasterRunTestOverrides { pub transfer_backend_activation_mode: Option, } +#[derive(Clone, Debug)] +pub(crate) struct BrokerRunTestOverrides { + pub rdma_control_init: ClusterManagerRdmaControlInit, +} + /// Result of a unified `get` that carries the role-specific holder types. #[derive(Clone)] pub enum KvGetResult { @@ -460,6 +469,12 @@ enum Commands { #[arg(short = 'f', long = "config")] config: Option, }, + /// Run as broker node + Broker { + /// Configuration file path + #[arg(short = 'f', long = "config")] + config: Option, + }, /// Run as client node Client { /// Configuration file path @@ -1336,6 +1351,15 @@ pub async fn entry() -> Result<()> { .await .map_err(|e| anyhow::anyhow!("{}", e))?; } + Commands::Broker { config } => { + let config_arg = config.map_or(ConfigArg::None, ConfigArg::File); + let (framework, _) = run_broker(config_arg).await?; + framework.wait_shutdown_signal().await; + framework + .shutdown() + .await + .map_err(|e| anyhow::anyhow!("{}", e))?; + } Commands::Client { config } => { let config_arg = config.map_or(ConfigArg::None, ConfigArg::File); let (framework, _) = run_client(config_arg).await?; @@ -1548,6 +1572,205 @@ pub async fn run_master( run_master_impl(config_arg, None).await } +async fn run_broker_impl( + config_arg: ConfigArg, + test_overrides: Option, +) -> Result<(Arc, ClientConfig)> { + #[cfg(unix)] + segfault_handler::install_sigsegv_classifier(); + + println!("Starting cache backend in BROKER mode"); + + let build_version = fluxon_util::git_version_build_record::get_current_git_commitid().unwrap(); + let source_sha256 = fluxon_util::build_info::SOURCE_SHA256; + println!("Build version (git commit): {}", build_version); + println!("Build version (source-sha256): {}", source_sha256); + + let config = load_client_config(config_arg) + .await + .map_err(|e| anyhow::anyhow!("Failed to load broker config: {}", e))?; + + let dram = config.contribute_to_cluster_pool_size.dram; + let vram_is_zero = config + .contribute_to_cluster_pool_size + .vram + .values() + .all(|&v| v == 0); + if dram != 0 || !vram_is_zero { + anyhow::bail!( + "broker config must be a zero-contribution external-client config; instance_key={}", + config.instance_key + ); + } + if matches!( + config.test_spec_config.side_transfer_role, + Some(SideTransferRole::Worker) + ) { + anyhow::bail!( + "broker config must not set test_spec_config.side_transfer_role=worker; instance_key={}", + config.instance_key + ); + } + + unsafe { + std::env::set_var( + "FLUXON_ENABLE_ICEORYX_LOGS", + if config.test_spec_config.enable_iceoryx_logs { + "1" + } else { + "0" + }, + ); + } + + let config = bootstrap_zero_contribution_client_config(config).await?; + + let kv_logs_dir = config + .large_file_paths + .kv_logs_dir(&config.cluster_name) + .map_err(|e| anyhow::anyhow!("invalid large_file_paths for broker kv logs: {}", e))?; + let observability_disabled = config.test_spec_config.disable_observability; + let greptime_tracing_rx = if observability_disabled { + fluxon_util::init_log(&kv_logs_dir, &config.instance_key); + None + } else { + let (greptime_tracing_layer, greptime_tracing_rx) = + fluxon_observability::greptime_otlp_tracing::new_tracing_layer( + crate::config::DEFAULT_OTLP_LOG_MAX_QUEUE_LINES, + ); + fluxon_util::init_log_with_extra_layer( + &kv_logs_dir, + &config.instance_key, + greptime_tracing_layer, + ); + Some(greptime_tracing_rx) + }; + info!("Broker config: {:?}", config); + info!("Build version (git commit): {}", build_version); + info!("Build version (source-sha256): {}", source_sha256); + + let mut metadata = HashMap::from([ + ("external_client".to_string(), "true".to_string()), + ( + FLUXON_MQ_COMPONENT_METADATA_KEY.to_string(), + FLUXON_MQ_COMPONENT_BROKER_METADATA_VALUE.to_string(), + ), + ("version".to_string(), build_version.clone()), + ]); + merge_startup_member_metadata(&mut metadata, HashMap::new())?; + + let rdma_control_init = test_overrides + .as_ref() + .map(|overrides| overrides.rdma_control_init.clone()) + .or_else(|| test_spec_config_rdma_control_init(Some(&config.test_spec_config))) + .unwrap_or_else(|| cluster_manager_rdma_control_init_from_config(&config)); + + let init_args = InitArgsBroker { + cluster_manager_arg: ClusterManagerNewArg { + etcd_endpoints: config.fluxonkv_spec.etcd_addresses.clone(), + cluster_name: config.cluster_name.clone(), + instance_name: Some(config.instance_key.clone()), + port: None, + metadata, + local_ipc_root: cluster_manager_local_ipc_root( + &config.share_mem_path, + &config.test_spec_config, + ), + rdma_control_init, + sub_cluster: config.fluxonkv_spec.sub_cluster.clone(), + network: None, + }, + p2p_arg: P2pModuleNewArg::new( + config.fluxonkv_spec.p2p_listen_port, + tcp_thread_transport_tuning_from_test_spec_config(&config.test_spec_config), + config.test_spec_config.disable_crossowner_ipc, + config.test_spec_config.iceoryx_external_busy_poll, + ) + .with_iceoryx_owner_client_busy_poll(config.test_spec_config.iceoryx_owner_client_busy_poll) + .with_user_rpc_sync_handler_thread_count( + config.test_spec_config.user_rpc_sync_handler_thread_count, + ), + metric_reporter_arg: MetricReporterNewArg { + test_spec_config: config.test_spec_config.clone(), + }, + external_client_api_arg: ExternalClientApiNewArg { + share_mem_path: config.share_mem_path.clone(), + large_file_paths: config.large_file_paths.clone(), + expected_cluster_name: config.cluster_name.clone(), + expected_protocol_version: build_version.clone(), + enable_side_transfer: config.test_spec_config.enable_side_transfer, + short_circuit_put_payload_path: config.test_spec_config.short_circuit_put_payload_path, + }, + }; + + let framework = Framework::new(format!( + "fluxon_kv.broker:{}:{}", + config.cluster_name, config.instance_key + )); + info!("Initializing broker framework..."); + + init_framework_broker(&framework, init_args) + .await + .map_err(|e| anyhow::anyhow!("Failed to initialize broker framework: {:#}", e))?; + register_broker_service(framework.p2p_view().clone(), 4096); + + let framework = Arc::new(framework); + + if !observability_disabled { + let otlp_cluster_name = config.cluster_name.clone(); + let otlp_member_id = config.instance_key.clone(); + let cm_view = framework.cluster_manager_view().clone(); + let p2p_view = framework.p2p_view().clone(); + let spawner = cm_view.clone(); + let _ = spawner.spawn("wait_master_otlp_log_api_broker", async move { + let outcome = wait_master_observe_broadcast( + &cm_view, + std::time::Duration::from_secs(60), + std::time::Duration::from_secs(10), + ) + .await; + let Some(cfg) = outcome.otlp_log_api() else { + warn!( + "Broker OTLP log exporter disabled: master metadata does not carry otlp_log_api" + ); + return; + }; + + start_greptime_otlp_tracing_exporter_kv( + cm_view, + p2p_view, + Some(cfg), + greptime_tracing_rx, + &otlp_cluster_name, + fluxon_observability::types::FluxonMemberRole::Broker, + &otlp_member_id, + ); + }); + } + + let shutdown_waiter = framework.cluster_manager_view().register_shutdown_waiter(); + let kv_profiles_dir = config + .large_file_paths + .kv_profiles_dir(&config.cluster_name) + .map_err(|e| anyhow::anyhow!("invalid large_file_paths for broker kv profiles: {}", e))?; + profile::spawn_pprof_flamegraph_on_timeout_or_shutdown( + config.pprof_duration_seconds, + kv_profiles_dir, + config.cluster_name.clone(), + profile::PprofRole::Broker, + config.instance_key.clone(), + shutdown_waiter, + ); + + Ok((framework, config)) +} + +pub async fn run_broker( + config_arg: ConfigArg, +) -> Result<(Arc, ClientConfig)> { + run_broker_impl(config_arg, None).await +} + #[cfg(feature = "test_bins")] pub(crate) async fn run_master_with_test_overrides( config_arg: ConfigArg, @@ -2736,8 +2959,8 @@ mod tests { large_file_paths: crate::config::LargeFilePaths { paths: vec![owner_large_root.to_string_lossy().into_owned()], }, - protocol_version: - fluxon_util::git_version_build_record::get_current_git_commitid().unwrap(), + protocol_version: fluxon_util::git_version_build_record::get_current_git_commitid() + .unwrap(), write_ts: Some(chrono::Utc::now().timestamp_micros()), }; let shared_meta_json = serde_json::to_string(&shared_meta).unwrap(); diff --git a/fluxon_rs/fluxon_kv/src/master_lease_manager/lease_manager_test.rs b/fluxon_rs/fluxon_kv/src/master_lease_manager/lease_manager_test.rs index 5c20cc1..5d344c9 100755 --- a/fluxon_rs/fluxon_kv/src/master_lease_manager/lease_manager_test.rs +++ b/fluxon_rs/fluxon_kv/src/master_lease_manager/lease_manager_test.rs @@ -22,7 +22,8 @@ async fn test1_lease_expire_removes_keys() { unsafe { std::env::set_var("FLUXON_LOG", "debug"); } - let (master_fw, client_fw) = start_master_and_client("lease_master_t1", "lease_client_t1").await; + let (master_fw, client_fw) = + start_master_and_client("lease_master_t1", "lease_client_t1").await; let client_view = client_fw.client_kv_api_view(); wait_master_ready(&client_view).await; @@ -82,7 +83,8 @@ async fn test2_rebind_to_new_lease_preserves_until_new_expire() { unsafe { std::env::set_var("FLUXON_LOG", "debug"); } - let (master_fw, client_fw) = start_master_and_client("lease_master_t2", "lease_client_t2").await; + let (master_fw, client_fw) = + start_master_and_client("lease_master_t2", "lease_client_t2").await; let client_view = client_fw.client_kv_api_view(); wait_master_ready(&client_view).await; @@ -161,7 +163,8 @@ async fn test3_keepalive() { unsafe { std::env::set_var("FLUXON_LOG", "debug"); } - let (master_fw, client_fw) = start_master_and_client("lease_master_t3", "lease_client_t3").await; + let (master_fw, client_fw) = + start_master_and_client("lease_master_t3", "lease_client_t3").await; let client_view = client_fw.client_kv_api_view(); wait_master_ready(&client_view).await; @@ -236,7 +239,8 @@ async fn test4_delete_under_lease_then_get_fails() { unsafe { std::env::set_var("FLUXON_LOG", "debug"); } - let (master_fw, client_fw) = start_master_and_client("lease_master_t4", "lease_client_t4").await; + let (master_fw, client_fw) = + start_master_and_client("lease_master_t4", "lease_client_t4").await; let client_view = client_fw.client_kv_api_view(); wait_master_ready(&client_view).await; diff --git a/fluxon_rs/fluxon_kv/src/memholder/lifetime.rs b/fluxon_rs/fluxon_kv/src/memholder/lifetime.rs index ad23b4d..1301a98 100755 --- a/fluxon_rs/fluxon_kv/src/memholder/lifetime.rs +++ b/fluxon_rs/fluxon_kv/src/memholder/lifetime.rs @@ -448,8 +448,8 @@ impl MemholderManagerTrait for MasterOwnerMemMgr { const DELETE_SUBMIT_QUEUE_CAPACITY: usize = 1000; const DELETE_TARGET_QUEUE_CAPACITY: usize = 1000; - const DELETE_MERGE_WINDOW_MILLIS: u64 = 1000; - const DELETE_RETRY_INTERVAL_MILLIS: u64 = 1000; + const DELETE_MERGE_WINDOW_MILLIS: u64 = 10; + const DELETE_RETRY_INTERVAL_MILLIS: u64 = 200; #[inline] fn inner_map(&self) -> &DashMap { @@ -737,8 +737,8 @@ impl MemholderManagerTrait for OwnerExternalMemMgr { const DELETE_SUBMIT_QUEUE_CAPACITY: usize = 1000; const DELETE_TARGET_QUEUE_CAPACITY: usize = 1000; - const DELETE_MERGE_WINDOW_MILLIS: u64 = 1000; - const DELETE_RETRY_INTERVAL_MILLIS: u64 = 1000; + const DELETE_MERGE_WINDOW_MILLIS: u64 = 10; + const DELETE_RETRY_INTERVAL_MILLIS: u64 = 200; #[inline] fn inner_map(&self) -> &DashMap { diff --git a/fluxon_rs/fluxon_kv/src/profile.rs b/fluxon_rs/fluxon_kv/src/profile.rs index c2f40d7..2d04374 100755 --- a/fluxon_rs/fluxon_kv/src/profile.rs +++ b/fluxon_rs/fluxon_kv/src/profile.rs @@ -7,6 +7,7 @@ use tracing::{info, warn}; #[derive(Debug, Clone, Copy)] pub(crate) enum PprofRole { Master, + Broker, Client, } @@ -14,6 +15,7 @@ impl PprofRole { fn as_str(self) -> &'static str { match self { PprofRole::Master => "master", + PprofRole::Broker => "broker", PprofRole::Client => "client", } } diff --git a/fluxon_rs/fluxon_mq/Cargo.toml b/fluxon_rs/fluxon_mq/Cargo.toml index 4f10f44..15f6329 100644 --- a/fluxon_rs/fluxon_mq/Cargo.toml +++ b/fluxon_rs/fluxon_mq/Cargo.toml @@ -17,6 +17,7 @@ parking_lot = { workspace = true } paste = { workspace = true } serde = { workspace = true } serde_json = { workspace = true } +bitcode = { workspace = true } etcd-client = { workspace = true } fluxon_util = { path = "../fluxon_util" } fluxon_observability = { path = "../fluxon_observability" } diff --git a/fluxon_rs/fluxon_mq/src/broker.rs b/fluxon_rs/fluxon_mq/src/broker.rs new file mode 100644 index 0000000..69827eb --- /dev/null +++ b/fluxon_rs/fluxon_mq/src/broker.rs @@ -0,0 +1,2878 @@ +use std::collections::{HashMap, HashSet, VecDeque}; +use std::env; +use std::sync::atomic::{AtomicU64, Ordering}; +use std::sync::{Arc, OnceLock}; +use std::time::{Duration, SystemTime, UNIX_EPOCH}; + +use bitcode::{Decode, Encode}; +use fluxon_commu::cluster_manager::ClusterManagerView; +use fluxon_commu::p2p::rpc::{MsgPack, MsgPackSerializePart, RPCCaller, RPCHandler, RPCReq}; +use fluxon_commu::p2p::P2pModuleView; +use serde::{Deserialize, Serialize}; +use thiserror::Error; +use tokio::sync::{mpsc, oneshot, Mutex}; + +use crate::keys::{self, MqCategory}; +use crate::manager::PRODUCE_OFFSET_BEGIN; + +const BROKER_RPC_REQ_MSG_ID: u32 = 8101; +const BROKER_RPC_RESP_MSG_ID: u32 = 8102; +pub const FLUXON_MQ_COMPONENT_METADATA_KEY: &str = "fluxon_mq_component"; +pub const FLUXON_MQ_COMPONENT_BROKER_METADATA_VALUE: &str = "broker"; +const BROKER_PAYLOAD_BYTES_CAP_ENV: &str = "FLUXON_MQ_BROKER_PAYLOAD_BYTES_CAP"; +const BROKER_PAYLOAD_BYTES_CAP_PERCENT_ENV: &str = "FLUXON_MQ_BROKER_PAYLOAD_BYTES_CAP_PERCENT"; +const BROKER_CLEANUP_RELEASE_DELAY_MS_ENV: &str = "FLUXON_MQ_BROKER_CLEANUP_RELEASE_DELAY_MS"; +const OWNER_POOL_DRAM_BYTES_ENV: &str = "FLUXON_OWNER_POOL_DRAM_BYTES"; +const DEFAULT_BROKER_PAYLOAD_BYTES_CAP: u64 = 64 * 1024 * 1024 * 1024; +const DEFAULT_BROKER_PAYLOAD_BYTES_CAP_PERCENT: u64 = 60; +const DEFAULT_BROKER_CLEANUP_RELEASE_DELAY_MS: u64 = 0; +const BROKER_DISCOVERY_TIMEOUT: Duration = Duration::from_secs(15); +const BROKER_RPC_RESPONSE_CACHE_LIMIT: usize = 65536; + +static BROKER_RPC_REQUEST_SEQ: AtomicU64 = AtomicU64::new(1); +static BROKER_RPC_REQUEST_PREFIX: OnceLock = OnceLock::new(); + +#[derive(Debug, Clone, Default, PartialEq, Eq, Serialize, Deserialize, Encode, Decode)] +pub struct BrokerChannelConfig { + pub channel_id: i64, + pub capacity: i64, +} + +#[derive(Debug, Clone, Default, Serialize, Deserialize, Encode, Decode)] +pub struct BrokerReserveRequest { + pub channel_id: i64, + pub producer_id: String, + pub category: MqCategory, + pub payload_bytes: u64, + pub now_ms: i64, +} + +#[derive(Debug, Clone, Default, Serialize, Deserialize, Encode, Decode)] +pub struct BrokerFetchRequest { + pub channel_id: i64, + pub consumer_id: String, + pub now_ms: i64, +} + +#[derive(Debug, Clone, Default, PartialEq, Eq, Serialize, Deserialize, Encode, Decode)] +pub struct BrokerEnvelope { + pub channel_id: i64, + pub producer_id: String, + pub msg_id: i64, + pub reservation_id: u64, + pub payload_key: String, + pub payload_bytes: u64, + pub reserved_at_ms: i64, + pub published_at_ms: Option, +} + +#[derive(Debug, Clone, Default, PartialEq, Eq, Serialize, Deserialize, Encode, Decode)] +pub struct BrokerReservation { + pub envelope: BrokerEnvelope, +} + +#[derive(Debug, Clone, Default, PartialEq, Eq, Serialize, Deserialize, Encode, Decode)] +pub struct BrokerFetchedMessage { + pub envelope: BrokerEnvelope, +} + +#[derive(Debug, Clone, Default, PartialEq, Eq, Serialize, Deserialize, Encode, Decode)] +pub struct BrokerFetchBatch { + pub messages: Vec, +} + +#[derive(Debug, Clone, Default, PartialEq, Eq, Serialize, Deserialize, Encode, Decode)] +pub struct BrokerCommitOutcome { + pub first_commit: bool, + pub cleanup: Option, +} + +#[derive(Debug, Clone, Default, PartialEq, Eq, Serialize, Deserialize, Encode, Decode)] +pub struct BrokerCommitBatchOutcome { + pub first_commit_count: usize, + pub cleanup: Vec, +} + +#[derive(Debug, Error, PartialEq, Eq, Clone, Serialize, Deserialize, Encode, Decode)] +pub enum BrokerError { + #[error("broker channel not found: channel_id={0}")] + ChannelNotFound(i64), + + #[error( + "broker channel capacity must be positive: channel_id={channel_id} capacity={capacity}" + )] + InvalidCapacity { channel_id: i64, capacity: i64 }, + + #[error( + "broker channel is full: channel_id={channel_id} capacity={capacity} used_slots={used_slots}" + )] + ChannelFull { + channel_id: i64, + capacity: i64, + used_slots: i64, + }, + + #[error( + "broker payload byte budget is full: requested_bytes={requested_bytes} capacity_bytes={capacity_bytes} used_bytes={used_bytes}" + )] + PayloadBytesFull { + requested_bytes: u64, + capacity_bytes: u64, + used_bytes: u64, + }, + + #[error( + "broker payload is larger than byte budget: requested_bytes={requested_bytes} capacity_bytes={capacity_bytes}" + )] + PayloadTooLarge { + requested_bytes: u64, + capacity_bytes: u64, + }, + + #[error( + "broker reservation not found: channel_id={channel_id} reservation_id={reservation_id}" + )] + ReservationNotFound { + channel_id: i64, + reservation_id: u64, + }, + + #[error( + "broker delivery not in-flight: channel_id={channel_id} reservation_id={reservation_id}" + )] + DeliveryNotFound { + channel_id: i64, + reservation_id: u64, + }, + + #[error("invalid broker state transition: {0}")] + InvalidRecord(String), + + #[error("broker master unavailable: {0}")] + BrokerUnavailable(String), + + #[error("broker rpc error: {0}")] + Rpc(String), + + #[error("broker actor closed")] + ActorClosed, +} + +#[derive(Debug, Default)] +pub struct LocalBroker { + state: BrokerState, +} + +#[derive(Debug)] +struct BrokerState { + channels: HashMap, + payload_byte_capacity: u64, + used_payload_bytes: u64, +} + +impl Default for BrokerState { + fn default() -> Self { + Self { + channels: HashMap::new(), + payload_byte_capacity: default_payload_byte_capacity(), + used_payload_bytes: 0, + } + } +} + +#[derive(Debug)] +struct ChannelState { + config: BrokerChannelConfig, + next_reservation_id: u64, + next_msg_by_producer: HashMap, + pending: HashMap, + visible: VecDeque, + inflight: HashMap, + inflight_order: VecDeque, + cleanup: VecDeque, + cleanup_inflight: HashMap, + used_slots: i64, + reserve_waiters: VecDeque, + fetch_waiters: VecDeque, +} + +impl ChannelState { + fn new(config: BrokerChannelConfig) -> Self { + Self { + config, + next_reservation_id: 1, + next_msg_by_producer: HashMap::new(), + pending: HashMap::new(), + visible: VecDeque::new(), + inflight: HashMap::new(), + inflight_order: VecDeque::new(), + cleanup: VecDeque::new(), + cleanup_inflight: HashMap::new(), + used_slots: 0, + reserve_waiters: VecDeque::new(), + fetch_waiters: VecDeque::new(), + } + } +} + +#[derive(Debug)] +struct ReserveWaiter { + req: BrokerReserveRequest, + reply: oneshot::Sender>, +} + +#[derive(Debug)] +struct FetchWaiter { + req: BrokerFetchRequest, + reply: oneshot::Sender, BrokerError>>, +} + +impl LocalBroker { + pub fn new() -> Self { + Self::default() + } + + #[cfg(test)] + fn with_payload_byte_capacity(payload_byte_capacity: u64) -> Self { + Self { + state: BrokerState { + channels: HashMap::new(), + payload_byte_capacity: payload_byte_capacity.max(1), + used_payload_bytes: 0, + }, + } + } + + pub fn upsert_channel(&mut self, config: BrokerChannelConfig) -> Result<(), BrokerError> { + validate_capacity(&config)?; + match self.state.channels.get_mut(&config.channel_id) { + Some(channel) => { + if config.capacity < channel.used_slots { + return Err(BrokerError::InvalidRecord(format!( + "channel_id={} capacity={} below used_slots={}", + config.channel_id, config.capacity, channel.used_slots + ))); + } + channel.config = config; + } + None => { + self.state + .channels + .insert(config.channel_id, ChannelState::new(config)); + } + } + Ok(()) + } + + pub fn delete_channel(&mut self, channel_id: i64) -> Result, BrokerError> { + let payload_keys = self.delete_channel_state(channel_id); + Ok(payload_keys) + } + + pub fn reserve(&mut self, req: BrokerReserveRequest) -> Result { + let channel = self.channel(req.channel_id)?; + if broker_category_enforces_capacity(req.category) + && channel.used_slots >= channel.config.capacity + { + return Err(BrokerError::ChannelFull { + channel_id: req.channel_id, + capacity: channel.config.capacity, + used_slots: channel.used_slots, + }); + } + + let msg_id = channel + .next_msg_by_producer + .get(&req.producer_id) + .copied() + .unwrap_or(PRODUCE_OFFSET_BEGIN + 1); + let reservation_id = channel.next_reservation_id; + let payload_key = keys::backend_message_key_with_category( + req.channel_id, + &req.producer_id, + msg_id, + &req.category, + ); + let payload_bytes = req.payload_bytes.max(1); + if payload_bytes > self.state.payload_byte_capacity { + return Err(BrokerError::PayloadTooLarge { + requested_bytes: payload_bytes, + capacity_bytes: self.state.payload_byte_capacity, + }); + } + if self.state.used_payload_bytes.saturating_add(payload_bytes) + > self.state.payload_byte_capacity + { + return Err(BrokerError::PayloadBytesFull { + requested_bytes: payload_bytes, + capacity_bytes: self.state.payload_byte_capacity, + used_bytes: self.state.used_payload_bytes, + }); + } + + let envelope = BrokerEnvelope { + channel_id: req.channel_id, + producer_id: req.producer_id, + msg_id, + reservation_id, + payload_key, + payload_bytes, + reserved_at_ms: req.now_ms, + published_at_ms: None, + }; + let channel = self.channel_mut(req.channel_id)?; + channel.next_reservation_id = reservation_id + 1; + let next_msg = channel + .next_msg_by_producer + .entry(envelope.producer_id.clone()) + .or_insert(PRODUCE_OFFSET_BEGIN + 1); + *next_msg = (*next_msg).max(msg_id + 1); + channel.pending.insert(reservation_id, envelope.clone()); + channel.used_slots += 1; + self.state.used_payload_bytes += payload_bytes; + Ok(BrokerReservation { envelope }) + } + + pub fn publish( + &mut self, + channel_id: i64, + reservation_id: u64, + now_ms: i64, + ) -> Result { + let channel = self.channel_mut(channel_id)?; + let mut envelope = + channel + .pending + .remove(&reservation_id) + .ok_or(BrokerError::ReservationNotFound { + channel_id, + reservation_id, + })?; + envelope.published_at_ms = Some(now_ms); + channel.visible.push_back(envelope.clone()); + Ok(envelope) + } + + pub fn abort(&mut self, channel_id: i64, reservation_id: u64) -> Result<(), BrokerError> { + let channel = self.channel_mut(channel_id)?; + let envelope = + channel + .pending + .remove(&reservation_id) + .ok_or(BrokerError::ReservationNotFound { + channel_id, + reservation_id, + })?; + channel.used_slots -= 1; + self.release_payload_bytes(envelope.payload_bytes); + Ok(()) + } + + pub fn fetch_next( + &mut self, + req: BrokerFetchRequest, + ) -> Result, BrokerError> { + let channel = self.channel_mut(req.channel_id)?; + let Some(envelope) = channel.visible.pop_front() else { + return Ok(None); + }; + channel + .inflight + .insert(envelope.reservation_id, envelope.clone()); + channel.inflight_order.push_back(envelope.reservation_id); + Ok(Some(BrokerFetchedMessage { envelope })) + } + + pub fn fetch_batch_available( + &mut self, + req: BrokerFetchRequest, + max_items: usize, + ) -> Result { + let mut messages = Vec::new(); + for _ in 0..max_items { + let Some(message) = self.fetch_next(req.clone())? else { + break; + }; + messages.push(message); + } + Ok(BrokerFetchBatch { messages }) + } + + pub fn commit( + &mut self, + channel_id: i64, + reservation_id: u64, + now_ms: i64, + ) -> Result { + let _ = now_ms; + let channel = self.channel_mut(channel_id)?; + if cleanup_contains(channel, reservation_id) { + return Ok(BrokerCommitOutcome { + first_commit: false, + cleanup: None, + }); + } + let envelope = + channel + .inflight + .remove(&reservation_id) + .ok_or(BrokerError::DeliveryNotFound { + channel_id, + reservation_id, + })?; + remove_from_deque(&mut channel.inflight_order, reservation_id); + channel.cleanup.push_back(envelope.clone()); + channel.used_slots -= 1; + Ok(BrokerCommitOutcome { + first_commit: true, + cleanup: Some(envelope), + }) + } + + pub fn commit_batch( + &mut self, + channel_id: i64, + reservation_ids: Vec, + now_ms: i64, + ) -> Result { + let mut cleanup = Vec::new(); + let mut first_commit_count = 0usize; + for reservation_id in reservation_ids { + let outcome = self.commit(channel_id, reservation_id, now_ms)?; + if outcome.first_commit { + first_commit_count += 1; + if let Some(envelope) = outcome.cleanup { + cleanup.push(envelope); + } + } + } + Ok(BrokerCommitBatchOutcome { + first_commit_count, + cleanup, + }) + } + + pub fn requeue_inflight( + &mut self, + channel_id: i64, + reservation_id: u64, + ) -> Result<(), BrokerError> { + let channel = self.channel_mut(channel_id)?; + let envelope = + channel + .inflight + .remove(&reservation_id) + .ok_or(BrokerError::DeliveryNotFound { + channel_id, + reservation_id, + })?; + remove_from_deque(&mut channel.inflight_order, reservation_id); + channel.visible.push_front(envelope); + Ok(()) + } + + pub fn requeue_inflight_batch( + &mut self, + channel_id: i64, + reservation_ids: Vec, + ) -> Result<(), BrokerError> { + let channel = self.channel(channel_id)?; + let mut seen = HashSet::new(); + for reservation_id in &reservation_ids { + if !seen.insert(*reservation_id) { + return Err(BrokerError::InvalidRecord(format!( + "duplicate requeue reservation_id={} for channel_id={}", + reservation_id, channel_id + ))); + } + if !channel.inflight.contains_key(reservation_id) { + return Err(BrokerError::DeliveryNotFound { + channel_id, + reservation_id: *reservation_id, + }); + } + } + + for reservation_id in reservation_ids.into_iter().rev() { + self.requeue_inflight(channel_id, reservation_id)?; + } + Ok(()) + } + + pub fn requeue_all_inflight(&mut self, channel_id: i64) -> Result<(), BrokerError> { + let reservation_ids: Vec = self + .channel(channel_id)? + .inflight_order + .iter() + .rev() + .copied() + .collect(); + for reservation_id in reservation_ids { + self.requeue_inflight(channel_id, reservation_id)?; + } + Ok(()) + } + + pub fn take_cleanup_batch( + &mut self, + channel_id: i64, + max_items: usize, + ) -> Result, BrokerError> { + let channel = self.channel_mut(channel_id)?; + let mut batch = Vec::new(); + for _ in 0..max_items { + let Some(envelope) = channel.cleanup.pop_front() else { + break; + }; + channel + .cleanup_inflight + .insert(envelope.reservation_id, envelope.clone()); + batch.push(envelope); + } + Ok(batch) + } + + pub fn cleanup_ack(&mut self, channel_id: i64, reservation_id: u64) -> Result<(), BrokerError> { + let _ = self.apply_cleanup_ack(channel_id, reservation_id, true)?; + Ok(()) + } + + pub fn cleanup_ack_for_delayed_release( + &mut self, + channel_id: i64, + reservation_id: u64, + ) -> Result { + self.apply_cleanup_ack(channel_id, reservation_id, false) + } + + pub fn cleanup_nack( + &mut self, + channel_id: i64, + reservation_id: u64, + ) -> Result<(), BrokerError> { + let channel = self.channel_mut(channel_id)?; + if let Some(envelope) = channel.cleanup_inflight.remove(&reservation_id) { + channel.cleanup.push_front(envelope); + } + Ok(()) + } + + fn release_payload_bytes(&mut self, payload_bytes: u64) { + self.state.used_payload_bytes = self.state.used_payload_bytes.saturating_sub(payload_bytes); + } + + fn delete_channel_state(&mut self, channel_id: i64) -> Vec { + let Some(mut channel) = self.state.channels.remove(&channel_id) else { + return Vec::new(); + }; + + let mut payload_bytes = 0u64; + let mut payload_keys = Vec::new(); + collect_deleted_payloads( + channel.pending.drain().map(|(_, envelope)| envelope), + &mut payload_keys, + &mut payload_bytes, + ); + collect_deleted_payloads( + channel.visible.drain(..), + &mut payload_keys, + &mut payload_bytes, + ); + collect_deleted_payloads( + channel.inflight.drain().map(|(_, envelope)| envelope), + &mut payload_keys, + &mut payload_bytes, + ); + collect_deleted_payloads( + channel.cleanup.drain(..), + &mut payload_keys, + &mut payload_bytes, + ); + collect_deleted_payloads( + channel + .cleanup_inflight + .drain() + .map(|(_, envelope)| envelope), + &mut payload_keys, + &mut payload_bytes, + ); + + while let Some(waiter) = channel.reserve_waiters.pop_front() { + let _ = waiter + .reply + .send(Err(BrokerError::ChannelNotFound(channel_id))); + } + while let Some(waiter) = channel.fetch_waiters.pop_front() { + let _ = waiter + .reply + .send(Err(BrokerError::ChannelNotFound(channel_id))); + } + + self.release_payload_bytes(payload_bytes); + payload_keys + } + + fn apply_cleanup_ack( + &mut self, + channel_id: i64, + reservation_id: u64, + release_payload_now: bool, + ) -> Result { + let channel = self.channel_mut(channel_id)?; + let envelope = if let Some(envelope) = channel.cleanup_inflight.remove(&reservation_id) { + envelope + } else if let Some(pos) = channel + .cleanup + .iter() + .position(|env| env.reservation_id == reservation_id) + { + channel + .cleanup + .remove(pos) + .expect("cleanup envelope position checked above") + } else { + return Err(BrokerError::ReservationNotFound { + channel_id, + reservation_id, + }); + }; + let payload_bytes = envelope.payload_bytes; + if release_payload_now { + self.release_payload_bytes(payload_bytes); + } + Ok(payload_bytes) + } + + fn channel(&self, channel_id: i64) -> Result<&ChannelState, BrokerError> { + self.state + .channels + .get(&channel_id) + .ok_or(BrokerError::ChannelNotFound(channel_id)) + } + + fn channel_mut(&mut self, channel_id: i64) -> Result<&mut ChannelState, BrokerError> { + self.state + .channels + .get_mut(&channel_id) + .ok_or(BrokerError::ChannelNotFound(channel_id)) + } +} + +fn drain_reserve_waiters(broker: &mut LocalBroker) { + loop { + let channel_ids: Vec = broker.state.channels.keys().copied().collect(); + let mut progressed = false; + for channel_id in channel_ids { + progressed |= drain_reserve_waiters_for_channel(broker, channel_id); + } + if !progressed { + return; + } + } +} + +fn drain_reserve_waiters_for_channel(broker: &mut LocalBroker, channel_id: i64) -> bool { + let mut progressed = false; + loop { + let waiter = match broker.channel_mut(channel_id) { + Ok(channel) => channel.reserve_waiters.pop_front(), + Err(_) => return progressed, + }; + let Some(waiter) = waiter else { + return progressed; + }; + + match broker.reserve(waiter.req.clone()) { + Ok(reservation) => { + if let Err(Ok(reservation)) = waiter.reply.send(Ok(reservation)) { + let _ = broker.abort(channel_id, reservation.envelope.reservation_id); + } + progressed = true; + } + Err(BrokerError::ChannelFull { .. }) | Err(BrokerError::PayloadBytesFull { .. }) => { + if let Ok(channel) = broker.channel_mut(channel_id) { + channel.reserve_waiters.push_front(waiter); + } + return progressed; + } + Err(err) => { + let _ = waiter.reply.send(Err(err)); + progressed = true; + } + } + } +} + +fn drain_fetch_waiters_for_channel(broker: &mut LocalBroker, channel_id: i64) { + loop { + let waiter = match broker.channel_mut(channel_id) { + Ok(channel) => channel.fetch_waiters.pop_front(), + Err(_) => return, + }; + let Some(waiter) = waiter else { + return; + }; + + match broker.fetch_next(waiter.req.clone()) { + Ok(Some(fetched)) => { + if let Err(Ok(Some(fetched))) = waiter.reply.send(Ok(Some(fetched))) { + let _ = broker.requeue_inflight( + fetched.envelope.channel_id, + fetched.envelope.reservation_id, + ); + } + } + Ok(None) => { + if let Ok(channel) = broker.channel_mut(channel_id) { + channel.fetch_waiters.push_front(waiter); + } + return; + } + Err(err) => { + let _ = waiter.reply.send(Err(err)); + } + } + } +} + +fn fail_all_waiters_with_actor_closed(broker: &mut LocalBroker) { + for channel in broker.state.channels.values_mut() { + while let Some(waiter) = channel.reserve_waiters.pop_front() { + let _ = waiter.reply.send(Err(BrokerError::ActorClosed)); + } + while let Some(waiter) = channel.fetch_waiters.pop_front() { + let _ = waiter.reply.send(Err(BrokerError::ActorClosed)); + } + } +} + +fn collect_deleted_payloads( + envelopes: impl Iterator, + payload_keys: &mut Vec, + payload_bytes: &mut u64, +) { + for envelope in envelopes { + *payload_bytes = payload_bytes.saturating_add(envelope.payload_bytes); + payload_keys.push(envelope.payload_key); + } +} + +fn cleanup_contains(channel: &ChannelState, reservation_id: u64) -> bool { + channel.cleanup_inflight.contains_key(&reservation_id) + || channel + .cleanup + .iter() + .any(|env| env.reservation_id == reservation_id) +} + +enum BrokerCommand { + UpsertChannel { + config: BrokerChannelConfig, + reply: oneshot::Sender>, + }, + DeleteChannel { + channel_id: i64, + reply: oneshot::Sender, BrokerError>>, + }, + Reserve { + req: BrokerReserveRequest, + reply: oneshot::Sender>, + }, + Publish { + channel_id: i64, + reservation_id: u64, + now_ms: i64, + reply: oneshot::Sender>, + }, + Abort { + channel_id: i64, + reservation_id: u64, + reply: oneshot::Sender>, + }, + FetchNext { + req: BrokerFetchRequest, + reply: oneshot::Sender, BrokerError>>, + }, + FetchBatchAvailable { + req: BrokerFetchRequest, + max_items: usize, + reply: oneshot::Sender>, + }, + Commit { + channel_id: i64, + reservation_id: u64, + now_ms: i64, + reply: oneshot::Sender>, + }, + CommitBatch { + channel_id: i64, + reservation_ids: Vec, + now_ms: i64, + reply: oneshot::Sender>, + }, + RequeueInflight { + channel_id: i64, + reservation_id: u64, + reply: oneshot::Sender>, + }, + RequeueInflightBatch { + channel_id: i64, + reservation_ids: Vec, + reply: oneshot::Sender>, + }, + RequeueAllInflight { + channel_id: i64, + reply: oneshot::Sender>, + }, + TakeCleanupBatch { + channel_id: i64, + max_items: usize, + reply: oneshot::Sender, BrokerError>>, + }, + CleanupAck { + channel_id: i64, + reservation_id: u64, + reply: oneshot::Sender>, + }, + CleanupNack { + channel_id: i64, + reservation_id: u64, + reply: oneshot::Sender>, + }, + ReleasePayloadBytes { + payload_bytes: u64, + }, + Shutdown { + reply: oneshot::Sender>, + }, +} + +#[derive(Clone, Debug)] +struct LocalBrokerHandle { + tx: mpsc::Sender, +} + +impl LocalBrokerHandle { + fn spawn_actor(broker: LocalBroker, queue_capacity: usize) -> Self { + Self::spawn_actor_with_cleanup_release_delay( + broker, + queue_capacity, + default_cleanup_release_delay(), + ) + } + + fn spawn_actor_with_cleanup_release_delay( + broker: LocalBroker, + queue_capacity: usize, + cleanup_release_delay: Duration, + ) -> Self { + let (tx, mut rx) = mpsc::channel(queue_capacity.max(1)); + let tx_for_actor = tx.clone(); + tokio::spawn(async move { + let mut broker = broker; + while let Some(cmd) = rx.recv().await { + match cmd { + BrokerCommand::UpsertChannel { config, reply } => { + let channel_id = config.channel_id; + let result = broker.upsert_channel(config); + if result.is_ok() { + let _ = channel_id; + drain_reserve_waiters(&mut broker); + } + let _ = reply.send(result); + } + BrokerCommand::DeleteChannel { channel_id, reply } => { + let result = broker.delete_channel(channel_id); + if result.is_ok() { + drain_reserve_waiters(&mut broker); + } + let _ = reply.send(result); + } + BrokerCommand::Reserve { req, reply } => { + let req_clone = req.clone(); + match broker.reserve(req_clone) { + Ok(reservation) => { + let _ = reply.send(Ok(reservation)); + } + Err(err) => { + let _ = reply.send(Err(err)); + } + } + } + BrokerCommand::Publish { + channel_id, + reservation_id, + now_ms, + reply, + } => { + let result = broker.publish(channel_id, reservation_id, now_ms); + if result.is_ok() { + drain_fetch_waiters_for_channel(&mut broker, channel_id); + } + let _ = reply.send(result); + } + BrokerCommand::Abort { + channel_id, + reservation_id, + reply, + } => { + let result = broker.abort(channel_id, reservation_id); + if result.is_ok() { + drain_reserve_waiters(&mut broker); + } + let _ = reply.send(result); + } + BrokerCommand::FetchNext { req, reply } => { + let req_clone = req.clone(); + match broker.fetch_next(req_clone) { + Ok(Some(message)) => { + let _ = reply.send(Ok(Some(message))); + } + Ok(None) => match broker.channel_mut(req.channel_id) { + Ok(channel) => { + channel.fetch_waiters.push_back(FetchWaiter { req, reply }) + } + Err(err) => { + let _ = reply.send(Err(err)); + } + }, + Err(err) => { + let _ = reply.send(Err(err)); + } + } + } + BrokerCommand::FetchBatchAvailable { + req, + max_items, + reply, + } => { + let _ = reply.send(broker.fetch_batch_available(req, max_items)); + } + BrokerCommand::Commit { + channel_id, + reservation_id, + now_ms, + reply, + } => { + let result = broker.commit(channel_id, reservation_id, now_ms); + if result.is_ok() { + drain_reserve_waiters(&mut broker); + } + let _ = reply.send(result); + } + BrokerCommand::CommitBatch { + channel_id, + reservation_ids, + now_ms, + reply, + } => { + let result = broker.commit_batch(channel_id, reservation_ids, now_ms); + if result.is_ok() { + drain_reserve_waiters(&mut broker); + } + let _ = reply.send(result); + } + BrokerCommand::RequeueInflight { + channel_id, + reservation_id, + reply, + } => { + let result = broker.requeue_inflight(channel_id, reservation_id); + if result.is_ok() { + drain_fetch_waiters_for_channel(&mut broker, channel_id); + } + let _ = reply.send(result); + } + BrokerCommand::RequeueInflightBatch { + channel_id, + reservation_ids, + reply, + } => { + let result = broker.requeue_inflight_batch(channel_id, reservation_ids); + if result.is_ok() { + drain_fetch_waiters_for_channel(&mut broker, channel_id); + } + let _ = reply.send(result); + } + BrokerCommand::RequeueAllInflight { channel_id, reply } => { + let result = broker.requeue_all_inflight(channel_id); + if result.is_ok() { + drain_fetch_waiters_for_channel(&mut broker, channel_id); + } + let _ = reply.send(result); + } + BrokerCommand::TakeCleanupBatch { + channel_id, + max_items, + reply, + } => { + let _ = reply.send(broker.take_cleanup_batch(channel_id, max_items)); + } + BrokerCommand::CleanupAck { + channel_id, + reservation_id, + reply, + } => { + let result = + broker.cleanup_ack_for_delayed_release(channel_id, reservation_id); + match result { + Ok(payload_bytes) if cleanup_release_delay.is_zero() => { + broker.release_payload_bytes(payload_bytes); + drain_reserve_waiters(&mut broker); + let _ = reply.send(Ok(())); + } + Ok(payload_bytes) => { + let tx_release = tx_for_actor.clone(); + tokio::spawn(async move { + tokio::time::sleep(cleanup_release_delay).await; + let _ = tx_release + .send(BrokerCommand::ReleasePayloadBytes { payload_bytes }) + .await; + }); + let _ = reply.send(Ok(())); + } + Err(err) => { + let _ = reply.send(Err(err)); + } + } + } + BrokerCommand::ReleasePayloadBytes { payload_bytes } => { + broker.release_payload_bytes(payload_bytes); + if payload_bytes > 0 { + drain_reserve_waiters(&mut broker); + } + } + BrokerCommand::CleanupNack { + channel_id, + reservation_id, + reply, + } => { + let _ = reply.send(broker.cleanup_nack(channel_id, reservation_id)); + } + BrokerCommand::Shutdown { reply } => { + fail_all_waiters_with_actor_closed(&mut broker); + let _ = reply.send(Ok(())); + break; + } + } + } + }); + Self { tx } + } + + async fn upsert_channel(&self, config: BrokerChannelConfig) -> Result<(), BrokerError> { + self.request(|reply| BrokerCommand::UpsertChannel { config, reply }) + .await + } + + async fn delete_channel(&self, channel_id: i64) -> Result, BrokerError> { + self.request(|reply| BrokerCommand::DeleteChannel { channel_id, reply }) + .await + } + + async fn reserve(&self, req: BrokerReserveRequest) -> Result { + self.request(|reply| BrokerCommand::Reserve { req, reply }) + .await + } + + async fn publish( + &self, + channel_id: i64, + reservation_id: u64, + now_ms: i64, + ) -> Result { + self.request(|reply| BrokerCommand::Publish { + channel_id, + reservation_id, + now_ms, + reply, + }) + .await + } + + async fn abort(&self, channel_id: i64, reservation_id: u64) -> Result<(), BrokerError> { + self.request(|reply| BrokerCommand::Abort { + channel_id, + reservation_id, + reply, + }) + .await + } + + async fn fetch_next( + &self, + req: BrokerFetchRequest, + ) -> Result, BrokerError> { + self.request(|reply| BrokerCommand::FetchNext { req, reply }) + .await + } + + async fn fetch_batch_available( + &self, + req: BrokerFetchRequest, + max_items: usize, + ) -> Result { + self.request(|reply| BrokerCommand::FetchBatchAvailable { + req, + max_items, + reply, + }) + .await + } + + async fn commit( + &self, + channel_id: i64, + reservation_id: u64, + now_ms: i64, + ) -> Result { + self.request(|reply| BrokerCommand::Commit { + channel_id, + reservation_id, + now_ms, + reply, + }) + .await + } + + async fn commit_batch( + &self, + channel_id: i64, + reservation_ids: Vec, + now_ms: i64, + ) -> Result { + self.request(|reply| BrokerCommand::CommitBatch { + channel_id, + reservation_ids, + now_ms, + reply, + }) + .await + } + + async fn requeue_inflight( + &self, + channel_id: i64, + reservation_id: u64, + ) -> Result<(), BrokerError> { + self.request(|reply| BrokerCommand::RequeueInflight { + channel_id, + reservation_id, + reply, + }) + .await + } + + async fn requeue_inflight_batch( + &self, + channel_id: i64, + reservation_ids: Vec, + ) -> Result<(), BrokerError> { + self.request(|reply| BrokerCommand::RequeueInflightBatch { + channel_id, + reservation_ids, + reply, + }) + .await + } + + async fn requeue_all_inflight(&self, channel_id: i64) -> Result<(), BrokerError> { + self.request(|reply| BrokerCommand::RequeueAllInflight { channel_id, reply }) + .await + } + + async fn take_cleanup_batch( + &self, + channel_id: i64, + max_items: usize, + ) -> Result, BrokerError> { + self.request(|reply| BrokerCommand::TakeCleanupBatch { + channel_id, + max_items, + reply, + }) + .await + } + + async fn cleanup_ack(&self, channel_id: i64, reservation_id: u64) -> Result<(), BrokerError> { + self.request(|reply| BrokerCommand::CleanupAck { + channel_id, + reservation_id, + reply, + }) + .await + } + + async fn cleanup_nack(&self, channel_id: i64, reservation_id: u64) -> Result<(), BrokerError> { + self.request(|reply| BrokerCommand::CleanupNack { + channel_id, + reservation_id, + reply, + }) + .await + } + + async fn shutdown(&self) -> Result<(), BrokerError> { + self.request(|reply| BrokerCommand::Shutdown { reply }) + .await + } + + async fn request( + &self, + make_cmd: impl FnOnce(oneshot::Sender>) -> BrokerCommand, + ) -> Result { + let (reply_tx, reply_rx) = oneshot::channel(); + self.tx + .send(make_cmd(reply_tx)) + .await + .map_err(|_| BrokerError::ActorClosed)?; + reply_rx.await.map_err(|_| BrokerError::ActorClosed)? + } +} + +#[derive(Debug, Clone, Default, Encode, Decode)] +enum BrokerRpcOperation { + #[default] + Noop, + UpsertChannel { + config: BrokerChannelConfig, + }, + DeleteChannel { + channel_id: i64, + }, + Reserve { + req: BrokerReserveRequest, + }, + Publish { + channel_id: i64, + reservation_id: u64, + now_ms: i64, + }, + Abort { + channel_id: i64, + reservation_id: u64, + }, + FetchNext { + req: BrokerFetchRequest, + }, + FetchBatchAvailable { + req: BrokerFetchRequest, + max_items: usize, + }, + Commit { + channel_id: i64, + reservation_id: u64, + now_ms: i64, + }, + CommitBatch { + channel_id: i64, + reservation_ids: Vec, + now_ms: i64, + }, + RequeueInflight { + channel_id: i64, + reservation_id: u64, + }, + RequeueInflightBatch { + channel_id: i64, + reservation_ids: Vec, + }, + RequeueAllInflight { + channel_id: i64, + }, + TakeCleanupBatch { + channel_id: i64, + max_items: usize, + }, + CleanupAck { + channel_id: i64, + reservation_id: u64, + }, + CleanupNack { + channel_id: i64, + reservation_id: u64, + }, +} + +#[derive(Debug, Clone, Default, Encode, Decode)] +struct BrokerRpcRequest { + request_id: String, + op: BrokerRpcOperation, +} + +impl BrokerRpcRequest { + fn new(op: BrokerRpcOperation) -> Self { + Self { + request_id: String::new(), + op, + } + } +} + +impl MsgPackSerializePart for BrokerRpcRequest { + fn msg_id(&self) -> u32 { + BROKER_RPC_REQ_MSG_ID + } +} + +impl RPCReq for BrokerRpcRequest { + type Resp = BrokerRpcResponse; +} + +#[derive(Debug, Clone, Encode, Decode)] +enum BrokerRpcReply { + Unit(Result<(), BrokerError>), + PayloadKeys(Result, BrokerError>), + Reservation(Result), + Envelope(Result), + Fetch(Result, BrokerError>), + FetchBatch(Result), + Commit(Result), + CommitBatch(Result), + CleanupBatch(Result, BrokerError>), +} + +impl Default for BrokerRpcReply { + fn default() -> Self { + Self::Unit(Ok(())) + } +} + +#[derive(Debug, Clone, Default, Encode, Decode)] +struct BrokerRpcResponse { + reply: BrokerRpcReply, +} + +#[derive(Default)] +struct BrokerRpcResponseCache { + completed: HashMap, + completed_order: VecDeque, + in_flight: HashMap>>, +} + +impl MsgPackSerializePart for BrokerRpcResponse { + fn msg_id(&self) -> u32 { + BROKER_RPC_RESP_MSG_ID + } +} + +async fn execute_rpc_request( + broker: &LocalBrokerHandle, + request: BrokerRpcRequest, + allow_wait: bool, +) -> BrokerRpcResponse { + let reply = match request.op { + BrokerRpcOperation::Noop => BrokerRpcReply::Unit(Err(BrokerError::Rpc( + "broker noop request is invalid".to_string(), + ))), + BrokerRpcOperation::UpsertChannel { config } => { + BrokerRpcReply::Unit(broker.upsert_channel(config).await) + } + BrokerRpcOperation::DeleteChannel { channel_id } => { + BrokerRpcReply::PayloadKeys(broker.delete_channel(channel_id).await) + } + BrokerRpcOperation::Reserve { req } => { + BrokerRpcReply::Reservation(broker.reserve(req).await) + } + BrokerRpcOperation::Publish { + channel_id, + reservation_id, + now_ms, + } => BrokerRpcReply::Envelope(broker.publish(channel_id, reservation_id, now_ms).await), + BrokerRpcOperation::Abort { + channel_id, + reservation_id, + } => BrokerRpcReply::Unit(broker.abort(channel_id, reservation_id).await), + BrokerRpcOperation::FetchNext { req } if allow_wait => { + BrokerRpcReply::Fetch(broker.fetch_next(req).await) + } + BrokerRpcOperation::FetchNext { req } => BrokerRpcReply::Fetch( + broker + .fetch_batch_available(req, 1) + .await + .map(|batch| batch.messages.into_iter().next()), + ), + BrokerRpcOperation::FetchBatchAvailable { req, max_items } => { + BrokerRpcReply::FetchBatch(broker.fetch_batch_available(req, max_items).await) + } + BrokerRpcOperation::Commit { + channel_id, + reservation_id, + now_ms, + } => BrokerRpcReply::Commit(broker.commit(channel_id, reservation_id, now_ms).await), + BrokerRpcOperation::CommitBatch { + channel_id, + reservation_ids, + now_ms, + } => BrokerRpcReply::CommitBatch( + broker + .commit_batch(channel_id, reservation_ids, now_ms) + .await, + ), + BrokerRpcOperation::RequeueInflight { + channel_id, + reservation_id, + } => BrokerRpcReply::Unit(broker.requeue_inflight(channel_id, reservation_id).await), + BrokerRpcOperation::RequeueInflightBatch { + channel_id, + reservation_ids, + } => BrokerRpcReply::Unit( + broker + .requeue_inflight_batch(channel_id, reservation_ids) + .await, + ), + BrokerRpcOperation::RequeueAllInflight { channel_id } => { + BrokerRpcReply::Unit(broker.requeue_all_inflight(channel_id).await) + } + BrokerRpcOperation::TakeCleanupBatch { + channel_id, + max_items, + } => BrokerRpcReply::CleanupBatch(broker.take_cleanup_batch(channel_id, max_items).await), + BrokerRpcOperation::CleanupAck { + channel_id, + reservation_id, + } => BrokerRpcReply::Unit(broker.cleanup_ack(channel_id, reservation_id).await), + BrokerRpcOperation::CleanupNack { + channel_id, + reservation_id, + } => BrokerRpcReply::Unit(broker.cleanup_nack(channel_id, reservation_id).await), + }; + BrokerRpcResponse { reply } +} + +async fn execute_rpc_request_with_cache( + broker: &LocalBrokerHandle, + response_cache: &Arc>, + request: BrokerRpcRequest, + allow_wait: bool, +) -> BrokerRpcResponse { + let request_id = request.request_id.clone(); + if request_id.is_empty() { + return execute_rpc_request(broker, request, allow_wait).await; + } + + let wait_for_existing = { + let mut cache = response_cache.lock().await; + if let Some(response) = cache.completed.get(&request_id) { + return response.clone(); + } + if let Some(waiters) = cache.in_flight.get_mut(&request_id) { + let (tx, rx) = oneshot::channel(); + waiters.push(tx); + Some(rx) + } else { + cache.in_flight.insert(request_id.clone(), Vec::new()); + None + } + }; + + if let Some(rx) = wait_for_existing { + return rx.await.unwrap_or(BrokerRpcResponse { + reply: BrokerRpcReply::Unit(Err(BrokerError::ActorClosed)), + }); + } + + let response = execute_rpc_request(broker, request, allow_wait).await; + let waiters = { + let mut cache = response_cache.lock().await; + let waiters = cache.in_flight.remove(&request_id).unwrap_or_default(); + cache.completed.insert(request_id.clone(), response.clone()); + cache.completed_order.push_back(request_id); + while cache.completed_order.len() > BROKER_RPC_RESPONSE_CACHE_LIMIT { + if let Some(old_request_id) = cache.completed_order.pop_front() { + cache.completed.remove(&old_request_id); + } + } + waiters + }; + + for waiter in waiters { + let _ = waiter.send(response.clone()); + } + response +} + +pub fn register_broker_service(p2p_view: P2pModuleView, queue_capacity: usize) { + let broker = LocalBrokerHandle::spawn_actor(LocalBroker::new(), queue_capacity); + let response_cache = Arc::new(Mutex::new(BrokerRpcResponseCache::default())); + let handler_view = p2p_view.clone(); + RPCHandler::::new().regist(p2p_view.p2p_module(), move |resp, msg| { + let broker = broker.clone(); + let response_cache = response_cache.clone(); + let handler_view = handler_view.clone(); + let _ = handler_view.spawn("fluxon_mq.broker.rpc", async move { + let response = + execute_rpc_request_with_cache(&broker, &response_cache, msg.serialize_part, false) + .await; + let _ = resp + .send_resp(MsgPack { + serialize_part: response, + raw_bytes: Vec::new(), + }) + .await; + }); + Ok(()) + }); +} + +#[derive(Clone)] +struct RemoteBrokerHandle { + cluster_manager_view: ClusterManagerView, + p2p_view: P2pModuleView, +} + +#[derive(Clone)] +enum BrokerHandleInner { + Local(LocalBrokerHandle), + Remote(RemoteBrokerHandle), +} + +pub struct BrokerHandle { + inner: BrokerHandleInner, +} + +impl Clone for BrokerHandle { + fn clone(&self) -> Self { + Self { + inner: self.inner.clone(), + } + } +} + +impl std::fmt::Debug for BrokerHandle { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + match &self.inner { + BrokerHandleInner::Local(_) => f + .debug_struct("BrokerHandle") + .field("kind", &"local") + .finish(), + BrokerHandleInner::Remote(_) => f + .debug_struct("BrokerHandle") + .field("kind", &"remote") + .finish(), + } + } +} + +impl BrokerHandle { + pub fn new_distributed( + cluster_manager_view: ClusterManagerView, + p2p_view: P2pModuleView, + ) -> Self { + Self { + inner: BrokerHandleInner::Remote(RemoteBrokerHandle { + cluster_manager_view, + p2p_view, + }), + } + } + + #[cfg(test)] + pub fn new_local_for_test(queue_capacity: usize) -> Self { + Self { + inner: BrokerHandleInner::Local( + LocalBrokerHandle::spawn_actor_with_cleanup_release_delay( + LocalBroker::new(), + queue_capacity, + Duration::ZERO, + ), + ), + } + } + + #[cfg(test)] + pub fn new_local_with_payload_byte_capacity_for_test( + payload_byte_capacity: u64, + queue_capacity: usize, + ) -> Self { + Self { + inner: BrokerHandleInner::Local( + LocalBrokerHandle::spawn_actor_with_cleanup_release_delay( + LocalBroker::with_payload_byte_capacity(payload_byte_capacity), + queue_capacity, + Duration::ZERO, + ), + ), + } + } + + pub async fn upsert_channel(&self, config: BrokerChannelConfig) -> Result<(), BrokerError> { + match self + .request(BrokerRpcRequest::new(BrokerRpcOperation::UpsertChannel { + config, + })) + .await? + .reply + { + BrokerRpcReply::Unit(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for upsert_channel: {:?}", + other + ))), + } + } + + pub async fn delete_channel(&self, channel_id: i64) -> Result, BrokerError> { + match self + .request(BrokerRpcRequest::new(BrokerRpcOperation::DeleteChannel { + channel_id, + })) + .await? + .reply + { + BrokerRpcReply::PayloadKeys(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for delete_channel: {:?}", + other + ))), + } + } + + pub async fn reserve( + &self, + req: BrokerReserveRequest, + ) -> Result { + match self + .request(BrokerRpcRequest::new(BrokerRpcOperation::Reserve { req })) + .await? + .reply + { + BrokerRpcReply::Reservation(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for reserve: {:?}", + other + ))), + } + } + + pub async fn publish( + &self, + channel_id: i64, + reservation_id: u64, + now_ms: i64, + ) -> Result { + match self + .request(BrokerRpcRequest::new(BrokerRpcOperation::Publish { + channel_id, + reservation_id, + now_ms, + })) + .await? + .reply + { + BrokerRpcReply::Envelope(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for publish: {:?}", + other + ))), + } + } + + pub async fn abort(&self, channel_id: i64, reservation_id: u64) -> Result<(), BrokerError> { + match self + .request(BrokerRpcRequest::new(BrokerRpcOperation::Abort { + channel_id, + reservation_id, + })) + .await? + .reply + { + BrokerRpcReply::Unit(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for abort: {:?}", + other + ))), + } + } + + pub async fn fetch_next( + &self, + req: BrokerFetchRequest, + ) -> Result, BrokerError> { + match self + .request(BrokerRpcRequest::new(BrokerRpcOperation::FetchNext { req })) + .await? + .reply + { + BrokerRpcReply::Fetch(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for fetch_next: {:?}", + other + ))), + } + } + + pub async fn fetch_batch_available( + &self, + req: BrokerFetchRequest, + max_items: usize, + ) -> Result { + match self + .request(BrokerRpcRequest::new( + BrokerRpcOperation::FetchBatchAvailable { req, max_items }, + )) + .await? + .reply + { + BrokerRpcReply::FetchBatch(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for fetch_batch_available: {:?}", + other + ))), + } + } + + pub async fn commit( + &self, + channel_id: i64, + reservation_id: u64, + now_ms: i64, + ) -> Result { + match self + .request(BrokerRpcRequest::new(BrokerRpcOperation::Commit { + channel_id, + reservation_id, + now_ms, + })) + .await? + .reply + { + BrokerRpcReply::Commit(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for commit: {:?}", + other + ))), + } + } + + pub async fn commit_batch( + &self, + channel_id: i64, + reservation_ids: Vec, + now_ms: i64, + ) -> Result { + match self + .request(BrokerRpcRequest::new(BrokerRpcOperation::CommitBatch { + channel_id, + reservation_ids, + now_ms, + })) + .await? + .reply + { + BrokerRpcReply::CommitBatch(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for commit_batch: {:?}", + other + ))), + } + } + + pub async fn requeue_inflight( + &self, + channel_id: i64, + reservation_id: u64, + ) -> Result<(), BrokerError> { + match self + .request(BrokerRpcRequest::new(BrokerRpcOperation::RequeueInflight { + channel_id, + reservation_id, + })) + .await? + .reply + { + BrokerRpcReply::Unit(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for requeue_inflight: {:?}", + other + ))), + } + } + + pub async fn requeue_inflight_batch( + &self, + channel_id: i64, + reservation_ids: Vec, + ) -> Result<(), BrokerError> { + match self + .request(BrokerRpcRequest::new( + BrokerRpcOperation::RequeueInflightBatch { + channel_id, + reservation_ids, + }, + )) + .await? + .reply + { + BrokerRpcReply::Unit(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for requeue_inflight_batch: {:?}", + other + ))), + } + } + + pub async fn requeue_all_inflight(&self, channel_id: i64) -> Result<(), BrokerError> { + match self + .request(BrokerRpcRequest::new( + BrokerRpcOperation::RequeueAllInflight { channel_id }, + )) + .await? + .reply + { + BrokerRpcReply::Unit(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for requeue_all_inflight: {:?}", + other + ))), + } + } + + pub async fn take_cleanup_batch( + &self, + channel_id: i64, + max_items: usize, + ) -> Result, BrokerError> { + match self + .request(BrokerRpcRequest::new( + BrokerRpcOperation::TakeCleanupBatch { + channel_id, + max_items, + }, + )) + .await? + .reply + { + BrokerRpcReply::CleanupBatch(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for take_cleanup_batch: {:?}", + other + ))), + } + } + + pub async fn cleanup_ack( + &self, + channel_id: i64, + reservation_id: u64, + ) -> Result<(), BrokerError> { + match self + .request(BrokerRpcRequest::new(BrokerRpcOperation::CleanupAck { + channel_id, + reservation_id, + })) + .await? + .reply + { + BrokerRpcReply::Unit(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for cleanup_ack: {:?}", + other + ))), + } + } + + pub async fn cleanup_nack( + &self, + channel_id: i64, + reservation_id: u64, + ) -> Result<(), BrokerError> { + match self + .request(BrokerRpcRequest::new(BrokerRpcOperation::CleanupNack { + channel_id, + reservation_id, + })) + .await? + .reply + { + BrokerRpcReply::Unit(result) => result, + other => Err(BrokerError::Rpc(format!( + "unexpected response for cleanup_nack: {:?}", + other + ))), + } + } + + pub async fn shutdown(&self) -> Result<(), BrokerError> { + match &self.inner { + BrokerHandleInner::Local(local) => local.shutdown().await, + BrokerHandleInner::Remote(_) => Err(BrokerError::Rpc( + "shutdown is unsupported for distributed broker handles".to_string(), + )), + } + } + + async fn request(&self, request: BrokerRpcRequest) -> Result { + match &self.inner { + BrokerHandleInner::Local(local) => Ok(execute_rpc_request(local, request, true).await), + BrokerHandleInner::Remote(remote) => remote.request(request).await, + } + } +} + +impl RemoteBrokerHandle { + async fn request( + &self, + mut request: BrokerRpcRequest, + ) -> Result { + if request.request_id.is_empty() { + request.request_id = next_broker_rpc_request_id(); + } + let broker_node = + find_or_wait_broker_node(self.cluster_manager_view.cluster_manager()).await?; + let response = RPCCaller::::new() + .call( + self.p2p_view.p2p_module(), + broker_node.into(), + MsgPack { + serialize_part: request, + raw_bytes: Vec::new(), + }, + None, + 6, + ) + .await + .map_err(|e| BrokerError::Rpc(format!("broker rpc call failed: {}", e)))?; + Ok(response.serialize_part) + } +} + +async fn find_or_wait_broker_node( + cluster_manager: &fluxon_commu::ClusterManager, +) -> Result { + let mut rx = cluster_manager.listen(); + let members = cluster_manager.get_members(); + let broker_nodes: Vec<_> = members + .iter() + .filter(|member| is_broker_member(member)) + .collect(); + if broker_nodes.len() == 1 { + return Ok(broker_nodes[0].id.to_string()); + } + if broker_nodes.len() > 1 { + return Err(BrokerError::BrokerUnavailable(format!( + "multiple brokers found: {:?}", + broker_nodes + .into_iter() + .map(|member| member.id.to_string()) + .collect::>() + ))); + } + + tokio::time::timeout(BROKER_DISCOVERY_TIMEOUT, async move { + while let Ok(event) = rx.recv().await { + match event { + fluxon_commu::ClusterEvent::MemberJoined(member) + | fluxon_commu::ClusterEvent::MemberUpdated(member) + if is_broker_member(&member) => + { + return Ok(member.id.to_string()); + } + _ => {} + } + } + Err(BrokerError::BrokerUnavailable( + "broker node not found from cluster manager".to_string(), + )) + }) + .await + .unwrap_or_else(|_| { + Err(BrokerError::BrokerUnavailable(format!( + "timed out waiting {}s for broker node registration; start fluxon_py.runtime.start_broker first", + BROKER_DISCOVERY_TIMEOUT.as_secs() + ))) + }) +} + +fn next_broker_rpc_request_id() -> String { + let prefix = BROKER_RPC_REQUEST_PREFIX.get_or_init(|| { + let started_ns = SystemTime::now() + .duration_since(UNIX_EPOCH) + .expect("system clock is before UNIX_EPOCH") + .as_nanos(); + format!("{}-{}", std::process::id(), started_ns) + }); + let seq = BROKER_RPC_REQUEST_SEQ.fetch_add(1, Ordering::Relaxed); + format!("{}-{}", prefix, seq) +} + +fn is_broker_member(member: &fluxon_commu::ClusterMember) -> bool { + member + .metadata + .get(FLUXON_MQ_COMPONENT_METADATA_KEY) + .is_some_and(|value| value == FLUXON_MQ_COMPONENT_BROKER_METADATA_VALUE) +} + +fn broker_category_enforces_capacity(category: MqCategory) -> bool { + matches!(category, MqCategory::MpmcSub { .. }) +} + +pub fn now_unix_ms() -> i64 { + SystemTime::now() + .duration_since(UNIX_EPOCH) + .expect("system clock is before UNIX_EPOCH") + .as_millis() as i64 +} + +fn validate_capacity(config: &BrokerChannelConfig) -> Result<(), BrokerError> { + if config.capacity <= 0 { + return Err(BrokerError::InvalidCapacity { + channel_id: config.channel_id, + capacity: config.capacity, + }); + } + Ok(()) +} + +fn default_payload_byte_capacity() -> u64 { + if let Ok(raw) = env::var(BROKER_PAYLOAD_BYTES_CAP_ENV) { + if let Ok(value) = raw.trim().parse::() { + if value > 0 { + return value; + } + } + } + + if let Ok(raw) = env::var(OWNER_POOL_DRAM_BYTES_ENV) { + if let Ok(value) = raw.trim().parse::() { + if value > 0 { + let percent = payload_byte_capacity_percent(); + return ((value as u128) * (percent as u128) / 100).max(1) as u64; + } + } + } + + DEFAULT_BROKER_PAYLOAD_BYTES_CAP +} + +fn payload_byte_capacity_percent() -> u64 { + env::var(BROKER_PAYLOAD_BYTES_CAP_PERCENT_ENV) + .ok() + .and_then(|raw| raw.trim().parse::().ok()) + .filter(|value| (1..=100).contains(value)) + .unwrap_or(DEFAULT_BROKER_PAYLOAD_BYTES_CAP_PERCENT) +} + +fn default_cleanup_release_delay() -> Duration { + Duration::from_millis( + env::var(BROKER_CLEANUP_RELEASE_DELAY_MS_ENV) + .ok() + .and_then(|raw| raw.trim().parse::().ok()) + .unwrap_or(DEFAULT_BROKER_CLEANUP_RELEASE_DELAY_MS), + ) +} + +fn remove_from_deque(queue: &mut VecDeque, reservation_id: u64) { + if let Some(pos) = queue.iter().position(|id| *id == reservation_id) { + queue.remove(pos); + } +} + +#[cfg(test)] +mod tests { + use super::*; + + fn reserve_req(channel_id: i64, producer_id: &str, now_ms: i64) -> BrokerReserveRequest { + reserve_req_with_category(channel_id, producer_id, MqCategory::Mpsc, 1, now_ms) + } + + fn reserve_req_with_category( + channel_id: i64, + producer_id: &str, + category: MqCategory, + payload_bytes: u64, + now_ms: i64, + ) -> BrokerReserveRequest { + BrokerReserveRequest { + channel_id, + producer_id: producer_id.to_string(), + category, + payload_bytes, + now_ms, + } + } + + fn reserve_req_bytes( + channel_id: i64, + producer_id: &str, + payload_bytes: u64, + now_ms: i64, + ) -> BrokerReserveRequest { + BrokerReserveRequest { + channel_id, + producer_id: producer_id.to_string(), + category: MqCategory::Mpsc, + payload_bytes, + now_ms, + } + } + + fn fetch_req(channel_id: i64, consumer_id: &str, now_ms: i64) -> BrokerFetchRequest { + BrokerFetchRequest { + channel_id, + consumer_id: consumer_id.to_string(), + now_ms, + } + } + + #[tokio::test] + async fn rpc_request_cache_deduplicates_retried_reserve() { + let broker = LocalBrokerHandle::spawn_actor_with_cleanup_release_delay( + LocalBroker::new(), + 8, + Duration::ZERO, + ); + let cache = Arc::new(Mutex::new(BrokerRpcResponseCache::default())); + let upsert = BrokerRpcRequest::new(BrokerRpcOperation::UpsertChannel { + config: BrokerChannelConfig { + channel_id: 41, + capacity: 2, + }, + }); + let _ = execute_rpc_request_with_cache(&broker, &cache, upsert, false).await; + + let reserve = BrokerRpcRequest { + request_id: "reserve-retry-1".to_string(), + op: BrokerRpcOperation::Reserve { + req: reserve_req(41, "p0", 10), + }, + }; + let first = execute_rpc_request_with_cache(&broker, &cache, reserve.clone(), false).await; + let second = execute_rpc_request_with_cache(&broker, &cache, reserve, false).await; + let first_reservation = match first.reply { + BrokerRpcReply::Reservation(Ok(reservation)) => reservation, + other => panic!("unexpected first reserve response: {:?}", other), + }; + let second_reservation = match second.reply { + BrokerRpcReply::Reservation(Ok(reservation)) => reservation, + other => panic!("unexpected second reserve response: {:?}", other), + }; + assert_eq!( + first_reservation.envelope.reservation_id, + second_reservation.envelope.reservation_id + ); + + let next = broker.reserve(reserve_req(41, "p0", 11)).await.unwrap(); + assert_eq!(next.envelope.reservation_id, 2); + broker.shutdown().await.unwrap(); + } + + #[tokio::test] + async fn rpc_fetch_next_without_wait_returns_none() { + let broker = LocalBrokerHandle::spawn_actor_with_cleanup_release_delay( + LocalBroker::new(), + 8, + Duration::ZERO, + ); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 42, + capacity: 2, + }) + .await + .unwrap(); + let cache = Arc::new(Mutex::new(BrokerRpcResponseCache::default())); + let response = tokio::time::timeout( + Duration::from_millis(50), + execute_rpc_request_with_cache( + &broker, + &cache, + BrokerRpcRequest { + request_id: "fetch-empty-1".to_string(), + op: BrokerRpcOperation::FetchNext { + req: fetch_req(42, "c0", 10), + }, + }, + false, + ), + ) + .await + .expect("remote-style fetch must not wait"); + match response.reply { + BrokerRpcReply::Fetch(Ok(None)) => {} + other => panic!("unexpected fetch response: {:?}", other), + } + broker.shutdown().await.unwrap(); + } + + #[test] + fn reserve_publish_fetch_commit_frees_capacity_for_mpmc_sub() { + let mut broker = LocalBroker::new(); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 7, + capacity: 2, + }) + .unwrap(); + + let first = broker + .reserve(reserve_req_with_category( + 7, + "p0", + MqCategory::MpmcSub { parent_mpmc_id: 70 }, + 1, + 10, + )) + .unwrap(); + let second = broker + .reserve(reserve_req_with_category( + 7, + "p0", + MqCategory::MpmcSub { parent_mpmc_id: 70 }, + 1, + 11, + )) + .unwrap(); + assert_eq!(first.envelope.msg_id, 0); + assert_eq!(second.envelope.msg_id, 1); + assert_eq!( + broker + .reserve(reserve_req_with_category( + 7, + "p0", + MqCategory::MpmcSub { parent_mpmc_id: 70 }, + 1, + 12, + )) + .unwrap_err(), + BrokerError::ChannelFull { + channel_id: 7, + capacity: 2, + used_slots: 2, + } + ); + + broker + .publish(7, first.envelope.reservation_id, 20) + .unwrap(); + let fetched = broker.fetch_next(fetch_req(7, "c0", 30)).unwrap().unwrap(); + assert_eq!( + fetched.envelope.reservation_id, + first.envelope.reservation_id + ); + + let committed = broker + .commit(7, fetched.envelope.reservation_id, 40) + .unwrap(); + assert!(committed.first_commit); + assert_eq!( + committed + .cleanup + .as_ref() + .map(|env| env.payload_key.as_str()), + Some( + keys::backend_message_key_with_category( + 7, + "p0", + 0, + &MqCategory::MpmcSub { parent_mpmc_id: 70 }, + ) + .as_str() + ) + ); + + let third = broker + .reserve(reserve_req_with_category( + 7, + "p0", + MqCategory::MpmcSub { parent_mpmc_id: 70 }, + 1, + 50, + )) + .unwrap(); + assert_eq!(third.envelope.msg_id, 2); + } + + #[test] + fn abort_releases_pending_slot_for_mpmc_sub() { + let mut broker = LocalBroker::new(); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 8, + capacity: 1, + }) + .unwrap(); + + let reservation = broker + .reserve(reserve_req_with_category( + 8, + "p0", + MqCategory::MpmcSub { parent_mpmc_id: 80 }, + 1, + 10, + )) + .unwrap(); + assert!(matches!( + broker.reserve(reserve_req_with_category( + 8, + "p0", + MqCategory::MpmcSub { parent_mpmc_id: 80 }, + 1, + 11, + )), + Err(BrokerError::ChannelFull { .. }) + )); + + broker + .abort(8, reservation.envelope.reservation_id) + .unwrap(); + let next = broker + .reserve(reserve_req_with_category( + 8, + "p0", + MqCategory::MpmcSub { parent_mpmc_id: 80 }, + 1, + 12, + )) + .unwrap(); + assert_eq!(next.envelope.msg_id, 1); + } + + #[test] + fn requeue_all_inflight_preserves_fetch_order() { + let mut broker = LocalBroker::new(); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 10, + capacity: 4, + }) + .unwrap(); + let first = broker.reserve(reserve_req(10, "p0", 10)).unwrap(); + let second = broker.reserve(reserve_req(10, "p0", 11)).unwrap(); + broker + .publish(10, first.envelope.reservation_id, 20) + .unwrap(); + broker + .publish(10, second.envelope.reservation_id, 21) + .unwrap(); + + let _ = broker.fetch_next(fetch_req(10, "c0", 30)).unwrap().unwrap(); + let _ = broker.fetch_next(fetch_req(10, "c0", 31)).unwrap().unwrap(); + broker.requeue_all_inflight(10).unwrap(); + + let redelivered_first = broker.fetch_next(fetch_req(10, "c0", 40)).unwrap().unwrap(); + let redelivered_second = broker.fetch_next(fetch_req(10, "c0", 41)).unwrap().unwrap(); + assert_eq!( + redelivered_first.envelope.reservation_id, + first.envelope.reservation_id + ); + assert_eq!( + redelivered_second.envelope.reservation_id, + second.envelope.reservation_id + ); + } + + #[test] + fn batch_fetch_and_commit_preserves_order_and_frees_capacity() { + let mut broker = LocalBroker::new(); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 11, + capacity: 3, + }) + .unwrap(); + + let first = broker.reserve(reserve_req(11, "p0", 10)).unwrap(); + let second = broker.reserve(reserve_req(11, "p0", 11)).unwrap(); + let third = broker.reserve(reserve_req(11, "p1", 12)).unwrap(); + for reservation in [&first, &second, &third] { + broker + .publish(11, reservation.envelope.reservation_id, 20) + .unwrap(); + } + + let batch = broker + .fetch_batch_available(fetch_req(11, "c0", 30), 2) + .unwrap(); + assert_eq!(batch.messages.len(), 2); + assert_eq!(batch.messages[0].envelope.msg_id, 0); + assert_eq!(batch.messages[1].envelope.msg_id, 1); + + let outcome = broker + .commit_batch( + 11, + batch + .messages + .iter() + .map(|message| message.envelope.reservation_id) + .collect(), + 40, + ) + .unwrap(); + assert_eq!(outcome.first_commit_count, 2); + assert_eq!(outcome.cleanup.len(), 2); + + let next = broker.reserve(reserve_req(11, "p0", 50)).unwrap(); + assert_eq!(next.envelope.msg_id, 2); + } + + #[test] + fn duplicate_commit_is_idempotent_until_cleanup_ack() { + let mut broker = LocalBroker::with_payload_byte_capacity(10); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 19, + capacity: 4, + }) + .unwrap(); + + let reserved = broker.reserve(reserve_req_bytes(19, "p0", 6, 10)).unwrap(); + broker + .publish(19, reserved.envelope.reservation_id, 20) + .unwrap(); + let fetched = broker.fetch_next(fetch_req(19, "c0", 30)).unwrap().unwrap(); + let reservation_id = fetched.envelope.reservation_id; + + let first = broker.commit(19, reservation_id, 40).unwrap(); + assert!(first.first_commit); + assert!(first.cleanup.is_some()); + let duplicate = broker.commit(19, reservation_id, 41).unwrap(); + assert!(!duplicate.first_commit); + assert!(duplicate.cleanup.is_none()); + + broker.cleanup_ack(19, reservation_id).unwrap(); + assert_eq!( + broker.commit(19, reservation_id, 42).unwrap_err(), + BrokerError::DeliveryNotFound { + channel_id: 19, + reservation_id, + } + ); + } + + #[test] + fn payload_byte_budget_is_global_and_released_on_cleanup_ack_or_abort() { + let mut broker = LocalBroker::with_payload_byte_capacity(10); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 21, + capacity: 8, + }) + .unwrap(); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 22, + capacity: 8, + }) + .unwrap(); + + let first = broker.reserve(reserve_req_bytes(21, "p0", 6, 10)).unwrap(); + assert_eq!(first.envelope.payload_bytes, 6); + assert!(matches!( + broker.reserve(reserve_req_bytes(22, "p1", 5, 11)), + Err(BrokerError::PayloadBytesFull { .. }) + )); + + broker + .publish(21, first.envelope.reservation_id, 20) + .unwrap(); + let fetched = broker.fetch_next(fetch_req(21, "c0", 30)).unwrap().unwrap(); + broker + .commit(21, fetched.envelope.reservation_id, 40) + .unwrap(); + assert!(matches!( + broker.reserve(reserve_req_bytes(22, "p1", 5, 41)), + Err(BrokerError::PayloadBytesFull { .. }) + )); + broker + .cleanup_ack(21, fetched.envelope.reservation_id) + .unwrap(); + let second = broker.reserve(reserve_req_bytes(22, "p1", 5, 50)).unwrap(); + broker.abort(22, second.envelope.reservation_id).unwrap(); + let third = broker.reserve(reserve_req_bytes(22, "p1", 10, 60)).unwrap(); + assert_eq!(third.envelope.payload_bytes, 10); + } + + #[test] + fn mpsc_reserve_does_not_gate_on_channel_capacity() { + let mut broker = LocalBroker::new(); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 201, + capacity: 1, + }) + .unwrap(); + + let first = broker.reserve(reserve_req(201, "p0", 10)).unwrap(); + let second = broker.reserve(reserve_req(201, "p0", 11)).unwrap(); + + assert_eq!(first.envelope.msg_id, 0); + assert_eq!(second.envelope.msg_id, 1); + } + + #[test] + fn mpmc_sub_reserve_still_gates_on_channel_capacity() { + let mut broker = LocalBroker::new(); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 202, + capacity: 1, + }) + .unwrap(); + + let _ = broker + .reserve(reserve_req_with_category( + 202, + "p0", + MqCategory::MpmcSub { parent_mpmc_id: 9 }, + 1, + 10, + )) + .unwrap(); + + assert!(matches!( + broker.reserve(reserve_req_with_category( + 202, + "p0", + MqCategory::MpmcSub { parent_mpmc_id: 9 }, + 1, + 11, + )), + Err(BrokerError::ChannelFull { .. }) + )); + } + + #[test] + fn cleanup_ack_releases_payload_after_cleanup_batch_take() { + let mut broker = LocalBroker::with_payload_byte_capacity(10); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 23, + capacity: 8, + }) + .unwrap(); + + let first = broker.reserve(reserve_req_bytes(23, "p0", 6, 10)).unwrap(); + broker + .publish(23, first.envelope.reservation_id, 20) + .unwrap(); + let fetched = broker.fetch_next(fetch_req(23, "c0", 30)).unwrap().unwrap(); + broker + .commit(23, fetched.envelope.reservation_id, 40) + .unwrap(); + assert_eq!(broker.take_cleanup_batch(23, 8).unwrap().len(), 1); + assert!(matches!( + broker.reserve(reserve_req_bytes(23, "p1", 5, 41)), + Err(BrokerError::PayloadBytesFull { .. }) + )); + + broker + .cleanup_ack(23, fetched.envelope.reservation_id) + .unwrap(); + let second = broker.reserve(reserve_req_bytes(23, "p1", 5, 50)).unwrap(); + assert_eq!(second.envelope.payload_bytes, 5); + } + + #[test] + fn delete_channel_releases_payload_budget_for_all_queues() { + let mut broker = LocalBroker::with_payload_byte_capacity(100); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 31, + capacity: 16, + }) + .unwrap(); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 32, + capacity: 16, + }) + .unwrap(); + + let pending = broker.reserve(reserve_req_bytes(31, "p0", 10, 10)).unwrap(); + + let inflight = broker.reserve(reserve_req_bytes(31, "p0", 12, 12)).unwrap(); + broker + .publish(31, inflight.envelope.reservation_id, 21) + .unwrap(); + let _ = broker.fetch_next(fetch_req(31, "c0", 30)).unwrap().unwrap(); + + let cleanup_inflight = broker.reserve(reserve_req_bytes(31, "p0", 13, 13)).unwrap(); + broker + .publish(31, cleanup_inflight.envelope.reservation_id, 22) + .unwrap(); + let fetched = broker.fetch_next(fetch_req(31, "c0", 31)).unwrap().unwrap(); + broker + .commit(31, fetched.envelope.reservation_id, 40) + .unwrap(); + assert_eq!(broker.take_cleanup_batch(31, 1).unwrap().len(), 1); + + let cleanup = broker.reserve(reserve_req_bytes(31, "p0", 14, 14)).unwrap(); + broker + .publish(31, cleanup.envelope.reservation_id, 23) + .unwrap(); + let fetched = broker.fetch_next(fetch_req(31, "c0", 32)).unwrap().unwrap(); + broker + .commit(31, fetched.envelope.reservation_id, 41) + .unwrap(); + + let visible = broker.reserve(reserve_req_bytes(31, "p0", 11, 15)).unwrap(); + broker + .publish(31, visible.envelope.reservation_id, 24) + .unwrap(); + + assert_eq!(broker.state.used_payload_bytes, 60); + assert!(matches!( + broker.reserve(reserve_req_bytes(32, "p1", 41, 50)), + Err(BrokerError::PayloadBytesFull { .. }) + )); + + let mut payload_keys = broker.delete_channel(31).unwrap(); + payload_keys.sort(); + let mut expected_payload_keys = vec![ + pending.envelope.payload_key, + inflight.envelope.payload_key, + cleanup_inflight.envelope.payload_key, + cleanup.envelope.payload_key, + visible.envelope.payload_key, + ]; + expected_payload_keys.sort(); + assert_eq!(payload_keys, expected_payload_keys); + assert_eq!(broker.state.used_payload_bytes, 0); + assert_eq!(broker.delete_channel(31), Ok(Vec::new())); + assert_eq!( + broker.fetch_next(fetch_req(31, "c0", 60)).unwrap_err(), + BrokerError::ChannelNotFound(31) + ); + + let next = broker + .reserve(reserve_req_bytes(32, "p1", 100, 70)) + .unwrap(); + assert_eq!(next.envelope.payload_bytes, 100); + } + + #[tokio::test] + async fn broker_handle_roundtrip_uses_local_actor() { + let handle = BrokerHandle::new_local_for_test(32); + handle + .upsert_channel(BrokerChannelConfig { + channel_id: 12, + capacity: 2, + }) + .await + .unwrap(); + let reserved = handle.reserve(reserve_req(12, "p0", 10)).await.unwrap(); + handle + .publish(12, reserved.envelope.reservation_id, 20) + .await + .unwrap(); + let fetched = handle + .fetch_next(fetch_req(12, "c0", 30)) + .await + .unwrap() + .unwrap(); + assert_eq!(fetched.envelope.msg_id, 0); + handle + .commit(12, fetched.envelope.reservation_id, 40) + .await + .unwrap(); + assert_eq!(handle.take_cleanup_batch(12, 8).await.unwrap().len(), 1); + handle + .cleanup_ack(12, fetched.envelope.reservation_id) + .await + .unwrap(); + handle.shutdown().await.unwrap(); + } + + #[tokio::test] + async fn broker_handle_delete_channel_releases_payload_budget() { + let handle = BrokerHandle::new_local_with_payload_byte_capacity_for_test(10, 8); + handle + .upsert_channel(BrokerChannelConfig { + channel_id: 24, + capacity: 4, + }) + .await + .unwrap(); + + let first = handle + .reserve(reserve_req_bytes(24, "p0", 6, 10)) + .await + .unwrap(); + assert!(matches!( + handle.reserve(reserve_req_bytes(24, "p1", 5, 11)).await, + Err(BrokerError::PayloadBytesFull { .. }) + )); + + assert_eq!( + handle.delete_channel(24).await.unwrap(), + vec![first.envelope.payload_key] + ); + assert_eq!( + handle.delete_channel(24).await.unwrap(), + Vec::::new() + ); + handle + .upsert_channel(BrokerChannelConfig { + channel_id: 25, + capacity: 4, + }) + .await + .unwrap(); + let next = handle + .reserve(reserve_req_bytes(25, "p1", 10, 20)) + .await + .unwrap(); + assert_eq!(next.envelope.payload_bytes, 10); + + handle.shutdown().await.unwrap(); + } + + #[tokio::test] + async fn broker_handle_returns_actor_closed_after_shutdown() { + let handle = BrokerHandle::new_local_for_test(8); + handle + .upsert_channel(BrokerChannelConfig { + channel_id: 13, + capacity: 1, + }) + .await + .unwrap(); + handle.shutdown().await.unwrap(); + assert_eq!( + handle.reserve(reserve_req(13, "p0", 10)).await.unwrap_err(), + BrokerError::ActorClosed + ); + } + + #[tokio::test] + async fn broker_handle_returns_channel_full_without_waiting_for_mpmc_sub() { + let handle = BrokerHandle::new_local_for_test(8); + handle + .upsert_channel(BrokerChannelConfig { + channel_id: 14, + capacity: 1, + }) + .await + .unwrap(); + + let first = handle + .reserve(reserve_req_with_category( + 14, + "p0", + MqCategory::MpmcSub { + parent_mpmc_id: 140, + }, + 1, + 10, + )) + .await + .unwrap(); + assert!(matches!( + handle + .reserve(reserve_req_with_category( + 14, + "p0", + MqCategory::MpmcSub { + parent_mpmc_id: 140 + }, + 1, + 11, + )) + .await, + Err(BrokerError::ChannelFull { .. }) + )); + + handle + .abort(14, first.envelope.reservation_id) + .await + .unwrap(); + let second = handle + .reserve(reserve_req_with_category( + 14, + "p0", + MqCategory::MpmcSub { + parent_mpmc_id: 140, + }, + 1, + 12, + )) + .await + .unwrap(); + assert_eq!(second.envelope.msg_id, 1); + + handle.shutdown().await.unwrap(); + } + + #[tokio::test] + async fn broker_handle_returns_payload_bytes_full_without_waiting() { + let handle = BrokerHandle::new_local_with_payload_byte_capacity_for_test(10, 8); + handle + .upsert_channel(BrokerChannelConfig { + channel_id: 16, + capacity: 8, + }) + .await + .unwrap(); + + let first = handle + .reserve(reserve_req_bytes(16, "p0", 6, 10)) + .await + .unwrap(); + assert!(matches!( + handle.reserve(reserve_req_bytes(16, "p1", 5, 11)).await, + Err(BrokerError::PayloadBytesFull { .. }) + )); + + handle + .publish(16, first.envelope.reservation_id, 20) + .await + .unwrap(); + let fetched = handle + .fetch_next(fetch_req(16, "c0", 30)) + .await + .unwrap() + .unwrap(); + handle + .commit(16, fetched.envelope.reservation_id, 40) + .await + .unwrap(); + + handle + .cleanup_ack(16, fetched.envelope.reservation_id) + .await + .unwrap(); + + let second = handle + .reserve(reserve_req_bytes(16, "p1", 5, 50)) + .await + .unwrap(); + assert_eq!(second.envelope.producer_id, "p1"); + assert_eq!(second.envelope.payload_bytes, 5); + + handle.shutdown().await.unwrap(); + } + + #[tokio::test] + async fn broker_handle_waits_for_message_then_resumes() { + use std::time::Duration; + use tokio::time::sleep; + + let handle = BrokerHandle::new_local_for_test(8); + handle + .upsert_channel(BrokerChannelConfig { + channel_id: 15, + capacity: 2, + }) + .await + .unwrap(); + + let waiter_handle = handle.clone(); + let pending = + tokio::spawn(async move { waiter_handle.fetch_next(fetch_req(15, "c0", 10)).await }); + + sleep(Duration::from_millis(50)).await; + assert!(!pending.is_finished()); + + let reservation = handle.reserve(reserve_req(15, "p0", 11)).await.unwrap(); + handle + .publish(15, reservation.envelope.reservation_id, 12) + .await + .unwrap(); + + let fetched = pending.await.unwrap().unwrap().unwrap(); + assert_eq!(fetched.envelope.msg_id, 0); + + handle.shutdown().await.unwrap(); + } +} diff --git a/fluxon_rs/fluxon_mq/src/consumer.rs b/fluxon_rs/fluxon_mq/src/consumer.rs index c5e5fa4..6207a30 100644 --- a/fluxon_rs/fluxon_mq/src/consumer.rs +++ b/fluxon_rs/fluxon_mq/src/consumer.rs @@ -47,6 +47,7 @@ use crate::nonblocking_monitor::{ }; use crate::shutdown::ShutdownCtl; use crate::LifecycleView; +use crate::{BrokerEnvelope, BrokerFetchRequest, BrokerFetchedMessage, BrokerHandle}; use tracing::{debug, info, warn}; const NO_MESSAGE_WARN_INTERVAL: Duration = Duration::from_secs(30); @@ -54,6 +55,10 @@ const PREFETCH_LATENCY_LOG_INTERVAL: Duration = NO_MESSAGE_WARN_INTERVAL; const PREFETCH_LATENCY_WINDOW_SIZE: usize = 16; const NONBLOCKING_QUEUE_WAIT_THRESHOLD: Duration = Duration::from_millis(500); const DELETE_CALLBACK_WARN_INTERVAL: Duration = Duration::from_secs(1); +const BROKER_CLEANUP_DELETE_RETRY_INITIAL_SLEEP: Duration = Duration::from_millis(50); +const BROKER_CLEANUP_DELETE_RETRY_MAX_SLEEP: Duration = Duration::from_secs(5); +const BROKER_CLEANUP_ACK_RETRY_INITIAL_SLEEP: Duration = Duration::from_millis(50); +const BROKER_CLEANUP_ACK_RETRY_MAX_SLEEP: Duration = Duration::from_secs(5); const COMMIT_WAIT_WARN_INTERVAL: Duration = Duration::from_secs(10); const COMMIT_WAIT_BREAKDOWN_SUMMARY_THRESHOLD: Duration = Duration::from_millis(50); const COMMIT_OFFSET_PUT_TIMEOUT: Duration = Duration::from_secs(10); @@ -64,6 +69,9 @@ const PREFETCH_HANDLE_AWAIT_WARN_INTERVAL: Duration = Duration::from_secs(2); const COMMIT_PROGRESS_RETENTION: usize = 1024; const STALE_PRODUCER_PROBE_TOMB_TTL: Duration = Duration::from_secs(10); const READY_TRACE_HISTORY_PER_PRODUCER: usize = 64; +const PREFETCH_REFILL_BURST_MAX: usize = 128; +const PREFETCH_NO_MESSAGE_RETRY_EMPTY_SLEEP: Duration = Duration::from_millis(1); +const PREFETCH_NO_MESSAGE_RETRY_PARTIAL_SLEEP: Duration = Duration::from_millis(5); static NEXT_CONSUMER_INSTANCE_ID: AtomicUsize = AtomicUsize::new(1); fn map_prefix_scan_error(err: EtcdPrefixScanError) -> MpscError { @@ -96,6 +104,21 @@ fn merge_offset_cache_monotonic(current: &mut HashMap, fetched: Has } } +fn prefetch_refill_launch_budget(target: usize, current: usize) -> usize { + target + .saturating_sub(current) + .min(PREFETCH_REFILL_BURST_MAX) + .max(1) +} + +fn prefetch_no_message_retry_sleep(current: usize) -> Duration { + if current == 0 { + PREFETCH_NO_MESSAGE_RETRY_EMPTY_SLEEP + } else { + PREFETCH_NO_MESSAGE_RETRY_PARTIAL_SLEEP + } +} + fn prefetch_job_stage_name(stage: u8) -> &'static str { match stage { 0 => "init", @@ -296,9 +319,7 @@ impl CommitSequencer { let mut current_blocker_begin_at = wait_begin; loop { if shutdown.is_closed() { - return Err(MpscError::Internal( - "consumer closed during consume-offset commit wait".to_string(), - )); + return Err(MpscError::Closed); } let observed_next_seq = self.next_seq.load(Ordering::SeqCst); if observed_next_seq == seq { @@ -366,9 +387,7 @@ impl CommitSequencer { ); } _ = shutdown.wait_closed() => { - return Err(MpscError::Internal( - "consumer closed during consume-offset commit wait".to_string(), - )); + return Err(MpscError::Closed); } } } @@ -759,9 +778,16 @@ struct ReadyPathLatencySample { } /// Application-level payload (type-erased) to avoid coupling with upper layers. -pub trait MqPayload: Downcast + Send {} +pub trait MqPayload: Downcast + Send { + fn attach_cleanup(&mut self, cleanup: PayloadCleanup) -> Result<(), PayloadCleanup> { + Err(cleanup) + } +} impl_downcast!(MqPayload); +pub type PayloadCleanupFuture = Pin + Send + 'static>>; +pub type PayloadCleanup = Box PayloadCleanupFuture + Send + 'static>; + /// Callback result: deliver a payload or indicate retry/non-retry. pub enum PayloadResult { Ok(Box), @@ -813,10 +839,12 @@ pub struct MpscConsumer { /// /// 队列元素是一次完整 get 操作的 JoinHandle;consumer /// 只需 pop 并等待其完成即可,保证按提交顺序消费。 - inflight_rx: mpsc::Receiver, + inflight_queue: Arc>>, inflight_consume_notify: Arc, /// 控制通道,仅用于下发回调设置等控制类命令。 cmd_tx: mpsc::Sender, + /// Local mirror of payload callback for non-prefetch direct paths. + payload_cb: Option, /// delete callback invoked after successful consume-offset commit. delete_cb: Option, /// Shared shutdown controller used by higher layers to signal @@ -1242,10 +1270,13 @@ impl MpscConsumer { } async fn recv_next_inflight_handle_with_idle_warn(&mut self) -> Option { - match self.inflight_rx.try_recv() { - Ok(handle) => return Some(handle), - Err(tokio::sync::mpsc::error::TryRecvError::Disconnected) => return None, - Err(tokio::sync::mpsc::error::TryRecvError::Empty) => {} + if let Some(handle) = self + .inflight_queue + .lock() + .expect("inflight queue mutex poisoned") + .pop_front() + { + return Some(handle); } let idle_warn_sleep = tokio::time::sleep(NO_MESSAGE_WARN_INTERVAL); @@ -1255,10 +1286,19 @@ impl MpscConsumer { if self.shutdown.is_closed() { return None; } + let queue_notify = self.inflight_consume_notify.notified(); + tokio::pin!(queue_notify); tokio::select! { biased; - handle_opt = self.inflight_rx.recv() => { - return handle_opt; + _ = &mut queue_notify => { + if let Some(handle) = self + .inflight_queue + .lock() + .expect("inflight queue mutex poisoned") + .pop_front() + { + return Some(handle); + } } _ = &mut idle_warn_sleep => { let parent_mpmc_id = match self.category { @@ -1399,7 +1439,7 @@ impl MpscConsumer { let global_lease_id = chan_mgr.global_lease.id() as i64; let ( cmd_tx, - inflight_rx, + inflight_queue, target_inflight, inflight_queue_size, inflight_consume_notify, @@ -1438,9 +1478,10 @@ impl MpscConsumer { chan_mgr, target_inflight, inflight_queue_size, - inflight_rx, + inflight_queue, cmd_tx, inflight_consume_notify, + payload_cb: None, delete_cb: None, shutdown, category, @@ -1480,6 +1521,10 @@ impl MpscConsumer { &self.consumer_idx } + pub fn channel_capacity(&self) -> i64 { + self.chan_mgr.capacity() + } + pub fn lease_manager(&self) -> &LeaseManager { &self.lease_manager } @@ -1570,6 +1615,7 @@ impl MpscConsumer { /// This method is synchronous and only pushes a control command to the /// internal actor via `try_send`. pub fn set_payload_callback(&mut self, cb: PayloadCallback) { + self.payload_cb = Some(cb.clone()); let _ = self.cmd_tx.try_send(ConsumerCmd::SetCallback(cb)); } @@ -1619,8 +1665,7 @@ impl MpscConsumer { } else { self.recv_next_inflight_handle_with_idle_warn().await }; - let inflight_item = - handle_opt.ok_or_else(|| MpscError::Internal("prefetch actor closed".to_string()))?; + let inflight_item = handle_opt.ok_or(MpscError::Closed)?; debug!( "[MpscConsumer get_with_payload] instance_id={} chan_id={} seq={} producer_id={} consume_offset={} inflight_queue_size_after_pop={}", self.instance_id, @@ -1893,6 +1938,48 @@ impl MpscConsumer { .await } + pub async fn get_with_payload_via_broker( + &mut self, + broker: &BrokerHandle, + ) -> Result { + let cb = self + .payload_cb + .as_ref() + .ok_or_else(|| MpscError::Internal("payload callback not set".to_string()))? + .clone(); + get_payload_via_broker( + broker, + self.chan_id, + self.consumer_idx.clone(), + cb, + self.delete_cb.clone(), + self.shutdown.clone(), + ) + .await + } + + pub async fn get_batch_with_payload_via_broker( + &mut self, + broker: &BrokerHandle, + batch_size: usize, + ) -> Result, MpscError> { + let cb = self + .payload_cb + .as_ref() + .ok_or_else(|| MpscError::Internal("payload callback not set".to_string()))? + .clone(); + get_payload_batch_via_broker( + broker, + self.chan_id, + self.consumer_idx.clone(), + batch_size, + cb, + self.delete_cb.clone(), + self.shutdown.clone(), + ) + .await + } + /// Runs the KV payload fetch stage with retry semantics. /// Consume-offset commit is handled by the prefetch job. async fn run_single_get( @@ -1909,9 +1996,7 @@ impl MpscConsumer { let mut payload_obj: Option> = None; loop { if shutdown.is_closed() { - return Err(MpscError::Internal( - "consumer closed during get_with_payload".to_string(), - )); + return Err(MpscError::Closed); } let msg_key = keys::backend_message_key_with_category( chan_id, @@ -1978,10 +2063,7 @@ impl MpscConsumer { loop { if shutdown.is_closed() { - return Err(MpscError::Internal(format!( - "consumer closed during consume-offset commit: seq={} producer_id={} consume_offset={}", - seq, producer_id, consume_offset - ))); + return Err(MpscError::Closed); } attempts += 1; @@ -1996,10 +2078,7 @@ impl MpscConsumer { let put_res = tokio::select! { biased; _ = shutdown.wait_closed() => { - return Err(MpscError::Internal(format!( - "consumer closed during consume-offset commit: seq={} producer_id={} consume_offset={}", - seq, producer_id, consume_offset - ))); + return Err(MpscError::Closed); } res = tokio::time::timeout( COMMIT_OFFSET_PUT_TIMEOUT, @@ -2073,10 +2152,7 @@ impl MpscConsumer { tokio::select! { biased; _ = shutdown.wait_closed() => { - return Err(MpscError::Internal(format!( - "consumer closed during consume-offset retry sleep: seq={} producer_id={} consume_offset={}", - seq, producer_id, consume_offset - ))); + return Err(MpscError::Closed); } _ = sleep(COMMIT_OFFSET_RETRY_SLEEP) => {} } @@ -2176,6 +2252,544 @@ impl MpscConsumer { } } +async fn get_payload_via_broker( + broker: &BrokerHandle, + chan_id: i64, + consumer_id: String, + cb: PayloadCallback, + delete_cb: Option, + shutdown: ShutdownCtl, +) -> Result { + let fetched = broker + .fetch_next(BrokerFetchRequest { + channel_id: chan_id, + consumer_id: consumer_id.clone(), + now_ms: now_ms(), + }) + .await + .map_err(|e| { + MpscError::Internal(format!( + "broker fetch failed: chan_id={} consumer_id={} err={}", + chan_id, consumer_id, e + )) + })? + .ok_or(MpscError::NoMessage)?; + let envelope = fetched.envelope; + let reservation_id = envelope.reservation_id; + let producer_id = envelope.producer_id.clone(); + let payload_key = envelope.payload_key.clone(); + let mut requeue_guard = + BrokerInflightRequeueGuard::new(broker.clone(), chan_id, vec![reservation_id]); + let payload = match run_payload_callback( + chan_id, + cb, + producer_id.clone(), + payload_key, + shutdown.clone(), + ) + .await + { + Ok((payload, _kv_get_latency_ns)) => payload, + Err(err) => { + requeue_guard.requeue_now().await; + return Err(err); + } + }; + + let commit_outcome = match broker.commit(chan_id, reservation_id, now_ms()).await { + Ok(outcome) => outcome, + Err(err) => { + requeue_guard.requeue_now().await; + return Err(MpscError::Internal(format!( + "broker commit failed: chan_id={} consumer_id={} reservation_id={} err={}", + chan_id, consumer_id, reservation_id, err + ))); + } + }; + requeue_guard.mark_completed(reservation_id); + if !commit_outcome.first_commit { + return Err(MpscError::Internal(format!( + "broker commit returned duplicate first_commit=false: chan_id={} consumer_id={} reservation_id={}", + chan_id, consumer_id, reservation_id + ))); + } + + if let Some(envelope) = commit_outcome.cleanup { + spawn_broker_cleanup(broker.clone(), chan_id, delete_cb.clone(), envelope); + } + + Ok(ConsumedPayload { + producer_id, + payload, + nonblocking_hit: true, + }) +} + +struct BrokerBatchPayload { + producer_id: String, + payload: Box, +} + +struct BrokerInflightRequeueGuard { + broker: BrokerHandle, + chan_id: i64, + reservation_ids: Vec, +} + +impl BrokerInflightRequeueGuard { + fn new(broker: BrokerHandle, chan_id: i64, reservation_ids: Vec) -> Self { + Self { + broker, + chan_id, + reservation_ids, + } + } + + fn extend(&mut self, reservation_ids: I) + where + I: IntoIterator, + { + self.reservation_ids.extend(reservation_ids); + } + + fn mark_completed(&mut self, reservation_id: u64) { + if let Some(pos) = self + .reservation_ids + .iter() + .position(|current| *current == reservation_id) + { + self.reservation_ids.remove(pos); + } + } + + async fn requeue_now(&mut self) { + let reservation_ids = std::mem::take(&mut self.reservation_ids); + requeue_pending_broker_inflight(&self.broker, self.chan_id, reservation_ids).await; + } +} + +impl Drop for BrokerInflightRequeueGuard { + fn drop(&mut self) { + let reservation_ids = std::mem::take(&mut self.reservation_ids); + if reservation_ids.is_empty() { + return; + } + let broker = self.broker.clone(); + let chan_id = self.chan_id; + tokio::spawn(async move { + requeue_pending_broker_inflight(&broker, chan_id, reservation_ids).await; + }); + } +} + +async fn get_payload_batch_via_broker( + broker: &BrokerHandle, + chan_id: i64, + consumer_id: String, + batch_size: usize, + cb: PayloadCallback, + delete_cb: Option, + shutdown: ShutdownCtl, +) -> Result, MpscError> { + if batch_size == 0 { + return Ok(Vec::new()); + } + + let first = broker + .fetch_next(BrokerFetchRequest { + channel_id: chan_id, + consumer_id: consumer_id.clone(), + now_ms: now_ms(), + }) + .await + .map_err(|e| { + MpscError::Internal(format!( + "broker fetch failed: chan_id={} consumer_id={} err={}", + chan_id, consumer_id, e + )) + })? + .ok_or(MpscError::NoMessage)?; + + let mut fetched = Vec::with_capacity(batch_size); + let mut requeue_guard = BrokerInflightRequeueGuard::new( + broker.clone(), + chan_id, + vec![first.envelope.reservation_id], + ); + fetched.push(first); + + let remaining = batch_size.saturating_sub(1); + if remaining > 0 { + let mut more = match broker + .fetch_batch_available( + BrokerFetchRequest { + channel_id: chan_id, + consumer_id: consumer_id.clone(), + now_ms: now_ms(), + }, + remaining, + ) + .await + { + Ok(batch) => { + requeue_guard.extend( + batch + .messages + .iter() + .map(|message| message.envelope.reservation_id), + ); + batch.messages + } + Err(err) => { + requeue_guard.requeue_now().await; + return Err(MpscError::Internal(format!( + "broker batch fetch failed: chan_id={} consumer_id={} err={}", + chan_id, consumer_id, err + ))); + } + }; + fetched.append(&mut more); + } + + match load_broker_payloads_commit_on_ready( + broker, + chan_id, + &consumer_id, + fetched, + cb, + delete_cb, + shutdown.clone(), + requeue_guard, + ) + .await + { + Ok(payloads) => Ok(payloads + .into_iter() + .map(|item| ConsumedPayload { + producer_id: item.producer_id, + payload: item.payload, + nonblocking_hit: true, + }) + .collect()), + Err(err) => Err(err), + } +} + +async fn load_broker_payloads_commit_on_ready( + broker: &BrokerHandle, + chan_id: i64, + consumer_id: &str, + fetched: Vec, + cb: PayloadCallback, + delete_cb: Option, + shutdown: ShutdownCtl, + mut requeue_guard: BrokerInflightRequeueGuard, +) -> Result, MpscError> { + let reservation_ids: Vec = fetched + .iter() + .map(|message| message.envelope.reservation_id) + .collect(); + let mut join_set = JoinSet::new(); + + for message in fetched { + let envelope = message.envelope; + let reservation_id = envelope.reservation_id; + let producer_id = envelope.producer_id.clone(); + let payload_key = envelope.payload_key.clone(); + let cb = cb.clone(); + let shutdown = shutdown.clone(); + join_set.spawn(async move { + let result = + run_payload_callback(chan_id, cb, producer_id.clone(), payload_key, shutdown) + .await + .map(|(payload, _kv_get_latency_ns)| BrokerBatchPayload { + producer_id, + payload, + }); + (reservation_id, result) + }); + } + + let mut payload_results: HashMap> = + HashMap::with_capacity(reservation_ids.len()); + let mut batch_load_failure: Option = None; + while let Some(join_res) = join_set.join_next().await { + match join_res { + Ok((reservation_id, Ok(payload))) => { + payload_results.insert(reservation_id, Ok(payload)); + } + Ok((reservation_id, Err(err))) => { + payload_results.insert(reservation_id, Err(err)); + join_set.abort_all(); + break; + } + Err(err) => { + join_set.abort_all(); + batch_load_failure = Some(MpscError::JoinError(err)); + break; + } + } + } + + let mut committed_payloads = Vec::with_capacity(reservation_ids.len()); + let mut remaining_reservation_ids = Vec::new(); + let mut stop_error = batch_load_failure; + let mut stop_after_current = stop_error.is_some(); + + for reservation_id in reservation_ids { + if stop_after_current { + remaining_reservation_ids.push(reservation_id); + continue; + } + + let Some(payload_result) = payload_results.remove(&reservation_id) else { + stop_error = Some(MpscError::Internal(format!( + "broker batch payload load canceled before ordered commit: chan_id={} consumer_id={} reservation_id={}", + chan_id, consumer_id, reservation_id + ))); + stop_after_current = true; + remaining_reservation_ids.push(reservation_id); + continue; + }; + + let payload = match payload_result { + Ok(payload) => payload, + Err(err) => { + stop_error = Some(err); + stop_after_current = true; + remaining_reservation_ids.push(reservation_id); + continue; + } + }; + + let commit_outcome = match broker.commit(chan_id, reservation_id, now_ms()).await { + Ok(outcome) => outcome, + Err(err) => { + stop_error = Some(MpscError::Internal(format!( + "broker commit failed during batch consume: chan_id={} consumer_id={} reservation_id={} err={}", + chan_id, consumer_id, reservation_id, err + ))); + stop_after_current = true; + remaining_reservation_ids.push(reservation_id); + continue; + } + }; + requeue_guard.mark_completed(reservation_id); + if !commit_outcome.first_commit { + stop_error = Some(MpscError::Internal(format!( + "broker commit returned duplicate during batch consume: chan_id={} consumer_id={} reservation_id={}", + chan_id, consumer_id, reservation_id + ))); + stop_after_current = true; + remaining_reservation_ids.push(reservation_id); + continue; + } + if let Some(envelope) = commit_outcome.cleanup { + spawn_broker_cleanup(broker.clone(), chan_id, delete_cb.clone(), envelope); + } + + committed_payloads.push(payload); + } + + if !remaining_reservation_ids.is_empty() { + requeue_guard.requeue_now().await; + } + + if !committed_payloads.is_empty() { + return Ok(committed_payloads); + } + + Err(stop_error.unwrap_or_else(|| { + MpscError::Internal(format!( + "broker batch consume stopped without committed payloads: chan_id={} consumer_id={}", + chan_id, consumer_id + )) + })) +} + +async fn run_payload_callback( + chan_id: i64, + cb: PayloadCallback, + producer_id: String, + payload_key: String, + shutdown: ShutdownCtl, +) -> Result<(Box, u128), MpscError> { + use tokio::time::sleep; + + let kv_get_begin = Instant::now(); + loop { + if shutdown.is_closed() { + return Err(MpscError::Closed); + } + let f = cb.clone(); + let producer_for_closure = producer_id.clone(); + let key_for_closure = payload_key.clone(); + let res = (f)(producer_for_closure, key_for_closure).await; + + match res { + PayloadResult::Ok(payload) => { + return Ok((payload, kv_get_begin.elapsed().as_nanos())); + } + PayloadResult::Retryable(msg) => { + warn!( + "[MpscConsumer chan_id={}] get payload retryable: {}", + chan_id, msg + ); + sleep(Duration::from_millis(50)).await; + } + PayloadResult::NonRetryable(msg) => { + return Err(MpscError::GetPayloadNonRetryable { message: msg }); + } + } + } +} + +async fn run_delete_callback_until_success( + chan_id: i64, + delete_cb: &DeleteCallback, + payload_key: String, +) { + use tokio::time::sleep; + + let mut retry_sleep = BROKER_CLEANUP_DELETE_RETRY_INITIAL_SLEEP; + loop { + let f = delete_cb.clone(); + let key_clone = payload_key.clone(); + let delete_begin = Instant::now(); + let delete_fut = (f)(key_clone.clone()); + tokio::pin!(delete_fut); + let res = loop { + tokio::select! { + res = &mut delete_fut => { + break res; + } + _ = sleep(DELETE_CALLBACK_WARN_INTERVAL) => { + warn!( + "[MpscConsumer chan_id={}] async broker delete callback still pending: key={} waited_ms={}", + chan_id, + key_clone, + delete_begin.elapsed().as_millis(), + ); + } + } + }; + match res { + DeleteResult::Ok => return, + DeleteResult::Retryable(msg) => { + warn!( + "[MpscConsumer chan_id={}] async broker delete payload retryable; retry_after_ms={}: {}", + chan_id, + retry_sleep.as_millis(), + msg + ); + } + DeleteResult::NonRetryable(msg) => { + warn!( + "[MpscConsumer chan_id={}] async broker delete payload non-retryable; keep retrying to preserve broker byte budget; retry_after_ms={}: {}", + chan_id, + retry_sleep.as_millis(), + msg + ); + } + } + sleep(retry_sleep).await; + retry_sleep = retry_sleep + .saturating_mul(2) + .min(BROKER_CLEANUP_DELETE_RETRY_MAX_SLEEP); + } +} + +async fn run_broker_cleanup_ack_until_success( + broker: BrokerHandle, + chan_id: i64, + reservation_id: u64, +) { + use tokio::time::sleep; + + let mut retry_sleep = BROKER_CLEANUP_ACK_RETRY_INITIAL_SLEEP; + loop { + match broker.cleanup_ack(chan_id, reservation_id).await { + Ok(()) => return, + Err(err) => { + if broker_cleanup_ack_error_is_terminal(&err) { + warn!( + "async broker cleanup ack stopped after terminal broker error: chan_id={} reservation_id={} err={}", + chan_id, + reservation_id, + err + ); + return; + } + warn!( + "async broker cleanup ack failed; retry_after_ms={}: chan_id={} reservation_id={} err={}", + retry_sleep.as_millis(), + chan_id, + reservation_id, + err + ); + } + } + sleep(retry_sleep).await; + retry_sleep = retry_sleep + .saturating_mul(2) + .min(BROKER_CLEANUP_ACK_RETRY_MAX_SLEEP); + } +} + +fn broker_cleanup_ack_error_is_terminal(err: &crate::BrokerError) -> bool { + match err { + crate::BrokerError::ActorClosed | crate::BrokerError::ChannelNotFound(_) => true, + crate::BrokerError::Rpc(message) => { + message.contains("System shutdown") + || message.contains("actor closed") + || message.contains("channel not found") + } + _ => false, + } +} + +fn spawn_broker_cleanup( + broker: BrokerHandle, + chan_id: i64, + delete_cb: Option, + envelope: BrokerEnvelope, +) { + tokio::spawn(async move { + let reservation_id = envelope.reservation_id; + if let Some(delete_cb) = delete_cb.as_ref() { + run_delete_callback_until_success(chan_id, delete_cb, envelope.payload_key.clone()) + .await; + } + run_broker_cleanup_ack_until_success(broker, chan_id, reservation_id).await; + }); +} + +async fn requeue_pending_broker_inflight( + broker: &BrokerHandle, + chan_id: i64, + reservation_ids: Vec, +) { + if reservation_ids.is_empty() { + return; + } + if let Err(err) = broker + .requeue_inflight_batch(chan_id, reservation_ids) + .await + { + warn!( + "best-effort broker batch requeue failed: chan_id={} err={}", + chan_id, err + ); + } +} + +fn now_ms() -> i64 { + SystemTime::now() + .duration_since(UNIX_EPOCH) + .expect("system clock is before UNIX_EPOCH") + .as_millis() as i64 +} + /// MPSC consumer actor,持有 selector、offset、lease 等完整状态。 /// 仅在 mpsc 模块内部可见,对上层 crate 透明。 pub struct ConsumedPayload { @@ -2454,9 +3068,10 @@ struct ConsumerActor { producer_selector: ProducerSelectorForConsumer, /// payload 回调,由上层通过 ConsumerCmd::SetCallback 设置. payload_cb: Option, - /// 每个 producer 当前已预取但尚未持久化消费的“下一条 offset” - /// 提示,用于避免在 etcd consume offset 尚未更新时重复预取 - /// 同一条消息。 + /// 每个 producer 的本地 reservation cursor(下一条待预取 offset)。 + /// + /// 这个 cursor 可能领先于 etcd consume offset,因为 actor 会在 + /// consume-offset 持久化之前先连续发起多条 prefetch。 prefetch_offset_map: HashMap, /// 本地缓存的 produce offset(来自 etcd),仅在无消息或 /// 初始化时 refresh;平时 select_next_message 只读该缓存。 @@ -2479,7 +3094,7 @@ struct ConsumerActor { /// 向 consumer 暴露的预取队列 sender。 /// /// 队列元素为一次完整 get 操作的 JoinHandle。 - inflight_tx: mpsc::Sender, + inflight_queue: Arc>>, /// inflight consume notify inflight_consume_notify: Arc, /// 共享的预取窗口目标。 @@ -2597,10 +3212,12 @@ impl ConsumerActor { } fn cached_next_hint(&self, producer_id: &str) -> i64 { + let committed_next = self.cached_consume_offset(producer_id); self.prefetch_offset_map .get(producer_id) .copied() - .unwrap_or_else(|| self.cached_consume_offset(producer_id)) + .map(|hint| hint.max(committed_next)) + .unwrap_or(committed_next) } fn cached_produce_offset(&self, producer_id: &str) -> i64 { @@ -2616,6 +3233,12 @@ impl ConsumerActor { || self.prefetch_offset_map.contains_key(producer_id) } + fn producer_has_prefetch_room(&self, producer_id: &str) -> bool { + let visible_tail = self.cached_produce_offset(producer_id); + let next_hint = self.cached_next_hint(producer_id); + next_hint <= visible_tail + } + fn refresh_ready_state_from_local(&mut self, producer_id: &str) -> bool { let ready_before = self.ready_producers.contains(producer_id); let stale_before = self.stale_no_room_producers.contains(producer_id); @@ -2626,8 +3249,7 @@ impl ConsumerActor { return ready_before || stale_before; } - let has_room = - self.cached_produce_offset(producer_id) >= self.cached_next_hint(producer_id); + let has_room = self.producer_has_prefetch_room(producer_id); if has_room { self.ready_producers.insert(producer_id.to_string()); self.stale_no_room_producers.remove(producer_id); @@ -2878,7 +3500,7 @@ impl ConsumerActor { global_lease_id: i64, ) -> ( mpsc::Sender, - mpsc::Receiver, + Arc>>, Arc, Arc, Arc, @@ -2889,7 +3511,7 @@ impl ConsumerActor { let (cmd_tx, cmd_rx) = mpsc::channel(8); let (meta_tx, meta_rx) = mpsc::channel(8); let (produce_offset_tx, produce_offset_rx) = mpsc::channel(128); - let (inflight_tx, inflight_rx) = mpsc::channel(32); + let inflight_queue = Arc::new(Mutex::new(VecDeque::new())); let target_inflight = Arc::new(AtomicUsize::new(0)); let inflight_queue_size = Arc::new(AtomicUsize::new(0)); let inflight_consume_notify = Arc::new(Notify::new()); @@ -2911,7 +3533,7 @@ impl ConsumerActor { ready_producers: HashSet::new(), ready_trace_history: HashMap::new(), stale_no_room_producers: HashSet::new(), - inflight_tx, + inflight_queue: inflight_queue.clone(), inflight_consume_notify: inflight_consume_notify.clone(), target_inflight: target_inflight.clone(), inflight_queue_size: inflight_queue_size.clone(), @@ -2960,7 +3582,7 @@ impl ConsumerActor { ( cmd_tx, - inflight_rx, + inflight_queue, target_inflight, inflight_queue_size, inflight_consume_notify, @@ -3118,7 +3740,7 @@ impl ConsumerActor { } // Do not poll `prefetch_tick()` as a `tokio::select!` branch. If the - // branch is canceled while `inflight_tx.send(...)` is pending, the + // branch is canceled while queueing a new inflight item is pending, the // oneshot receiver inside `InflightItem` is dropped after the // prefetch job has already started, which strands commit ordering. self.drain_pending_actor_inputs(&mut rx, &mut meta_rx, &mut produce_offset_rx); @@ -3163,22 +3785,35 @@ impl ConsumerActor { return; } - for _ in 0..1 { + let initial_queue_size = self.inflight_queue_size.load(Ordering::SeqCst); + let burst_limit = prefetch_refill_launch_budget(target, initial_queue_size); + let mut launched = 0usize; + loop { let current = self.inflight_queue_size.load(Ordering::SeqCst); if current >= target { - self.wait_actor_inputs_or_inflight_consume(rx, meta_rx, produce_offset_rx) - .await; + if launched == 0 { + self.wait_actor_inputs_or_inflight_consume(rx, meta_rx, produce_offset_rx) + .await; + } + return; + } + if launched >= burst_limit { return; } match self.try_prefetch_one().await { Ok(()) => { + launched += 1; self.prefetch_no_message_next_warn_at = tokio::time::Instant::now() + NO_MESSAGE_WARN_INTERVAL; self.maybe_log_select_next_message_stats(false); } Err(MpscError::NoMessage) => { self.select_next_message_stats.record_no_message_backoff(); + if launched > 0 { + self.maybe_log_select_next_message_stats(false); + return; + } let now = tokio::time::Instant::now(); if now >= self.prefetch_no_message_next_warn_at { let parent_mpmc_id = match self.category { @@ -3195,7 +3830,13 @@ impl ConsumerActor { self.prefetch_no_message_next_warn_at = now + NO_MESSAGE_WARN_INTERVAL; } self.maybe_log_select_next_message_stats(false); - self.wait_actor_inputs(rx, meta_rx, produce_offset_rx).await; + self.wait_actor_inputs_or_timeout( + rx, + meta_rx, + produce_offset_rx, + prefetch_no_message_retry_sleep(current), + ) + .await; return; } Err(other) => { @@ -3213,6 +3854,7 @@ impl ConsumerActor { Duration::from_millis(100), ) .await; + return; } } } @@ -3223,7 +3865,7 @@ impl ConsumerActor { /// 返回 `MpscError::NoMessage`。 async fn try_prefetch_one(&mut self) -> Result<(), MpscError> { if self.shutdown.is_closed() { - return Err(MpscError::Internal("consumer closed".to_string())); + return Err(MpscError::Closed); } let cb = self .payload_cb @@ -3305,16 +3947,17 @@ impl ConsumerActor { queue_size_after_inc, self.target_inflight.load(Ordering::SeqCst), ); - self.inflight_tx - .send(InflightItem { + self.inflight_queue + .lock() + .expect("inflight queue mutex poisoned") + .push_back(InflightItem { seq, producer_id: producer_id_for_queue, consume_offset, ready_path_trace, rx, - }) - .await - .map_err(|_| MpscError::Internal("prefetch queue closed".to_string()))?; + }); + self.inflight_consume_notify.notify_one(); debug!( "[MpscConsumer enqueue] instance_id={} chan_id={} seq={} queue_send_completed queue_size_now={}", self.instance_id, @@ -3445,31 +4088,66 @@ impl ConsumerActor { return Err(MpscError::NoMessage); } - self.producer_selector.moveon_round_robin(); - let producer_id = self - .producer_selector - .current_producer_idx() - .ok_or(MpscError::NoMessage)? - .to_string(); + let ready_count = self.ready_producers.len(); + for _ in 0..ready_count { + self.producer_selector.moveon_round_robin(); + let producer_id = self + .producer_selector + .current_producer_idx() + .ok_or(MpscError::NoMessage)? + .to_string(); - let prod_off = self.cached_produce_offset(&producer_id); - let next_hint = self.cached_next_hint(&producer_id); + let next_hint = self.cached_next_hint(&producer_id); - if prod_off < next_hint { + if !self.producer_has_prefetch_room(&producer_id) { + if self.refresh_ready_state_from_local(&producer_id) { + self.rebuild_ready_selector(); + } + continue; + } + + let actual_offset = next_hint; + self.prefetch_offset_map + .insert(producer_id.clone(), actual_offset + 1); if self.refresh_ready_state_from_local(&producer_id) { self.rebuild_ready_selector(); } - return Err(MpscError::NoMessage); + + return Ok((producer_id, actual_offset)); } - let actual_offset = next_hint; - self.prefetch_offset_map - .insert(producer_id.clone(), actual_offset + 1); - if self.refresh_ready_state_from_local(&producer_id) { - self.rebuild_ready_selector(); + if !self.stale_no_room_producers.is_empty() { + self.probe_stale_no_room_producers_timed(trace).await?; + if !self.ready_producers.is_empty() { + let retry_ready_count = self.ready_producers.len(); + for _ in 0..retry_ready_count { + self.producer_selector.moveon_round_robin(); + let producer_id = self + .producer_selector + .current_producer_idx() + .ok_or(MpscError::NoMessage)? + .to_string(); + + let next_hint = self.cached_next_hint(&producer_id); + if !self.producer_has_prefetch_room(&producer_id) { + if self.refresh_ready_state_from_local(&producer_id) { + self.rebuild_ready_selector(); + } + continue; + } + + let actual_offset = next_hint; + self.prefetch_offset_map + .insert(producer_id.clone(), actual_offset + 1); + if self.refresh_ready_state_from_local(&producer_id) { + self.rebuild_ready_selector(); + } + return Ok((producer_id, actual_offset)); + } + } } - Ok((producer_id, actual_offset)) + Err(MpscError::NoMessage) } async fn refresh_offsets_from_etcd_timed( @@ -3623,8 +4301,22 @@ impl ConsumerActor { #[cfg(test)] mod tests { - use super::{merge_monotonic_offset, merge_offset_cache_monotonic}; + use super::{ + get_payload_batch_via_broker, get_payload_via_broker, merge_monotonic_offset, + merge_offset_cache_monotonic, MqPayload, PayloadCallback, PayloadResult, + }; + use crate::{ + keys::MqCategory, BrokerChannelConfig, BrokerFetchRequest, BrokerHandle, + BrokerReserveRequest, + }; use std::collections::HashMap; + use std::sync::Arc; + use std::time::Duration; + use tokio::sync::Notify; + + struct TestPayload; + + impl MqPayload for TestPayload {} #[test] fn merge_monotonic_offset_keeps_cached_when_probe_missing() { @@ -3654,6 +4346,276 @@ mod tests { assert_eq!(current.get("producer_b"), Some(&41)); assert_eq!(current.get("producer_c"), Some(&7)); } + + #[test] + fn visible_tail_does_not_allow_prefetch_past_last_published_offset() { + let visible_tail = 0; + let next_visible = 0; + let next_not_yet_published = 1; + + assert!(next_visible <= visible_tail); + assert!(next_not_yet_published > visible_tail); + } + + async fn fetch_next_for_test( + broker: &BrokerHandle, + channel_id: i64, + consumer_id: &str, + now_ms: i64, + ) -> crate::BrokerFetchedMessage { + tokio::time::timeout( + Duration::from_secs(1), + broker.fetch_next(BrokerFetchRequest { + channel_id, + consumer_id: consumer_id.to_string(), + now_ms, + }), + ) + .await + .expect("timed out waiting for broker redelivery") + .unwrap() + .unwrap() + } + + #[tokio::test] + async fn broker_single_consume_timeout_requeues_reserved_message() { + let broker = BrokerHandle::new_local_for_test(32); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 72, + capacity: 2, + }) + .await + .unwrap(); + + let reserved = broker + .reserve(BrokerReserveRequest { + channel_id: 72, + producer_id: "p0".to_string(), + category: MqCategory::Mpsc, + payload_bytes: 1, + now_ms: 10, + }) + .await + .unwrap(); + broker + .publish(72, reserved.envelope.reservation_id, 20) + .await + .unwrap(); + + let callback_started = Arc::new(Notify::new()); + let cb_started_for_callback = callback_started.clone(); + let cb: PayloadCallback = Arc::new(move |_producer_id: String, _key: String| { + let cb_started_for_callback = cb_started_for_callback.clone(); + Box::pin(async move { + cb_started_for_callback.notify_one(); + tokio::time::sleep(Duration::from_millis(50)).await; + PayloadResult::Ok(Box::new(TestPayload)) + }) + }); + + let mut consume = Box::pin(get_payload_via_broker( + &broker, + 72, + "c0".to_string(), + cb, + None, + crate::ShutdownCtl::new(), + )); + tokio::select! { + _ = callback_started.notified() => {} + result = &mut consume => panic!("consume completed before timeout setup: {:?}", result.err()), + } + assert!(tokio::time::timeout(Duration::from_millis(5), &mut consume) + .await + .is_err()); + drop(consume); + + let redelivered = fetch_next_for_test(&broker, 72, "c1", 30).await; + assert_eq!( + redelivered.envelope.reservation_id, + reserved.envelope.reservation_id + ); + } + + #[tokio::test] + async fn broker_batch_consume_timeout_requeues_reserved_messages_in_order() { + let broker = BrokerHandle::new_local_for_test(32); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 73, + capacity: 2, + }) + .await + .unwrap(); + + let first = broker + .reserve(BrokerReserveRequest { + channel_id: 73, + producer_id: "p0".to_string(), + category: MqCategory::Mpsc, + payload_bytes: 1, + now_ms: 10, + }) + .await + .unwrap(); + let second = broker + .reserve(BrokerReserveRequest { + channel_id: 73, + producer_id: "p0".to_string(), + category: MqCategory::Mpsc, + payload_bytes: 1, + now_ms: 11, + }) + .await + .unwrap(); + broker + .publish(73, first.envelope.reservation_id, 20) + .await + .unwrap(); + broker + .publish(73, second.envelope.reservation_id, 21) + .await + .unwrap(); + + let callback_started = Arc::new(Notify::new()); + let cb_started_for_callback = callback_started.clone(); + let cb: PayloadCallback = Arc::new(move |_producer_id: String, _key: String| { + let cb_started_for_callback = cb_started_for_callback.clone(); + Box::pin(async move { + cb_started_for_callback.notify_one(); + tokio::time::sleep(Duration::from_millis(50)).await; + PayloadResult::Ok(Box::new(TestPayload)) + }) + }); + + let mut consume = Box::pin(get_payload_batch_via_broker( + &broker, + 73, + "c0".to_string(), + 2, + cb, + None, + crate::ShutdownCtl::new(), + )); + tokio::select! { + _ = callback_started.notified() => {} + result = &mut consume => panic!("batch consume completed before timeout setup: {:?}", result.err()), + } + assert!(tokio::time::timeout(Duration::from_millis(5), &mut consume) + .await + .is_err()); + drop(consume); + + let redelivered_first = fetch_next_for_test(&broker, 73, "c1", 30).await; + let redelivered_second = fetch_next_for_test(&broker, 73, "c1", 31).await; + assert_eq!( + redelivered_first.envelope.reservation_id, + first.envelope.reservation_id + ); + assert_eq!( + redelivered_second.envelope.reservation_id, + second.envelope.reservation_id + ); + } + + #[tokio::test] + async fn broker_batch_consume_requeues_without_out_of_order_commit() { + let broker = BrokerHandle::new_local_for_test(32); + broker + .upsert_channel(BrokerChannelConfig { + channel_id: 71, + capacity: 2, + }) + .await + .unwrap(); + + let first = broker + .reserve(BrokerReserveRequest { + channel_id: 71, + producer_id: "p0".to_string(), + category: MqCategory::Mpsc, + payload_bytes: 1, + now_ms: 10, + }) + .await + .unwrap(); + let second = broker + .reserve(BrokerReserveRequest { + channel_id: 71, + producer_id: "p0".to_string(), + category: MqCategory::Mpsc, + payload_bytes: 1, + now_ms: 11, + }) + .await + .unwrap(); + broker + .publish(71, first.envelope.reservation_id, 20) + .await + .unwrap(); + broker + .publish(71, second.envelope.reservation_id, 21) + .await + .unwrap(); + + let first_key = first.envelope.payload_key.clone(); + let cb: PayloadCallback = Arc::new(move |_producer_id: String, key: String| { + let first_key = first_key.clone(); + Box::pin(async move { + if key == first_key { + tokio::time::sleep(Duration::from_millis(50)).await; + PayloadResult::NonRetryable("first payload failed".to_string()) + } else { + PayloadResult::Ok(Box::new(TestPayload)) + } + }) + }); + + let err = get_payload_batch_via_broker( + &broker, + 71, + "c0".to_string(), + 2, + cb, + None, + crate::ShutdownCtl::new(), + ) + .await + .err() + .expect("batch consume should fail when the first payload callback fails"); + assert!(matches!( + err, + crate::MpscError::GetPayloadNonRetryable { .. } + )); + + let redelivered_first = broker + .fetch_next(crate::BrokerFetchRequest { + channel_id: 71, + consumer_id: "c1".to_string(), + now_ms: 30, + }) + .await + .unwrap() + .unwrap(); + let redelivered_second = broker + .fetch_next(crate::BrokerFetchRequest { + channel_id: 71, + consumer_id: "c1".to_string(), + now_ms: 31, + }) + .await + .unwrap() + .unwrap(); + assert_eq!( + redelivered_first.envelope.reservation_id, + first.envelope.reservation_id + ); + assert_eq!( + redelivered_second.envelope.reservation_id, + second.envelope.reservation_id + ); + } } /// Producer selector for consumer-side weighted round robin. diff --git a/fluxon_rs/fluxon_mq/src/create.rs b/fluxon_rs/fluxon_mq/src/create.rs index 4fbb753..79da7ce 100644 --- a/fluxon_rs/fluxon_mq/src/create.rs +++ b/fluxon_rs/fluxon_mq/src/create.rs @@ -311,6 +311,7 @@ pub async fn create_mpsc_channel( global_lease: global_lease_handle, global_long_lease: global_long_lease_handle, payload_lease: payload_lease_handle, + capacity: cfg.capacity, etcd_client, }) } @@ -534,6 +535,7 @@ impl ChanManager { global_lease, global_long_lease, payload_lease, + capacity: meta.capacity, etcd_client: client, }) } diff --git a/fluxon_rs/fluxon_mq/src/error.rs b/fluxon_rs/fluxon_mq/src/error.rs index b4f1171..9d25d39 100755 --- a/fluxon_rs/fluxon_mq/src/error.rs +++ b/fluxon_rs/fluxon_mq/src/error.rs @@ -12,12 +12,24 @@ pub enum MpscError { #[error("no new message available")] NoMessage, + #[error("consumer is closed")] + Closed, + #[error("etcd error: {0}")] Etcd(#[from] etcd_client::Error), #[error("spawn blocking task failed: {0}")] JoinError(#[from] tokio::task::JoinError), + #[error( + "message buffer full: channel_id={channel_id} capacity={capacity} used_slots={used_slots}" + )] + MessageBufferFull { + channel_id: i64, + capacity: i64, + used_slots: i64, + }, + #[error("put payload returned non-retryable error (code=2)")] PutPayloadNonRetryable, @@ -61,10 +73,12 @@ impl MpscError { match self { // 可重试类 MpscError::NoMessage => 1000, + MpscError::Closed => 1001, // etcd / 系统 MpscError::Etcd(_) => 2000, MpscError::JoinError(_) => 2001, + MpscError::MessageBufferFull { .. } => 2002, // put payload MpscError::PutPayloadNonRetryable => 3000, diff --git a/fluxon_rs/fluxon_mq/src/keys.rs b/fluxon_rs/fluxon_mq/src/keys.rs index 1d55754..e2c8a4e 100644 --- a/fluxon_rs/fluxon_mq/src/keys.rs +++ b/fluxon_rs/fluxon_mq/src/keys.rs @@ -1,9 +1,13 @@ use std::fmt::Write as _; +use bitcode::{Decode, Encode}; +use serde::{Deserialize, Serialize}; + /// MQ category for key generation. -#[derive(Clone, Copy, Debug)] +#[derive(Clone, Copy, Debug, Default, Serialize, Deserialize, Encode, Decode)] pub enum MqCategory { /// Standalone MPSC usage + #[default] Mpsc, /// MPSC acts as a submodule under an MPMC producer; carries parent mpmc id only. /// The producer member id is the same as `producer_idx` passed alongside and diff --git a/fluxon_rs/fluxon_mq/src/lib.rs b/fluxon_rs/fluxon_mq/src/lib.rs index 3dded48..70b2024 100644 --- a/fluxon_rs/fluxon_mq/src/lib.rs +++ b/fluxon_rs/fluxon_mq/src/lib.rs @@ -1,3 +1,4 @@ +pub mod broker; pub mod consumer; pub mod create; pub mod error; @@ -10,6 +11,7 @@ pub mod nonblocking_monitor; pub mod producer; pub mod shutdown; +pub use crate::broker::*; pub use crate::consumer::DeleteResult; pub use crate::consumer::MpscConsumer; pub use crate::create::{create_mpsc_channel, ChanCreateConfig}; diff --git a/fluxon_rs/fluxon_mq/src/manager.rs b/fluxon_rs/fluxon_mq/src/manager.rs index b6d581a..fb5ffdb 100644 --- a/fluxon_rs/fluxon_mq/src/manager.rs +++ b/fluxon_rs/fluxon_mq/src/manager.rs @@ -206,6 +206,7 @@ pub struct ChanManager { /// 决定好 payload lease id,并通过 LeaseManager 注册 /// 对应的 kvclient keepalive;此处始终持有一个有效句柄。 pub payload_lease: GeneralLease, + pub(crate) capacity: i64, pub(crate) etcd_client: etcd::Client, } @@ -227,4 +228,8 @@ impl ChanManager { pub fn member_lease_id(&self) -> i64 { self.member_lease.id() as i64 } + + pub fn capacity(&self) -> i64 { + self.capacity + } } diff --git a/fluxon_rs/fluxon_mq/src/producer.rs b/fluxon_rs/fluxon_mq/src/producer.rs index fb4e4ea..72082c2 100644 --- a/fluxon_rs/fluxon_mq/src/producer.rs +++ b/fluxon_rs/fluxon_mq/src/producer.rs @@ -28,10 +28,15 @@ use crate::nonblocking_monitor::{ }; use crate::shutdown::ShutdownCtl; use crate::LifecycleView; +use crate::{BrokerError, BrokerHandle, BrokerReserveRequest}; use tokio::sync::watch; use tracing::warn; const PRODUCE_OFFSET_ETCD_SLOW_WARN_THRESHOLD: Duration = Duration::from_secs(1); +const BROKER_BACKPRESSURE_INITIAL_SLEEP_MS: u64 = 2; +const BROKER_BACKPRESSURE_MAX_SLEEP_MS: u64 = 50; +const BROKER_BACKPRESSURE_JITTER_MS: u64 = 7; +const BROKER_BACKPRESSURE_WARN_INTERVAL: Duration = Duration::from_secs(5); #[derive(Debug, Clone, Serialize, Deserialize)] struct ProducerMemberMeta { @@ -266,6 +271,10 @@ impl MpscProducer { self.chan_mgr.payload_lease.id() as i64 } + pub fn channel_capacity(&self) -> i64 { + self.chan_mgr.capacity() + } + /// Shared shutdown controller for this producer instance. pub fn shutdown_ctl(&self) -> ShutdownCtl { self.shutdown.clone() @@ -420,9 +429,7 @@ impl MpscProducer { let put_payload = Arc::new(put_payload); loop { if self.shutdown.is_closed() { - return Err(MpscError::Internal( - "producer closed during put_with_payload".to_string(), - )); + return Err(MpscError::Closed); } let key_clone = msg_key.clone(); let f = put_payload.clone(); @@ -479,6 +486,243 @@ impl MpscProducer { } Ok(()) } + + /// Broker-backed put path. + /// + /// This keeps the existing payload callback contract but moves + /// message id allocation and publish visibility into the broker. + /// The current etcd-backed `put_with_payload` remains untouched + /// until call sites are switched to this path. + pub async fn put_with_payload_via_broker( + &mut self, + broker: &BrokerHandle, + payload_bytes: u64, + put_payload: F, + ) -> Result<(), MpscError> + where + F: Fn(String, i64, Option) -> i32 + Send + Sync + 'static, + { + let preferred_sub_cluster_for_call = self.preferred_sub_cluster_for_put()?; + let published_msg_id = put_payload_via_broker( + broker, + self.chan_id, + &self.producer_idx, + self.category, + payload_bytes, + self.shutdown.clone(), + preferred_sub_cluster_for_call, + put_payload, + ) + .await?; + self.next_msg_id = self.next_msg_id.max(published_msg_id + 1); + Ok(()) + } +} + +async fn put_payload_via_broker( + broker: &BrokerHandle, + chan_id: i64, + producer_idx: &str, + category: MqCategory, + payload_bytes: u64, + shutdown: ShutdownCtl, + preferred_sub_cluster_for_call: Option, + put_payload: F, +) -> Result +where + F: Fn(String, i64, Option) -> i32 + Send + Sync + 'static, +{ + use limit_thirdparty::tokio::task; + use tokio::time::sleep; + + let put_payload = Arc::new(put_payload); + let reserve_wait_begin = Instant::now(); + let mut reserve_retry_attempt: u32 = 0; + let mut payload_retry_attempt: u32 = 0; + let mut next_reserve_warn_at = Instant::now() + BROKER_BACKPRESSURE_WARN_INTERVAL; + let mut next_payload_warn_at = Instant::now() + BROKER_BACKPRESSURE_WARN_INTERVAL; + + loop { + if shutdown.is_closed() { + return Err(MpscError::Closed); + } + + let reservation = match broker + .reserve(BrokerReserveRequest { + channel_id: chan_id, + producer_id: producer_idx.to_string(), + category, + payload_bytes, + now_ms: broker_now_ms(), + }) + .await + { + Ok(reservation) => { + reserve_retry_attempt = 0; + reservation + } + Err(BrokerError::ChannelFull { + channel_id, + capacity, + used_slots, + }) => { + let now = Instant::now(); + if now >= next_reserve_warn_at { + warn!( + "broker reserve backpressured: chan_id={} producer_idx={} capacity={} used_slots={} waited_ms={}", + channel_id, + producer_idx, + capacity, + used_slots, + reserve_wait_begin.elapsed().as_millis(), + ); + next_reserve_warn_at = now + BROKER_BACKPRESSURE_WARN_INTERVAL; + } + let sleep_for = + broker_backpressure_sleep_duration(producer_idx, reserve_retry_attempt); + reserve_retry_attempt = reserve_retry_attempt.saturating_add(1); + sleep(sleep_for).await; + continue; + } + Err(BrokerError::PayloadBytesFull { + capacity_bytes, + used_bytes, + requested_bytes, + }) => { + let now = Instant::now(); + if now >= next_reserve_warn_at { + warn!( + "broker payload budget backpressured: chan_id={} producer_idx={} requested_bytes={} capacity_bytes={} used_bytes={} waited_ms={}", + chan_id, + producer_idx, + requested_bytes, + capacity_bytes, + used_bytes, + reserve_wait_begin.elapsed().as_millis(), + ); + next_reserve_warn_at = now + BROKER_BACKPRESSURE_WARN_INTERVAL; + } + let sleep_for = + broker_backpressure_sleep_duration(producer_idx, reserve_retry_attempt); + reserve_retry_attempt = reserve_retry_attempt.saturating_add(1); + sleep(sleep_for).await; + continue; + } + Err(BrokerError::PayloadTooLarge { + requested_bytes, + capacity_bytes, + }) => { + return Err(MpscError::Internal(format!( + "broker payload too large: chan_id={} producer_idx={} requested_bytes={} capacity_bytes={}", + chan_id, producer_idx, requested_bytes, capacity_bytes + ))); + } + Err(other) => { + return Err(MpscError::Internal(format!( + "broker reserve failed: chan_id={} producer_idx={} err={}", + chan_id, producer_idx, other + ))); + } + }; + let reservation_id = reservation.envelope.reservation_id; + let msg_id = reservation.envelope.msg_id; + let msg_key = reservation.envelope.payload_key.clone(); + + let key_clone = msg_key.clone(); + let f = put_payload.clone(); + let hint = preferred_sub_cluster_for_call.clone(); + let code = task::spawn_blocking(move || (f)(key_clone, msg_id, hint)) + .await + .map_err(|e| { + abort_on_payload_failure_async(broker.clone(), chan_id, reservation_id); + MpscError::JoinError(e) + })?; + + match code { + 0 => { + broker + .publish(chan_id, reservation_id, broker_now_ms()) + .await + .map_err(|e| { + MpscError::Internal(format!( + "broker publish failed after payload write: chan_id={} producer_idx={} reservation_id={} msg_id={} err={}", + chan_id, producer_idx, reservation_id, msg_id, e + )) + })?; + return Ok(msg_id); + } + 1 => { + abort_broker_reservation_best_effort(broker, chan_id, reservation_id).await; + let now = Instant::now(); + if now >= next_payload_warn_at { + warn!( + "broker payload write backpressured by owner pool: chan_id={} producer_idx={} waited_ms={}", + chan_id, + producer_idx, + reserve_wait_begin.elapsed().as_millis(), + ); + next_payload_warn_at = now + BROKER_BACKPRESSURE_WARN_INTERVAL; + } + let sleep_for = + broker_backpressure_sleep_duration(producer_idx, payload_retry_attempt); + payload_retry_attempt = payload_retry_attempt.saturating_add(1); + sleep(sleep_for).await; + continue; + } + 2 => { + abort_broker_reservation_best_effort(broker, chan_id, reservation_id).await; + return Err(MpscError::PutPayloadNonRetryable); + } + other => { + abort_broker_reservation_best_effort(broker, chan_id, reservation_id).await; + return Err(MpscError::PutPayloadUnknownCode { code: other }); + } + } + } +} + +fn broker_backpressure_sleep_duration(producer_idx: &str, retry_attempt: u32) -> Duration { + let shift = retry_attempt.min(6); + let base_ms = BROKER_BACKPRESSURE_INITIAL_SLEEP_MS + .saturating_mul(1_u64 << shift) + .min(BROKER_BACKPRESSURE_MAX_SLEEP_MS); + let jitter_ms = if BROKER_BACKPRESSURE_JITTER_MS == 0 { + 0 + } else { + producer_idx + .bytes() + .fold(retry_attempt as u64, |acc, byte| { + acc.wrapping_mul(31).wrapping_add(byte as u64) + }) + % (BROKER_BACKPRESSURE_JITTER_MS + 1) + }; + Duration::from_millis((base_ms + jitter_ms).min(BROKER_BACKPRESSURE_MAX_SLEEP_MS)) +} + +async fn abort_broker_reservation_best_effort( + broker: &BrokerHandle, + chan_id: i64, + reservation_id: u64, +) { + if let Err(err) = broker.abort(chan_id, reservation_id).await { + warn!( + "best-effort broker abort failed: chan_id={} reservation_id={} err={}", + chan_id, reservation_id, err + ); + } +} + +fn abort_on_payload_failure_async(broker: BrokerHandle, chan_id: i64, reservation_id: u64) { + tokio::spawn(async move { + abort_broker_reservation_best_effort(&broker, chan_id, reservation_id).await; + }); +} + +fn broker_now_ms() -> i64 { + SystemTime::now() + .duration_since(UNIX_EPOCH) + .expect("system clock is before UNIX_EPOCH") + .as_millis() as i64 } fn spawn_consumer_meta_watch( diff --git a/fluxon_rs/fluxon_observability/src/types.rs b/fluxon_rs/fluxon_observability/src/types.rs index 446c43d..42db8aa 100644 --- a/fluxon_rs/fluxon_observability/src/types.rs +++ b/fluxon_rs/fluxon_observability/src/types.rs @@ -20,6 +20,7 @@ impl FluxonMemberKind { #[derive(Clone, Copy, Debug, PartialEq, Eq)] pub enum FluxonMemberRole { Master, + Broker, OwnerClient, ExternalClient, SideTransferWorker, @@ -30,6 +31,7 @@ impl FluxonMemberRole { pub fn as_str(self) -> &'static str { match self { FluxonMemberRole::Master => "master", + FluxonMemberRole::Broker => "broker", FluxonMemberRole::OwnerClient => "owner_client", FluxonMemberRole::ExternalClient => "external_client", FluxonMemberRole::SideTransferWorker => "side_transfer_worker", diff --git a/fluxon_rs/fluxon_ops/build.rs b/fluxon_rs/fluxon_ops/build.rs index 585fbfc..51e95c4 100644 --- a/fluxon_rs/fluxon_ops/build.rs +++ b/fluxon_rs/fluxon_ops/build.rs @@ -59,9 +59,17 @@ print( } fn render_log_shard_helper(repo_root: &Path) -> String { - let helper_path = repo_root.join("deployment").join("utils").join("log_shard.py"); - fs::read_to_string(&helper_path) - .unwrap_or_else(|e| panic!("read log shard helper failed: {} ({})", helper_path.display(), e)) + let helper_path = repo_root + .join("deployment") + .join("utils") + .join("log_shard.py"); + fs::read_to_string(&helper_path).unwrap_or_else(|e| { + panic!( + "read log shard helper failed: {} ({})", + helper_path.display(), + e + ) + }) } fn main() { @@ -87,6 +95,10 @@ fn main() { ); println!( "cargo:rerun-if-changed={}", - repo_root.join("deployment").join("utils").join("log_shard.py").display() + repo_root + .join("deployment") + .join("utils") + .join("log_shard.py") + .display() ); } diff --git a/fluxon_rs/fluxon_ops/src/lib.rs b/fluxon_rs/fluxon_ops/src/lib.rs index 29d9434..3adb053 100644 --- a/fluxon_rs/fluxon_ops/src/lib.rs +++ b/fluxon_rs/fluxon_ops/src/lib.rs @@ -80,7 +80,8 @@ const DELETE_APPLY_NO_WAIT_DELAY_SECONDS: u64 = 30; const EMBEDDED_SELECTION_SUPERVISOR_SOURCE: &str = include_str!(concat!(env!("OUT_DIR"), "/selection_supervisor.py")); -const EMBEDDED_LOG_SHARD_HELPER_SOURCE: &str = include_str!(concat!(env!("OUT_DIR"), "/log_shard.py")); +const EMBEDDED_LOG_SHARD_HELPER_SOURCE: &str = + include_str!(concat!(env!("OUT_DIR"), "/log_shard.py")); // Ops controller uses Fluxon user-RPC to talk to ops agents. // Keep the timeout as a fixed constant to avoid config surface area. @@ -351,7 +352,10 @@ fn workload_log_latest_shard_identity(logical_path: &Path) -> anyhow::Result anyhow::Result Ok(resolved) } -fn ensure_embedded_selection_supervisor_runtime(workdir: &Path) -> anyhow::Result<(PathBuf, PathBuf)> { +fn ensure_embedded_selection_supervisor_runtime( + workdir: &Path, +) -> anyhow::Result<(PathBuf, PathBuf)> { let runtime_dir = workdir.join(OPS_SELECTION_SUPERVISOR_DIR_NAME); std::fs::create_dir_all(&runtime_dir).with_context(|| { format!( @@ -1657,10 +1663,11 @@ fn selection_owner_supervisor( scope_key: Option<&str>, exclude_pid: Option, ) -> anyhow::Result> { - let owners: Vec = live_selection_supervisors(snapshot, Some(label), scope_key)? - .into_iter() - .filter(|supervisor| exclude_pid != Some(supervisor.pid())) - .collect(); + let owners: Vec = + live_selection_supervisors(snapshot, Some(label), scope_key)? + .into_iter() + .filter(|supervisor| exclude_pid != Some(supervisor.pid())) + .collect(); if owners.is_empty() { return Ok(None); } @@ -2068,7 +2075,16 @@ fn wait_for_selection_attached( argv: &[String], cwd: Option<&str>, ) -> anyhow::Result { - wait_for_selection_attached_for_scope(kind, name, authority, None, apply_id, owner_ts_ms, argv, cwd) + wait_for_selection_attached_for_scope( + kind, + name, + authority, + None, + apply_id, + owner_ts_ms, + argv, + cwd, + ) } fn wait_for_selection_attached_without_present_for_scope( @@ -2803,10 +2819,9 @@ impl SupervisorBackedWorkloads { fn list_workloads(&self) -> anyhow::Result> { let mut out: Vec = Vec::new(); let snapshot = selection_supervisor_proc_snapshot()?; - for status in observe_all_selection_statuses_for_snapshot( - &snapshot, - Some(self.scope_key.as_str()), - )? { + for status in + observe_all_selection_statuses_for_snapshot(&snapshot, Some(self.scope_key.as_str()))? + { let kind = status.kind.with_context(|| { format!( "selection supervisor list item missing kind: label={}", @@ -3159,7 +3174,10 @@ impl UserRpcHandler for ReadWorkloadLogChunkHandler { None => { let Some(path) = resolve_readable_log_path(&logical_path) else { let resp = make_err_resp( - format!("log file is not available yet: logical_path={}", logical_path.display()), + format!( + "log file is not available yet: logical_path={}", + logical_path.display() + ), None, ); return Ok(serde_json::to_vec(&resp).unwrap()); @@ -3186,138 +3204,11 @@ impl UserRpcHandler for ReadWorkloadLogChunkHandler { }; let file_size = meta.len(); - let (start, end, start_cursor, end_cursor, effective_path, effective_file_size) = - match req.direction { - LogReadDirection::Forward => { - if let Some(cursor) = req.cursor.as_ref() { - if cursor.offset > file_size { - let resp = make_err_resp( - format!( - "cursor out of range: shard={} cursor={} file_size={}", - cursor.shard, cursor.offset, file_size - ), - Some(file_size), - ); - return Ok(serde_json::to_vec(&resp).unwrap()); - } - let mut effective_path = path.clone(); - let mut effective_shard = shard.clone(); - let mut effective_file_size = file_size; - let mut start = cursor.offset; - if cursor.offset == file_size { - if let Ok(Some(next_shard)) = - workload_log_next_shard(&logical_path, &cursor.shard) - { - let next_path = match workload_log_path_for_shard(&logical_path, &next_shard) { - Ok(v) => v, - Err(e) => { - let resp = make_err_resp(format!("{}", e), Some(file_size)); - return Ok(serde_json::to_vec(&resp).unwrap()); - } - }; - match std::fs::metadata(&next_path) { - Ok(next_meta) => { - effective_file_size = next_meta.len(); - effective_path = next_path; - effective_shard = next_shard; - start = 0; - } - Err(e) => { - let resp = make_err_resp( - format!( - "stat next log shard failed: path={} err={}", - next_path.display(), - e - ), - Some(file_size), - ); - return Ok(serde_json::to_vec(&resp).unwrap()); - } - } - } else if let Ok(Some(latest_shard)) = - workload_log_latest_shard_identity(&logical_path) - { - if latest_shard != cursor.shard { - let latest_path = - match workload_log_path_for_shard(&logical_path, &latest_shard) { - Ok(v) => v, - Err(e) => { - let resp = make_err_resp(format!("{}", e), Some(file_size)); - return Ok(serde_json::to_vec(&resp).unwrap()); - } - }; - match std::fs::metadata(&latest_path) { - Ok(latest_meta) => { - effective_file_size = latest_meta.len(); - effective_path = latest_path; - effective_shard = latest_shard; - start = 0; - } - Err(e) => { - let resp = make_err_resp( - format!( - "stat latest log shard failed: path={} err={}", - latest_path.display(), - e - ), - Some(file_size), - ); - return Ok(serde_json::to_vec(&resp).unwrap()); - } - } - } - } - } - let end = match max_bytes { - Some(max_bytes) => { - std::cmp::min(effective_file_size, start.saturating_add(max_bytes)) - } - None => effective_file_size, - }; - ( - start, - end, - Some(WorkloadLogCursor { - shard: effective_shard.clone(), - offset: start, - }), - Some(WorkloadLogCursor { - shard: effective_shard.clone(), - offset: end, - }), - effective_path, - effective_file_size, - ) - } else { - let end = file_size; - let start = match max_bytes { - Some(max_bytes) => end.saturating_sub(max_bytes), - None => 0, - }; - ( - start, - end, - Some(WorkloadLogCursor { - shard: shard.clone(), - offset: start, - }), - Some(WorkloadLogCursor { - shard: shard.clone(), - offset: end, - }), - path.clone(), - file_size, - ) - } - } - LogReadDirection::Backward => { - let Some(cursor) = req.cursor.as_ref() else { - let resp = make_err_resp( - "cursor is required for Backward reads".to_string(), - Some(file_size), - ); - return Ok(serde_json::to_vec(&resp).unwrap()); - }; + let (start, end, start_cursor, end_cursor, effective_path, effective_file_size) = match req + .direction + { + LogReadDirection::Forward => { + if let Some(cursor) = req.cursor.as_ref() { if cursor.offset > file_size { let resp = make_err_resp( format!( @@ -3331,30 +3222,31 @@ impl UserRpcHandler for ReadWorkloadLogChunkHandler { let mut effective_path = path.clone(); let mut effective_shard = shard.clone(); let mut effective_file_size = file_size; - let mut end = cursor.offset; - if cursor.offset == 0 { - if let Ok(Some(prev_shard)) = - workload_log_previous_shard(&logical_path, &cursor.shard) + let mut start = cursor.offset; + if cursor.offset == file_size { + if let Ok(Some(next_shard)) = + workload_log_next_shard(&logical_path, &cursor.shard) { - let prev_path = match workload_log_path_for_shard(&logical_path, &prev_shard) { - Ok(v) => v, - Err(e) => { - let resp = make_err_resp(format!("{}", e), Some(file_size)); - return Ok(serde_json::to_vec(&resp).unwrap()); - } - }; - match std::fs::metadata(&prev_path) { - Ok(prev_meta) => { - effective_file_size = prev_meta.len(); - effective_path = prev_path; - effective_shard = prev_shard; - end = effective_file_size; + let next_path = + match workload_log_path_for_shard(&logical_path, &next_shard) { + Ok(v) => v, + Err(e) => { + let resp = make_err_resp(format!("{}", e), Some(file_size)); + return Ok(serde_json::to_vec(&resp).unwrap()); + } + }; + match std::fs::metadata(&next_path) { + Ok(next_meta) => { + effective_file_size = next_meta.len(); + effective_path = next_path; + effective_shard = next_shard; + start = 0; } Err(e) => { let resp = make_err_resp( format!( - "stat previous log shard failed: path={} err={}", - prev_path.display(), + "stat next log shard failed: path={} err={}", + next_path.display(), e ), Some(file_size), @@ -3362,11 +3254,47 @@ impl UserRpcHandler for ReadWorkloadLogChunkHandler { return Ok(serde_json::to_vec(&resp).unwrap()); } } + } else if let Ok(Some(latest_shard)) = + workload_log_latest_shard_identity(&logical_path) + { + if latest_shard != cursor.shard { + let latest_path = + match workload_log_path_for_shard(&logical_path, &latest_shard) + { + Ok(v) => v, + Err(e) => { + let resp = + make_err_resp(format!("{}", e), Some(file_size)); + return Ok(serde_json::to_vec(&resp).unwrap()); + } + }; + match std::fs::metadata(&latest_path) { + Ok(latest_meta) => { + effective_file_size = latest_meta.len(); + effective_path = latest_path; + effective_shard = latest_shard; + start = 0; + } + Err(e) => { + let resp = make_err_resp( + format!( + "stat latest log shard failed: path={} err={}", + latest_path.display(), + e + ), + Some(file_size), + ); + return Ok(serde_json::to_vec(&resp).unwrap()); + } + } + } } } - let start = match max_bytes { - Some(max_bytes) => end.saturating_sub(max_bytes), - None => 0, + let end = match max_bytes { + Some(max_bytes) => { + std::cmp::min(effective_file_size, start.saturating_add(max_bytes)) + } + None => effective_file_size, }; ( start, @@ -3382,8 +3310,103 @@ impl UserRpcHandler for ReadWorkloadLogChunkHandler { effective_path, effective_file_size, ) + } else { + let end = file_size; + let start = match max_bytes { + Some(max_bytes) => end.saturating_sub(max_bytes), + None => 0, + }; + ( + start, + end, + Some(WorkloadLogCursor { + shard: shard.clone(), + offset: start, + }), + Some(WorkloadLogCursor { + shard: shard.clone(), + offset: end, + }), + path.clone(), + file_size, + ) } - }; + } + LogReadDirection::Backward => { + let Some(cursor) = req.cursor.as_ref() else { + let resp = make_err_resp( + "cursor is required for Backward reads".to_string(), + Some(file_size), + ); + return Ok(serde_json::to_vec(&resp).unwrap()); + }; + if cursor.offset > file_size { + let resp = make_err_resp( + format!( + "cursor out of range: shard={} cursor={} file_size={}", + cursor.shard, cursor.offset, file_size + ), + Some(file_size), + ); + return Ok(serde_json::to_vec(&resp).unwrap()); + } + let mut effective_path = path.clone(); + let mut effective_shard = shard.clone(); + let mut effective_file_size = file_size; + let mut end = cursor.offset; + if cursor.offset == 0 { + if let Ok(Some(prev_shard)) = + workload_log_previous_shard(&logical_path, &cursor.shard) + { + let prev_path = + match workload_log_path_for_shard(&logical_path, &prev_shard) { + Ok(v) => v, + Err(e) => { + let resp = make_err_resp(format!("{}", e), Some(file_size)); + return Ok(serde_json::to_vec(&resp).unwrap()); + } + }; + match std::fs::metadata(&prev_path) { + Ok(prev_meta) => { + effective_file_size = prev_meta.len(); + effective_path = prev_path; + effective_shard = prev_shard; + end = effective_file_size; + } + Err(e) => { + let resp = make_err_resp( + format!( + "stat previous log shard failed: path={} err={}", + prev_path.display(), + e + ), + Some(file_size), + ); + return Ok(serde_json::to_vec(&resp).unwrap()); + } + } + } + } + let start = match max_bytes { + Some(max_bytes) => end.saturating_sub(max_bytes), + None => 0, + }; + ( + start, + end, + Some(WorkloadLogCursor { + shard: effective_shard.clone(), + offset: start, + }), + Some(WorkloadLogCursor { + shard: effective_shard.clone(), + offset: end, + }), + effective_path, + effective_file_size, + ) + } + }; if end < start { let resp = make_err_resp( @@ -3416,7 +3439,11 @@ impl UserRpcHandler for ReadWorkloadLogChunkHandler { Ok(v) => v, Err(e) => { let resp = make_err_resp( - format!("open log failed: path={} err={}", effective_path.display(), e), + format!( + "open log failed: path={} err={}", + effective_path.display(), + e + ), Some(effective_file_size), ); return Ok(serde_json::to_vec(&resp).unwrap()); @@ -3425,7 +3452,11 @@ impl UserRpcHandler for ReadWorkloadLogChunkHandler { if let Err(e) = std::io::Seek::seek(&mut f, std::io::SeekFrom::Start(start)) { let resp = make_err_resp( - format!("seek log failed: path={} err={}", effective_path.display(), e), + format!( + "seek log failed: path={} err={}", + effective_path.display(), + e + ), Some(effective_file_size), ); return Ok(serde_json::to_vec(&resp).unwrap()); @@ -3434,7 +3465,11 @@ impl UserRpcHandler for ReadWorkloadLogChunkHandler { let mut buf: Vec = vec![0; len]; if let Err(e) = std::io::Read::read_exact(&mut f, &mut buf) { let resp = make_err_resp( - format!("read log failed: path={} err={}", effective_path.display(), e), + format!( + "read log failed: path={} err={}", + effective_path.display(), + e + ), Some(effective_file_size), ); return Ok(serde_json::to_vec(&resp).unwrap()); @@ -4067,8 +4102,7 @@ fn desired_workload_matches_running( &desired.name, &desired.authority, Some(workloads.scope_key.as_str()), - ) - else { + ) else { return false; }; desired_workload_status_matches_goal(&status, desired) @@ -14337,14 +14371,14 @@ mod tests { assert_eq!(scoped_b.len(), 1); assert_eq!(scoped_b[0].pid(), 22); - let listed_a = observe_all_selection_statuses_for_snapshot(&snapshot, Some("/tmp/scope-a")) - .unwrap(); + let listed_a = + observe_all_selection_statuses_for_snapshot(&snapshot, Some("/tmp/scope-a")).unwrap(); assert_eq!(listed_a.len(), 1); assert_eq!(listed_a[0].label, "DaemonSet/target"); assert_eq!(listed_a[0].pid, Some(11)); - let listed_b = observe_all_selection_statuses_for_snapshot(&snapshot, Some("/tmp/scope-b")) - .unwrap(); + let listed_b = + observe_all_selection_statuses_for_snapshot(&snapshot, Some("/tmp/scope-b")).unwrap(); assert_eq!(listed_b.len(), 1); assert_eq!(listed_b[0].label, "DaemonSet/target"); assert_eq!(listed_b[0].pid, Some(22)); @@ -14548,8 +14582,8 @@ mod tests { zombie_infos: Vec::new(), }; - let listed = observe_apply_runtime_statuses_for_snapshot("apply-1", &snapshot, None) - .unwrap(); + let listed = + observe_apply_runtime_statuses_for_snapshot("apply-1", &snapshot, None).unwrap(); assert_eq!(listed.len(), 1); assert_eq!(listed[0].name.as_deref(), Some("target-present")); assert!(listed[0].present); @@ -14774,12 +14808,8 @@ mod tests { None )); - let delete_old = workloads.delete_generation( - WorkloadKind::Deployment, - &name, - &name, - Some("apply-1"), - ); + let delete_old = + workloads.delete_generation(WorkloadKind::Deployment, &name, &name, Some("apply-1")); if !delete_old.ok { let err = delete_old.err.as_deref().unwrap_or_default(); assert!( @@ -14813,8 +14843,7 @@ mod tests { delete_current.ok, "unguarded delete should bind and retire the current visible generation: {delete_current:?}" ); - wait_for_selection_absent(WorkloadKind::Deployment, &name, &name, Some("apply-2")) - .unwrap(); + wait_for_selection_absent(WorkloadKind::Deployment, &name, &name, Some("apply-2")).unwrap(); } #[test] @@ -14826,9 +14855,12 @@ mod tests { python_exe.display() ); let workdir = tempfile::tempdir().unwrap(); - let runtime = - SelectionSupervisorRuntime::materialize(workdir.path(), workdir.path(), python_exe.as_path()) - .unwrap(); + let runtime = SelectionSupervisorRuntime::materialize( + workdir.path(), + workdir.path(), + python_exe.as_path(), + ) + .unwrap(); assert!(runtime.script_path.exists()); assert!( runtime @@ -14849,9 +14881,12 @@ mod tests { python_exe.display() ); let workdir = tempfile::tempdir().unwrap(); - let runtime = - SelectionSupervisorRuntime::materialize(workdir.path(), workdir.path(), python_exe.as_path()) - .unwrap(); + let runtime = SelectionSupervisorRuntime::materialize( + workdir.path(), + workdir.path(), + python_exe.as_path(), + ) + .unwrap(); let log_path = workdir.path().join("startup.log"); let command = vec![ python_exe.display().to_string(), @@ -14876,7 +14911,9 @@ mod tests { "--".to_string(), "/bin/true".to_string(), ]; - let pid = runtime.spawn_detached_command(&log_path, command.as_slice()).unwrap(); + let pid = runtime + .spawn_detached_command(&log_path, command.as_slice()) + .unwrap(); let deadline = Instant::now() + Duration::from_secs(10); let expected = "owner-ts-ms must be positive"; let mut saw_expected = false; @@ -15164,7 +15201,9 @@ mod tests { }), max_bytes: Some(65536), }; - let raw = handler.handle("n1".into(), &serde_json::to_vec(&req).unwrap()).unwrap(); + let raw = handler + .handle("n1".into(), &serde_json::to_vec(&req).unwrap()) + .unwrap(); let resp: ReadWorkloadLogResp = serde_json::from_slice(&raw).unwrap(); assert!(resp.ok, "{resp:?}"); assert_eq!(resp.text.as_deref(), Some("new\n")); @@ -15209,7 +15248,9 @@ mod tests { }), max_bytes: Some(65536), }; - let raw = handler.handle("n1".into(), &serde_json::to_vec(&req).unwrap()).unwrap(); + let raw = handler + .handle("n1".into(), &serde_json::to_vec(&req).unwrap()) + .unwrap(); let resp: ReadWorkloadLogResp = serde_json::from_slice(&raw).unwrap(); assert!(resp.ok, "{resp:?}"); assert_eq!(resp.text.as_deref(), Some("old\n")); diff --git a/fluxon_rs/fluxon_pyo3/src/error.rs b/fluxon_rs/fluxon_pyo3/src/error.rs index 97ab680..e153ebc 100644 --- a/fluxon_rs/fluxon_pyo3/src/error.rs +++ b/fluxon_rs/fluxon_pyo3/src/error.rs @@ -51,6 +51,26 @@ pub(crate) fn pyerr_message_consumption_no_new_message( }) } +pub(crate) fn pyerr_channel_closed(py: Python<'_>, message: &str, channel_id: i64) -> PyErr { + build_ext_error(py, "ChannelClosedError", message, |kw| { + kw.set_item("channel_id", channel_id).unwrap(); + }) +} + +pub(crate) fn pyerr_producer_closed( + py: Python<'_>, + message: &str, + channel_id: i64, + producer_idx: Option<&str>, +) -> PyErr { + build_ext_error(py, "ProducerClosedError", message, |kw| { + kw.set_item("channel_id", channel_id).unwrap(); + if let Some(p) = producer_idx { + kw.set_item("producer_idx", p).unwrap(); + } + }) +} + pub(crate) fn pyerr_message_consumption( py: Python<'_>, message: &str, @@ -87,6 +107,18 @@ pub(crate) fn pyerr_chan_message_produce( }) } +pub(crate) fn pyerr_message_buffer_full( + py: Python<'_>, + message: &str, + channel_id: i64, + buffer_size: i64, +) -> PyErr { + build_ext_error(py, "MessageBufferFullError", message, |kw| { + kw.set_item("channel_id", channel_id).unwrap(); + kw.set_item("buffer_size", buffer_size).unwrap(); + }) +} + // System/bridge category constructors (distinct helpers for clarity) pub(crate) fn pyerr_etcd(py: Python<'_>, message: &str, component: &str) -> PyErr { build_ext_error(py, "EtcdError", message, |kw| { @@ -264,10 +296,13 @@ pub(crate) fn new_store_closed_error(py: Python<'_>, message: &str) -> PyObject pub(crate) fn new_result_success(py: Python<'_>, value: PyObject) -> PyObject { let api_error_module = py.import_bound("fluxon_py.api_error").unwrap(); let result_class = api_error_module.getattr("Result").unwrap(); - result_class - .call_method1("new_ok", (value,)) - .unwrap() - .into() + match result_class.call_method1("new_ok", (value,)) { + Ok(obj) => obj.into(), + Err(err) => { + let message = format!("Failed to build Result.new_ok: {}", err); + new_result_error(py, new_general_error(py, &message)) + } + } } pub(crate) fn new_result_error(py: Python<'_>, error: PyObject) -> PyObject { diff --git a/fluxon_rs/fluxon_pyo3/src/flatdict_zerocopy.rs b/fluxon_rs/fluxon_pyo3/src/flatdict_zerocopy.rs index 335f36e..c80e775 100644 --- a/fluxon_rs/fluxon_pyo3/src/flatdict_zerocopy.rs +++ b/fluxon_rs/fluxon_pyo3/src/flatdict_zerocopy.rs @@ -1,6 +1,7 @@ -use std::collections::{BTreeMap, BTreeSet}; +use std::collections::{BTreeMap, BTreeSet, HashSet}; use std::os::raw::c_void; -use std::sync::Arc; +use std::sync::atomic::{AtomicUsize, Ordering}; +use std::sync::{Arc, Mutex}; use fluxon_kv::memholder::kvclient_encode::{ BorrowedFlatKvValueRange, FLAT_KV_TYPE_BOOL, FLAT_KV_TYPE_BYTES, FLAT_KV_TYPE_FLOAT64, @@ -30,25 +31,90 @@ const DLPACK_USED_CAPSULE_NAME_CSTR: &[u8] = b"used_dltensor\0"; #[derive(Clone)] pub(crate) enum FlatDictDataOwner { - OwnedBytes(Arc<[u8]>), - UserMemHolder(Arc), - ExternalMemHolder(Arc), + OwnedBytes(Arc), + UserMemHolder(Arc), + ExternalMemHolder(Arc), } impl FlatDictDataOwner { pub(crate) fn from_owned_bytes(bytes: Vec) -> Self { - Self::OwnedBytes(Arc::<[u8]>::from(bytes)) + Self::OwnedBytes(Arc::new(FlatDictOwnedBytes { + bytes: Arc::<[u8]>::from(bytes), + })) + } + + pub(crate) fn from_user_memholder(holder: Arc) -> Self { + Self::UserMemHolder(Arc::new(FlatDictUserMemHolder { holder })) + } + + pub(crate) fn from_external_memholder(holder: Arc) -> Self { + Self::ExternalMemHolder(Arc::new(FlatDictExternalMemHolder { holder })) } fn bytes(&self) -> &[u8] { match self { - Self::OwnedBytes(bytes) => bytes.as_ref(), - Self::UserMemHolder(holder) => holder.bytes(), - Self::ExternalMemHolder(holder) => holder.bytes(), + Self::OwnedBytes(owner) => owner.bytes.as_ref(), + Self::UserMemHolder(owner) => owner.holder.bytes(), + Self::ExternalMemHolder(owner) => owner.holder.bytes(), } } } +pub(crate) type FlatDictCleanup = Box; + +#[derive(Clone)] +pub(crate) struct FlatDictSharedCleanup { + state: Arc, +} + +struct FlatDictSharedCleanupState { + remaining_views: AtomicUsize, + cleanup: Mutex>, +} + +impl FlatDictSharedCleanup { + fn new(view_count: usize, cleanup: FlatDictCleanup) -> Self { + assert!( + view_count > 0, + "flatdict cleanup requires at least one view" + ); + Self { + state: Arc::new(FlatDictSharedCleanupState { + remaining_views: AtomicUsize::new(view_count), + cleanup: Mutex::new(Some(cleanup)), + }), + } + } + + fn release(self) { + let previous = self.state.remaining_views.fetch_sub(1, Ordering::AcqRel); + assert!(previous > 0, "flatdict cleanup released too many times"); + if previous == 1 { + let cleanup = self.state.cleanup.lock().unwrap().take(); + run_flatdict_cleanup(cleanup); + } + } +} + +pub(crate) struct FlatDictOwnedBytes { + bytes: Arc<[u8]>, +} + +pub(crate) struct FlatDictUserMemHolder { + holder: Arc, +} + +pub(crate) struct FlatDictExternalMemHolder { + holder: Arc, +} + +fn run_flatdict_cleanup(cleanup: Option) { + let Some(cleanup) = cleanup else { + return; + }; + cleanup(); +} + pub(crate) struct FlatDictEncodePlan { ptrs: Vec<(u8, usize, u32, u64, u32, Option)>, key_storage: Vec>, @@ -254,6 +320,7 @@ pub(crate) struct FlatDictDLPackView { bits: u8, lanes: u16, shape: Arc<[i64]>, + cleanup: Mutex>, } #[pymethods] @@ -300,6 +367,32 @@ impl FlatDictDLPackView { bits, lanes, shape, + cleanup: Mutex::new(None), + } + } + + pub(crate) fn attach_cleanup( + &self, + cleanup: FlatDictSharedCleanup, + ) -> Result<(), FlatDictSharedCleanup> { + let mut guard = self.cleanup.lock().unwrap(); + if guard.is_some() { + return Err(cleanup); + } + *guard = Some(cleanup); + Ok(()) + } + + fn has_cleanup(&self) -> bool { + self.cleanup.lock().unwrap().is_some() + } +} + +impl Drop for FlatDictDLPackView { + fn drop(&mut self) { + let cleanup = self.cleanup.get_mut().unwrap().take(); + if let Some(cleanup) = cleanup { + cleanup.release(); } } } @@ -1002,3 +1095,94 @@ pub(crate) fn decode_flat_dict_to_wrapped_py_object( } Ok(dict.into()) } + +pub(crate) fn attach_cleanup_to_flatdict_pyobject( + py: Python<'_>, + obj: &PyObject, + make_cleanup: F, +) -> bool +where + F: FnOnce() -> FlatDictCleanup, +{ + let any = obj.bind(py); + let Ok(dict) = any.downcast::() else { + return false; + }; + let mut seen_views = HashSet::::new(); + let mut view_count = 0usize; + for value in dict.values() { + if let Ok(view) = value.extract::>() { + let view_ptr = (&*view as *const FlatDictDLPackView) as usize; + if !seen_views.insert(view_ptr) { + continue; + } + if view.has_cleanup() { + return false; + } + view_count += 1; + } + } + if view_count == 0 { + return false; + } + + let cleanup = FlatDictSharedCleanup::new(view_count, make_cleanup()); + let mut seen_views = HashSet::::new(); + for value in dict.values() { + if let Ok(view) = value.extract::>() { + let view_ptr = (&*view as *const FlatDictDLPackView) as usize; + if !seen_views.insert(view_ptr) { + continue; + } + assert!( + view.attach_cleanup(cleanup.clone()).is_ok(), + "flatdict cleanup attach must succeed after pre-check" + ); + } + } + true +} + +#[cfg(test)] +mod tests { + use super::*; + use pyo3::types::PyDict; + use std::sync::atomic::{AtomicUsize, Ordering}; + + #[test] + fn cleanup_waits_for_all_dlpack_views() { + Python::with_gil(|py| { + let owner = FlatDictDataOwner::from_owned_bytes(vec![0u8; 16]); + let shape = Arc::<[i64]>::from(vec![8_i64]); + let view1 = Py::new( + py, + FlatDictDLPackView::new(owner.clone(), 0, 8, 1, 8, 1, shape.clone()), + ) + .unwrap(); + let view2 = Py::new(py, FlatDictDLPackView::new(owner, 8, 8, 1, 8, 1, shape)).unwrap(); + + let dict = PyDict::new_bound(py); + dict.set_item("a", view1.bind(py)).unwrap(); + dict.set_item("b", view2.bind(py)).unwrap(); + + let hits = Arc::new(AtomicUsize::new(0)); + let hits_for_cleanup = hits.clone(); + let dict_obj = dict.into_any().into_py(py); + assert!(attach_cleanup_to_flatdict_pyobject(py, &dict_obj, || { + let hits_for_cleanup = hits_for_cleanup.clone(); + Box::new(move || { + hits_for_cleanup.fetch_add(1, Ordering::SeqCst); + }) + })); + + drop(dict_obj); + assert_eq!(hits.load(Ordering::SeqCst), 0); + + drop(view1); + assert_eq!(hits.load(Ordering::SeqCst), 0); + + drop(view2); + assert_eq!(hits.load(Ordering::SeqCst), 1); + }); + } +} diff --git a/fluxon_rs/fluxon_pyo3/src/lease_manager.rs b/fluxon_rs/fluxon_pyo3/src/lease_manager.rs index 99b8a79..52645e9 100755 --- a/fluxon_rs/fluxon_pyo3/src/lease_manager.rs +++ b/fluxon_rs/fluxon_pyo3/src/lease_manager.rs @@ -139,26 +139,39 @@ pub struct LeaseManagerHandle { // 仅作为 fluxon_mq::lease_manager::Lease 的包装,避免在 fluxon_pyo3 中重复实现 RAII 逻辑。 #[pyclass] pub struct PyGeneralLease { - lease: fluxon_mq::lease_manager::GeneralLease, + lease: Option, } #[pymethods] impl PyGeneralLease { #[getter] - fn id(&self) -> u64 { - self.lease.id() + fn id(&self) -> PyResult { + Ok(self + .lease + .as_ref() + .ok_or_else(|| { + PyErr::new::("lease handle is closed") + })? + .id()) } fn __repr__(&self) -> String { - match self.lease.kind() { + let Some(lease) = self.lease.as_ref() else { + return "".to_string(); + }; + match lease.kind() { fluxon_util::lease_manager::LeaseType::Etcd => { - format!("", self.id()) + format!("", lease.id()) } fluxon_util::lease_manager::LeaseType::KvClient => { - format!("", self.id()) + format!("", lease.id()) } } } + + fn close(&mut self) { + self.lease = None; + } } #[pymethods] @@ -220,7 +233,7 @@ impl LeaseManagerHandle { "end allocate_etcd_lease: id={}, elapsed_ms={}", lease.id(), t0.elapsed().as_millis() ); - Ok(PyGeneralLease { lease }) + Ok(PyGeneralLease { lease: Some(lease) }) } /// Register existing etcd lease id for keepalive and wrap the core Lease. @@ -272,7 +285,7 @@ impl LeaseManagerHandle { "end register_etcd_lease: id={}, elapsed_ms={}", lease.id(), t0.elapsed().as_millis() ); - Ok(PyGeneralLease { lease }) + Ok(PyGeneralLease { lease: Some(lease) }) } /// Register a kvclient lease via constructed backend uid carrying callbacks. @@ -330,7 +343,7 @@ impl LeaseManagerHandle { "end register_kvclient_lease_via_backend: id={}, elapsed_ms={}", lease.id(), t0.elapsed().as_millis() ); - Ok(PyGeneralLease { lease }) + Ok(PyGeneralLease { lease: Some(lease) }) } /// Debug-only: dump current active lease entries from the keepalive actor. diff --git a/fluxon_rs/fluxon_pyo3/src/lib.rs b/fluxon_rs/fluxon_pyo3/src/lib.rs index a73591f..31d3c59 100644 --- a/fluxon_rs/fluxon_pyo3/src/lib.rs +++ b/fluxon_rs/fluxon_pyo3/src/lib.rs @@ -31,7 +31,7 @@ use fluxon_kv::user_api::FluxonUserApi; use fluxon_kv::{ ConfigArg, Framework, KvClientTrait, KvGetResult, config::{ClientConfig, MasterConfig}, - run_client, run_master, + run_broker, run_client, run_master, }; use fluxon_ops; use fluxon_proxy; @@ -2623,6 +2623,31 @@ fn python_config_to_master_config( } } +fn python_config_to_client_config( + py: Python, + py_config: &Bound<'_, PyDict>, +) -> ApiResult { + let config: serde_yaml::Value = match pyany_to_serde_value(py, &py_config.to_object(py)) { + Ok(val) => val, + Err(e) => return ApiResult::new_error(new_invalid_argument_error(py, &e.to_string())), + }; + + let yaml_str = match serde_yaml::to_string(&config) { + Ok(s) => s, + Err(e) => return ApiResult::new_error(new_invalid_argument_error(py, &e.to_string())), + }; + + let config: ClientConfigYaml = match ClientConfigYaml::from_str(&yaml_str) { + Ok(config) => config, + Err(e) => return ApiResult::new_error(new_invalid_argument_error(py, &e.to_string())), + }; + + match config.verify() { + Ok(config) => ApiResult::new_success(config.into()), + Err(e) => ApiResult::new_error(new_invalid_argument_error(py, &e.to_string())), + } +} + fn pyany_to_serde_value(py: Python, obj: &PyObject) -> PyResult { if obj.is_none(py) { Ok(Value::Null) @@ -4138,6 +4163,124 @@ fn run_master_blocking(config: Option<&Bound<'_, PyAny>>, py: Python) -> PyObjec run_master_inner(config, py).into_py_object(py) } +/// Run broker with automatic lifecycle management +/// This function creates a broker, runs it until Ctrl+C, then shuts down +#[pyfunction] +#[pyo3(signature = (config=None))] +fn run_broker_blocking(config: Option<&Bound<'_, PyAny>>, py: Python) -> PyObject { + fn run_broker_inner(config: Option<&Bound<'_, PyAny>>, py: Python) -> ApiResult { + println!("🛠️ Broker init configuration: {:?}", config); + + let runtime = match Runtime::new() { + Ok(rt) => rt, + Err(e) => { + return ApiResult::new_error(new_general_error( + py, + &format!("Failed to create runtime: {}", e), + )); + } + }; + + let config_arg = match config { + None => ConfigArg::None, + Some(py_obj) => { + if py_obj.is_instance_of::() { + let path_str: String = match py_obj.extract() { + Ok(path) => path, + Err(_) => { + return ApiResult::new_error(new_invalid_argument_error( + py, + "Invalid configuration file path", + )); + } + }; + ConfigArg::File(PathBuf::from(path_str)) + } else if py_obj.is_instance_of::() { + let py_dict = match py_obj.downcast::() { + Ok(dict) => dict, + Err(_) => { + return ApiResult::new_error(new_invalid_argument_error( + py, + "Invalid configuration dictionary", + )); + } + }; + match python_config_to_client_config(py, py_dict) { + ApiResult::Success(client_config) => ConfigArg::Config(client_config), + ApiResult::Error(error) => return ApiResult::new_error(error), + } + } else { + return ApiResult::new_error(new_invalid_argument_error( + py, + "Config parameter must be None, string (file path), or dict (config object)", + )); + } + } + }; + + println!("🚀 Starting KV Broker..."); + + let (framework, final_config) = match py.allow_threads(|| { + runtime.run_async_from_sync(async move { fluxon_kv::run_broker(config_arg).await }) + }) { + Ok(Ok((fw, cfg))) => (fw, cfg), + Ok(Err(e)) => { + return ApiResult::new_error(new_backend_init_failed_error( + py, + &format!("Failed to initialize KV broker: {}", e), + Some("unified"), + )); + } + Err(e) => { + return ApiResult::new_error(new_backend_init_failed_error( + py, + &format!("Runtime bridge failed: {}", e), + Some("unified"), + )); + } + }; + + println!("✅ KV Broker started successfully"); + println!("📊 Instance: {}", final_config.instance_key); + println!("🏷️ Cluster: {}", final_config.cluster_name); + match final_config.fluxonkv_spec.p2p_listen_port { + Some(port) => println!("🔌 Port: {}", port), + None => println!("🔌 Port: auto"), + } + println!("🚀 Broker is running... Press Ctrl+C to stop"); + + let shutdown_result = py.allow_threads(|| { + runtime.block_on(async move { + if let Err(e) = tokio::signal::ctrl_c().await { + eprintln!("Failed to listen for shutdown signal: {}", e); + } + match framework.shutdown().await { + Ok(_) => { + println!("✅ Broker shut down successfully"); + Ok(()) + } + Err(e) => { + eprintln!("⚠️ Warning during shutdown: {}", e); + Err(e) + } + } + }) + }); + + let out = match shutdown_result { + Ok(_) => ApiResult::new_success(new_none_success_instance(py)), + Err(e) => ApiResult::new_error(new_general_error( + py, + &format!("Error during shutdown: {}", e), + )), + }; + runtime.shutdown_background(); + out + } + + run_broker_inner(config, py).into_py_object(py) +} + /// Python module definition #[pymodule] #[pyo3(name = "fluxon_pyo3")] @@ -4158,6 +4301,7 @@ fn fluxon_pyo3(m: &Bound<'_, PyModule>) -> PyResult<()> { m.add_class::()?; m.add_class::()?; m.add_function(wrap_pyfunction!(run_master_blocking, m)?)?; + m.add_function(wrap_pyfunction!(run_broker_blocking, m)?)?; m.add_function(wrap_pyfunction!(monitor_render_cli, m)?)?; m.add_function(wrap_pyfunction!(monitor_render_web, m)?)?; m.add_function(wrap_pyfunction!(fluxon_ops_controller_blocking, m)?)?; diff --git a/fluxon_rs/fluxon_pyo3/src/memholder.rs b/fluxon_rs/fluxon_pyo3/src/memholder.rs index b2750e6..7b66aec 100755 --- a/fluxon_rs/fluxon_pyo3/src/memholder.rs +++ b/fluxon_rs/fluxon_pyo3/src/memholder.rs @@ -34,9 +34,11 @@ impl MemHolder { let data_owner = match &holder.holder { MemHolderInner::Seg(seg_holder) => { - FlatDictDataOwner::UserMemHolder(seg_holder.clone()) + FlatDictDataOwner::from_user_memholder(seg_holder.clone()) + } + MemHolderInner::Owned(bytes) => { + FlatDictDataOwner::from_owned_bytes(bytes.as_ref().to_vec()) } - MemHolderInner::Owned(bytes) => FlatDictDataOwner::OwnedBytes(bytes.clone()), }; match decode_flat_dict_to_wrapped_py_object(py, data_owner) { Ok(obj) => { @@ -126,7 +128,7 @@ impl ExternalMemHolder { return ApiResult::new_success(cached.clone_ref(py).into_py(py)); } - let data_owner = FlatDictDataOwner::ExternalMemHolder(holder.holder.clone()); + let data_owner = FlatDictDataOwner::from_external_memholder(holder.holder.clone()); match decode_flat_dict_to_wrapped_py_object(py, data_owner) { Ok(obj) => { *holder.access_cache.write() = Some(obj.clone_ref(py)); diff --git a/fluxon_rs/fluxon_pyo3/src/mpsc.rs b/fluxon_rs/fluxon_pyo3/src/mpsc.rs index 22d0d07..78bb552 100644 --- a/fluxon_rs/fluxon_pyo3/src/mpsc.rs +++ b/fluxon_rs/fluxon_pyo3/src/mpsc.rs @@ -1,5 +1,5 @@ use std::sync::atomic::{AtomicBool, AtomicU8, Ordering}; -use std::sync::{Arc, OnceLock}; +use std::sync::{Arc, Mutex, OnceLock}; use std::time::{Duration, Instant}; use crossbeam_channel as cbchan; @@ -9,8 +9,8 @@ use fluxon_mq::consumer::{ PayloadResult as CorePayloadResult, }; use fluxon_mq::{ - ChanManager, MpscConsumer as CoreMpscConsumer, MpscError as CoreMpscError, - MpscProducer as CoreMpscProducer, ShutdownCtl, + BrokerChannelConfig, BrokerHandle, ChanManager, MpscConsumer as CoreMpscConsumer, + MpscError as CoreMpscError, MpscProducer as CoreMpscProducer, ShutdownCtl, create::{ChanCreateConfig, create_mpsc_channel}, }; use pyo3::Py; @@ -21,7 +21,9 @@ use tokio::runtime::Handle; use tokio::runtime::Runtime; // (no local payload buffering) -use crate::flatdict_zerocopy::{FlatDictDataOwner, decode_flat_dict_to_wrapped_py_object}; +use crate::flatdict_zerocopy::{ + FlatDictDataOwner, attach_cleanup_to_flatdict_pyobject, decode_flat_dict_to_wrapped_py_object, +}; use crate::lease_manager::PyLeaseBackendUid; use fluxon_kv::{Framework as KvFramework, KvClientTrait}; use fluxon_mq::lease_manager::LeaseBackendUid; @@ -33,9 +35,35 @@ use tracing::{debug, warn}; // that implements the core MqPayload trait so we can downcast later. struct PyPayload { inner: PyObject, + cleanup_runtime: Handle, } -impl CoreMqPayload for PyPayload {} +impl CoreMqPayload for PyPayload { + fn attach_cleanup( + &mut self, + cleanup: fluxon_mq::consumer::PayloadCleanup, + ) -> Result<(), fluxon_mq::consumer::PayloadCleanup> { + let runtime = self.cleanup_runtime.clone(); + let mut cleanup = Some(cleanup); + let attached = Python::with_gil(|py| { + attach_cleanup_to_flatdict_pyobject(py, &self.inner, || { + let cleanup = cleanup + .take() + .expect("cleanup must be present when DLPack attach is selected"); + Box::new(move || { + runtime.spawn(async move { + cleanup().await; + }); + }) + }) + }); + if attached { + Ok(()) + } else { + Err(cleanup.expect("cleanup must remain present when attach fails")) + } + } +} // Shared runtime for PyO3 helpers that are not lifecycle-governed by a KV Framework. // MQ producer/consumer operations should prefer the KV client's runtime/framework to @@ -50,6 +78,7 @@ const RUST_KV_DELETE_TIMEOUT: Duration = Duration::from_secs(10); const RUST_KV_DELETE_JOIN_WARN_INTERVAL: Duration = Duration::from_secs(1); const PAYLOAD_STAGE_WATCHDOG_INTERVAL: Duration = Duration::from_secs(2); const PAYLOAD_STAGE_SLOW_WARN_THRESHOLD: Duration = Duration::from_secs(1); +const BROKER_EMPTY_POLL_INTERVAL: Duration = Duration::from_millis(20); const GET_ONE_PENDING_WARN_INTERVAL: Duration = Duration::from_secs(2); /// Global runtime for standalone PyO3 helpers (e.g., lease_manager.rs). @@ -60,6 +89,59 @@ pub(crate) fn get_global_runtime() -> Arc { .clone() } +fn connect_distributed_broker(kv_framework: &Arc) -> BrokerHandle { + BrokerHandle::new_distributed( + kv_framework.cluster_manager_view().clone(), + kv_framework.p2p_view().clone(), + ) +} + +fn init_broker_for_channel( + runtime: &Handle, + broker: &BrokerHandle, + channel_id: i64, + capacity: i64, +) -> PyResult<()> { + use pyo3::exceptions::PyRuntimeError; + + runtime + .run_async_from_sync(async move { + broker + .upsert_channel(BrokerChannelConfig { + channel_id, + capacity, + }) + .await + }) + .map_err(|e| PyRuntimeError::new_err(format!("broker runtime bridge failed: {}", e)))? + .map_err(|e| { + PyRuntimeError::new_err(format!( + "failed to upsert broker channel config: chan_id={} capacity={} err={}", + channel_id, capacity, e + )) + })?; + Ok(()) +} + +fn delete_broker_channel( + runtime: &Handle, + broker: &BrokerHandle, + channel_id: i64, +) -> PyResult> { + use pyo3::exceptions::PyRuntimeError; + + let payload_keys = runtime + .run_async_from_sync(async move { broker.delete_channel(channel_id).await }) + .map_err(|e| PyRuntimeError::new_err(format!("broker runtime bridge failed: {}", e)))? + .map_err(|e| { + PyRuntimeError::new_err(format!( + "failed to delete broker channel: chan_id={} err={}", + channel_id, e + )) + })?; + Ok(payload_keys) +} + fn get_consumed_message_class(py: Python<'_>) -> PyResult> { if let Some(c) = CONSUMED_MESSAGE_CLASS.get() { return Ok(c.clone_ref(py)); @@ -126,6 +208,120 @@ fn finalize_payload_result( result } +fn map_producer_result( + py: Python<'_>, + result: Result<(), CoreMpscError>, + producer: &CoreMpscProducer, +) -> PyResult<()> { + use crate::error::CoreMpscErrorReExport as CoreErr; + + match result { + Ok(()) => Ok(()), + Err(e) => match e { + CoreErr::MessageBufferFull { + channel_id, + capacity, + .. + } => Err(crate::error::pyerr_message_buffer_full( + py, + &e.to_string(), + channel_id, + capacity, + )), + CoreErr::PutPayloadNonRetryable | CoreErr::PutPayloadUnknownCode { .. } => { + Err(crate::error::pyerr_chan_message_produce( + py, + &e.to_string(), + producer.chan_id(), + Some(&producer.producer_idx().to_string()), + None, + )) + } + CoreErr::Etcd(_) => Err(crate::error::pyerr_etcd(py, &e.to_string(), "mpsc_rust")), + CoreErr::JoinError(_) => Err(crate::error::pyerr_join_error( + py, + &e.to_string(), + "mpsc_rust", + )), + CoreErr::Closed => Err(crate::error::pyerr_producer_closed( + py, + &e.to_string(), + producer.chan_id(), + Some(producer.producer_idx()), + )), + CoreErr::Internal(_) | CoreErr::NoMessage => Err(crate::error::pyerr_internal( + py, + &e.to_string(), + "mpsc_rust", + )), + _ => Err(crate::error::pyerr_internal( + py, + &e.to_string(), + "mpsc_rust", + )), + }, + } +} + +fn map_consumer_error(py: Python<'_>, err: CoreMpscError, chan_id: i64) -> PyErr { + use crate::error::CoreMpscErrorReExport as CoreErr; + let message = err.to_string(); + + match err { + CoreErr::NoMessage => crate::error::pyerr_message_consumption_no_new_message( + py, &message, chan_id, None, None, + ), + CoreErr::MessageBufferFull { capacity, .. } => { + crate::error::pyerr_message_buffer_full(py, &message, chan_id, capacity) + } + CoreErr::GetPayloadNonRetryable { .. } + | CoreErr::GetPayloadUnknownCode { .. } + | CoreErr::ConsumeOffsetUpdate { .. } + | CoreErr::DeletePayloadNonRetryable { .. } + | CoreErr::DeletePayloadUnknownCode { .. } => { + crate::error::pyerr_message_consumption(py, &message, chan_id, None, None) + } + CoreErr::PutPayloadNonRetryable | CoreErr::PutPayloadUnknownCode { .. } => { + crate::error::pyerr_chan_message_produce(py, &message, chan_id, None, None) + } + CoreErr::Etcd(_) => crate::error::pyerr_etcd(py, &message, "mpsc_rust"), + CoreErr::JoinError(_) => crate::error::pyerr_join_error(py, &message, "mpsc_rust"), + CoreErr::Internal(_) | CoreErr::Closed => { + crate::error::pyerr_internal(py, &message, "mpsc_rust") + } + } +} + +fn consumed_payload_to_pyobject( + py: Python<'_>, + consumed: CoreConsumedPayload, +) -> PyResult<(PyObject, u64)> { + use pyo3::exceptions::PyRuntimeError; + + let CoreConsumedPayload { payload, .. } = consumed; + let pyobj = match payload.downcast::() { + Ok(v) => v.inner, + Err(_) => { + return Err(PyRuntimeError::new_err( + "payload type mismatch: expected PyPayload", + )); + } + }; + + let payload_len: u64 = { + let any = pyobj.bind(py); + if any.is_instance_of::() { + let b = any + .downcast::() + .expect("PyBytes downcast failed after is_instance_of"); + b.as_bytes().len() as u64 + } else { + 0 + } + }; + Ok((pyobj, payload_len)) +} + // (LeaseManagerHandle and PyLease moved to lease_manager.rs) /// Shared MPSC context bound to a specific etcd endpoint set. @@ -307,6 +503,7 @@ impl MpscContext { Ok(MpscProducerHandle { inner: Some(producer), + broker: None, shutdown, kv_framework: self.kv_framework.clone(), kv_runtime: self.kv_runtime.clone(), @@ -449,6 +646,7 @@ impl MpscContext { Ok(MpscConsumerHandle { inner: Some(consumer), + broker: None, shutdown, parent_mpmc_id_opt, kv_framework: self.kv_framework.clone(), @@ -493,6 +691,11 @@ impl MpscContext { ))), } } + + fn delete_broker_channel(&self, chan_id: i64) -> PyResult> { + let broker = connect_distributed_broker(&self.kv_framework); + delete_broker_channel(&self.kv_runtime, &broker, chan_id) + } } /// PyO3 handle for MPSC producer. Currently this focuses on @@ -501,6 +704,7 @@ impl MpscContext { #[pyclass] pub struct MpscProducerHandle { pub(crate) inner: Option, + broker: Option, shutdown: ShutdownCtl, kv_framework: Arc, kv_runtime: Handle, @@ -509,50 +713,18 @@ pub struct MpscProducerHandle { put_profile_window_bytes: u64, } -#[pymethods] -impl MpscProducerHandle { - fn chan_id(&self) -> i64 { - self.inner - .as_ref() - .expect("MpscProducerHandle inner not initialized or already taken by an in-flight put") - .chan_id() - } - - fn producer_idx(&self) -> String { - self.inner - .as_ref() - .expect("MpscProducerHandle inner not initialized or already taken by an in-flight put") - .producer_idx() - .to_string() - } - - fn payload_lease_id(&self) -> i64 { - self.inner - .as_ref() - .expect("MpscProducerHandle inner not initialized or already taken by an in-flight put") - .payload_lease_id() - } +enum ProducerPutMode { + Etcd, + Broker { broker: BrokerHandle }, +} - /// Put a message payload into the underlying KV backend by passing raw ptr tuples. - /// - /// This avoids calling back into Python for kvclient.put and lets the KV backend - /// encode/copy directly into segment memory. - /// - /// `ptrs` is a list of `(type_id, dict_key_ptr, dict_key_len, val_u64, val_len, extra)`: - /// - `dict_key_ptr/dict_key_len`: UTF-8 bytes of the dict field key. - /// - For scalar types (bool/int64/float64), `val_u64` stores raw bits and `val_len` is fixed. - /// - For bytes-like types (string/bytes), `val_u64` stores a pointer and `val_len` is the byte length. - /// - /// Safety/lifetime contract: - /// - This is async on the Rust side; the caller must keep the memory behind pointers - /// alive and immutable until this method returns. - #[pyo3(signature = (ptrs))] - fn put_flat_dict_ptrs( +impl MpscProducerHandle { + fn put_flat_dict_ptrs_impl( &mut self, ptrs: Vec<(u8, u64, u32, u64, u32, Option)>, + mode: ProducerPutMode, ) -> PyResult<()> { use pyo3::exceptions::PyRuntimeError; - use std::sync::{Arc, Mutex}; if self.shutdown.is_closed() { return Err(PyRuntimeError::new_err("MpscProducerHandle is closed")); @@ -598,56 +770,79 @@ impl MpscProducerHandle { let (tx, rx) = cbchan::bounded::<(Result<(), CoreMpscError>, CoreMpscProducer)>(1); runtime.spawn(async move { - let mut guard = ProducerGuard::new(inner, tx); + let mut guard = ProducerGuard::new(inner, tx, ShutdownCtl::new()); let payload_lease_id = guard.inner_mut().payload_lease_id() as u64; - - let res = guard - .inner_mut() - .put_with_payload(move |key: String, _msg_id: i64, preferred_sub_cluster| { - let mut o = fluxon_kv::client_kv_api::PutOptionalArgs::new(); - o.0.push(fluxon_kv::client_kv_api::PutOptionalArg::LeaseId( - payload_lease_id, + let put_payload: Arc< + dyn Fn(String, i64, Option) -> i32 + Send + Sync + 'static, + > = Arc::new(move |key: String, _msg_id: i64, preferred_sub_cluster| { + let mut o = fluxon_kv::client_kv_api::PutOptionalArgs::new(); + o.0.push(fluxon_kv::client_kv_api::PutOptionalArg::LeaseId( + payload_lease_id, + )); + if let Some(sc) = preferred_sub_cluster { + o.0.push(fluxon_kv::client_kv_api::PutOptionalArg::PreferredSubCluster( + sc, )); - if let Some(sc) = preferred_sub_cluster { - o.0.push(fluxon_kv::client_kv_api::PutOptionalArg::PreferredSubCluster( - sc, - )); - } + } - let ptrs_for_call: Vec<(u8, usize, u32, u64, u32, Option)> = - (*ptrs_arc).clone(); - let kv_framework_for_call = kv_framework.clone(); - let kv_runtime_for_call = kv_runtime.clone(); - let put_res = kv_runtime_for_call.run_async_from_sync(async move { - unsafe { kv_framework_for_call.kv_put_ptrs(&key, ptrs_for_call, o).await } - }); + let ptrs_for_call: Vec<(u8, usize, u32, u64, u32, Option)> = + (*ptrs_arc).clone(); + let kv_framework_for_call = kv_framework.clone(); + let kv_runtime_for_call = kv_runtime.clone(); + let put_res = kv_runtime_for_call.run_async_from_sync(async move { + unsafe { kv_framework_for_call.kv_put_ptrs(&key, ptrs_for_call, o).await } + }); - match put_res { - Ok(Ok(())) => 0, - Ok(Err(e)) => { - if matches!( - &e, - fluxon_kv::rpcresp_kvresult_convert::msg_and_error::KvError::Api( - fluxon_kv::rpcresp_kvresult_convert::msg_and_error::ApiError::NoSpace { .. } - ) - ) { - 1 - } else { - if let Ok(mut g) = err_for_closure.lock() { - *g = Some(e.to_string()); - } - 2 - } - } - Err(e) => { + match put_res { + Ok(Ok(())) => 0, + Ok(Err(e)) => { + if matches!( + &e, + fluxon_kv::rpcresp_kvresult_convert::msg_and_error::KvError::Api( + fluxon_kv::rpcresp_kvresult_convert::msg_and_error::ApiError::NoSpace { .. } + ) + ) { + 1 + } else { if let Ok(mut g) = err_for_closure.lock() { - *g = Some(format!("runtime bridge failed: {}", e)); + *g = Some(e.to_string()); } 2 } } - }) - .await; + Err(e) => { + if let Ok(mut g) = err_for_closure.lock() { + *g = Some(format!("runtime bridge failed: {}", e)); + } + 2 + } + } + }); + + let res = match mode { + ProducerPutMode::Etcd => { + let put_payload = put_payload.clone(); + guard + .inner_mut() + .put_with_payload(move |key, msg_id, preferred_sub_cluster| { + (put_payload)(key, msg_id, preferred_sub_cluster) + }) + .await + } + ProducerPutMode::Broker { broker } => { + let put_payload = put_payload.clone(); + guard + .inner_mut() + .put_with_payload_via_broker( + &broker, + payload_len, + move |key, msg_id, preferred_sub_cluster| { + (put_payload)(key, msg_id, preferred_sub_cluster) + }, + ) + .await + } + }; guard.finish(res); }); @@ -658,8 +853,15 @@ impl MpscProducerHandle { Ok(v) => break v, Err(cbchan::RecvTimeoutError::Timeout) => {} Err(cbchan::RecvTimeoutError::Disconnected) => { + self.shutdown.close(); return ( - Err(PyRuntimeError::new_err("put_flat_dict_ptrs task cancelled")), + Err(crate::error::pyerr_chan_message_produce( + py, + "producer is closed", + self.chan_id(), + Some(&self.producer_idx().to_string()), + None, + )), None, ); } @@ -685,42 +887,10 @@ impl MpscProducerHandle { } } - let mapped = match result { - Ok(()) => Ok(()), - Err(e) => { - use crate::error::CoreMpscErrorReExport as CoreErr; - match e { - CoreErr::PutPayloadNonRetryable | CoreErr::PutPayloadUnknownCode { .. } => { - Err(crate::error::pyerr_chan_message_produce( - py, - &e.to_string(), - producer_back.chan_id(), - Some(&producer_back.producer_idx().to_string()), - None, - )) - } - CoreErr::Etcd(_) => { - Err(crate::error::pyerr_etcd(py, &e.to_string(), "mpsc_rust")) - } - CoreErr::JoinError(_) => Err(crate::error::pyerr_join_error( - py, - &e.to_string(), - "mpsc_rust", - )), - CoreErr::Internal(_) => Err(crate::error::pyerr_internal( - py, - &e.to_string(), - "mpsc_rust", - )), - _ => Err(crate::error::pyerr_internal( - py, - &e.to_string(), - "mpsc_rust", - )), - } - } - }; - (mapped, Some(producer_back)) + ( + map_producer_result(py, result, &producer_back), + Some(producer_back), + ) }); if let Some(back) = maybe_back { @@ -743,6 +913,71 @@ impl MpscProducerHandle { mapped } +} + +#[pymethods] +impl MpscProducerHandle { + fn chan_id(&self) -> i64 { + self.inner + .as_ref() + .expect("MpscProducerHandle inner not initialized or already taken by an in-flight put") + .chan_id() + } + + fn producer_idx(&self) -> String { + self.inner + .as_ref() + .expect("MpscProducerHandle inner not initialized or already taken by an in-flight put") + .producer_idx() + .to_string() + } + + fn payload_lease_id(&self) -> i64 { + self.inner + .as_ref() + .expect("MpscProducerHandle inner not initialized or already taken by an in-flight put") + .payload_lease_id() + } + + #[pyo3(signature = (ptrs))] + fn put_flat_dict_ptrs( + &mut self, + ptrs: Vec<(u8, u64, u32, u64, u32, Option)>, + ) -> PyResult<()> { + use pyo3::exceptions::PyRuntimeError; + + let broker = self + .broker + .clone() + .ok_or_else(|| PyRuntimeError::new_err("broker is not initialized"))?; + self.put_flat_dict_ptrs_impl(ptrs, ProducerPutMode::Broker { broker }) + } + + #[pyo3(signature = (ptrs))] + fn put_flat_dict_ptrs_legacy_for_internal_check( + &mut self, + ptrs: Vec<(u8, u64, u32, u64, u32, Option)>, + ) -> PyResult<()> { + self.put_flat_dict_ptrs_impl(ptrs, ProducerPutMode::Etcd) + } + + fn init_broker(&mut self) -> PyResult<()> { + use pyo3::exceptions::PyRuntimeError; + + let inner = self + .inner + .as_ref() + .ok_or_else(|| PyRuntimeError::new_err("MpscProducerHandle inner not initialized"))?; + let broker = connect_distributed_broker(&self.kv_framework); + init_broker_for_channel( + &self.kv_runtime, + &broker, + inner.chan_id(), + inner.channel_capacity(), + )?; + self.broker = Some(broker); + Ok(()) + } // Removed: the legacy `put_with_payload(callback)` API was intentionally deleted to // force a single supported data path (put_flat_dict_ptrs) and avoid Python callbacks @@ -797,6 +1032,7 @@ impl PyShutdownCtl { #[pyclass] pub struct MpscConsumerHandle { pub(crate) inner: Option, + broker: Option, shutdown: ShutdownCtl, /// Optional parent MPMC id when this MPSC acts as a submodule of a MPMC channel. /// Only used for diagnostics (rate-limited retry logging) and not for behavior. @@ -829,66 +1065,477 @@ pub struct MpscConsumerHandle { get_one_profile_last_timeout_ms: Option, } -#[pymethods] -impl MpscConsumerHandle { - fn chan_id(&self) -> i64 { - self.inner - .as_ref() - .expect("MpscConsumerHandle inner not initialized") - .chan_id() - } - - fn consumer_idx(&self) -> String { - self.inner - .as_ref() - .expect("MpscConsumerHandle inner not initialized") - .consumer_idx() - .to_string() - } +enum ConsumerGetMode { + Prefetch { + prefetch_target: usize, + timeout_ms: Option, + maybe_sync_sub_cluster: Option>, + }, + Broker { + broker: BrokerHandle, + timeout_ms: Option, + }, +} - /// Initialize the global payload callback for this consumer. - /// - /// 回调在 consumer 生命周期内复用;后续 `get_one` / - /// `get_with_payload` 调用都不会再传入回调参数。 - #[pyo3(signature = (callback))] - fn init_payload_callback(&mut self, callback: PyObject) -> PyResult<()> { +impl MpscConsumerHandle { + fn get_one_impl( + &mut self, + py: Python<'_>, + mode: ConsumerGetMode, + profile_prefetch_target: usize, + profile_timeout_ms: Option, + ) -> PyResult { use pyo3::exceptions::PyRuntimeError; - use std::sync::Arc; - - let cb: Arc = Arc::new(callback); + use std::time::Duration; + let get_one_begin = std::time::Instant::now(); + let chan_id_for_profile = self.chan_id(); + let consumer_idx_for_profile = self.consumer_idx(); + self.get_one_profile_last_prefetch_target = profile_prefetch_target; + self.get_one_profile_last_timeout_ms = profile_timeout_ms; + if self.shutdown.is_closed() { + return Err(crate::error::pyerr_channel_closed( + py, + "consumer is closed", + self.chan_id(), + )); + } - // Capture identifiers for rate-limited retry logging (diagnostic only). - let mpsc_id_for_log = self.chan_id(); - let parent_mpmc_id_opt = self.parent_mpmc_id_opt; + let runtime = self.kv_runtime.clone(); + let shutdown_for_task = self.shutdown.clone(); - // Rate limit helper lives in fluxon_util::limitrate + let inner = self + .inner + .take() + .ok_or_else(|| PyRuntimeError::new_err("MpscConsumerHandle is already in use"))?; - let bridge_cb: fluxon_mq::consumer::PayloadCallback = Arc::new( - move |producer_id: String, key: String| { - let cb_for_call = cb.clone(); - Box::pin(async move { - let producer_id_for_call = producer_id.clone(); - let key_for_call = key.clone(); + let (tx, rx) = + cbchan::bounded::<(Result, CoreMpscConsumer)>(1); - let join = limit_thirdparty::tokio::task::spawn_blocking(move || { - // Run the Python callback via a global Python executor. - // This avoids blocking the Tokio scheduler thread. - let (pid_obj, key_obj) = Python::with_gil(|py| { - ( - PyString::new_bound(py, &producer_id_for_call) - .unbind() - .into(), - PyString::new_bound(py, &key_for_call).unbind().into(), + runtime.spawn(async move { + let mut guard = ConsumerGuard::new(inner, tx, ShutdownCtl::new()); + let (chan_id_for_log, consumer_idx_for_log) = { + let inner_ref = guard.inner_mut(); + (inner_ref.chan_id(), inner_ref.consumer_idx().to_string()) + }; + let res = match mode { + ConsumerGetMode::Prefetch { + prefetch_target, + timeout_ms, + maybe_sync_sub_cluster, + } => { + if let Some(sc) = maybe_sync_sub_cluster { + if let Err(e) = guard.inner_mut().sync_kvclient_sub_cluster(sc.clone()).await + { + warn!( + "[MpscConsumer chan_id={} consumer_idx={}] failed to sync kvclient_sub_cluster={:?}: {}; continuing consumption", + chan_id_for_log, consumer_idx_for_log, sc, e + ); + } + } + if let Some(ms) = timeout_ms { + guard + .inner_mut() + .get_with_payload_retry_wait_timeout( + prefetch_target, + Duration::from_millis(ms as u64), ) - }); - - match fluxon_util::pyo3::run_longtime_py_function( - cb_for_call.as_ref(), - vec![pid_obj, key_obj], - None, - ) { - Ok(obj) => { - // Normalize error reporting to (code:int, msg:str). Otherwise treat as payload object. + .await + } else { + guard + .inner_mut() + .get_with_payload_retry(prefetch_target) + .await + } + } + ConsumerGetMode::Broker { broker, timeout_ms } => { + if let Some(ms) = timeout_ms { + let timeout = Duration::from_millis(ms.max(0) as u64); + let deadline = Instant::now() + timeout; + loop { + match guard.inner_mut().get_with_payload_via_broker(&broker).await { + Err(CoreMpscError::NoMessage) + if !shutdown_for_task.is_closed() + && Instant::now() < deadline => + { + let remaining = + deadline.saturating_duration_since(Instant::now()); + tokio::time::sleep(std::cmp::min( + BROKER_EMPTY_POLL_INTERVAL, + remaining, + )) + .await; + } + result => break result, + } + } + } else { + guard.inner_mut().get_with_payload_via_broker(&broker).await + } + } + }; + match &res { + Ok(payload) => { + debug!( + "[MpscConsumerHandle chan_id={} consumer_idx={}] async get finished: producer_id={} nonblocking_hit={}", + chan_id_for_log, + consumer_idx_for_log, + payload.producer_id, + payload.nonblocking_hit, + ); + } + Err(err) => { + debug!( + "[MpscConsumerHandle chan_id={} consumer_idx={}] async get finished with error: {:?}", + chan_id_for_log, + consumer_idx_for_log, + err, + ); + } + } + guard.finish(res); + }); + + let mut wait_rx_ns: u64 = 0; + let mut wait_rx_max_ns: u64 = 0; + let mut signal_ns: u64 = 0; + let mut signal_max_ns: u64 = 0; + let mut recv_timeouts: u64 = 0; + let mut recv_calls: u64 = 0; + let wait_begin = Instant::now(); + let mut next_pending_warn_at = wait_begin + GET_ONE_PENDING_WARN_INTERVAL; + + let (result, consumer_back) = loop { + recv_calls += 1; + let recv_begin = Instant::now(); + let recv_res = py.allow_threads(|| rx.recv_timeout(Duration::from_millis(50))); + let recv_elapsed_ns = recv_begin.elapsed().as_nanos() as u64; + wait_rx_ns += recv_elapsed_ns; + if recv_elapsed_ns > wait_rx_max_ns { + wait_rx_max_ns = recv_elapsed_ns; + } + + match recv_res { + Ok(v) => break v, + Err(cbchan::RecvTimeoutError::Timeout) => { + recv_timeouts += 1; + let now = Instant::now(); + if now >= next_pending_warn_at { + warn!( + "[MpscConsumerHandle chan_id={} consumer_idx={}] get_one still pending: elapsed_ms={} recv_calls={} recv_timeouts={} prefetch_target={} timeout_ms={:?}", + chan_id_for_profile, + consumer_idx_for_profile, + wait_begin.elapsed().as_millis(), + recv_calls, + recv_timeouts, + profile_prefetch_target, + profile_timeout_ms, + ); + next_pending_warn_at = now + GET_ONE_PENDING_WARN_INTERVAL; + } + } + Err(cbchan::RecvTimeoutError::Disconnected) => { + return Err(PyRuntimeError::new_err("get_one task cancelled")); + } + } + + let signal_begin = Instant::now(); + let signal_res = py.check_signals(); + let signal_elapsed_ns = signal_begin.elapsed().as_nanos() as u64; + signal_ns += signal_elapsed_ns; + if signal_elapsed_ns > signal_max_ns { + signal_max_ns = signal_elapsed_ns; + } + + if let Err(e) = signal_res { + self.shutdown.close(); + return Err(e); + } + }; + + let post_begin = Instant::now(); + self.inner = Some(consumer_back); + + let consumed = match result { + Ok(v) => v, + Err(e) => return Err(map_consumer_error(py, e, self.chan_id())), + }; + let (pyobj, payload_len) = consumed_payload_to_pyobject(py, consumed)?; + + let get_one_total = get_one_begin.elapsed(); + let total_ns = get_one_total.as_nanos() as u64; + let post_ns = post_begin.elapsed().as_nanos() as u64; + + self.get_one_profile_cnt += 1; + self.get_one_profile_window_bytes += payload_len; + self.get_one_profile_total_sum_ns += total_ns; + if total_ns > self.get_one_profile_total_max_ns { + self.get_one_profile_total_max_ns = total_ns; + } + self.get_one_profile_wait_rx_sum_ns += wait_rx_ns; + if wait_rx_max_ns > self.get_one_profile_wait_rx_max_ns { + self.get_one_profile_wait_rx_max_ns = wait_rx_max_ns; + } + self.get_one_profile_signal_sum_ns += signal_ns; + if signal_max_ns > self.get_one_profile_signal_max_ns { + self.get_one_profile_signal_max_ns = signal_max_ns; + } + self.get_one_profile_post_sum_ns += post_ns; + if post_ns > self.get_one_profile_post_max_ns { + self.get_one_profile_post_max_ns = post_ns; + } + self.get_one_profile_recv_timeouts += recv_timeouts; + self.get_one_profile_recv_calls += recv_calls; + + let now = Instant::now(); + if now >= self.get_one_profile_next_log_at && self.get_one_profile_cnt > 0 { + let cnt = self.get_one_profile_cnt; + let avg_total_ms = + (self.get_one_profile_total_sum_ns as f64) / (cnt as f64) / 1_000_000.0; + let avg_wait_rx_ms = + (self.get_one_profile_wait_rx_sum_ns as f64) / (cnt as f64) / 1_000_000.0; + let avg_signal_ms = + (self.get_one_profile_signal_sum_ns as f64) / (cnt as f64) / 1_000_000.0; + let avg_post_ms = + (self.get_one_profile_post_sum_ns as f64) / (cnt as f64) / 1_000_000.0; + let max_total_ms = (self.get_one_profile_total_max_ns as f64) / 1_000_000.0; + let max_wait_rx_ms = (self.get_one_profile_wait_rx_max_ns as f64) / 1_000_000.0; + let max_signal_ms = (self.get_one_profile_signal_max_ns as f64) / 1_000_000.0; + let max_post_ms = (self.get_one_profile_post_max_ns as f64) / 1_000_000.0; + + tracing::info!( + "[MpscConsumerHandle chan_id={} consumer_idx={}] get_one breakdown: avg_total_ms={:.3} max_total_ms={:.3} avg_wait_rx_ms={:.3} max_wait_rx_ms={:.3} avg_signal_ms={:.3} max_signal_ms={:.3} avg_post_ms={:.3} max_post_ms={:.3} cnt={} recv_calls={} recv_timeouts={} last_prefetch_target={} last_timeout_ms={:?}", + chan_id_for_profile, + consumer_idx_for_profile, + avg_total_ms, + max_total_ms, + avg_wait_rx_ms, + max_wait_rx_ms, + avg_signal_ms, + max_signal_ms, + avg_post_ms, + max_post_ms, + cnt, + self.get_one_profile_recv_calls, + self.get_one_profile_recv_timeouts, + self.get_one_profile_last_prefetch_target, + self.get_one_profile_last_timeout_ms, + ); + + self.inner + .as_ref() + .expect("MpscConsumerHandle inner not initialized") + .observe_get_one_breakdown_window_ms( + avg_total_ms, + max_total_ms, + avg_wait_rx_ms, + max_wait_rx_ms, + avg_signal_ms, + max_signal_ms, + avg_post_ms, + max_post_ms, + cnt, + self.get_one_profile_recv_timeouts, + self.get_one_profile_window_bytes, + ); + + self.get_one_profile_next_log_at = now + Duration::from_secs(30); + self.get_one_profile_cnt = 0; + self.get_one_profile_total_sum_ns = 0; + self.get_one_profile_total_max_ns = 0; + self.get_one_profile_wait_rx_sum_ns = 0; + self.get_one_profile_wait_rx_max_ns = 0; + self.get_one_profile_signal_sum_ns = 0; + self.get_one_profile_signal_max_ns = 0; + self.get_one_profile_post_sum_ns = 0; + self.get_one_profile_post_max_ns = 0; + self.get_one_profile_recv_timeouts = 0; + self.get_one_profile_recv_calls = 0; + self.get_one_profile_window_bytes = 0; + } + Ok(pyobj) + } + + fn get_batch_via_broker_impl( + &mut self, + py: Python<'_>, + broker: BrokerHandle, + batch_size: usize, + timeout_ms: Option, + ) -> PyResult> { + use pyo3::exceptions::PyRuntimeError; + + if self.shutdown.is_closed() { + return Err(crate::error::pyerr_channel_closed( + py, + "consumer is closed", + self.chan_id(), + )); + } + + let chan_id_for_log = self.chan_id(); + let consumer_idx_for_log = self.consumer_idx(); + let runtime = self.kv_runtime.clone(); + let shutdown_for_task = self.shutdown.clone(); + let inner = self + .inner + .take() + .ok_or_else(|| PyRuntimeError::new_err("MpscConsumerHandle is already in use"))?; + let (tx, rx) = cbchan::bounded::<( + Result, CoreMpscError>, + CoreMpscConsumer, + )>(1); + + runtime.spawn(async move { + let mut guard = BatchConsumerGuard::new(inner, tx, ShutdownCtl::new()); + let res = { + if let Some(ms) = timeout_ms { + let timeout = Duration::from_millis(ms.max(0) as u64); + let deadline = Instant::now() + timeout; + loop { + match guard + .inner_mut() + .get_batch_with_payload_via_broker(&broker, batch_size) + .await + { + Err(CoreMpscError::NoMessage) + if !shutdown_for_task.is_closed() && Instant::now() < deadline => + { + let remaining = deadline.saturating_duration_since(Instant::now()); + tokio::time::sleep(std::cmp::min( + BROKER_EMPTY_POLL_INTERVAL, + remaining, + )) + .await; + } + result => break result, + } + } + } else { + guard + .inner_mut() + .get_batch_with_payload_via_broker(&broker, batch_size) + .await + } + }; + guard.finish(res); + }); + + let wait_begin = Instant::now(); + let mut next_pending_warn_at = wait_begin + GET_ONE_PENDING_WARN_INTERVAL; + let mut recv_calls: u64 = 0; + let mut recv_timeouts: u64 = 0; + + let (result, consumer_back) = loop { + recv_calls += 1; + let recv_res = py.allow_threads(|| rx.recv_timeout(Duration::from_millis(50))); + match recv_res { + Ok(v) => break v, + Err(cbchan::RecvTimeoutError::Timeout) => { + recv_timeouts += 1; + let now = Instant::now(); + if now >= next_pending_warn_at { + warn!( + "[MpscConsumerHandle chan_id={} consumer_idx={}] get_batch still pending: elapsed_ms={} recv_calls={} recv_timeouts={} batch_size={} timeout_ms={:?}", + chan_id_for_log, + consumer_idx_for_log, + wait_begin.elapsed().as_millis(), + recv_calls, + recv_timeouts, + batch_size, + timeout_ms, + ); + next_pending_warn_at = now + GET_ONE_PENDING_WARN_INTERVAL; + } + } + Err(cbchan::RecvTimeoutError::Disconnected) => { + return Err(PyRuntimeError::new_err("get_batch task cancelled")); + } + } + + if let Err(e) = py.check_signals() { + self.shutdown.close(); + return Err(e); + } + }; + + self.inner = Some(consumer_back); + let payloads = match result { + Ok(v) => v, + Err(e) => return Err(map_consumer_error(py, e, chan_id_for_log)), + }; + + let mut objects = Vec::with_capacity(payloads.len()); + for consumed in payloads { + let (pyobj, payload_len) = consumed_payload_to_pyobject(py, consumed)?; + self.get_one_profile_window_bytes += payload_len; + objects.push(pyobj); + } + Ok(objects) + } +} + +#[pymethods] +impl MpscConsumerHandle { + fn chan_id(&self) -> i64 { + self.inner + .as_ref() + .expect("MpscConsumerHandle inner not initialized") + .chan_id() + } + + fn consumer_idx(&self) -> String { + self.inner + .as_ref() + .expect("MpscConsumerHandle inner not initialized") + .consumer_idx() + .to_string() + } + + /// Initialize the global payload callback for this consumer. + /// + /// 回调在 consumer 生命周期内复用;后续 `get_one` / + /// `get_with_payload` 调用都不会再传入回调参数。 + #[pyo3(signature = (callback))] + fn init_payload_callback(&mut self, callback: PyObject) -> PyResult<()> { + use pyo3::exceptions::PyRuntimeError; + use std::sync::Arc; + + let cb: Arc = Arc::new(callback); + let kv_runtime = self.kv_runtime.clone(); + + // Capture identifiers for rate-limited retry logging (diagnostic only). + let mpsc_id_for_log = self.chan_id(); + let parent_mpmc_id_opt = self.parent_mpmc_id_opt; + + // Rate limit helper lives in fluxon_util::limitrate + + let bridge_cb: fluxon_mq::consumer::PayloadCallback = Arc::new( + move |producer_id: String, key: String| { + let cb_for_call = cb.clone(); + let kv_runtime_for_call = kv_runtime.clone(); + Box::pin(async move { + let producer_id_for_call = producer_id.clone(); + let key_for_call = key.clone(); + + let join = limit_thirdparty::tokio::task::spawn_blocking(move || { + // Run the Python callback via a global Python executor. + // This avoids blocking the Tokio scheduler thread. + let (pid_obj, key_obj) = Python::with_gil(|py| { + ( + PyString::new_bound(py, &producer_id_for_call) + .unbind() + .into(), + PyString::new_bound(py, &key_for_call).unbind().into(), + ) + }); + + match fluxon_util::pyo3::run_longtime_py_function( + cb_for_call.as_ref(), + vec![pid_obj, key_obj], + None, + ) { + Ok(obj) => { + // Normalize error reporting to (code:int, msg:str). Otherwise treat as payload object. Python::with_gil(|py| { if let Ok((code, msg)) = obj.extract::<(i32, String)>(py) { if code == 1 { @@ -921,6 +1568,7 @@ impl MpscConsumerHandle { } else { CorePayloadResult::Ok(Box::new(PyPayload { inner: obj.clone_ref(py), + cleanup_runtime: kv_runtime_for_call.clone(), })) } }) @@ -1129,8 +1777,10 @@ impl MpscConsumerHandle { let py_wrap_begin = Instant::now(); stage.store(3, Ordering::Relaxed); let payload_owner = match &holder { - KvHolder::Owner(h) => FlatDictDataOwner::UserMemHolder(h.clone()), - KvHolder::External(h) => FlatDictDataOwner::ExternalMemHolder(h.clone()), + KvHolder::Owner(h) => FlatDictDataOwner::from_user_memholder(h.clone()), + KvHolder::External(h) => { + FlatDictDataOwner::from_external_memholder(h.clone()) + } }; let pyobj_res: Result = Python::with_gil(|py| { stage_for_py.store(4, Ordering::Relaxed); @@ -1159,376 +1809,149 @@ impl MpscConsumerHandle { match pyobj_res { Ok(obj) => finalize_payload_result( - CorePayloadResult::Ok(Box::new(PyPayload { inner: obj })), + CorePayloadResult::Ok(Box::new(PyPayload { + inner: obj, + cleanup_runtime: kv_runtime_for_call.clone(), + })), &stage, &done, payload_begin, - &producer_id, - &key, - kv_get_ns, - decode_ns, - py_wrap_ns, - ), - Err(msg) => finalize_payload_result( - CorePayloadResult::NonRetryable(msg), - &stage, - &done, - payload_begin, - &producer_id, - &key, - kv_get_ns, - decode_ns, - py_wrap_ns, - ), - } - }) - }, - ); - - match self.inner.as_mut() { - Some(inner) => { - inner.set_payload_callback(bridge_cb); - Ok(()) - } - None => Err(PyRuntimeError::new_err( - "MpscConsumerHandle inner not initialized", - )), - } - } - - /// New get API that relies on the previously initialized - /// payload callback and returns the Python payload object. - /// - /// `prefetch_target` 用于驱动 Rust 侧预取窗口大小,通常 - /// 由 Python `get_data(batch_size, prefetch_num)` 计算得出。 - /// - /// `timeout_ms` is an optional timeout (milliseconds) for waiting on an - /// available inflight slot. If it fires, the call returns `NoMessage`. - /// - /// Important: once a message is reserved (i.e. an inflight JoinHandle is - /// popped), the call will await it to completion to avoid dropping in-flight - /// fetches and stranding offsets. - #[pyo3(signature = (prefetch_target, timeout_ms=None))] - fn get_one( - &mut self, - py: Python<'_>, - prefetch_target: usize, - timeout_ms: Option, - ) -> PyResult { - use pyo3::exceptions::PyRuntimeError; - use std::time::Duration; - let get_one_begin = std::time::Instant::now(); - let chan_id_for_profile = self.chan_id(); - let consumer_idx_for_profile = self.consumer_idx(); - self.get_one_profile_last_prefetch_target = prefetch_target; - self.get_one_profile_last_timeout_ms = timeout_ms; - if self.shutdown.is_closed() { - return Err(PyRuntimeError::new_err("MpscConsumerHandle is closed")); - } - - let maybe_sync_sub_cluster = { - let now = Instant::now(); - if now >= self.next_sub_cluster_sync_at { - self.next_sub_cluster_sync_at = now + SUB_CLUSTER_SYNC_INTERVAL; - Some( - self.kv_framework - .cluster_manager_view() - .cluster_manager() - .get_self_info() - .sub_cluster - .clone(), - ) - } else { - None - } - }; - - let runtime = self.kv_runtime.clone(); - - let inner = self - .inner - .take() - .ok_or_else(|| PyRuntimeError::new_err("MpscConsumerHandle is already in use"))?; - - let (tx, rx) = - cbchan::bounded::<(Result, CoreMpscConsumer)>(1); - - runtime.spawn(async move { - let mut guard = ConsumerGuard::new(inner, tx); - let (chan_id_for_log, consumer_idx_for_log) = { - let inner_ref = guard.inner_mut(); - (inner_ref.chan_id(), inner_ref.consumer_idx().to_string()) - }; - if let Some(sc) = maybe_sync_sub_cluster { - if let Err(e) = guard.inner_mut().sync_kvclient_sub_cluster(sc.clone()).await { - warn!( - "[MpscConsumer chan_id={} consumer_idx={}] failed to sync kvclient_sub_cluster={:?}: {}; continuing consumption", - chan_id_for_log, consumer_idx_for_log, sc, e - ); - } - } - let res = if let Some(ms) = timeout_ms { - guard - .inner_mut() - .get_with_payload_retry_wait_timeout(prefetch_target, Duration::from_millis(ms as u64)) - .await - } else { - guard.inner_mut().get_with_payload_retry(prefetch_target).await - }; - match &res { - Ok(payload) => { - debug!( - "[MpscConsumerHandle chan_id={} consumer_idx={}] async get finished: producer_id={} nonblocking_hit={}", - chan_id_for_log, - consumer_idx_for_log, - payload.producer_id, - payload.nonblocking_hit, - ); - } - Err(err) => { - debug!( - "[MpscConsumerHandle chan_id={} consumer_idx={}] async get finished with error: {:?}", - chan_id_for_log, - consumer_idx_for_log, - err, - ); - } - } - guard.finish(res); - }); - - let mut wait_rx_ns: u64 = 0; - let mut wait_rx_max_ns: u64 = 0; - let mut signal_ns: u64 = 0; - let mut signal_max_ns: u64 = 0; - let mut recv_timeouts: u64 = 0; - let mut recv_calls: u64 = 0; - let wait_begin = Instant::now(); - let mut next_pending_warn_at = wait_begin + GET_ONE_PENDING_WARN_INTERVAL; - - let (result, consumer_back) = loop { - recv_calls += 1; - let recv_begin = Instant::now(); - let recv_res = py.allow_threads(|| rx.recv_timeout(Duration::from_millis(50))); - let recv_elapsed_ns = recv_begin.elapsed().as_nanos() as u64; - wait_rx_ns += recv_elapsed_ns; - if recv_elapsed_ns > wait_rx_max_ns { - wait_rx_max_ns = recv_elapsed_ns; - } - - match recv_res { - Ok(v) => break v, - Err(cbchan::RecvTimeoutError::Timeout) => { - recv_timeouts += 1; - let now = Instant::now(); - if now >= next_pending_warn_at { - warn!( - "[MpscConsumerHandle chan_id={} consumer_idx={}] get_one still pending: elapsed_ms={} recv_calls={} recv_timeouts={} prefetch_target={} timeout_ms={:?}", - chan_id_for_profile, - consumer_idx_for_profile, - wait_begin.elapsed().as_millis(), - recv_calls, - recv_timeouts, - prefetch_target, - timeout_ms, - ); - next_pending_warn_at = now + GET_ONE_PENDING_WARN_INTERVAL; + &producer_id, + &key, + kv_get_ns, + decode_ns, + py_wrap_ns, + ), + Err(msg) => finalize_payload_result( + CorePayloadResult::NonRetryable(msg), + &stage, + &done, + payload_begin, + &producer_id, + &key, + kv_get_ns, + decode_ns, + py_wrap_ns, + ), } - } - Err(cbchan::RecvTimeoutError::Disconnected) => { - return Err(PyRuntimeError::new_err("get_one task cancelled")); - } - } + }) + }, + ); - let signal_begin = Instant::now(); - let signal_res = py.check_signals(); - let signal_elapsed_ns = signal_begin.elapsed().as_nanos() as u64; - signal_ns += signal_elapsed_ns; - if signal_elapsed_ns > signal_max_ns { - signal_max_ns = signal_elapsed_ns; + match self.inner.as_mut() { + Some(inner) => { + inner.set_payload_callback(bridge_cb); + Ok(()) } + None => Err(PyRuntimeError::new_err( + "MpscConsumerHandle inner not initialized", + )), + } + } - if let Err(e) = signal_res { - self.shutdown.close(); - return Err(e); - } - }; + /// New get API that relies on the previously initialized + /// payload callback and returns the Python payload object. + /// + /// `prefetch_target` 用于驱动 Rust 侧预取窗口大小,通常 + /// 由 Python `get_data(batch_size, prefetch_num)` 计算得出。 + /// + /// `timeout_ms` is an optional timeout (milliseconds) for waiting on an + /// available inflight slot. If it fires, the call returns `NoMessage`. + /// + /// Important: once a message is reserved (i.e. an inflight JoinHandle is + /// popped), the call will await it to completion to avoid dropping in-flight + /// fetches and stranding offsets. + #[pyo3(signature = (prefetch_target, timeout_ms=None))] + fn get_one( + &mut self, + py: Python<'_>, + prefetch_target: usize, + timeout_ms: Option, + ) -> PyResult { + use pyo3::exceptions::PyRuntimeError; - let post_begin = Instant::now(); - self.inner = Some(consumer_back); + let _ = prefetch_target; + let broker = self + .broker + .clone() + .ok_or_else(|| PyRuntimeError::new_err("broker is not initialized"))?; + self.get_one_impl( + py, + ConsumerGetMode::Broker { broker, timeout_ms }, + 0, + timeout_ms, + ) + } - let consumed = match result { - Ok(v) => v, - Err(e) => { - use crate::error::CoreMpscErrorReExport as CoreErr; - return Err(match e { - CoreErr::NoMessage => crate::error::pyerr_message_consumption_no_new_message( - py, - &e.to_string(), - self.chan_id(), - None, - None, - ), - CoreErr::GetPayloadNonRetryable { .. } - | CoreErr::GetPayloadUnknownCode { .. } - | CoreErr::ConsumeOffsetUpdate { .. } - | CoreErr::DeletePayloadNonRetryable { .. } - | CoreErr::DeletePayloadUnknownCode { .. } => { - crate::error::pyerr_message_consumption( - py, - &e.to_string(), - self.chan_id(), - None, - None, - ) - } - CoreErr::PutPayloadNonRetryable | CoreErr::PutPayloadUnknownCode { .. } => { - crate::error::pyerr_chan_message_produce( - py, - &e.to_string(), - self.chan_id(), - None, - None, - ) - } - CoreErr::Etcd(_) => crate::error::pyerr_etcd(py, &e.to_string(), "mpsc_rust"), - CoreErr::JoinError(_) => { - crate::error::pyerr_join_error(py, &e.to_string(), "mpsc_rust") - } - CoreErr::Internal(_) => { - crate::error::pyerr_internal(py, &e.to_string(), "mpsc_rust") - } - }); - } - }; - // Downcast to PyPayload and extract the PyObject - let CoreConsumedPayload { payload, .. } = consumed; - let pyobj = match payload.downcast::() { - Ok(v) => v.inner, - Err(_) => { - return Err(PyRuntimeError::new_err( - "payload type mismatch: expected PyPayload", - )); - } - }; + #[pyo3(signature = (batch_size, prefetch_target, timeout_ms=None))] + fn get_batch( + &mut self, + py: Python<'_>, + batch_size: usize, + prefetch_target: usize, + timeout_ms: Option, + ) -> PyResult> { + use pyo3::exceptions::PyRuntimeError; + + let _ = prefetch_target; + let broker = self + .broker + .clone() + .ok_or_else(|| PyRuntimeError::new_err("broker is not initialized"))?; + self.get_batch_via_broker_impl(py, broker, batch_size, timeout_ms) + } - // English note: - // - MQ payload is expected to be bytes in the common path. - // - If payload is not a `bytes` object, we skip size accounting to avoid guessing. - let payload_len: u64 = { - let any = pyobj.bind(py); - if any.is_instance_of::() { - let b = any - .downcast::() - .expect("PyBytes downcast failed after is_instance_of"); - b.as_bytes().len() as u64 + #[pyo3(signature = (prefetch_target, timeout_ms=None))] + fn get_one_legacy_for_internal_check( + &mut self, + py: Python<'_>, + prefetch_target: usize, + timeout_ms: Option, + ) -> PyResult { + let maybe_sync_sub_cluster = { + let now = Instant::now(); + if now >= self.next_sub_cluster_sync_at { + self.next_sub_cluster_sync_at = now + SUB_CLUSTER_SYNC_INTERVAL; + Some( + self.kv_framework + .cluster_manager_view() + .cluster_manager() + .get_self_info() + .sub_cluster + .clone(), + ) } else { - 0 + None } }; + self.get_one_impl( + py, + ConsumerGetMode::Prefetch { + prefetch_target, + timeout_ms, + maybe_sync_sub_cluster, + }, + prefetch_target, + timeout_ms, + ) + } - let get_one_total = get_one_begin.elapsed(); - let total_ns = get_one_total.as_nanos() as u64; - let post_ns = post_begin.elapsed().as_nanos() as u64; - - self.get_one_profile_cnt += 1; - self.get_one_profile_window_bytes += payload_len; - self.get_one_profile_total_sum_ns += total_ns; - if total_ns > self.get_one_profile_total_max_ns { - self.get_one_profile_total_max_ns = total_ns; - } - self.get_one_profile_wait_rx_sum_ns += wait_rx_ns; - if wait_rx_max_ns > self.get_one_profile_wait_rx_max_ns { - self.get_one_profile_wait_rx_max_ns = wait_rx_max_ns; - } - self.get_one_profile_signal_sum_ns += signal_ns; - if signal_max_ns > self.get_one_profile_signal_max_ns { - self.get_one_profile_signal_max_ns = signal_max_ns; - } - self.get_one_profile_post_sum_ns += post_ns; - if post_ns > self.get_one_profile_post_max_ns { - self.get_one_profile_post_max_ns = post_ns; - } - self.get_one_profile_recv_timeouts += recv_timeouts; - self.get_one_profile_recv_calls += recv_calls; - - let now = Instant::now(); - if now >= self.get_one_profile_next_log_at && self.get_one_profile_cnt > 0 { - let cnt = self.get_one_profile_cnt; - let avg_total_ms = - (self.get_one_profile_total_sum_ns as f64) / (cnt as f64) / 1_000_000.0; - let avg_wait_rx_ms = - (self.get_one_profile_wait_rx_sum_ns as f64) / (cnt as f64) / 1_000_000.0; - let avg_signal_ms = - (self.get_one_profile_signal_sum_ns as f64) / (cnt as f64) / 1_000_000.0; - let avg_post_ms = - (self.get_one_profile_post_sum_ns as f64) / (cnt as f64) / 1_000_000.0; - let max_total_ms = (self.get_one_profile_total_max_ns as f64) / 1_000_000.0; - let max_wait_rx_ms = (self.get_one_profile_wait_rx_max_ns as f64) / 1_000_000.0; - let max_signal_ms = (self.get_one_profile_signal_max_ns as f64) / 1_000_000.0; - let max_post_ms = (self.get_one_profile_post_max_ns as f64) / 1_000_000.0; - - tracing::info!( - "[MpscConsumerHandle chan_id={} consumer_idx={}] get_one breakdown: \ -avg_total_ms={:.3} max_total_ms={:.3} \ -avg_wait_rx_ms={:.3} max_wait_rx_ms={:.3} \ -avg_signal_ms={:.3} max_signal_ms={:.3} \ -avg_post_ms={:.3} max_post_ms={:.3} \ -cnt={} recv_calls={} recv_timeouts={} last_prefetch_target={} last_timeout_ms={:?}", - chan_id_for_profile, - consumer_idx_for_profile, - avg_total_ms, - max_total_ms, - avg_wait_rx_ms, - max_wait_rx_ms, - avg_signal_ms, - max_signal_ms, - avg_post_ms, - max_post_ms, - cnt, - self.get_one_profile_recv_calls, - self.get_one_profile_recv_timeouts, - self.get_one_profile_last_prefetch_target, - self.get_one_profile_last_timeout_ms, - ); - - self.inner - .as_ref() - .expect("MpscConsumerHandle inner not initialized") - .observe_get_one_breakdown_window_ms( - avg_total_ms, - max_total_ms, - avg_wait_rx_ms, - max_wait_rx_ms, - avg_signal_ms, - max_signal_ms, - avg_post_ms, - max_post_ms, - cnt, - self.get_one_profile_recv_timeouts, - self.get_one_profile_window_bytes, - ); + fn init_broker(&mut self) -> PyResult<()> { + use pyo3::exceptions::PyRuntimeError; - self.get_one_profile_next_log_at = now + Duration::from_secs(30); - self.get_one_profile_cnt = 0; - self.get_one_profile_total_sum_ns = 0; - self.get_one_profile_total_max_ns = 0; - self.get_one_profile_wait_rx_sum_ns = 0; - self.get_one_profile_wait_rx_max_ns = 0; - self.get_one_profile_signal_sum_ns = 0; - self.get_one_profile_signal_max_ns = 0; - self.get_one_profile_post_sum_ns = 0; - self.get_one_profile_post_max_ns = 0; - self.get_one_profile_recv_timeouts = 0; - self.get_one_profile_recv_calls = 0; - self.get_one_profile_window_bytes = 0; - } - // println!( - // "[MpscConsumer chan_id={}] get_one total duration: {:?}", - // self.chan_id(), - // get_one_total - // ); - Ok(pyobj) + let inner = self + .inner + .as_ref() + .ok_or_else(|| PyRuntimeError::new_err("MpscConsumerHandle inner not initialized"))?; + let broker = connect_distributed_broker(&self.kv_framework); + init_broker_for_channel( + &self.kv_runtime, + &broker, + inner.chan_id(), + inner.channel_capacity(), + )?; + self.broker = Some(broker); + Ok(()) } /// Initialize a delete callback which will be invoked by Rust after @@ -1793,16 +2216,19 @@ cnt={} recv_calls={} recv_timeouts={} last_prefetch_target={} last_timeout_ms={: struct ProducerGuard { inner: Option, tx: Option, CoreMpscProducer)>>, + shutdown: ShutdownCtl, } impl ProducerGuard { fn new( inner: CoreMpscProducer, tx: cbchan::Sender<(Result<(), CoreMpscError>, CoreMpscProducer)>, + shutdown: ShutdownCtl, ) -> Self { Self { inner: Some(inner), tx: Some(tx), + shutdown, } } @@ -1822,12 +2248,12 @@ impl ProducerGuard { impl Drop for ProducerGuard { fn drop(&mut self) { if let (Some(inner), Some(tx)) = (self.inner.take(), self.tx.take()) { - let _ = tx.send(( - Err(CoreMpscError::Internal( - "producer guard dropped unexpectedly".to_string(), - )), - inner, - )); + let err = if self.shutdown.is_closed() { + CoreMpscError::Closed + } else { + CoreMpscError::Internal("producer guard dropped unexpectedly".to_string()) + }; + let _ = tx.send((Err(err), inner)); } } } @@ -1837,16 +2263,19 @@ impl Drop for ProducerGuard { struct ConsumerGuard { inner: Option, tx: Option, CoreMpscConsumer)>>, + shutdown: ShutdownCtl, } impl ConsumerGuard { fn new( inner: CoreMpscConsumer, tx: cbchan::Sender<(Result, CoreMpscConsumer)>, + shutdown: ShutdownCtl, ) -> Self { Self { inner: Some(inner), tx: Some(tx), + shutdown, } } @@ -1885,12 +2314,65 @@ impl ConsumerGuard { impl Drop for ConsumerGuard { fn drop(&mut self) { if let (Some(inner), Some(tx)) = (self.inner.take(), self.tx.take()) { - let _ = tx.send(( - Err(CoreMpscError::Internal( - "consumer guard dropped unexpectedly".to_string(), - )), - inner, - )); + let err = if self.shutdown.is_closed() { + CoreMpscError::Closed + } else { + CoreMpscError::Internal("consumer guard dropped unexpectedly".to_string()) + }; + let _ = tx.send((Err(err), inner)); + } + } +} + +struct BatchConsumerGuard { + inner: Option, + tx: Option< + cbchan::Sender<( + Result, CoreMpscError>, + CoreMpscConsumer, + )>, + >, + shutdown: ShutdownCtl, +} + +impl BatchConsumerGuard { + fn new( + inner: CoreMpscConsumer, + tx: cbchan::Sender<( + Result, CoreMpscError>, + CoreMpscConsumer, + )>, + shutdown: ShutdownCtl, + ) -> Self { + Self { + inner: Some(inner), + tx: Some(tx), + shutdown, + } + } + + fn inner_mut(&mut self) -> &mut CoreMpscConsumer { + self.inner + .as_mut() + .expect("BatchConsumerGuard inner already taken") + } + + fn finish(mut self, res: Result, CoreMpscError>) { + if let (Some(inner), Some(tx)) = (self.inner.take(), self.tx.take()) { + let _ = tx.send((res, inner)); + } + } +} + +impl Drop for BatchConsumerGuard { + fn drop(&mut self) { + if let (Some(inner), Some(tx)) = (self.inner.take(), self.tx.take()) { + let err = if self.shutdown.is_closed() { + CoreMpscError::Closed + } else { + CoreMpscError::Internal("batch consumer guard dropped unexpectedly".to_string()) + }; + let _ = tx.send((Err(err), inner)); } } } diff --git a/fluxon_rs/fluxon_util/src/dev_config.rs b/fluxon_rs/fluxon_util/src/dev_config.rs index c860910..d3f5953 100644 --- a/fluxon_rs/fluxon_util/src/dev_config.rs +++ b/fluxon_rs/fluxon_util/src/dev_config.rs @@ -1,4 +1,4 @@ -use anyhow::{anyhow, Context, Result}; +use anyhow::{Context, Result, anyhow}; use serde_yaml::Value; use std::fs; use std::path::{Path, PathBuf}; diff --git a/fluxon_rs/fluxon_util/src/lease_manager/lease_handle.rs b/fluxon_rs/fluxon_util/src/lease_manager/lease_handle.rs index 1c24cf2..af70d23 100755 --- a/fluxon_rs/fluxon_util/src/lease_manager/lease_handle.rs +++ b/fluxon_rs/fluxon_util/src/lease_manager/lease_handle.rs @@ -70,20 +70,16 @@ impl GeneralLease { impl Drop for GeneralLease { fn drop(&mut self) { - // Instrument drop of the high-level lease handle so we can correlate - // who released the last user-visible handle. let lease_id = self.id(); let kind_str = match self.kind() { LeaseType::Etcd => "Etcd", LeaseType::KvClient => "KvClient", }; let label = super::lifecycle::get_register_by(lease_id); - let bt = std::backtrace::Backtrace::force_capture(); - tracing::info!( + tracing::debug!( lease_id, kind = kind_str, label = %label.clone().unwrap_or_else(|| "".to_string()), - backtrace = %format!("{:?}", bt), "GeneralLease drop: releasing user-visible lease handle", ); // AutoCleanMapEntry drop happens after this method returns; the map diff --git a/fluxon_rs/fluxon_util/src/lib.rs b/fluxon_rs/fluxon_util/src/lib.rs index a85aed0..cde6dc7 100644 --- a/fluxon_rs/fluxon_util/src/lib.rs +++ b/fluxon_rs/fluxon_util/src/lib.rs @@ -37,10 +37,9 @@ pub mod limitrate; pub mod pyo3; // Re-export for stable public API: existing call sites can keep using `fluxon_util::init_log`. pub use log::{ - current_daily_sharded_log_path, current_log_file_path, daily_sharded_log_path, - display_runtime_log_path, init_log, init_log_test, init_log_with_extra_layer, - latest_existing_daily_sharded_log_path, resolve_readable_log_path, - DEFAULT_DAILY_LOG_RETENTION_DAYS, + DEFAULT_DAILY_LOG_RETENTION_DAYS, current_daily_sharded_log_path, current_log_file_path, + daily_sharded_log_path, display_runtime_log_path, init_log, init_log_test, + init_log_with_extra_layer, latest_existing_daily_sharded_log_path, resolve_readable_log_path, }; #[cfg(test)] mod test_util_test; @@ -251,7 +250,12 @@ mod tests { ); assert_logged_text( &active_log_path, - &["debug message", "info message", "warning message", "error message"], + &[ + "debug message", + "info message", + "warning message", + "error message", + ], ); } } diff --git a/fluxon_rs/fluxon_util/src/log.rs b/fluxon_rs/fluxon_util/src/log.rs index fc6066f..4db4ae4 100644 --- a/fluxon_rs/fluxon_util/src/log.rs +++ b/fluxon_rs/fluxon_util/src/log.rs @@ -61,9 +61,7 @@ fn read_test_log_shard_window_config() -> anyhow::Result 0"); @@ -85,7 +83,9 @@ fn read_test_log_shard_window_config() -> anyhow::Result) -> anyhow::Result { +fn resolve_shard_date_from_datetime( + now: chrono::DateTime, +) -> anyhow::Result { let Some(config) = read_test_log_shard_window_config()? else { return Ok(now.date_naive()); }; @@ -99,8 +99,8 @@ fn resolve_shard_date_from_datetime(now: chrono::DateTime) -> anyho ); } let bucket_index = delta_seconds / config.window_seconds; - let base_date = chrono::NaiveDate::from_ymd_opt(2026, 1, 1) - .expect("valid hard-coded synthetic base date"); + let base_date = + chrono::NaiveDate::from_ymd_opt(2026, 1, 1).expect("valid hard-coded synthetic base date"); Ok(base_date + chrono::Days::new(bucket_index as u64)) } @@ -108,10 +108,7 @@ fn current_shard_date() -> anyhow::Result { resolve_shard_date_from_datetime(chrono::Utc::now()) } -fn cleanup_old_daily_sharded_logs( - base_path: &Path, - retention_days: usize, -) -> anyhow::Result<()> { +fn cleanup_old_daily_sharded_logs(base_path: &Path, retention_days: usize) -> anyhow::Result<()> { let parent = match base_path.parent() { Some(parent) => parent, None => return Ok(()), @@ -124,7 +121,8 @@ fn cleanup_old_daily_sharded_logs( return Ok(()); }; fs::create_dir_all(parent)?; - let keep_since = current_shard_date()? - chrono::Days::new(retention_days.saturating_sub(1) as u64); + let keep_since = + current_shard_date()? - chrono::Days::new(retention_days.saturating_sub(1) as u64); let prefix = format!("{stem}."); for entry in std::fs::read_dir(parent)? { let entry = entry?; @@ -180,10 +178,7 @@ impl DailyShardedFileWriter { current_daily_sharded_log_path(&self.base_path) } - fn rotate_if_needed( - &self, - state: &mut DailyShardedFileWriterState, - ) -> io::Result<()> { + fn rotate_if_needed(&self, state: &mut DailyShardedFileWriterState) -> io::Result<()> { let next_path = self .current_path() .map_err(|err| io::Error::new(io::ErrorKind::Other, err.to_string()))?; @@ -314,20 +309,19 @@ pub fn daily_sharded_log_path( base_path: &Path, date: chrono::NaiveDate, ) -> anyhow::Result { - let file_name = base_path.file_name().and_then(|v| v.to_str()).ok_or_else(|| { - anyhow::anyhow!( - "log path must end with a valid utf-8 filename: {}", - base_path.display() - ) - })?; + let file_name = base_path + .file_name() + .and_then(|v| v.to_str()) + .ok_or_else(|| { + anyhow::anyhow!( + "log path must end with a valid utf-8 filename: {}", + base_path.display() + ) + })?; let stem = file_name .strip_suffix(".log") .ok_or_else(|| anyhow::anyhow!("log path must end with .log: {}", base_path.display()))?; - Ok(base_path.with_file_name(format!( - "{}.{}.log", - stem, - date.format("%Y-%m-%d") - ))) + Ok(base_path.with_file_name(format!("{}.{}.log", stem, date.format("%Y-%m-%d")))) } pub fn current_daily_sharded_log_path(base_path: &Path) -> anyhow::Result { diff --git a/fluxon_rs/fluxon_util/tests/log_mgmt.rs b/fluxon_rs/fluxon_util/tests/log_mgmt.rs index 431c5fc..f459337 100644 --- a/fluxon_rs/fluxon_util/tests/log_mgmt.rs +++ b/fluxon_rs/fluxon_util/tests/log_mgmt.rs @@ -59,7 +59,10 @@ fn kv_log_shards_roll_and_cleanup_with_test_window() { .expect("unix epoch") .as_secs() as i64; let _window_guard = EnvVarGuard::set(TEST_LOG_SHARD_WINDOW_SECONDS_ENV, "10"); - let _anchor_guard = EnvVarGuard::set(TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV, (now - 2).to_string()); + let _anchor_guard = EnvVarGuard::set( + TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV, + (now - 2).to_string(), + ); fluxon_util::init_log(log_path, instance_key); tracing::info!(target: "fluxon_util", "[kv-log-mgmt][phase=before] ts={}", now); @@ -115,7 +118,8 @@ fn resolve_readable_log_path_ignores_plain_base_log_when_daily_shards_exist() { let shard_path = temp_dir.path().join("startup.2026-06-21.log"); fs::write(&shard_path, "shard\n").expect("write shard log"); - let resolved = fluxon_util::resolve_readable_log_path(&base_path).expect("resolve readable log path"); + let resolved = + fluxon_util::resolve_readable_log_path(&base_path).expect("resolve readable log path"); assert_eq!(resolved, shard_path); } @@ -128,7 +132,7 @@ fn latest_existing_daily_sharded_log_path_skips_invalid_candidates() { fs::write(&invalid_shard_path, "invalid\n").expect("write invalid shard"); fs::write(&valid_shard_path, "valid\n").expect("write valid shard"); - let resolved = - fluxon_util::latest_existing_daily_sharded_log_path(&base_path).expect("resolve latest shard"); + let resolved = fluxon_util::latest_existing_daily_sharded_log_path(&base_path) + .expect("resolve latest shard"); assert_eq!(resolved, valid_shard_path); } diff --git a/fluxon_test_stack/test_runner.py b/fluxon_test_stack/test_runner.py index 1a5ca7f..de759cb 100644 --- a/fluxon_test_stack/test_runner.py +++ b/fluxon_test_stack/test_runner.py @@ -147,9 +147,9 @@ RUNTIME_LAYER_CASE, ) CI_BASE_RUNTIME_SERVICE_IDS = ("etcd", "greptime") -CI_CLUSTER_MEMBER_INSTANCE_IDS = ("master", "owner_0") -CI_CLUSTER_RUNTIME_INSTANCE_IDS = ("master", "owner_0") -CI_CASE_RUNTIME_INSTANCE_IDS = ("master", "owner_0", "ci_runner") +CI_CLUSTER_MEMBER_INSTANCE_IDS = ("master", "owner_0", "broker") +CI_CLUSTER_RUNTIME_INSTANCE_IDS = ("master", "owner_0", "broker") +CI_CASE_RUNTIME_INSTANCE_IDS = ("master", "owner_0", "broker", "ci_runner") CI_CLUSTER_RUNTIME_REMOTE_STAGE_INCLUDE_RELPATHS = ( "configs", "src/fluxon_py/runtime", @@ -158,6 +158,7 @@ ) CI_CLUSTER_RUNTIME_REMOTE_STAGE_VERIFY_RELPATHS = ( "src/fluxon_py/runtime/start_master.py", + "src/fluxon_py/runtime/start_broker.py", "src/fluxon_py/runtime/start_owner_kvclient.py", ) CI_RUNNER_REMOTE_STAGE_INCLUDE_RELPATHS = ( @@ -3030,6 +3031,8 @@ def _ci_cluster_runtime_stage(resolved_case: Dict[str, Any]) -> _RemoteRunDirSta verify_relpaths.append("configs/ci_owner_0.yaml") if _ci_has_instance(resolved_case, instance_id="master"): verify_relpaths.append("configs/ci_master.yaml") + if _ci_has_instance(resolved_case, instance_id="broker"): + verify_relpaths.append("configs/ci_broker.yaml") return _RemoteRunDirStage( archive_prefix="fluxon_ci_cluster_runtime_run_dir__", stage_prefix="fluxon_ci_cluster_runtime_stage_", @@ -3046,6 +3049,8 @@ def _ci_runner_runtime_stage(resolved_case: Dict[str, Any]) -> _RemoteRunDirStag verify_relpaths.append("configs/ci_owner_0.yaml") if _ci_has_instance(resolved_case, instance_id="master"): verify_relpaths.append("configs/ci_master.yaml") + if _ci_has_instance(resolved_case, instance_id="broker"): + verify_relpaths.append("configs/ci_broker.yaml") include_relpaths = list(CI_RUNNER_REMOTE_STAGE_INCLUDE_RELPATHS) if _ci_runtime_contract_id(resolved_case) == CI_RUNTIME_CONTRACT_CLUSTER_KV_OWNER: for relpath in ("fluxon_release", "test_rsc"): @@ -3140,15 +3145,34 @@ def _compile_case_plan(resolved_case: Dict[str, Any]) -> _CasePlan: prepare_instance_ids = _ci_cluster_runtime_instance_ids(resolved_case) prepare_phases: Tuple[_RuntimePhase, ...] = () if prepare_instance_ids: - prepare_phases = ( - _RuntimePhase( - phase_id="cluster_runtime", - layer=RUNTIME_LAYER_CASE, - instance_ids=prepare_instance_ids, - write_ctx="CI", - stage_run_dir=_ci_cluster_runtime_stage(resolved_case), - ), + prepare_phase_list: List[_RuntimePhase] = [] + broker_prepare_ids = tuple( + instance_id for instance_id in prepare_instance_ids if instance_id == "broker" + ) + cluster_prepare_ids = tuple( + instance_id for instance_id in prepare_instance_ids if instance_id != "broker" ) + if cluster_prepare_ids: + prepare_phase_list.append( + _RuntimePhase( + phase_id="cluster_runtime", + layer=RUNTIME_LAYER_CASE, + instance_ids=cluster_prepare_ids, + write_ctx="CI", + stage_run_dir=_ci_cluster_runtime_stage(resolved_case), + ) + ) + if broker_prepare_ids: + prepare_phase_list.append( + _RuntimePhase( + phase_id="broker_runtime", + layer=RUNTIME_LAYER_CASE, + instance_ids=broker_prepare_ids, + write_ctx="CI", + stage_run_dir=_ci_cluster_runtime_stage(resolved_case), + ) + ) + prepare_phases = tuple(prepare_phase_list) return _CasePlan( case_family=case_family, prepare_phases=prepare_phases, @@ -3693,6 +3717,9 @@ def _wait_ci_instance_ready(resolved_case: Dict[str, Any], *, instance_id: str) timeout_s=180, ) return + if instance_id == "broker": + _wait_instance_running(resolved_case, instance_id=instance_id, timeout_s=60) + return if instance_id == "ci_runner": _wait_instance_running(resolved_case, instance_id=instance_id, timeout_s=30) return @@ -5961,7 +5988,7 @@ def _validate_profile_ci_runtime_block(runtime: Dict[str, Any], ctx: str, target f"{tpl_ctx}.deployer", ) target = _require_str(deployer.get("target"), f"{tpl_ctx}.deployer.target") - if instance_id in ("owner_0", "ci_runner"): + if instance_id in ("owner_0", "broker", "ci_runner"): if target != "__TARGET__": raise ValueError(f"{tpl_ctx}.deployer.target must be '__TARGET__'") elif target not in target_ip_map: @@ -8921,7 +8948,7 @@ def _ci_materialized_target_for_instance(*, topology: Any, targets: Dict[str, An primary = _require_str(targets.get("primary"), f"{ctx}.targets.primary") if instance_id == "master": return primary - if instance_id in ("owner_0", "ci_runner"): + if instance_id in ("owner_0", "broker", "ci_runner"): if machine_count == 1: return primary if machine_count == 2: @@ -8929,6 +8956,27 @@ def _ci_materialized_target_for_instance(*, topology: Any, targets: Dict[str, An raise ValueError(f"{ctx} unsupported CI instance id for placement: {instance_id}") +def _default_ci_broker_runtime_template() -> Dict[str, Any]: + return { + "lifecycle": "service", + "k8s_ref": "deployment/broker", + "deployer": { + "target": "__TARGET__", + "command": ["/bin/bash", "-lc"], + "args": [ + """ +set -euo pipefail +cd __RUN_DIR__/src +mkdir -p __RUN_DIR__/services/broker +exec __RUN_DIR__/venv/bin/python3 -m fluxon_py.runtime.start_broker \\ + -c __RUN_DIR__/configs/ci_broker.yaml \\ + -w __RUN_DIR__/services/broker +""".strip() + ], + }, + } + + def _compile_ci_case(resolved_case: Dict[str, Any]) -> None: scale = _require_dict(resolved_case.get("scale"), "resolved_case.scale") topology = scale.get("topology") @@ -8939,13 +8987,24 @@ def _compile_ci_case(resolved_case: Dict[str, Any]) -> None: profile_ci = _require_dict(profile.get("ci"), "resolved_case.profile.ci") runtime_templates = _require_dict(profile_ci.get("runtime"), "resolved_case.profile.ci.runtime") deploy = _require_dict(resolved_case.get("deploy"), "resolved_case.deploy") - case_runtime_templates = _require_dict( - runtime_templates.get(RUNTIME_LAYER_CASE), - f"resolved_case.profile.ci.runtime.{RUNTIME_LAYER_CASE}", + case_runtime_templates = copy.deepcopy( + _require_dict( + runtime_templates.get(RUNTIME_LAYER_CASE), + f"resolved_case.profile.ci.runtime.{RUNTIME_LAYER_CASE}", + ) ) + if ( + "master" in case_runtime_templates + and "owner_0" in case_runtime_templates + and "ci_runner" in case_runtime_templates + and "broker" not in case_runtime_templates + ): + case_runtime_templates["broker"] = _default_ci_broker_runtime_template() ordered_instance_ids = [ - instance_id for instance_id in CI_CASE_RUNTIME_INSTANCE_IDS if instance_id in case_runtime_templates + instance_id + for instance_id in CI_CASE_RUNTIME_INSTANCE_IDS + if instance_id in case_runtime_templates ] if not ordered_instance_ids: raise ValueError("resolved_case.profile.ci.runtime.case_runtime must be non-empty") @@ -13969,6 +14028,7 @@ def _write_ci_master_owner_configs( owner_dram_bytes: int, ) -> tuple[Path, Path]: owner_work_root = run_dir / "services" / "owner_0" + broker_work_root = run_dir / "services" / "broker" master_cfg = { "etcd_endpoints": ["__ETCD__"], "cluster_name": cluster_name, @@ -14004,6 +14064,15 @@ def _write_ci_master_owner_configs( }, } + broker_cfg = { + "instance_key": "ci_broker", + "contribute_to_cluster_pool_size": {"dram": 0, "vram": {}}, + "fluxonkv_spec": { + "cluster_name": cluster_name, + "share_mem_path": share_mem_path, + }, + } + etcd_ip = _ci_base_runtime_service_target_ip(resolved_case, service_id="etcd") etcd_port = _ci_base_runtime_service_port(resolved_case, service_id="etcd") greptime_ip = _ci_base_runtime_service_target_ip(resolved_case, service_id="greptime") @@ -14027,8 +14096,12 @@ def _write_ci_master_owner_configs( cfg_dir.mkdir(parents=True, exist_ok=True) master_path = cfg_dir / "ci_master.yaml" owner_path = cfg_dir / "ci_owner_0.yaml" + broker_path = cfg_dir / "ci_broker.yaml" _write_yaml_file(master_path, master_cfg) _write_yaml_file(owner_path, owner_cfg) + if _ci_has_instance(resolved_case, instance_id="broker"): + broker_work_root.mkdir(parents=True, exist_ok=True) + _write_yaml_file(broker_path, broker_cfg) return master_path, owner_path @@ -16255,6 +16328,8 @@ def _ui_ops_logs_base_url(controller_url: str) -> str: def _ui_test_stack_member_role_for_instance_id(instance_id: str) -> str: if instance_id == "master": return "master" + if instance_id == "broker": + return "broker" return "owner_client" diff --git a/fluxon_test_stack/test_runner_runtime_backend.py b/fluxon_test_stack/test_runner_runtime_backend.py index 30d1191..a2f01d2 100644 --- a/fluxon_test_stack/test_runner_runtime_backend.py +++ b/fluxon_test_stack/test_runner_runtime_backend.py @@ -378,13 +378,6 @@ def _execute_ci_case( ), ) outcome = ctx.RUN_OUTCOME_SUCCESS if rc == 0 else ctx.RUN_OUTCOME_FAILED - if outcome == ctx.RUN_OUTCOME_SUCCESS and runtime_tracking.ci_apply_ids.get("ci_runner") is not None: - ctx._delete_apply_id( - resolved_case, - apply_id=ctx._require_str(runtime_tracking.ci_apply_ids.get("ci_runner"), "CI ci_runner apply_id"), - ctx="CI ci_runner apply", - ) - del runtime_tracking.ci_apply_ids["ci_runner"] summary = ctx._build_ci_summary_yaml( resolved_case, run_index=run_index, diff --git a/fluxon_test_stack/tests/test_test_runner_testbed_contract.py b/fluxon_test_stack/tests/test_test_runner_testbed_contract.py index 0066cc2..f14cea4 100644 --- a/fluxon_test_stack/tests/test_test_runner_testbed_contract.py +++ b/fluxon_test_stack/tests/test_test_runner_testbed_contract.py @@ -44,10 +44,16 @@ def test_write_ci_master_owner_configs_emits_owner_large_file_paths(self) -> Non with tempfile.TemporaryDirectory() as td: run_dir = Path(td) resolved_case = { + "runtime_model": { + _RUNNER.RUNTIME_LAYER_TEST_BED: {"kind": "ops"}, + _RUNNER.RUNTIME_LAYER_BASE: {}, + _RUNNER.RUNTIME_LAYER_CASE: {"instance_ids": ["master", "owner_0", "broker", "ci_runner"]}, + }, "deploy": { "instances": [ {"id": "master", "deployer": {"target": "local-node-a"}}, {"id": "owner_0", "deployer": {"target": "local-node-a"}}, + {"id": "broker", "deployer": {"target": "local-node-a"}}, ], "target_ip_map": {"local-node-a": "127.0.0.1"}, } @@ -221,6 +227,76 @@ def test_ci_runtime_tracked_apply_entries_groups_shared_apply_id(self) -> None: ], ) + def test_execute_ci_case_leaves_successful_runner_apply_for_finalize(self) -> None: + with tempfile.TemporaryDirectory() as td: + run_dir = Path(td) + tracking = _RUNNER._CaseRuntimeTracking() + resolved_case = { + "case": { + "case_id": "ci_top_attention_mq_core__n1_kvowner_dram_20gib__fluxon_tcp_thread", + "case_key": "ci_top_attention_mq_core__n1_kvowner_dram_20gib__fluxon_tcp_thread", + } + } + plan = _RUNNER._CasePlan( + case_family=_RUNNER.CASE_FAMILY_CI, + prepare_phases=(), + execute_phases=( + _RUNNER._RuntimePhase( + phase_id="ci_runner", + layer=_RUNNER.RUNTIME_LAYER_CASE, + instance_ids=("ci_runner",), + write_ctx="CI", + ), + ), + ) + prepared = _RUNNER._PreparedCase(plan=plan) + summary = { + "schema_version": _RUNNER.SCHEMA_VERSION, + "case_id": resolved_case["case"]["case_id"], + "case_key": resolved_case["case"]["case_key"], + "run_index": 1, + "outcome": _RUNNER.RUN_OUTCOME_SUCCESS, + "counted": False, + "timing": {"started_at_unix_s": 100, "finished_at_unix_s": 200}, + "ci": {"rc": 0}, + } + + with mock.patch.object(_RUNNER, "_ci_runner_exit_code_timeout_seconds", return_value=60): + with mock.patch.object(_RUNNER, "_deploy_runtime_phase", return_value={"history_id": "apply-runner"}): + with mock.patch.object(_RUNNER, "_wait_ci_instance_ready") as wait_ready: + with mock.patch.object(_RUNNER, "_wait_ci_runner_exit_code", return_value=0): + with mock.patch.object(_RUNNER, "_build_ci_summary_yaml", return_value=summary): + with mock.patch.object(_RUNNER, "_delete_apply_id") as delete_apply: + executed = _RUNNER._execute_ci_case( + _RUNNER._PlannedCase( + case=_RUNNER._ResolvedCase( + scene_id="ci_top_attention_mq_core", + scale_id="n1_kvowner_dram_20gib", + profile_id="fluxon_tcp_thread", + case_id=resolved_case["case"]["case_id"], + case_key=resolved_case["case"]["case_key"], + ), + ci_commands=[], + ci_prepare_steps=[], + label="ci case", + command_id=None, + test_id=None, + counted=False, + ), + resolved_case=resolved_case, + run_dir=run_dir, + run_index=1, + started_at=100, + prepared_case=prepared, + runtime_tracking=tracking, + ) + + self.assertEqual(executed.outcome, _RUNNER.RUN_OUTCOME_SUCCESS) + self.assertEqual(tracking.ci_apply_ids, {"ci_runner": "apply-runner"}) + self.assertEqual(tracking.ci_attempted_instance_ids, ["ci_runner"]) + delete_apply.assert_not_called() + wait_ready.assert_called_once_with(resolved_case, instance_id="ci_runner") + def test_finalize_ci_case_runtime_deletes_each_apply_id_once(self) -> None: with tempfile.TemporaryDirectory() as td: run_dir = Path(td) diff --git a/pics/blog2_mq_broker_state.png b/pics/blog2_mq_broker_state.png new file mode 100644 index 0000000..0928819 Binary files /dev/null and b/pics/blog2_mq_broker_state.png differ diff --git a/pics/blog2_mq_payload_flow.png b/pics/blog2_mq_payload_flow.png new file mode 100644 index 0000000..5d8d37b Binary files /dev/null and b/pics/blog2_mq_payload_flow.png differ diff --git "a/pics/fluxon\346\236\266\346\236\204\345\233\27620260423.png" b/pics/fluxon_architecture.png similarity index 100% rename from "pics/fluxon\346\236\266\346\236\204\345\233\27620260423.png" rename to pics/fluxon_architecture.png diff --git "a/pics/\346\236\266\346\236\204\345\205\250\346\231\257\345\233\276.png" b/pics/fluxon_architecture_overview.png similarity index 100% rename from "pics/\346\236\266\346\236\204\345\205\250\346\231\257\345\233\276.png" rename to pics/fluxon_architecture_overview.png diff --git a/pics/mq_bench.svg b/pics/mq_bench.svg new file mode 100644 index 0000000..f938bf5 --- /dev/null +++ b/pics/mq_bench.svg @@ -0,0 +1,186 @@ + + + + +MQ Benchmark Comparison +Concurrency Sweep +Throughput (MB/s) +Payload Sweep +Throughput (MB/s) + +0 + +11,600 + +23,200 + +34,800 + +46,400 + +58,000 + + + +16p/2c + +16p/4c + +16p/8c + +16p/12c + +24p/2c + +24p/4c + +24p/8c + +32p/4c +Producer / Consumer Concurrency + +0 + +18,000 + +36,000 + +54,000 + +72,000 + +90,000 + + + +4.8 + +8 + +12 + +16 + +20 + +24 + +32 + +40 + +48 + +56 + +64 +Payload Size (MB) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +etcd 4.8MB b1/pf0 + + +broker 4.8MB b1/pf0 + + +etcd 32MB b48/pf48 + + +broker 32MB b48/pf48 + + +etcd 24p/4c + + +broker 24p/4c + + +etcd 16p/4c + + +broker 16p/4c + \ No newline at end of file