PD Disaggregation Test Coverage Plan (PR #253)

## Motivation

[PR #253](https://github.com/ROCm/ATOM/pull/253) introduces Prefill/Decode disaggregation — a critical production feature that splits inference across separate GPU instances via MORI-IO RDMA. Current test coverage is minimal:

- **1 test file** (`test_kv_aggregator.py`, 96 lines) covering only `KVOutputAggregator`
- **0 tests** for the core transfer engine (1,624 lines), proxy (372 lines), scheduler integration (344 lines), and async worker plumbing (212 lines)

This issue tracks the plan to add layered test coverage following the same strategy as the plugin-mode CI ([#255](https://github.com/ROCm/ATOM/issues/255)).

## Approach: Layered Testing by Module

### L1: CPU Unit Tests (P0 — gate for merge)

Pure logic tests with mocked GPU/RDMA/ZMQ dependencies. Run on `ubuntu-latest` in < 5 seconds.

```
tests/disaggregation/
  ├── test_kv_aggregator.py          # Enhance existing
  ├── test_connector_metadata.py     # New
  ├── test_kv_connector_scheduler.py # New
  ├── test_proxy.py                  # New
  ├── test_transfer_utils.py         # New
  └── test_scheduler_kv_integration.py # New
```

**Mock strategy:** Mock `aiter`, `mori.io`, `torch.distributed`, `zmq` at `sys.modules` level.

#### test_kv_aggregator.py (enhance)
- [ ] All workers report same ID → emitted
- [ ] Partial workers → not emitted until all report
- [ ] Multiple rounds accumulate correctly
- [ ] Empty inputs don't crash
- [ ] Interleaved send/recv tracked independently
- [ ] `reset()` clears pending
- [ ] Counter entries deleted after emission (no leak)
- [ ] `world_size <= 0` raises ValueError

#### test_connector_metadata.py
- [ ] `add_new_req_to_recv` builds correct ReqMeta from kv_transfer_params
- [ ] `add_new_req_to_save` builds correct ReqMeta
- [ ] Missing required params raises KeyError
- [ ] Multiple reqs don't clobber each other
- [ ] `request_id_to_transfer_id` mapping passthrough

#### test_kv_connector_scheduler.py
- [ ] `get_num_new_matched_tokens` returns `(prompt_len, True)` for `do_remote_prefill`
- [ ] Second call returns `(0, False)` — `kv_async_tagged` idempotent
- [ ] No kv_transfer_params → `(0, False)`
- [ ] `update_state_after_alloc` consumer: queues req, sets transfer_id mapping
- [ ] `update_state_after_alloc` producer: does NOT queue
- [ ] `do_remote_prefill` flag cleared after processing
- [ ] `build_connector_meta` drains pending queue into metadata
- [ ] `build_connector_meta` on empty queue → no crash
- [ ] `request_finished` producer output contains block_table, engine_id, host, port
- [ ] `request_finished` consumer cleans up transfer_id mapping
- [ ] transfer_id ↔ request_id always bidirectionally consistent

#### test_proxy.py
- [ ] `_append_whole_dict_unique` deduplicates
- [ ] Dedup ignores `index` field
- [ ] Transfer mode mismatch raises ValueError
- [ ] No instances → 503 response
- [ ] Round-robin cycles through instances evenly
- [ ] `_extract_ip_port` on valid URL
- [ ] `_extract_ip_port` on invalid URL raises ValueError
- [ ] Prefill request sets `max_tokens=1` and `stream=False`

#### test_transfer_utils.py
- [ ] `convert_virtual_to_physical_pages` default 16→1 expansion
- [ ] Same size → no expansion
- [ ] Custom block_size ratios
- [ ] `merge_contiguous_blocks` — all contiguous → 1 merged
- [ ] None contiguous → N transfers
- [ ] Partial merge
- [ ] Empty/single input
- [ ] Unsorted input → auto-sorts
- [ ] `_compute_block_transfer_offsets` MHA (5D) vs MLA (3D)
- [ ] `make_zmq_path` IPv4, IPv6, no-port
- [ ] `RoleManager` singleton + thread safety
- [ ] `set_role` / `get_role` round-trip
- [ ] `get_port_offset` formula: `dp_rank * tp_size + tp_rank`

#### test_scheduler_kv_integration.py
- [ ] Seq enters WAITING_FOR_REMOTE_KV state
- [ ] Finished recv moves seq to RUNNING
- [ ] Finished send triggers block cleanup
- [ ] `None` kv_connector_output → no crash
- [ ] Seqs waiting for KV excluded from scheduled batch
- [ ] `connector_meta_output` attached to ScheduledBatch

### L2: CPU Integration Tests (P0)

| Test | Description |
|------|-------------|
| ZMQ handshake roundtrip | Listener + client threads in-process, verify metadata exchange |
| Service discovery registration | Simulate proxy ZMQ ROUTER, verify msgpack format and dedup |
| AsyncIOProcManager KV aggregation | Mock multiple worker KV outputs, verify `call_func_with_aggregation` |
| **`_pop_done_transfers` all-status check** | **Bug:** current code only checks `status_list[-1]`. Test with `[FAIL, SUCCESS]` → should NOT mark done |
| OpenAI server kv_params roundtrip | Request with `kv_transfer_params` → response contains output |
| Proxy prefill→decode read-mode flow | Simulate: prefill response → extract block metadata → decode request |

### L3: GPU Tests (P1 — design only)

| Test | Env | Description |
|------|-----|-------------|
| `register_kv_caches` RDMA metadata | 1 GPU | Real KV tensors → verify RDMA metadata non-null |
| MoRIIO wrapper tensor registration | 1 GPU | CUDA tensor → packed metadata valid |
| Single-node loopback transfer | 2+ GPU | Producer → consumer RDMA read, verify data match |
| E2E proxy+prefill+decode | 8 GPU | Full 3-process inference |
| Multi-request concurrent | 8 GPU | Concurrent P/D pipeline |

## Known Bugs to Cover

1. **`_pop_done_transfers` only checks `status_list[-1]`** — should check ALL statuses
2. **`start_load_kv` busy-wait** — `while need_handshake: continue` burns CPU
3. **Proxy 600-hour timeout** — `aiohttp.ClientTimeout(total=6*6000*6000)` should be configurable

## Estimated Effort

| Layer | Files | Test Cases | Lines (est.) |
|-------|-------|-----------|--------------|
| L1 | 6 | ~55 | ~800 |
| L2 | 1 | ~6 | ~300 |
| L3 | design only | ~5 | ~200 |
| **Total** | **7** | **~66** | **~1,300** |

## CI Integration

Add to existing workflow or new `atom-pd-test.yaml`:

```yaml
pd-unit-tests:
  name: PD Disaggregation Unit Tests (CPU)
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-python@v4
      with:
        python-version: "3.12"
    - run: pip install pytest msgpack msgspec numpy aiohttp quart
    - run: pip install torch --index-url https://download.pytorch.org/whl/cpu
    - run: pytest tests/disaggregation/ -v --tb=short
```

## Reference

- Design doc: `docs/plans/2026-03-04-pd-disaggregation-test-coverage-design.md`
- Related: [PR #253](https://github.com/ROCm/ATOM/pull/253), [Issue #255](https://github.com/ROCm/ATOM/issues/255)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PD Disaggregation Test Coverage Plan (PR #253) #22

Motivation

Approach: Layered Testing by Module

L1: CPU Unit Tests (P0 — gate for merge)

test_kv_aggregator.py (enhance)

test_connector_metadata.py

test_kv_connector_scheduler.py

test_proxy.py

test_transfer_utils.py

test_scheduler_kv_integration.py

L2: CPU Integration Tests (P0)

L3: GPU Tests (P1 — design only)

Known Bugs to Cover

Estimated Effort

CI Integration

Reference

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Test	Description
ZMQ handshake roundtrip	Listener + client threads in-process, verify metadata exchange
Service discovery registration	Simulate proxy ZMQ ROUTER, verify msgpack format and dedup
AsyncIOProcManager KV aggregation	Mock multiple worker KV outputs, verify `call_func_with_aggregation`
`_pop_done_transfers` all-status check	Bug: current code only checks `status_list[-1]`. Test with `[FAIL, SUCCESS]` → should NOT mark done
OpenAI server kv_params roundtrip	Request with `kv_transfer_params` → response contains output
Proxy prefill→decode read-mode flow	Simulate: prefill response → extract block metadata → decode request

Test	Env	Description
`register_kv_caches` RDMA metadata	1 GPU	Real KV tensors → verify RDMA metadata non-null
MoRIIO wrapper tensor registration	1 GPU	CUDA tensor → packed metadata valid
Single-node loopback transfer	2+ GPU	Producer → consumer RDMA read, verify data match
E2E proxy+prefill+decode	8 GPU	Full 3-process inference
Multi-request concurrent	8 GPU	Concurrent P/D pipeline

Layer	Files	Test Cases	Lines (est.)
L1	6	~55	~800
L2	1	~6	~300
L3	design only	~5	~200
Total	7	~66	~1,300

PD Disaggregation Test Coverage Plan (PR #253) #22

Description

Motivation

Approach: Layered Testing by Module

L1: CPU Unit Tests (P0 — gate for merge)

test_kv_aggregator.py (enhance)

test_connector_metadata.py

test_kv_connector_scheduler.py

test_proxy.py

test_transfer_utils.py

test_scheduler_kv_integration.py

L2: CPU Integration Tests (P0)

L3: GPU Tests (P1 — design only)

Known Bugs to Cover

Estimated Effort

CI Integration

Reference

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions