Add MetaInstance declarative layer by AlexCheema · Pull Request #1447 · exo-explore/exo

AlexCheema · 2026-02-11T00:15:52Z

Motivation

Users currently manage instances directly, which means if a node disconnects or connections break, the instance dies and nothing recreates it. MetaInstance is a declarative primitive: "ensure an instance matching these parameters always exists." The reconciler watches for unhealthy or missing backing instances and re-places them automatically.

Changes

MetaInstance type (meta_instance.py): declarative constraint with model_id, min_nodes, optional node_ids, and sharding
Reconciler (reconcile.py): find_unsatisfied_meta_instances checks which MetaInstances lack a healthy backing instance, try_place_for_meta_instance creates one
Master loop (main.py): periodically reconciles unsatisfied MetaInstances; immediate placement on CreateMetaInstance command
API (api.py): create_meta_instance / delete_meta_instance / GET /meta_instances endpoints; delete cascades to backing instances with task cancellation
Binding via meta_instance_id on Instance (instances.py): no separate binding event or backing map — the instance carries its parent MetaInstance ID directly, eliminating race conditions in the reconciler
Dashboard: sidebar shows MetaInstances with their backing instance status; orphan instances (created directly) still shown separately
Tests: constraint matching, connection health, unsatisfied detection, exclusive binding, cascade delete with task cancellation

Recent improvements

fix: cancel active tasks on cascade delete — DeleteMetaInstance now emits TaskStatusUpdated(Cancelled) for any Pending/Running tasks on backing instances before emitting InstanceDeleted. Previously, cascade-deleting backing instances left orphaned task references in state.
Lifecycle logging — added logger.info/logger.warning for: CreateMetaInstance (model, min_nodes, sharding), DeleteMetaInstance (with cascade count), reconciler placement success/failure, and retry decisions with attempt counts in InstanceHealthReconciler.
GET /meta_instances endpoint — lists all meta-instances without needing to fetch full state.
2 regression tests — test_cascade_delete_cancels_active_tasks and test_cascade_delete_skips_completed_tasks verify the cascade-delete event sequence.

Why It Works

Putting meta_instance_id on BaseInstance makes binding inherent to instance creation. When the reconciler creates an instance for a MetaInstance, it tags it via model_copy. When the instance is deleted, the binding disappears with it. This avoids the two bugs that a separate binding mechanism would introduce:

Stale exclusion sets — the reconciler loop can't accidentally bind two MetaInstances to the same instance
Delete ordering race — no window between deleting an instance and its binding where the reconciler could re-place

Test Plan

Manual Testing

Created MetaInstance via dashboard, verified instance placed
Verified delete cascades (deleting MetaInstance removes backing instance)
Verified orphan instances still work independently

Automated Testing

30 tests in test_meta_instance_edge_cases.py: lifecycle, retry logic, error handling, concurrent operations, cascade delete with task cancellation
24 tests in test_reconcile.py: constraint matching, connection health (single/multi-node, edge removal, IP changes), unsatisfied detection, exclusive binding, idempotency
All 261 tests pass
basedpyright 0 errors, ruff clean, dashboard builds

AlexCheema · 2026-02-11T00:21:07Z

Testing scenarios needed before merging

Disconnect Ethernet with a Ring instance running — verify MetaInstance reconciler detects unhealthy connections and re-places
Disconnect Thunderbolt 5 with an RDMA instance running — verify same recovery behavior
Kill a node that's part of an Instance — verify node timeout triggers instance deletion and MetaInstance re-places on remaining nodes
Delete a MetaInstance from the dashboard — verify backing instance is cascade-deleted
Create multiple MetaInstances for the same model — verify each gets its own backing instance (exclusive binding)
Create an orphan instance directly via API — verify it works independently and isn't affected by MetaInstance lifecycle

AlexCheema · 2026-02-11T00:23:46Z

Future work: placement preferences

MetaInstance currently places with no optimization preference. A natural next step is letting users specify a placement preference, e.g.:

Highest interactivity — maximize tokens/sec per request (fewer nodes, lower latency)
Highest throughput — maximize total tokens/sec across concurrent requests (more sharding, more parallelism)

These are different points on a throughput vs. interactivity Pareto curve. The placer would use the preference to score candidate placements differently rather than just picking the first valid one.

AlexCheema · 2026-02-12T00:40:12Z

JACCL RDMA Error Warning Banner

Added a dashboard warning that detects [jaccl] errors in MetaInstance failure messages. These errors indicate a problem with the experimental RDMA driver in macOS — the only fix is restarting the affected machine.

What it does:

Scans metaInstances for lastFailureError containing [jaccl]
Shows a red dismissible alert banner at the top-left of the topology view
Hover tooltip explains the issue and tells the user to restart
Re-appears if a new jaccl error arrives after dismissal

Banner:

Tooltip on hover:

AlexCheema · 2026-02-13T20:26:05Z

Deep Review of PR 1447: MetaInstance Layer

Reviewed: all 32 commits, ~2459 additions / ~228 deletions across 25+ files
Edge-case tests: 25 new tests pushed in src/exo/master/tests/test_meta_instance_edge_cases.py (all passing)
Full test suite: 256 passed, 1 skipped, 97 deselected (pre-existing slow tests)
Pre-commit checks: basedpyright 0 errors, ruff clean, nix fmt clean

Summary

The PR adds a MetaInstance declarative constraint layer that ensures a model instance matching given parameters always exists. When the backing instance fails, the system automatically retries placement up to MAX_INSTANCE_RETRIES (3). The implementation includes a clean ProcessManager protocol with three reconcilers (InstanceHealth, MetaInstance, NodeTimeout), new event types, worker-side retry coordination, and JACCL SideChannel FIFO relay.

Recommendation: Merge (with minor suggestions below)

Architecture is clean — pure reconciliation functions in reconcile.py, side-effectful orchestration in process managers, event application in apply.py. Follows existing patterns well. Test coverage is solid (43 existing + 25 new edge-case tests).

Issues Found

1. Race condition in `delete_meta_instance` (Medium-High)

In src/exo/master/api.py, the delete handler sends a DeleteMetaInstance command, then reads self.state() to find backing instances for cascade deletion. Since commands are processed asynchronously, state may be stale — could leave orphaned instances.

Suggestion: Read state before sending the delete command, or make cascade deletion part of the event handler.

2. `apply_instance_retrying` silently drops events for missing instances (Low-Medium)

In src/exo/shared/apply.py, when InstanceRetrying references a non-existent instance, the handler returns early without incrementing the MetaInstance failure counter. This is likely intentional (InstanceDeleted handles counting instead), but is undocumented and was confusing during review.

Suggestion: Add a brief comment explaining this design choice.

3. Reconcile loop runs `ModelCard.load()` async I/O (Medium)

try_place_for_meta_instance() calls ModelCard.load() inside the 1-second reconcile loop. If slow or failing, this blocks all reconciliation (health checks, node timeouts, other meta-instances).

Suggestion: Consider a timeout or running placement attempts outside the main reconcile cycle.

Minor Notes

MAX_INSTANCE_RETRIES is hardcoded to 3 — works for now, could be configurable later
Removed use_default validator from PlaceInstanceParams — intentional per commit message, minor breaking API change
RDMA placement now properly raises ValueError instead of silently falling through — good fix
No exponential backoff on retries (3 rapid attempts at 1s intervals for persistent failures)

Edge-Case Tests Added (25 tests)

Category	Tests
Lifecycle	create/delete roundtrip, frozen model, deletion removes from state
Retry logic	counter increments through cycle, max retries → deletion, resets on success
Error handling	retrying for missing instance, placement failure records error, double-delete idempotent
Backward compat	instances without meta_instance_id, legacy placement, state serialization
Concurrent	multiple meta-instances for same model, deleting one doesn't affect others
Constraints	node_ids subset matching, min_nodes enforcement, binding vs constraint semantics

🤖 Generated with Claude Code

AlexCheema · 2026-02-16T14:07:25Z

Full Summary of PR #1447 — Add MetaInstance Declarative Layer

Diff: 31 files changed, +3,241 / -238 lines

What is MetaInstance?

A declarative primitive: "ensure an instance matching these parameters always exists." If a node disconnects or connections break, the reconciler automatically re-creates the backing instance. Previously, users managed instances directly and dead instances stayed dead.

Core Changes (Python — 20 files)

New types and models:

src/exo/shared/types/meta_instance.py — MetaInstance frozen Pydantic model with model_id, sharding, instance_meta, min_nodes, optional node_ids, failure tracking (consecutive_failures, last_failure_error, placement_error)
src/exo/shared/types/common.py — Added MetaInstanceId type
src/exo/shared/types/worker/instances.py — Added optional meta_instance_id field on BaseInstance, binding instances to their parent MetaInstance

Events and commands:

src/exo/shared/types/events.py — New events: MetaInstanceCreated, MetaInstanceDeleted, MetaInstancePlacementFailed, InstanceRetrying
src/exo/shared/types/commands.py — New commands: CreateMetaInstance, DeleteMetaInstance
src/exo/shared/types/api.py — New API request/response types for meta-instance endpoints
src/exo/shared/types/state.py — Added meta_instances dict to State

Event sourcing:

src/exo/shared/apply.py — New apply functions for all meta-instance events; added explanatory comment on apply_instance_retrying documenting why it returns early for missing instances (avoids double-counting failures)

Reconciliation:

src/exo/master/reconcile.py — find_unsatisfied_meta_instances() checks which MetaInstances lack a healthy backing instance; try_place_for_meta_instance() creates one using existing placement logic
src/exo/master/process_managers/meta_instance.py — MetaInstanceReconciler runs in the master loop with 10s timeout and error handling for ModelCard.load(), emitting MetaInstancePlacementFailed events with dedup
src/exo/master/process_managers/instance_health.py — InstanceHealthReconciler (extracted from inline code); MAX_INSTANCE_RETRIES = 3 retry logic for failed instances
src/exo/master/process_managers/node_timeout.py — NodeTimeoutReconciler (extracted)
src/exo/master/process_managers/__init__.py — Package init

Master and worker:

src/exo/master/main.py — Reconcile loop runs 3 process managers every 1s; handles CreateMetaInstance/DeleteMetaInstance commands; cascade-delete removes backing instances
src/exo/master/api.py — /create_meta_instance and /delete_meta_instance API endpoints
src/exo/worker/plan.py — _create_runner now takes all_runners param to check for terminal peer runners before creating (prevents races during retry)

Other:

src/exo/master/placement.py, src/exo/master/placement_utils.py — Minor refactors for reuse by meta-instance placement
src/exo/main.py — Wires up new components
src/exo/download/coordinator.py — Minor adjustment
pyproject.toml — Test config update

Dashboard Changes (Svelte — 4 files)

dashboard/src/routes/+page.svelte —
- Shows MetaInstances in sidebar with backing instance status (placing, healthy, error states)
- Bug fix: $derived(fn) → $derived.by(fn) for unifiedDisplayItems (was returning the function itself, not its result)
- Bug fix: Removed tautological lastError ? ... : null check (always truthy when failures > 0)
- getMetaInstancePlacingStatus() helper for UI state derivation
dashboard/src/lib/stores/app.svelte.ts — MetaInstanceData interface, metaInstances reactive store
dashboard/src/lib/components/ChatSidebar.svelte — Removed MlxIbvInstance references (consolidated to MlxJacclInstance)
dashboard/src/lib/components/ModelCard.svelte — Same MlxIbv cleanup

Tests (3 files, ~28 new tests)

src/exo/master/tests/test_reconcile.py — 24 new tests: constraint matching, connection health (single/multi-node, edge removal, IP changes), unsatisfied detection, exclusive binding, idempotency
src/exo/master/tests/test_meta_instance_edge_cases.py — 28 tests: retry/failure flows, cascade delete, placement error dedup, ModelCard.load timeout/error handling
src/exo/master/tests/test_placement_utils.py — Placement utility tests

Design Decisions

Binding via meta_instance_id on Instance — No separate binding event or backing map. The instance carries its parent MetaInstance ID directly, eliminating two classes of race conditions (stale exclusion sets, delete ordering races).
Process manager extraction — Instance health, node timeout, and meta-instance reconciliation are separate @final classes with a reconcile(state) -> Sequence[Event] interface.
ModelCard.load timeout — 10s anyio.fail_after prevents a slow/failing model card load from blocking the entire reconcile loop. Errors emit MetaInstancePlacementFailed with dedup against current state.
Retry strategy — Failed instances retry up to 3 times (MAX_INSTANCE_RETRIES), then get deleted. consecutive_failures and last_failure_error are always set together in apply.

Merge Conflict Resolution

The merge with origin/main resolved conflicts between this feature and main's new task cancellation feature:

commands.py — Both CreateMetaInstance/DeleteMetaInstance and TaskCancelled
plan.py — Added _cancel_tasks from main, kept all_runners param from this branch
runner_supervisor.py — Merged _cancel_sender/cancelled fields with JACCL pipe relay fields
reconcile.py — Pass empty tasks dict to get_transition_events (new param from main)

AlexCheema · 2026-02-16T17:55:34Z

Latest improvements (commit `3df896d`)

Bug fix: cancel active tasks on meta-instance cascade delete

DeleteMetaInstance now emits TaskStatusUpdated(Cancelled) for any Pending/Running tasks on backing instances before emitting InstanceDeleted. Previously, cascade-deleting backing instances left orphaned task references in state — matching the pattern already used by get_transition_events() in placement.py.

Files: src/exo/master/main.py

Lifecycle logging

Added structured logging for meta-instance operations to aid debugging:

main.py — logs CreateMetaInstance (model, min_nodes, sharding) and DeleteMetaInstance (with cascade instance count)
process_managers/meta_instance.py — logs successful placement and placement failures
process_managers/instance_health.py — logs retry attempts with (attempt N/3) and when retry limit is exceeded

GET `/meta_instances` API endpoint

Added GET /meta_instances to list all meta-instances directly, without needing to fetch full cluster state.

File: src/exo/master/api.py

Regression tests

2 new tests in test_meta_instance_edge_cases.py (30 total):

test_cascade_delete_cancels_active_tasks — verifies the full event sequence (MetaInstanceDeleted → TaskStatusUpdated(Cancelled) → InstanceDeleted) correctly updates state
test_cascade_delete_skips_completed_tasks — verifies only Pending/Running tasks are targeted, not completed ones

Pre-commit checks

All passing: basedpyright 0 errors, ruff clean, nix fmt clean, 261 tests passed (1 skipped, 97 deselected slow).

🤖 Generated with Claude Code

AlexCheema

Code Review: PR #1447 — MetaInstance Declarative Layer

Overall Assessment

Well-architected addition that introduces a declarative MetaInstance abstraction over the existing imperative instance management. The reconciliation pattern is clean, the retry logic is sound, and the test coverage is thorough (778 lines of edge-case tests). This is a significant architectural improvement.

Strengths

ProcessManager Protocol (process_managers/__init__.py): Clean, composable interface. @runtime_checkable is a nice touch for validation. The three reconcilers (InstanceHealth, NodeTimeout, MetaInstance) have clear single responsibilities.
_apply_and_broadcast eliminates the loopback processor hack. Centralizing event indexing, state mutation, persistence, and broadcast in one method is a major improvement. The docstring correctly notes Python's cooperative scheduling guarantees.
instance_runners_failed (reconcile.py): The logic is exactly right — requires ALL runners to be terminal AND at least one RunnerFailed. Correctly returns (False, None) when runners haven't reported yet (still starting) or when all are gracefully shut down. The node-identity-aware error messages are a nice UX touch.
find_unsatisfied_meta_instances correctly checks both meta_instance_id binding AND instance_connections_healthy, not just one or the other. A MetaInstance with a backing instance that has a broken connection will be detected.
Cascade delete (_command_processor, DeleteMetaInstance case): Correctly cancels active tasks before deleting the backing instance. The TaskStatus.Cancelled events are emitted before InstanceDeleted, ensuring proper cleanup ordering.
Comprehensive edge-case tests (test_meta_instance_edge_cases.py): Tests cover frozen model validation, create/delete roundtrips, nonexistent-ID safety, duplicate MetaInstances, retry counter resets, full retry cycles, and more.

Issues

try_place_for_meta_instance doesn't intersect node_ids with live topology (reconcile.py ~line 227): If a meta-instance pins node_ids=["node-a", "node-b"] and node-b goes down, placement will fail because node-b isn't in the topology. The placement could be made more resilient by intersecting meta_instance.node_ids with topology.list_nodes() and only requiring alive nodes. (Note: the other branch — #1484 — appears to have this fix with alive = set(meta_instance.node_ids) & live_nodes.)
Reconcile loop at 1-second interval (master/main.py): The previous _plan loop ran every 10 seconds. Now _reconcile runs every 1 second, and MetaInstanceReconciler calls ModelCard.load() for every unsatisfied meta-instance on each cycle. If model card loading hits the network (HuggingFace API), this could generate excessive requests when placement persistently fails. Consider:
- Caching ModelCard.load() results
- Adding a timeout to ModelCard.load() (prevent one slow model lookup from blocking the entire reconcile loop)
- Exponential backoff for meta-instances with placement_error set
Placement strategy change (placement.py, placement_utils.py): get_smallest_cycles → get_largest_cycles changes the default from "minimize nodes used" to "maximize nodes used". This is a significant behavior change — users who previously got 2-node placements will now get 4-node placements. Should be documented and justified (performance? memory pressure?).
_reconcile manager ordering matters (master/main.py): Managers run sequentially with state updated between each. InstanceHealthReconciler runs before MetaInstanceReconciler, which is correct (delete broken instances before trying to re-place). But this ordering dependency is implicit. Consider adding a comment.
Potential double-place race in CreateMetaInstance handler (master/main.py): The command handler does an immediate placement attempt after _apply_and_broadcast(MetaInstanceCreated). It re-checks find_unsatisfied_meta_instances to avoid racing with the reconciler, which is good. However, between the await ModelCard.load() and the re-check, the reconciler could also be placing — the re-check mitigates but doesn't fully eliminate the race since the reconciler could be mid-placement (after find_unsatisfied but before emitting InstanceCreated).

Minor

PlacementResult as NamedTuple (reconcile.py): Rest of the codebase uses frozen Pydantic models or dataclasses. Minor inconsistency.
Magic number: MAX_INSTANCE_RETRIES = 3 in instance_health.py — consider making this configurable or at least documenting why 3.

Verdict

Strong architectural improvement. The declarative layer, reconciliation pattern, and retry logic are well-designed. The main concerns are the placement strategy change (largest vs smallest), the 1-second reconcile interval with potentially slow ModelCard.load(), and the node_ids/topology intersection issue. Would approve after addressing #7 (topology intersection) and #9 (placement strategy documentation).

🤖 Generated with Claude Code

Introduces MetaInstance as a declarative constraint ensuring an instance matching given parameters (model, sharding, min_nodes) always exists. The master's reconciliation loop continuously checks for unsatisfied meta-instances and attempts placement. Connection health checking verifies that specific IPs (MlxRing) and RDMA interfaces (MlxJaccl) stored on instances still exist as topology edges, enabling automatic recovery when cables are swapped or interfaces change. Also eliminates the master's loopback event path, unifying all event emission through _apply_and_broadcast for simpler control flow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add MetaInstanceBound event and meta_instance_backing State field for explicit MetaInstance → Instance binding (prevents ambiguous linking when two MetaInstances have identical constraints) - Replace model_card: ModelCard with model_id: ModelId on MetaInstance (load ModelCard on-demand at placement time) - Add MetaInstance API endpoints (POST /meta_instance, DELETE) - Update dashboard to use MetaInstances as primary primitive with unified display items merging MetaInstances and orphan instances - Dashboard launches via MetaInstance instead of direct Instance creation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The separate MetaInstanceBound event + meta_instance_backing map introduced two bugs: stale exclusion sets in the reconciler loop and a delete ordering race. Embedding meta_instance_id directly on BaseInstance eliminates the binding mechanism entirely — when an instance is created for a MetaInstance it carries the ID, when deleted the binding is gone. No separate map, no cleanup, no races. Also fixes delete_meta_instance to cascade-delete backing instances. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace inline _plan() steps with a list of ProcessManagers, each implementing async reconcile(State) -> Sequence[Event]. Tick every 1s instead of 10s — safe because all PMs are idempotent against state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace inline _plan() with ProcessManager loop (_reconcile), tick every 1s instead of 10s — safe because all PMs are idempotent - Fix dashboard sending "MlxIbv" instead of "MlxJaccl" for RDMA instance type, which silently fell back to MlxRing default - Remove all stale MlxIbv references from dashboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When a MetaInstance has no backing instance yet, derive the strategy display from the MetaInstance's own sharding and instanceMeta fields rather than showing "Unknown (Unknown)". Also clean up all stale MlxIbv references across the dashboard — the backend enum is MlxJaccl. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Show why MetaInstance placement fails instead of stuck "PLACING", and show per-node runner status during loading for multi-node instances. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The mode="plain" validator bypassed Pydantic's string-to-enum coercion, so JSON strings like "Tensor" and "MlxJaccl" from the dashboard failed the isinstance check and silently fell back to Pipeline/MlxRing defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Dashboard was not including the user's node filter in the POST to /meta_instance, so placement ignored which nodes the user selected. Also, placement silently fell back to Ring when RDMA was requested but no RDMA-connected cycles were available — now raises an error that surfaces via MetaInstancePlacementFailed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When user selects specific nodes via the filter, min_nodes should be at least the number of filtered nodes to prevent placement from picking a smaller cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

RDMA requires at least 2 nodes — a single-node RDMA instance is nonsensical. Enforce this in both the dashboard (when building the launch request) and the backend placement (when filtering cycles). Previously, selecting RDMA would still place on 1 node because min_nodes defaulted to 1 and the placement silently switched to Ring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The dashboard now extracts node IDs from the selected preview's memory_delta_by_node, ensuring the backend places on exactly the nodes the user was shown. Also reverts incorrect RDMA min_nodes >= 2 enforcement since single-node RDMA is valid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

frozenset serializes to a JSON array but cannot be deserialized back in strict mode through the TaggedModel wrap validator (list → frozenset coercion is rejected). Changed to list[NodeId] since the model is already frozen/immutable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Extend InstanceDeleted with failure_error field for runner crash info - Add InstanceFailureInfo model tracking consecutive failures per MetaInstance - InstanceHealthReconciler now detects runner failures (all terminal with at least one RunnerFailed) in addition to connection failures - apply_instance_deleted increments failure counter for meta-bound instances - Dashboard shows RETRYING (N/3) status with error messages, and "Instance re-created due to failure" after 3 consecutive failures - Extract and display RunnerFailed error messages in instance status Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ashboard MetaInstanceReconciler now checks failure count before placement — after 3 consecutive failures it emits MetaInstancePlacementFailed instead of retrying forever. Dashboard shows "Retrying after error: <msg>" in orange throughout the retry cycle, not just during the brief window with no backing instance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When multiple runners fail, concatenate all error messages with "; " so the real error isn't hidden by generic side-effect failures from other runners. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The dashboard % 3 logic already handles displaying retry progress in batches (RETRYING 1/3, 2/3, 3/3, then PLACING with error, repeat). No need to permanently block placement after 3 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Each error in the combined message is now prefixed with the node's friendly name (e.g. "MacBook Pro: OOM; Mac Studio: connection reset") so the root cause node is easily identifiable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…er wording - apply_instance_created no longer clears last_failure_error so the error context persists while the new instance starts up - Dashboard retryError shows the error without (N/3) prefix when consecutiveFailures is 0 (instance was recreated) - Jaccl warning tooltip now says "experimental RDMA driver in macOS" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace fragile TCP SideChannel with anonymous pipes relayed through exo's event-sourced control plane. RunnerSupervisor creates pipe pairs for MlxJaccl instances, relays all_gather rounds via JacclSideChannelData/ JacclSideChannelGathered events through the master, eliminating errno=57 crashes from Thunderbolt RDMA driver instability. Also includes dashboard RDMA warning improvements and instance retry fixes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Anonymous pipes from os.pipe() don't survive multiprocessing.Process spawn on macOS (default since Python 3.8). The FD numbers are passed but the actual file descriptors don't exist in the child process, causing EBADF errors. Switch to named pipes (FIFOs) which the child opens by path in the spawned process, getting valid FDs for the C++ SideChannel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TaggedModel's wrap validator converts JSON→Python validation context, which breaks strict-mode bytes deserialization from JSON strings. Use Base64Bytes type to encode/decode bytes as base64 strings in JSON. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Cover retry logic, error handling, backward compatibility, concurrent scenarios, placement error tracking, and serialization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two race conditions existed in the meta-instance lifecycle: 1. CreateMetaInstance buffered MetaInstanceCreated but didn't apply it before awaiting ModelCard.load(). The reconciler could interleave during the await, leading to duplicate placements. Fix: apply MetaInstanceCreated eagerly via _apply_and_broadcast, then re-check satisfaction after the await so placement uses fresh state and skips if the reconciler already handled it. 2. delete_meta_instance (API handler) sent DeleteMetaInstance then read self.state.instances for cascade deletion. State was stale, so backing instances created between the send and the read were missed — permanently orphaning them. Fix: move cascade delete into the command processor's DeleteMetaInstance handler, where InstanceDeleted events are generated atomically with MetaInstanceDeleted. Reproduced on 4-node Mac Mini cluster: 28K anomalies in stress test including 21 permanently orphaned instances. After fix, the cascade delete and placement are race-free. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Prevents RuntimeError when the context has already been set, e.g. when Terminal.app reuses a tab or the process restarts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dels On startup, _emit_existing_download_progress() used downloaded_bytes_this_session to decide between DownloadPending and DownloadOngoing. Since downloaded_bytes_this_session is always 0 on startup (it tracks the current session only), fully-downloaded models were incorrectly reported as DownloadPending. Now checks actual disk state: if downloaded_bytes >= total_bytes, emit DownloadCompleted regardless of session bytes. This fixes the UI showing models as pending when they're already available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add TaskCancelled command and Cancelled task status - Detect API client disconnects in master/api.py - Handle TaskCancelled in master state machine - Add _cancel_tasks to worker for graceful task cleanup - Add cancel_receiver to runner for inference abort - Add mx_any helper in MLX utils for cancellable operations - Guard instance lookup in worker to prevent KeyError - Update tests for cancellation flow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…UG-001)" This reverts commit 2a75672.

…001d) The placement algorithm previously selected the smallest viable cycle, causing large models to be distributed across too few nodes and running out of memory. Changed get_smallest_cycles to get_largest_cycles so that all healthy nodes are utilized, spreading layers more evenly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ures (BUG-001c) The place_instance API endpoint used fire-and-forget: it sent the command and returned HTTP 200 immediately. On a fresh cluster start, the master's state often lacks topology/memory data, so placement raises ValueError which was silently caught and logged. The caller never learned it failed. Two fixes: - API: validate placement locally before sending, return HTTP 400 on failure instead of silently accepting an unprocessable command - Master: emit MetaInstancePlacementFailed on immediate placement error in CreateMetaInstance handler so the error surfaces in state right away Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…eReconciler ModelCard.load() does async I/O inside the 1-second reconcile loop. A slow or failing load blocked all reconciliation (health checks, node timeouts, other meta-instances). Adds a 10-second timeout, per-meta-instance error handling with MetaInstancePlacementFailed events, and documents the intentional early return in apply_instance_retrying. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…instance UI 1. DERIVED REACTIVITY BUG: `unifiedDisplayItems` used `$derived(fn)` which made the derived value the function itself instead of its result. Svelte never tracked reactive dependencies in the function body, so the instance list didn't update when metaInstances or instances changed. Fixed by using `$derived.by(fn)` and removing the `()` call-sites in the template. 2. TAUTOLOGICAL CHECK: In `getMetaInstancePlacingStatus`, the `lastError ? ... : null` guard inside the `failures > 0` branch was always true because `lastFailureError` and `consecutiveFailures` are always set together in `apply_instance_retrying` and `apply_instance_deleted`. Removed the dead `: null` branch. Also fixes pyright errors in test file by using proper pytest.MonkeyPatch type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previously, DeleteMetaInstance cascade-deleted backing instances without cancelling their active tasks, leaving orphaned task references. Now emits TaskStatusUpdated(Cancelled) for Pending/Running tasks before InstanceDeleted. Also adds lifecycle logging for meta-instance operations, a GET /meta_instances endpoint, and 2 regression tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add TaskStatusUpdated to imports in master/main.py (used in cascade delete but missing after rebase conflict resolution) - Mock mx.distributed.all_gather in test_event_ordering.py so MockGroup works with the cancel-checking code path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…mbda - Add `tasks` parameter to `try_place_for_meta_instance()` in reconcile.py and thread `state.tasks` through from both call sites (main.py, meta_instance.py) - Replace untyped lambda with typed helper function in test_event_ordering.py to satisfy basedpyright strict mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add missing case branch for TaskCancelled to satisfy basedpyright exhaustive match check. Deletes the associated task and cleans up the command_task_mapping, mirroring TaskFinished behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The merge resolution kept pre-#1447 code that assigned to `instanceData` (a Svelte 5 $derived constant) and used the old /instance endpoint. Switch both launchInstance and onboardingLaunchModel to POST /meta_instance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This reverts commit a962a28.

The merge resolution kept pre-#1447 code that assigned to `instanceData` (a Svelte 5 $derived constant) and used the old /instance endpoint. Switch both launchInstance and onboardingLaunchModel to POST /meta_instance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AlexCheema marked this pull request as ready for review February 16, 2026 13:36

AlexCheema enabled auto-merge (squash) February 16, 2026 13:39

AlexCheema commented Feb 16, 2026

View reviewed changes

AlexCheema force-pushed the alexcheema/meta-instance branch from 3df896d to 34e846b Compare February 17, 2026 17:06

AlexCheema and others added 20 commits February 17, 2026 09:32

Add placement error feedback and per-node loading status

7ce020c

Show why MetaInstance placement fails instead of stuck "PLACING", and show per-node runner status during loading for multi-node instances. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Ensure min_nodes >= node filter size when launching

a3367bb

When user selects specific nodes via the filter, min_nodes should be at least the number of filtered nodes to prevent placement from picking a smaller cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Collect all runner error messages instead of just the last one

394c687

When multiple runners fail, concatenate all error messages with "; " so the real error isn't hidden by generic side-effect failures from other runners. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Show retry count in exceeded retry limit message (3/3)

52b6ca8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Show retry attempt count with error message, e.g. (2/3)

d44bfbf

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AlexCheema and others added 20 commits February 17, 2026 09:32

temp: add jaccl warning screenshots for PR comment

32bbb64

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: remove temporary screenshot files

8748d0e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test: add 25 edge-case tests for MetaInstance lifecycle

e85be8b

Cover retry logic, error handling, backward compatibility, concurrent scenarios, placement error tracking, and serialization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use force=True for multiprocessing set_start_method

3417cef

Prevents RuntimeError when the context has already been set, e.g. when Terminal.app reuses a tab or the process restarts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert "feat: add task cancellation for client disconnect handling (B…

63cfe13

…UG-001)" This reverts commit 2a75672.

AlexCheema force-pushed the alexcheema/meta-instance branch from fb6a084 to 11424f6 Compare February 17, 2026 17:32

AlexCheema merged commit a962a28 into main Feb 17, 2026
6 checks passed

AlexCheema deleted the alexcheema/meta-instance branch February 17, 2026 17:48

Evanev7 added a commit that referenced this pull request Feb 17, 2026

Revert "Add MetaInstance declarative layer (#1447)"

e4feb05

This reverts commit a962a28.

Evanev7 mentioned this pull request Feb 17, 2026

Revert "Add MetaInstance declarative layer" #1507

Merged

Evanev7 added a commit that referenced this pull request Feb 17, 2026

Revert "Add MetaInstance declarative layer (#1447)"

eccc629

This reverts commit a962a28.

AlexCheema mentioned this pull request Feb 17, 2026

Add MetaInstance declarative layer (rebased) #1519

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MetaInstance declarative layer#1447

Add MetaInstance declarative layer#1447
AlexCheema merged 44 commits intomainfrom
alexcheema/meta-instance

AlexCheema commented Feb 11, 2026 •

edited

Loading

Uh oh!

AlexCheema commented Feb 11, 2026

Uh oh!

AlexCheema commented Feb 11, 2026

Uh oh!

AlexCheema commented Feb 12, 2026

Uh oh!

AlexCheema commented Feb 13, 2026

Uh oh!

AlexCheema commented Feb 16, 2026

Uh oh!

AlexCheema commented Feb 16, 2026

Uh oh!

AlexCheema left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlexCheema commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Recent improvements

Why It Works

Test Plan

Manual Testing

Automated Testing

Uh oh!

AlexCheema commented Feb 11, 2026

Testing scenarios needed before merging

Uh oh!

AlexCheema commented Feb 11, 2026

Future work: placement preferences

Uh oh!

AlexCheema commented Feb 12, 2026

JACCL RDMA Error Warning Banner

Uh oh!

AlexCheema commented Feb 13, 2026

Deep Review of PR 1447: MetaInstance Layer

Summary

Recommendation: Merge (with minor suggestions below)

Issues Found

1. Race condition in delete_meta_instance (Medium-High)

2. apply_instance_retrying silently drops events for missing instances (Low-Medium)

3. Reconcile loop runs ModelCard.load() async I/O (Medium)

Minor Notes

Edge-Case Tests Added (25 tests)

Uh oh!

AlexCheema commented Feb 16, 2026

Full Summary of PR #1447 — Add MetaInstance Declarative Layer

What is MetaInstance?

Core Changes (Python — 20 files)

Dashboard Changes (Svelte — 4 files)

Tests (3 files, ~28 new tests)

Design Decisions

Merge Conflict Resolution

Uh oh!

AlexCheema commented Feb 16, 2026

Latest improvements (commit 3df896d)

Bug fix: cancel active tasks on meta-instance cascade delete

Lifecycle logging

GET /meta_instances API endpoint

Regression tests

Pre-commit checks

Uh oh!

AlexCheema left a comment

Choose a reason for hiding this comment

Code Review: PR #1447 — MetaInstance Declarative Layer

Overall Assessment

Strengths

Issues

Minor

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AlexCheema commented Feb 11, 2026 •

edited

Loading

1. Race condition in `delete_meta_instance` (Medium-High)

2. `apply_instance_retrying` silently drops events for missing instances (Low-Medium)

3. Reconcile loop runs `ModelCard.load()` async I/O (Medium)

Latest improvements (commit `3df896d`)

GET `/meta_instances` API endpoint