Skip to content

Add MetaInstance declarative layer#1447

Merged
AlexCheema merged 44 commits intomainfrom
alexcheema/meta-instance
Feb 17, 2026
Merged

Add MetaInstance declarative layer#1447
AlexCheema merged 44 commits intomainfrom
alexcheema/meta-instance

Conversation

@AlexCheema
Copy link
Copy Markdown
Contributor

@AlexCheema AlexCheema commented Feb 11, 2026

Motivation

Users currently manage instances directly, which means if a node disconnects or connections break, the instance dies and nothing recreates it. MetaInstance is a declarative primitive: "ensure an instance matching these parameters always exists." The reconciler watches for unhealthy or missing backing instances and re-places them automatically.

Changes

  • MetaInstance type (meta_instance.py): declarative constraint with model_id, min_nodes, optional node_ids, and sharding
  • Reconciler (reconcile.py): find_unsatisfied_meta_instances checks which MetaInstances lack a healthy backing instance, try_place_for_meta_instance creates one
  • Master loop (main.py): periodically reconciles unsatisfied MetaInstances; immediate placement on CreateMetaInstance command
  • API (api.py): create_meta_instance / delete_meta_instance / GET /meta_instances endpoints; delete cascades to backing instances with task cancellation
  • Binding via meta_instance_id on Instance (instances.py): no separate binding event or backing map — the instance carries its parent MetaInstance ID directly, eliminating race conditions in the reconciler
  • Dashboard: sidebar shows MetaInstances with their backing instance status; orphan instances (created directly) still shown separately
  • Tests: constraint matching, connection health, unsatisfied detection, exclusive binding, cascade delete with task cancellation

Recent improvements

  • fix: cancel active tasks on cascade deleteDeleteMetaInstance now emits TaskStatusUpdated(Cancelled) for any Pending/Running tasks on backing instances before emitting InstanceDeleted. Previously, cascade-deleting backing instances left orphaned task references in state.
  • Lifecycle logging — added logger.info/logger.warning for: CreateMetaInstance (model, min_nodes, sharding), DeleteMetaInstance (with cascade count), reconciler placement success/failure, and retry decisions with attempt counts in InstanceHealthReconciler.
  • GET /meta_instances endpoint — lists all meta-instances without needing to fetch full state.
  • 2 regression teststest_cascade_delete_cancels_active_tasks and test_cascade_delete_skips_completed_tasks verify the cascade-delete event sequence.

Why It Works

Putting meta_instance_id on BaseInstance makes binding inherent to instance creation. When the reconciler creates an instance for a MetaInstance, it tags it via model_copy. When the instance is deleted, the binding disappears with it. This avoids the two bugs that a separate binding mechanism would introduce:

  1. Stale exclusion sets — the reconciler loop can't accidentally bind two MetaInstances to the same instance
  2. Delete ordering race — no window between deleting an instance and its binding where the reconciler could re-place

Test Plan

Manual Testing

  • Created MetaInstance via dashboard, verified instance placed
  • Verified delete cascades (deleting MetaInstance removes backing instance)
  • Verified orphan instances still work independently

Automated Testing

  • 30 tests in test_meta_instance_edge_cases.py: lifecycle, retry logic, error handling, concurrent operations, cascade delete with task cancellation
  • 24 tests in test_reconcile.py: constraint matching, connection health (single/multi-node, edge removal, IP changes), unsatisfied detection, exclusive binding, idempotency
  • All 261 tests pass
  • basedpyright 0 errors, ruff clean, dashboard builds

@AlexCheema
Copy link
Copy Markdown
Contributor Author

Testing scenarios needed before merging

  • Disconnect Ethernet with a Ring instance running — verify MetaInstance reconciler detects unhealthy connections and re-places
  • Disconnect Thunderbolt 5 with an RDMA instance running — verify same recovery behavior
  • Kill a node that's part of an Instance — verify node timeout triggers instance deletion and MetaInstance re-places on remaining nodes
  • Delete a MetaInstance from the dashboard — verify backing instance is cascade-deleted
  • Create multiple MetaInstances for the same model — verify each gets its own backing instance (exclusive binding)
  • Create an orphan instance directly via API — verify it works independently and isn't affected by MetaInstance lifecycle

@AlexCheema
Copy link
Copy Markdown
Contributor Author

Future work: placement preferences

MetaInstance currently places with no optimization preference. A natural next step is letting users specify a placement preference, e.g.:

  • Highest interactivity — maximize tokens/sec per request (fewer nodes, lower latency)
  • Highest throughput — maximize total tokens/sec across concurrent requests (more sharding, more parallelism)

These are different points on a throughput vs. interactivity Pareto curve. The placer would use the preference to score candidate placements differently rather than just picking the first valid one.

@AlexCheema
Copy link
Copy Markdown
Contributor Author

JACCL RDMA Error Warning Banner

Added a dashboard warning that detects [jaccl] errors in MetaInstance failure messages. These errors indicate a problem with the experimental RDMA driver in macOS — the only fix is restarting the affected machine.

What it does:

  • Scans metaInstances for lastFailureError containing [jaccl]
  • Shows a red dismissible alert banner at the top-left of the topology view
  • Hover tooltip explains the issue and tells the user to restart
  • Re-appears if a new jaccl error arrives after dismissal

Banner:
JACCL warning banner

Tooltip on hover:
JACCL warning tooltip

@AlexCheema
Copy link
Copy Markdown
Contributor Author

Deep Review of PR 1447: MetaInstance Layer

Reviewed: all 32 commits, ~2459 additions / ~228 deletions across 25+ files
Edge-case tests: 25 new tests pushed in src/exo/master/tests/test_meta_instance_edge_cases.py (all passing)
Full test suite: 256 passed, 1 skipped, 97 deselected (pre-existing slow tests)
Pre-commit checks: basedpyright 0 errors, ruff clean, nix fmt clean


Summary

The PR adds a MetaInstance declarative constraint layer that ensures a model instance matching given parameters always exists. When the backing instance fails, the system automatically retries placement up to MAX_INSTANCE_RETRIES (3). The implementation includes a clean ProcessManager protocol with three reconcilers (InstanceHealth, MetaInstance, NodeTimeout), new event types, worker-side retry coordination, and JACCL SideChannel FIFO relay.

Recommendation: Merge (with minor suggestions below)

Architecture is clean — pure reconciliation functions in reconcile.py, side-effectful orchestration in process managers, event application in apply.py. Follows existing patterns well. Test coverage is solid (43 existing + 25 new edge-case tests).


Issues Found

1. Race condition in delete_meta_instance (Medium-High)

In src/exo/master/api.py, the delete handler sends a DeleteMetaInstance command, then reads self.state() to find backing instances for cascade deletion. Since commands are processed asynchronously, state may be stale — could leave orphaned instances.

Suggestion: Read state before sending the delete command, or make cascade deletion part of the event handler.

2. apply_instance_retrying silently drops events for missing instances (Low-Medium)

In src/exo/shared/apply.py, when InstanceRetrying references a non-existent instance, the handler returns early without incrementing the MetaInstance failure counter. This is likely intentional (InstanceDeleted handles counting instead), but is undocumented and was confusing during review.

Suggestion: Add a brief comment explaining this design choice.

3. Reconcile loop runs ModelCard.load() async I/O (Medium)

try_place_for_meta_instance() calls ModelCard.load() inside the 1-second reconcile loop. If slow or failing, this blocks all reconciliation (health checks, node timeouts, other meta-instances).

Suggestion: Consider a timeout or running placement attempts outside the main reconcile cycle.

Minor Notes

  • MAX_INSTANCE_RETRIES is hardcoded to 3 — works for now, could be configurable later
  • Removed use_default validator from PlaceInstanceParams — intentional per commit message, minor breaking API change
  • RDMA placement now properly raises ValueError instead of silently falling through — good fix
  • No exponential backoff on retries (3 rapid attempts at 1s intervals for persistent failures)

Edge-Case Tests Added (25 tests)

Category Tests
Lifecycle create/delete roundtrip, frozen model, deletion removes from state
Retry logic counter increments through cycle, max retries → deletion, resets on success
Error handling retrying for missing instance, placement failure records error, double-delete idempotent
Backward compat instances without meta_instance_id, legacy placement, state serialization
Concurrent multiple meta-instances for same model, deleting one doesn't affect others
Constraints node_ids subset matching, min_nodes enforcement, binding vs constraint semantics

🤖 Generated with Claude Code

@AlexCheema AlexCheema marked this pull request as ready for review February 16, 2026 13:36
@AlexCheema AlexCheema enabled auto-merge (squash) February 16, 2026 13:39
@AlexCheema
Copy link
Copy Markdown
Contributor Author

Full Summary of PR #1447 — Add MetaInstance Declarative Layer

Diff: 31 files changed, +3,241 / -238 lines


What is MetaInstance?

A declarative primitive: "ensure an instance matching these parameters always exists." If a node disconnects or connections break, the reconciler automatically re-creates the backing instance. Previously, users managed instances directly and dead instances stayed dead.


Core Changes (Python — 20 files)

New types and models:

  • src/exo/shared/types/meta_instance.pyMetaInstance frozen Pydantic model with model_id, sharding, instance_meta, min_nodes, optional node_ids, failure tracking (consecutive_failures, last_failure_error, placement_error)
  • src/exo/shared/types/common.py — Added MetaInstanceId type
  • src/exo/shared/types/worker/instances.py — Added optional meta_instance_id field on BaseInstance, binding instances to their parent MetaInstance

Events and commands:

  • src/exo/shared/types/events.py — New events: MetaInstanceCreated, MetaInstanceDeleted, MetaInstancePlacementFailed, InstanceRetrying
  • src/exo/shared/types/commands.py — New commands: CreateMetaInstance, DeleteMetaInstance
  • src/exo/shared/types/api.py — New API request/response types for meta-instance endpoints
  • src/exo/shared/types/state.py — Added meta_instances dict to State

Event sourcing:

  • src/exo/shared/apply.py — New apply functions for all meta-instance events; added explanatory comment on apply_instance_retrying documenting why it returns early for missing instances (avoids double-counting failures)

Reconciliation:

  • src/exo/master/reconcile.pyfind_unsatisfied_meta_instances() checks which MetaInstances lack a healthy backing instance; try_place_for_meta_instance() creates one using existing placement logic
  • src/exo/master/process_managers/meta_instance.pyMetaInstanceReconciler runs in the master loop with 10s timeout and error handling for ModelCard.load(), emitting MetaInstancePlacementFailed events with dedup
  • src/exo/master/process_managers/instance_health.pyInstanceHealthReconciler (extracted from inline code); MAX_INSTANCE_RETRIES = 3 retry logic for failed instances
  • src/exo/master/process_managers/node_timeout.pyNodeTimeoutReconciler (extracted)
  • src/exo/master/process_managers/__init__.py — Package init

Master and worker:

  • src/exo/master/main.py — Reconcile loop runs 3 process managers every 1s; handles CreateMetaInstance/DeleteMetaInstance commands; cascade-delete removes backing instances
  • src/exo/master/api.py/create_meta_instance and /delete_meta_instance API endpoints
  • src/exo/worker/plan.py_create_runner now takes all_runners param to check for terminal peer runners before creating (prevents races during retry)

Other:

  • src/exo/master/placement.py, src/exo/master/placement_utils.py — Minor refactors for reuse by meta-instance placement
  • src/exo/main.py — Wires up new components
  • src/exo/download/coordinator.py — Minor adjustment
  • pyproject.toml — Test config update

Dashboard Changes (Svelte — 4 files)

  • dashboard/src/routes/+page.svelte
    • Shows MetaInstances in sidebar with backing instance status (placing, healthy, error states)
    • Bug fix: $derived(fn)$derived.by(fn) for unifiedDisplayItems (was returning the function itself, not its result)
    • Bug fix: Removed tautological lastError ? ... : null check (always truthy when failures > 0)
    • getMetaInstancePlacingStatus() helper for UI state derivation
  • dashboard/src/lib/stores/app.svelte.tsMetaInstanceData interface, metaInstances reactive store
  • dashboard/src/lib/components/ChatSidebar.svelte — Removed MlxIbvInstance references (consolidated to MlxJacclInstance)
  • dashboard/src/lib/components/ModelCard.svelte — Same MlxIbv cleanup

Tests (3 files, ~28 new tests)

  • src/exo/master/tests/test_reconcile.py — 24 new tests: constraint matching, connection health (single/multi-node, edge removal, IP changes), unsatisfied detection, exclusive binding, idempotency
  • src/exo/master/tests/test_meta_instance_edge_cases.py — 28 tests: retry/failure flows, cascade delete, placement error dedup, ModelCard.load timeout/error handling
  • src/exo/master/tests/test_placement_utils.py — Placement utility tests

Design Decisions

  1. Binding via meta_instance_id on Instance — No separate binding event or backing map. The instance carries its parent MetaInstance ID directly, eliminating two classes of race conditions (stale exclusion sets, delete ordering races).
  2. Process manager extraction — Instance health, node timeout, and meta-instance reconciliation are separate @final classes with a reconcile(state) -> Sequence[Event] interface.
  3. ModelCard.load timeout — 10s anyio.fail_after prevents a slow/failing model card load from blocking the entire reconcile loop. Errors emit MetaInstancePlacementFailed with dedup against current state.
  4. Retry strategy — Failed instances retry up to 3 times (MAX_INSTANCE_RETRIES), then get deleted. consecutive_failures and last_failure_error are always set together in apply.

Merge Conflict Resolution

The merge with origin/main resolved conflicts between this feature and main's new task cancellation feature:

  • commands.py — Both CreateMetaInstance/DeleteMetaInstance and TaskCancelled
  • plan.py — Added _cancel_tasks from main, kept all_runners param from this branch
  • runner_supervisor.py — Merged _cancel_sender/cancelled fields with JACCL pipe relay fields
  • reconcile.py — Pass empty tasks dict to get_transition_events (new param from main)

@AlexCheema
Copy link
Copy Markdown
Contributor Author

Latest improvements (commit 3df896d)

Bug fix: cancel active tasks on meta-instance cascade delete

DeleteMetaInstance now emits TaskStatusUpdated(Cancelled) for any Pending/Running tasks on backing instances before emitting InstanceDeleted. Previously, cascade-deleting backing instances left orphaned task references in state — matching the pattern already used by get_transition_events() in placement.py.

Files: src/exo/master/main.py

Lifecycle logging

Added structured logging for meta-instance operations to aid debugging:

  • main.py — logs CreateMetaInstance (model, min_nodes, sharding) and DeleteMetaInstance (with cascade instance count)
  • process_managers/meta_instance.py — logs successful placement and placement failures
  • process_managers/instance_health.py — logs retry attempts with (attempt N/3) and when retry limit is exceeded

GET /meta_instances API endpoint

Added GET /meta_instances to list all meta-instances directly, without needing to fetch full cluster state.

File: src/exo/master/api.py

Regression tests

2 new tests in test_meta_instance_edge_cases.py (30 total):

  • test_cascade_delete_cancels_active_tasks — verifies the full event sequence (MetaInstanceDeleted → TaskStatusUpdated(Cancelled) → InstanceDeleted) correctly updates state
  • test_cascade_delete_skips_completed_tasks — verifies only Pending/Running tasks are targeted, not completed ones

Pre-commit checks

All passing: basedpyright 0 errors, ruff clean, nix fmt clean, 261 tests passed (1 skipped, 97 deselected slow).

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor Author

@AlexCheema AlexCheema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #1447 — MetaInstance Declarative Layer

Overall Assessment

Well-architected addition that introduces a declarative MetaInstance abstraction over the existing imperative instance management. The reconciliation pattern is clean, the retry logic is sound, and the test coverage is thorough (778 lines of edge-case tests). This is a significant architectural improvement.

Strengths

  1. ProcessManager Protocol (process_managers/__init__.py): Clean, composable interface. @runtime_checkable is a nice touch for validation. The three reconcilers (InstanceHealth, NodeTimeout, MetaInstance) have clear single responsibilities.

  2. _apply_and_broadcast eliminates the loopback processor hack. Centralizing event indexing, state mutation, persistence, and broadcast in one method is a major improvement. The docstring correctly notes Python's cooperative scheduling guarantees.

  3. instance_runners_failed (reconcile.py): The logic is exactly right — requires ALL runners to be terminal AND at least one RunnerFailed. Correctly returns (False, None) when runners haven't reported yet (still starting) or when all are gracefully shut down. The node-identity-aware error messages are a nice UX touch.

  4. find_unsatisfied_meta_instances correctly checks both meta_instance_id binding AND instance_connections_healthy, not just one or the other. A MetaInstance with a backing instance that has a broken connection will be detected.

  5. Cascade delete (_command_processor, DeleteMetaInstance case): Correctly cancels active tasks before deleting the backing instance. The TaskStatus.Cancelled events are emitted before InstanceDeleted, ensuring proper cleanup ordering.

  6. Comprehensive edge-case tests (test_meta_instance_edge_cases.py): Tests cover frozen model validation, create/delete roundtrips, nonexistent-ID safety, duplicate MetaInstances, retry counter resets, full retry cycles, and more.

Issues

  1. try_place_for_meta_instance doesn't intersect node_ids with live topology (reconcile.py ~line 227): If a meta-instance pins node_ids=["node-a", "node-b"] and node-b goes down, placement will fail because node-b isn't in the topology. The placement could be made more resilient by intersecting meta_instance.node_ids with topology.list_nodes() and only requiring alive nodes. (Note: the other branch — #1484 — appears to have this fix with alive = set(meta_instance.node_ids) & live_nodes.)

  2. Reconcile loop at 1-second interval (master/main.py): The previous _plan loop ran every 10 seconds. Now _reconcile runs every 1 second, and MetaInstanceReconciler calls ModelCard.load() for every unsatisfied meta-instance on each cycle. If model card loading hits the network (HuggingFace API), this could generate excessive requests when placement persistently fails. Consider:

    • Caching ModelCard.load() results
    • Adding a timeout to ModelCard.load() (prevent one slow model lookup from blocking the entire reconcile loop)
    • Exponential backoff for meta-instances with placement_error set
  3. Placement strategy change (placement.py, placement_utils.py): get_smallest_cyclesget_largest_cycles changes the default from "minimize nodes used" to "maximize nodes used". This is a significant behavior change — users who previously got 2-node placements will now get 4-node placements. Should be documented and justified (performance? memory pressure?).

  4. _reconcile manager ordering matters (master/main.py): Managers run sequentially with state updated between each. InstanceHealthReconciler runs before MetaInstanceReconciler, which is correct (delete broken instances before trying to re-place). But this ordering dependency is implicit. Consider adding a comment.

  5. Potential double-place race in CreateMetaInstance handler (master/main.py): The command handler does an immediate placement attempt after _apply_and_broadcast(MetaInstanceCreated). It re-checks find_unsatisfied_meta_instances to avoid racing with the reconciler, which is good. However, between the await ModelCard.load() and the re-check, the reconciler could also be placing — the re-check mitigates but doesn't fully eliminate the race since the reconciler could be mid-placement (after find_unsatisfied but before emitting InstanceCreated).

Minor

  1. PlacementResult as NamedTuple (reconcile.py): Rest of the codebase uses frozen Pydantic models or dataclasses. Minor inconsistency.

  2. Magic number: MAX_INSTANCE_RETRIES = 3 in instance_health.py — consider making this configurable or at least documenting why 3.

Verdict

Strong architectural improvement. The declarative layer, reconciliation pattern, and retry logic are well-designed. The main concerns are the placement strategy change (largest vs smallest), the 1-second reconcile interval with potentially slow ModelCard.load(), and the node_ids/topology intersection issue. Would approve after addressing #7 (topology intersection) and #9 (placement strategy documentation).

🤖 Generated with Claude Code

@AlexCheema AlexCheema force-pushed the alexcheema/meta-instance branch from 3df896d to 34e846b Compare February 17, 2026 17:06
AlexCheema and others added 20 commits February 17, 2026 09:32
Introduces MetaInstance as a declarative constraint ensuring an instance
matching given parameters (model, sharding, min_nodes) always exists.
The master's reconciliation loop continuously checks for unsatisfied
meta-instances and attempts placement. Connection health checking
verifies that specific IPs (MlxRing) and RDMA interfaces (MlxJaccl)
stored on instances still exist as topology edges, enabling automatic
recovery when cables are swapped or interfaces change.

Also eliminates the master's loopback event path, unifying all event
emission through _apply_and_broadcast for simpler control flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add MetaInstanceBound event and meta_instance_backing State field
  for explicit MetaInstance → Instance binding (prevents ambiguous
  linking when two MetaInstances have identical constraints)
- Replace model_card: ModelCard with model_id: ModelId on MetaInstance
  (load ModelCard on-demand at placement time)
- Add MetaInstance API endpoints (POST /meta_instance, DELETE)
- Update dashboard to use MetaInstances as primary primitive with
  unified display items merging MetaInstances and orphan instances
- Dashboard launches via MetaInstance instead of direct Instance creation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The separate MetaInstanceBound event + meta_instance_backing map
introduced two bugs: stale exclusion sets in the reconciler loop and
a delete ordering race. Embedding meta_instance_id directly on
BaseInstance eliminates the binding mechanism entirely — when an
instance is created for a MetaInstance it carries the ID, when
deleted the binding is gone. No separate map, no cleanup, no races.

Also fixes delete_meta_instance to cascade-delete backing instances.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace inline _plan() steps with a list of ProcessManagers, each
implementing async reconcile(State) -> Sequence[Event]. Tick every
1s instead of 10s — safe because all PMs are idempotent against state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace inline _plan() with ProcessManager loop (_reconcile), tick
  every 1s instead of 10s — safe because all PMs are idempotent
- Fix dashboard sending "MlxIbv" instead of "MlxJaccl" for RDMA
  instance type, which silently fell back to MlxRing default
- Remove all stale MlxIbv references from dashboard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a MetaInstance has no backing instance yet, derive the strategy
display from the MetaInstance's own sharding and instanceMeta fields
rather than showing "Unknown (Unknown)".

Also clean up all stale MlxIbv references across the dashboard —
the backend enum is MlxJaccl.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Show why MetaInstance placement fails instead of stuck "PLACING", and
show per-node runner status during loading for multi-node instances.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The mode="plain" validator bypassed Pydantic's string-to-enum coercion,
so JSON strings like "Tensor" and "MlxJaccl" from the dashboard failed
the isinstance check and silently fell back to Pipeline/MlxRing defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dashboard was not including the user's node filter in the POST to
/meta_instance, so placement ignored which nodes the user selected.
Also, placement silently fell back to Ring when RDMA was requested but
no RDMA-connected cycles were available — now raises an error that
surfaces via MetaInstancePlacementFailed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When user selects specific nodes via the filter, min_nodes should be at
least the number of filtered nodes to prevent placement from picking a
smaller cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RDMA requires at least 2 nodes — a single-node RDMA instance is
nonsensical. Enforce this in both the dashboard (when building the
launch request) and the backend placement (when filtering cycles).
Previously, selecting RDMA would still place on 1 node because
min_nodes defaulted to 1 and the placement silently switched to Ring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dashboard now extracts node IDs from the selected preview's
memory_delta_by_node, ensuring the backend places on exactly the
nodes the user was shown. Also reverts incorrect RDMA min_nodes >= 2
enforcement since single-node RDMA is valid.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
frozenset serializes to a JSON array but cannot be deserialized back
in strict mode through the TaggedModel wrap validator (list → frozenset
coercion is rejected). Changed to list[NodeId] since the model is
already frozen/immutable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extend InstanceDeleted with failure_error field for runner crash info
- Add InstanceFailureInfo model tracking consecutive failures per MetaInstance
- InstanceHealthReconciler now detects runner failures (all terminal with
  at least one RunnerFailed) in addition to connection failures
- apply_instance_deleted increments failure counter for meta-bound instances
- Dashboard shows RETRYING (N/3) status with error messages, and
  "Instance re-created due to failure" after 3 consecutive failures
- Extract and display RunnerFailed error messages in instance status

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ashboard

MetaInstanceReconciler now checks failure count before placement — after 3
consecutive failures it emits MetaInstancePlacementFailed instead of retrying
forever. Dashboard shows "Retrying after error: <msg>" in orange throughout
the retry cycle, not just during the brief window with no backing instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When multiple runners fail, concatenate all error messages with "; " so the
real error isn't hidden by generic side-effect failures from other runners.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dashboard % 3 logic already handles displaying retry progress in batches
(RETRYING 1/3, 2/3, 3/3, then PLACING with error, repeat). No need to
permanently block placement after 3 failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each error in the combined message is now prefixed with the node's friendly
name (e.g. "MacBook Pro: OOM; Mac Studio: connection reset") so the root
cause node is easily identifiable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AlexCheema and others added 20 commits February 17, 2026 09:32
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…er wording

- apply_instance_created no longer clears last_failure_error so the
  error context persists while the new instance starts up
- Dashboard retryError shows the error without (N/3) prefix when
  consecutiveFailures is 0 (instance was recreated)
- Jaccl warning tooltip now says "experimental RDMA driver in macOS"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace fragile TCP SideChannel with anonymous pipes relayed through
exo's event-sourced control plane. RunnerSupervisor creates pipe pairs
for MlxJaccl instances, relays all_gather rounds via JacclSideChannelData/
JacclSideChannelGathered events through the master, eliminating errno=57
crashes from Thunderbolt RDMA driver instability.

Also includes dashboard RDMA warning improvements and instance retry fixes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Anonymous pipes from os.pipe() don't survive multiprocessing.Process
spawn on macOS (default since Python 3.8). The FD numbers are passed
but the actual file descriptors don't exist in the child process,
causing EBADF errors.

Switch to named pipes (FIFOs) which the child opens by path in the
spawned process, getting valid FDs for the C++ SideChannel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TaggedModel's wrap validator converts JSON→Python validation context,
which breaks strict-mode bytes deserialization from JSON strings.
Use Base64Bytes type to encode/decode bytes as base64 strings in JSON.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cover retry logic, error handling, backward compatibility,
concurrent scenarios, placement error tracking, and serialization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two race conditions existed in the meta-instance lifecycle:

1. CreateMetaInstance buffered MetaInstanceCreated but didn't apply it
   before awaiting ModelCard.load(). The reconciler could interleave
   during the await, leading to duplicate placements.

   Fix: apply MetaInstanceCreated eagerly via _apply_and_broadcast,
   then re-check satisfaction after the await so placement uses fresh
   state and skips if the reconciler already handled it.

2. delete_meta_instance (API handler) sent DeleteMetaInstance then
   read self.state.instances for cascade deletion. State was stale,
   so backing instances created between the send and the read were
   missed — permanently orphaning them.

   Fix: move cascade delete into the command processor's
   DeleteMetaInstance handler, where InstanceDeleted events are
   generated atomically with MetaInstanceDeleted.

Reproduced on 4-node Mac Mini cluster: 28K anomalies in stress test
including 21 permanently orphaned instances. After fix, the cascade
delete and placement are race-free.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents RuntimeError when the context has already been set,
e.g. when Terminal.app reuses a tab or the process restarts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dels

On startup, _emit_existing_download_progress() used
downloaded_bytes_this_session to decide between DownloadPending and
DownloadOngoing. Since downloaded_bytes_this_session is always 0 on
startup (it tracks the current session only), fully-downloaded models
were incorrectly reported as DownloadPending.

Now checks actual disk state: if downloaded_bytes >= total_bytes, emit
DownloadCompleted regardless of session bytes. This fixes the UI showing
models as pending when they're already available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add TaskCancelled command and Cancelled task status
- Detect API client disconnects in master/api.py
- Handle TaskCancelled in master state machine
- Add _cancel_tasks to worker for graceful task cleanup
- Add cancel_receiver to runner for inference abort
- Add mx_any helper in MLX utils for cancellable operations
- Guard instance lookup in worker to prevent KeyError
- Update tests for cancellation flow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…001d)

The placement algorithm previously selected the smallest viable cycle,
causing large models to be distributed across too few nodes and running
out of memory. Changed get_smallest_cycles to get_largest_cycles so that
all healthy nodes are utilized, spreading layers more evenly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ures (BUG-001c)

The place_instance API endpoint used fire-and-forget: it sent the command
and returned HTTP 200 immediately. On a fresh cluster start, the master's
state often lacks topology/memory data, so placement raises ValueError
which was silently caught and logged. The caller never learned it failed.

Two fixes:
- API: validate placement locally before sending, return HTTP 400 on
  failure instead of silently accepting an unprocessable command
- Master: emit MetaInstancePlacementFailed on immediate placement error
  in CreateMetaInstance handler so the error surfaces in state right away

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eReconciler

ModelCard.load() does async I/O inside the 1-second reconcile loop. A slow
or failing load blocked all reconciliation (health checks, node timeouts,
other meta-instances). Adds a 10-second timeout, per-meta-instance error
handling with MetaInstancePlacementFailed events, and documents the
intentional early return in apply_instance_retrying.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…instance UI

1. DERIVED REACTIVITY BUG: `unifiedDisplayItems` used `$derived(fn)` which
   made the derived value the function itself instead of its result. Svelte
   never tracked reactive dependencies in the function body, so the instance
   list didn't update when metaInstances or instances changed. Fixed by using
   `$derived.by(fn)` and removing the `()` call-sites in the template.

2. TAUTOLOGICAL CHECK: In `getMetaInstancePlacingStatus`, the `lastError ? ...
   : null` guard inside the `failures > 0` branch was always true because
   `lastFailureError` and `consecutiveFailures` are always set together in
   `apply_instance_retrying` and `apply_instance_deleted`. Removed the dead
   `: null` branch.

Also fixes pyright errors in test file by using proper pytest.MonkeyPatch type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, DeleteMetaInstance cascade-deleted backing instances without
cancelling their active tasks, leaving orphaned task references. Now emits
TaskStatusUpdated(Cancelled) for Pending/Running tasks before InstanceDeleted.

Also adds lifecycle logging for meta-instance operations, a GET /meta_instances
endpoint, and 2 regression tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add TaskStatusUpdated to imports in master/main.py (used in cascade
  delete but missing after rebase conflict resolution)
- Mock mx.distributed.all_gather in test_event_ordering.py so MockGroup
  works with the cancel-checking code path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mbda

- Add `tasks` parameter to `try_place_for_meta_instance()` in reconcile.py
  and thread `state.tasks` through from both call sites (main.py, meta_instance.py)
- Replace untyped lambda with typed helper function in test_event_ordering.py
  to satisfy basedpyright strict mode

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add missing case branch for TaskCancelled to satisfy basedpyright
exhaustive match check. Deletes the associated task and cleans up
the command_task_mapping, mirroring TaskFinished behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlexCheema AlexCheema force-pushed the alexcheema/meta-instance branch from fb6a084 to 11424f6 Compare February 17, 2026 17:32
@AlexCheema AlexCheema merged commit a962a28 into main Feb 17, 2026
6 checks passed
@AlexCheema AlexCheema deleted the alexcheema/meta-instance branch February 17, 2026 17:48
AlexCheema added a commit that referenced this pull request Feb 17, 2026
The merge resolution kept pre-#1447 code that assigned to `instanceData`
(a Svelte 5 $derived constant) and used the old /instance endpoint.
Switch both launchInstance and onboardingLaunchModel to POST /meta_instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Evanev7 added a commit that referenced this pull request Feb 17, 2026
Evanev7 added a commit that referenced this pull request Feb 17, 2026
AlexCheema added a commit that referenced this pull request Feb 20, 2026
The merge resolution kept pre-#1447 code that assigned to `instanceData`
(a Svelte 5 $derived constant) and used the old /instance endpoint.
Switch both launchInstance and onboardingLaunchModel to POST /meta_instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AlexCheema added a commit that referenced this pull request Feb 21, 2026
The merge resolution kept pre-#1447 code that assigned to `instanceData`
(a Svelte 5 $derived constant) and used the old /instance endpoint.
Switch both launchInstance and onboardingLaunchModel to POST /meta_instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant