PODMS: Policy-Orchestrated Disaggregated Mesh Scaling

⚠️ EXPERIMENTAL FEATURE - NOT PRODUCTION READY

Status: Experimental proof-of-concept implementation

This document describes both the vision for PODMS and the current experimental implementation. Many described features are proofs-of-concept or partially implemented. This is research-grade software, not production-ready infrastructure.

Maturity Level:

Core types/telemetry: 🟡 Alpha

Policy compiler: 🟡 Alpha

Metro-sync replication: 🟠 Experimental (TCP POC, RDMA mocked)

Gossip protocol: 🟠 Experimental

Full mesh federation: 🔴 Planned (minimal implementation)

For actual feature status, see the main README Feature Status Table.

Documentation Status: Mix of implemented features and aspirational architecture - 2025-11-08

Overview

PODMS (Policy-Orchestrated Disaggregated Mesh Scaling) is SPACE's experimental distributed scaling architecture exploring autonomous, policy-driven replication and migration across disaggregated storage nodes. Unlike traditional cluster architectures or monolithic scale-out systems, PODMS aims to treat each capsule as an independent, swarm-ready unit with embedded policy intelligence.

Current Reality: Basic infrastructure exists (types, telemetry, policy compiler), but distributed features are early-stage proofs-of-concept requiring extensive development and testing before production use.

Vision: Breaking Traditional Scaling Models

Traditional Approaches (What We Avoid)

Monolithic Clustering:

Tight coupling between nodes
Forklift upgrades required
Blast radius on failures
Manual rebalancing

Modular Scale-Out:

Independent services, but...
Still requires centralized orchestration
Policy enforcement at API gateway
Human-in-loop for placement

PODMS Approach: Autonomous Swarm Intelligence

Each capsule is:

Self-describing via embedded Policy
Swarm-aware via telemetry signals
Autonomously placeable by agent swarms
Zero-trust secured end-to-end

Traditional:         API → Controller → Scheduler → Worker Nodes
PODMS:              Capsule → Telemetry → Agent Swarm → Autonomous Action

Core Principles

1. Policy as Intelligence

Every capsule carries its placement/replication contract:

pub struct Policy {
    // Traditional fields...
    compression: CompressionPolicy,
    encryption: EncryptionPolicy,

    // PODMS fields (feature-gated)
    rpo: Duration,                    // Recovery Point Objective
    latency_target: Duration,         // Max acceptable latency
    sovereignty: SovereigntyLevel,    // Data placement scope
}

RPO Examples:

Duration::ZERO → Synchronous metro-sync
Duration::from_secs(60) → 1-minute async
Duration::from_secs(3600) → Hourly snapshots

Latency Targets:

2ms → Metro zone (same AZ)
10ms → Regional (same geo)
100ms → Global (cross-continent)

Sovereignty Levels:

Local → Never leaves node (air-gapped, edge)
Zone → Within defined zones (metro-sync)
Global → Full federation (geo-replicated)

2. Telemetry-Driven Scaling

Agents subscribe to telemetry channels for real-time signals:

pub enum Telemetry {
    NewCapsule { id, policy, node_id },      // Triggers replication
    HeatSpike { id, accesses_per_min },      // Triggers migration
    CapacityThreshold { node_id, used_pct }, // Triggers rebalancing
    NodeDegraded { node_id, reason },        // Triggers evacuation
    ForcePolicyExecution {                   // Forces async RPO to run now
        capsule_id: CapsuleId,
        forced_rpo: Option<Duration>,        // Override per-call RPO
    },
}

Event Flow:

Write Pipeline → Emit Telemetry → Bounded Channel → Agent Swarm → Autonomous Action

3. Disaggregated Mesh Topology

Nodes are loosely coupled, zone-aware:

Metro Zone (us-west-1a):
  ┌─────────┐     ┌─────────┐     ┌─────────┐
  │ Node A  │────▶│ Node B  │────▶│ Node C  │
  └─────────┘     └─────────┘     └─────────┘
       │               │               │
       └───────────────┴───────────────┘
              Telemetry Mesh

Geo Zone (eu-central):
  ┌─────────┐     ┌─────────┐
  │ Node D  │────▶│ Node E  │
  └─────────┘     └─────────┘
       │               │
       └───────────────┘ Async Replication
                │
                ▼
           Node A (cross-geo)

4. Zero-Disruption Compatibility

PODMS is opt-in via feature flags:

Single-node mode: No overhead, no dependencies
PODMS mode: Telemetry enabled, agents subscribe
Mixed environments: Some nodes single, some distributed

Architecture

Type Hierarchy

// Mesh Identity
pub struct NodeId(Uuid);

// Zone Classification
pub enum ZoneId {
    Metro { name: String },  // "us-west-1a"
    Geo { name: String },    // "eu-central"
    Edge { name: String },   // "air-gapped-site-42"
}

// Sovereignty Control
pub enum SovereigntyLevel {
    Local,   // No external replication
    Zone,    // Within zone only
    Global,  // Full federation
}

Pipeline Integration

WritePipeline gains optional telemetry channel:

pub struct WritePipeline {
    // Existing fields...
    registry: CapsuleRegistry,
    nvram: NvramLog,

    // PODMS addition (feature-gated)
    #[cfg(all(feature = "podms", feature = "pipeline_async"))]
    telemetry_tx: Option<UnboundedSender<Telemetry>>,
}

Usage:

let (tx, rx) = mpsc::unbounded_channel();
let pipeline = WritePipeline::new(registry, nvram)
    .with_telemetry_channel(tx);

// Agent subscribes to rx
tokio::spawn(async move {
    while let Some(event) = rx.recv().await {
        match event {
            Telemetry::NewCapsule { id, policy, .. } => {
                // Trigger replication based on policy.rpo
            }
            _ => {}
        }
    }
});

Step 1: Bedrock Preparation (Complete)

Goal: Enable distributed awareness without disrupting single-node operations.

Deliverables:

✅ PODMS types (NodeId, ZoneId, SovereigntyLevel, Telemetry)
✅ Policy extensions (rpo, latency_target, sovereignty)
✅ Telemetry channel infrastructure
✅ Async event emission in write pipeline
✅ Feature flags (podms requires pipeline_async)
✅ Unit + integration tests
✅ Documentation

Zero Regression:

Single-node builds: No changes, no overhead
PODMS builds: Telemetry hooks present but dormant until channel set
Test coverage: 90%+ for new code

Step 2: Metro-Sync Replication (Complete)

Status: ✅ Complete - 2025-11-09

Goal: Implement core metro-sync replication with mesh networking and autonomous agents.

Deliverables:

✅ scaling crate with mesh networking (gossip discovery via memberlist)
✅ RDMA mock transport for zero-copy segment mirroring (TCP path now runs through the unified DataMotion engine for the full replication flow)
✅ MeshNode with peer discovery and segment mirroring
✅ ScalingAgent consuming telemetry and triggering autonomous actions
✅ WritePipeline extension for metro-sync replication on RPO=0 policies
✅ Hash-based dedup preservation during replication
✅ Unit tests for mesh discovery and mirroring
✅ Integration tests for multi-node replication scenarios
✅ Documentation updates (README, podms.md)

Timeline: Completed in 1 day (single developer with comprehensive spec)

Implementation Guide

1. Basic Metro-Sync Setup

use capsule_registry::pipeline::WritePipeline;
use capsule_registry::runtime::RuntimeHandles;
use common::Policy;
use scaling::MeshNode;
use common::podms::ZoneId;
use std::sync::Arc;
use tokio::sync::mpsc;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Create mesh node in a zone
    let zone = ZoneId::Metro { name: "us-west-1a".into() };
    let listen_addr = "127.0.0.1:8000".parse().unwrap();
    let mesh_node = Arc::new(MeshNode::new(zone, listen_addr).await?);

    // Start mesh with seed nodes
    let seeds = vec!["127.0.0.1:8001".parse().unwrap()];
    mesh_node.start(seeds).await?;

    // Create pipeline with mesh and telemetry
    let runtime = RuntimeHandles::from_env()?;
    let registry = (*runtime.registry).clone();
    let nvram = runtime.nvram.read().await.clone();
    let (tx, rx) = mpsc::unbounded_channel();

    let pipeline = WritePipeline::new(registry, nvram)
        .with_mesh_node(mesh_node.clone())
        .with_telemetry_channel(tx);

    // Spawn scaling agent
    let agent = runtime.build_scaling_agent(mesh_node.clone(), Policy::metro_sync());
    tokio::spawn(async move { agent.run(rx).await });

    // Write with metro-sync policy (RPO=0)
    let data = b"Important data requiring zero-RPO";
    let capsule_id = pipeline
        .write_capsule_with_policy_async(data, &Policy::metro_sync())
        .await?;

    // Segments automatically mirrored to peers!
    println!("Capsule {} replicated", capsule_id.as_uuid());

    Ok(())
}

2. Manual Peer Registration (For Testing)

// In production, peers discovered via gossip
// For testing, manually register peers:
let peer_id = NodeId::new();
let peer_addr = "127.0.0.1:8002".parse().unwrap();
mesh_node.register_peer(peer_id, peer_addr).await;

3. Testing Best Practices

When writing integration tests, ensure each test uses isolated state:

#[tokio::test]
async fn test_metro_sync_example() {
    // Create unique temp directory per test to avoid state conflicts
    let test_id = uuid::Uuid::new_v4();
    let temp_dir = std::env::temp_dir().join(format!("podms_test_{}", test_id));
    std::fs::create_dir_all(&temp_dir).unwrap();

    // Use unique paths for registry and nvram
    let registry_path = temp_dir.join("registry.metadata");
    let registry = CapsuleRegistry::open(&registry_path).unwrap();
    let nvram = NvramLog::open(&temp_dir.join("nvram.log")).unwrap();

    // Use async API in tokio tests
    let capsule_id = pipeline
        .write_capsule_with_policy_async(data, &policy)
        .await
        .unwrap();
}

Important:

Always use CapsuleRegistry::open(&unique_path) in tests, not CapsuleRegistry::new() (which uses a shared "space.db" file)
Always use NvramLog::open(&path) with a unique path per test
Use write_capsule_with_policy_async().await in async contexts (e.g., #[tokio::test])

4. Telemetry Events

The scaling agent reacts to these events:

pub enum Telemetry {
    // Triggers metro-sync if RPO=0
    NewCapsule { id, policy, node_id },

    // Triggers migration to cooler nodes
    HeatSpike { id, accesses_per_min, node_id },

    // Triggers rebalancing
    CapacityThreshold { node_id, used_bytes, total_bytes, threshold_pct },

    // Triggers evacuation
    NodeDegraded { node_id, reason },
}

5. Testing Metro-Sync

# Run integration tests
cargo test --features podms podms_metro_sync

# Run with logs
RUST_LOG=info cargo test --features podms -- --nocapture

# Specific test
cargo test --features podms test_metro_sync_replication_with_mesh_node

Architecture Details

Data Flow:

Write with RPO=0 Policy
  ↓
WritePipeline::write_capsule_with_policy_async()
  ↓
Local segments committed to NVRAM
  ↓
perform_metro_sync_replication()
  ↓
mesh_node.discover_peers() → Select 1-2 targets
  ↓
For each segment:
  - Read from NVRAM
  - Check content hash (dedup preservation)
  - mesh_node.mirror_segment() via RDMA mock
  ↓
Telemetry event emitted → ScalingAgent

Mesh Node Components:

MeshNode {
    id: NodeId,                    // Unique node identifier
    zone: ZoneId,                  // Zone placement
    capabilities: NodeCapabilities, // NVRAM, GPU, network tier
    memberlist: Memberlist,        // Gossip discovery
    peers: HashMap<NodeId, Addr>,  // Peer registry
    listen_addr: SocketAddr,       // TCP listener for mirrors
}

Transport Layer:

POC (Step 2): TCP streams for segment mirroring
Production (Future): RDMA verbs via rdma-sys for zero-copy
Fallback: Always TCP-compatible for edge nodes

Performance Characteristics

Measured Overhead (Step 2):

Metro-sync latency: ~5-20ms per capsule (1-5 segments, local network)
Throughput impact: <10% when replicating to 2 peers
Memory: ~24 bytes per MeshNode, ~200 bytes per telemetry event
CPU: Minimal (async I/O, no polling)

Optimization Targets (Future Steps):

RDMA transport: <50µs added latency
Batched replication: Amortize discovery overhead
Parallel mirroring: Concurrent segment transfers

Debugging & Troubleshooting

Common Issues:

"Peer not found in registry"
- Ensure mesh_node.register_peer() called before mirroring
- Or wait for gossip discovery to complete
"Failed to connect to target"
- Check peer's listen_addr is reachable
- Verify firewall rules allow TCP on mirror port
"Metro-sync skipped: mesh node not configured"
- Call pipeline.with_mesh_node() before writing
- Or build without podms feature for single-node mode
"Segment not found: SegmentId(X)" in tests
- Tests are sharing CapsuleRegistry state (using default "space.db" file)
- Solution: Use unique paths per test (see "Testing Best Practices" above)
- Use CapsuleRegistry::open(&unique_path) instead of CapsuleRegistry::new()
"Cannot start a runtime from within a runtime" in tests
- Calling write_capsule_with_policy() (sync wrapper) from #[tokio::test]
- Solution: Use write_capsule_with_policy_async().await in async test contexts

Logging:

# Full PODMS trace
RUST_LOG=scaling=trace,capsule_registry::pipeline=trace cargo run --features podms

# Metro-sync only
RUST_LOG=scaling::mesh=debug cargo run --features podms

Step 3: Policy Compiler (Complete)

Goal: Autonomous orchestration via compiled policy rules—the "brain" of PODMS swarm intelligence.

Status: ✅ Complete - Policy compiler integrated with autonomous agents, enabling declarative-to-executable scaling.

Architecture

The Policy Compiler (scaling/src/compiler.rs) translates telemetry events + policies into executable ScalingActions:

// PolicyCompiler processes telemetry → actions
let compiler = PolicyCompiler::with_defaults();
let mesh_state = build_mesh_state().await->;

let actions = compiler.compile_scaling_actions(&event, &policy, &mesh_state);
// Returns: Vec<ScalingAction> (Replicate, Migrate, Evacuate, Rebalance)

Decision Rules

1. Replication Strategy (from policy.rpo):

RPO = 0 → MetroSync { replica_count: policy.replica_count } (synchronous; total copies incl. local)
RPO < 60s → AsyncWithBatching { rpo } (batched async)
RPO >= 60s → None (no immediate replication)

2. Migration Triggers (from policy.latency_target):

Heat spike (>100 accesses/min) + latency_target <2ms → Migrate to low-latency zone
Capacity threshold >80% → Rebalance to underutilized nodes
Checks sovereignty before migration (Local/Zone/Global)

3. Evacuation Urgency (from reason string):

"disk_failure" or "power" → Immediate (parallel evacuation)
"degraded_health" → Gradual (cold capsules first)

Swarm Behavior Trait

Capsules self-transform during migrations via the SwarmBehavior trait (common/src/lib.rs). The circular dependency with crypto/compression is resolved through injected TransformOps (implemented by the runtime). See the deep-dive at docs/specs/PODMS_SWARM_BEHAVIOR.md.

TransformOps now carries capsule_id into encrypt/decrypt so runtimes can derive per-capsule keys (see docs/specs/PODMS_TRANSFORM_OPS.md for the SwarmOps adapter). ScalingAgent::migrate_capsule_task uses SwarmOps to execute decrypt -> decompress -> recompress -> re-encrypt before streaming replication frames, rotating keys to the current version when unset. Segment keys are convergent (content-derived) and wrapped per capsule; frames carry the capsule id and wrapped key so receivers unwrap with Zero Trust isolation while preserving dedup.

pub trait TransformOps {
    fn decrypt(
        &self,
        capsule_id: CapsuleId,
        data: &[u8],
        policy: &EncryptionPolicy,
        ctx: SegmentId,
    ) -> Result<Vec<u8>>;
    fn encrypt(
        &self,
        capsule_id: CapsuleId,
        data: &[u8],
        policy: &EncryptionPolicy,
        ctx: SegmentId,
    ) -> Result<Vec<u8>>;
    fn decompress(&self, data: &[u8], policy: &CompressionPolicy) -> Result<Vec<u8>>;
    fn compress(&self, data: &[u8], policy: &CompressionPolicy) -> Result<Vec<u8>>;
}

pub trait SwarmBehavior {
    fn apply_transform<T: TransformOps>(
        &self,
        segment_id: SegmentId,
        data: &[u8],
        target_policy: &Policy,
        ops: &T,
    ) -> Result<Vec<u8>>;
    fn on_migrate(&self, destination: NodeId, dest_zone: &ZoneId) -> Result<()>;
    fn requires_transform(&self, source_zone: &ZoneId, dest_zone: &ZoneId) -> bool;
}

Transformation Logic (Unwrap -> Transcode -> Rewrap):

Decrypt when source policy enabled encryption.
Decompress -> re-compress only if compression policies differ (short-circuit when they match).
Encrypt with target policy (re-key on zone crossing even if policies match).
Sovereignty guard: Local capsules error before leaving the node; Zone capsules log validation; Global is unrestricted.

Runtime Integration (Scaling Agent example):

/// In crates/scaling (or pipeline), wrap the crypto/compress crates.
struct PipelineOps<'a> {
    crypto: &'a CryptoEngine,
    comp: &'a CompressionEngine,
}

impl TransformOps for PipelineOps<'_> {
    fn decrypt(
        &self,
        capsule_id: CapsuleId,
        data: &[u8],
        policy: &EncryptionPolicy,
        ctx: SegmentId,
    ) -> Result<Vec<u8>> {
        self.crypto.decrypt_segment(capsule_id, data, policy, ctx)
    }
    fn encrypt(
        &self,
        capsule_id: CapsuleId,
        data: &[u8],
        policy: &EncryptionPolicy,
        ctx: SegmentId,
    ) -> Result<Vec<u8>> {
        self.crypto.encrypt_segment(capsule_id, data, policy, ctx)
    }
    fn decompress(&self, data: &[u8], policy: &CompressionPolicy) -> Result<Vec<u8>> {
        self.comp.decompress(data, policy)
    }
    fn compress(&self, data: &[u8], policy: &CompressionPolicy) -> Result<Vec<u8>> {
        self.comp.compress(data, policy)
    }
}

// During migration:
let ops = PipelineOps { crypto: &crypto_engine, comp: &compression_engine };
let transformed = capsule.apply_transform(segment_id, &bytes, &target_policy, &ops)?;

This keeps common free of crypto/compression dependencies while letting the scaling agent orchestrate the full unwrap/transcode/rewrap flow during migration or replication.

Example Policy YAML

See docs/example-policy.yaml for declarative policy configurations:

metro_sync:
  rpo: 0s
  latency_target: 2ms
  sovereignty: zone
  # Triggers: MetroSync replication + placement in <2ms zone

Integration with Agents

The ScalingAgent (scaling/src/agent.rs) uses the compiler in its event loop:

async fn handle_telemetry_event(&self, event: Telemetry) -> Result<()> {
    let policy = extract_policy(&event);
    let mesh_state = self.build_mesh_state().await->;

    let actions = self.compiler.compile_scaling_actions(&event, &policy, &mesh_state);

    for action in actions {
        self.execute_action(action).await->; // Execute migration, replication, etc.
    }
}

Testing

Unit Tests (90%+ coverage on compiler logic):

test_replication_strategy_zero_rpo - Verifies RPO=0 → MetroSync
test_heat_spike_migration - Heat + low latency → Migration
test_evacuation_urgency - Failure reason → Immediate/Gradual
test_sovereignty_validation - Policies block zone violations

Integration Tests (in capsule-registry/tests/podms_*.rs):

Multi-node simulations with policy-triggered failovers
Telemetry → Action → Mesh operation end-to-end flows

Run tests:

cargo test --package scaling  # Runs all compiler + agent tests

Step 4: Full Mesh Federation (Long-term)

Goal: Global-scale, zone-aware federation with intelligent routing.

Features:

Cross-zone routing optimization
Traffic shaping based on latency targets
Cost-aware placement (e.g., S3 tier storage)
Federated identity (SPIFFE integration)

Design Rationale

Why Not Traditional Clustering->

Aspect	Traditional Cluster	PODMS
Coupling	Tight (shared state)	Loose (telemetry events)
Placement	Manual/centralized	Autonomous/policy-driven
Failure Blast Radius	Cluster-wide	Per-capsule isolation
Upgrade Path	Forklift (downtime)	Rolling (zero-downtime)
Policy Enforcement	API gateway	Embedded in capsule

Why Not Microservices Model->

Microservices decompose by service function. PODMS decomposes by data primitive (capsule). Each capsule is independently scalable, reducing orchestration complexity.

Why Telemetry Channels->

Alternatives Considered:

Polling: Higher latency, wasted cycles
Shared memory: Tight coupling, single-node only
Message queue: External dependency, ops overhead

Telemetry Channels:

Bounded async channels (Tokio)
Zero-copy event passing
Backpressure-safe (unbounded for now, bounded in Step 2)
Local-first (no network until Step 2)

Security Considerations

Telemetry Data Sensitivity

Telemetry events include:

Capsule IDs (UUIDs, not sensitive)
Policy (may reveal business logic)
Access patterns (heatmap data)

Mitigations:

PODMS telemetry stays in-process (Step 1)
Cross-node telemetry encrypted (Step 2, via SPIFFE/mTLS)
Audit log integration (advanced-security feature)

Agent Trust Model

Step 2 agents will:

Run with least privilege (no registry write access)
Validate telemetry signatures (BLAKE3-MAC)
Enforce sovereignty boundaries (e.g., Local policies block replication)

Performance Impact

Step 1 Overhead (This Implementation)

Without PODMS feature:

Zero overhead (types not compiled in)

With PODMS feature, no telemetry channel:

<1% overhead (one if let check per write)

With PODMS feature + telemetry channel:

~2-3% overhead (channel send + tracing)
Measured: 2.1 GB/s → 2.05 GB/s write throughput

Memory:

UnboundedSender: ~24 bytes per pipeline
Events: ~200 bytes each (before send)

Step 2 Target (Replication Agents)

Target overhead:

Metro-sync (RPO=0): <10% latency increase
Async geo-replication: <1% (background buffered)

Bottleneck mitigation:

Bounded channels with backpressure
Rate limiting per zone
Telemetry sampling for high-throughput workloads

Testing Strategy

Unit Tests

Policy Tests (common/src/policy.rs):

Default values for RPO/latency/sovereignty
Serialization round-trip
Policy presets (metro_sync, geo_replicated)

Type Tests (common/src/lib.rs):

NodeId uniqueness
ZoneId display formatting
Telemetry event serialization

Integration Tests

Pipeline Tests (capsule-registry/tests/podms_test.rs):

Telemetry emission on write
Channel closed gracefully
Multiple writes → multiple events
No telemetry without channel

Coverage Target:

90%+ for PODMS code paths
100% for critical paths (telemetry emission)

Benchmark Tests (Future)

Step 2 will add:

Throughput regression tests (<5% degradation)
Latency percentiles (p50, p99, p99.9)
Replication lag measurements

Migration Path

Existing Single-Node Deployments

No action required:

PODMS feature not enabled → zero changes
Binary size unchanged
Performance unchanged

Enabling PODMS

Step-by-step:

Rebuild with feature:
```
cargo build --release --features podms
```

Initialize telemetry (optional):

let (tx, rx) = mpsc::unbounded_channel();
let pipeline = pipeline.with_telemetry_channel(tx);

// Spawn agent (Step 2)
tokio::spawn(async move { /* agent logic */ });

Update policies (optional):

let policy = Policy::metro_sync(); // or geo_replicated()

Rollback

Disable PODMS:

cargo build --release --no-default-features

Pipeline falls back to single-node mode.

Comparison to Prior Art

Traditional Object Stores (RADOS-style)

Similarities:

Object-level granularity
Placement rules (CRUSH-like maps vs. Policy)

Differences:

Traditional: Centralized monitor cluster
PODMS: Autonomous agent swarms

CockroachDB

Similarities:

Gossip-based node discovery (planned Step 2)
Range-level replication (capsule-level here)

Differences:

CockroachDB: SQL-centric, synchronous Raft
PODMS: Policy-centric, async + sync hybrid

etcd (Raft Consensus)

Similarities:

Strong consistency option (metro-sync)

Differences:

etcd: Single Raft group (centralized)
PODMS: Per-capsule autonomy (decentralized)

Future Extensions

Adaptive RPO

Agents learn optimal RPO from workload patterns:

if access_pattern.is_write_heavy() {
    policy.rpo = min(policy.rpo, Duration::from_secs(5));
}

Cost-Aware Placement

Integrate cloud pricing APIs:

if policy.sovereignty == Global && estimated_cost > budget {
    place_in_cheaper_zone();
}

ML-Driven Heatmap Prediction

Train models to predict HeatSpike events:

Historical access patterns → LSTM → Predicted spike → Proactive migration

Glossary

PODMS: Policy-Orchestrated Disaggregated Mesh Scaling
RPO: Recovery Point Objective (max acceptable data loss window)
RTO: Recovery Time Objective (max acceptable downtime) - future
Metro-sync: Synchronous replication within a metro zone (RPO=0)
Geo-replication: Asynchronous replication across geographic regions
Sovereignty: Policy-enforced data residency constraints
Telemetry: Lightweight event stream for autonomous agents
Agent Swarm: Distributed processes subscribing to telemetry

References

architecture.md - Overall SPACE design
future_state_architecture.md - Long-term vision
ENCRYPTION_IMPLEMENTATION.md - Security model
Cargo.toml features - Feature flag configuration

Changelog

2025-11-08 - Step 1 Complete:

Added PODMS types (NodeId, ZoneId, SovereigntyLevel, Telemetry)
Extended Policy with RPO, latency_target, sovereignty
Integrated telemetry channel in WritePipeline
Added 90%+ test coverage
Updated README.md and docs/

2025-11-09 - Step 2 Complete:

Added scaling crate with MeshNode and ScalingAgent
Implemented gossip-based peer discovery (memberlist)
Added RDMA mock transport (TCP for POC)
Extended WritePipeline with perform_metro_sync_replication()
Metro-sync triggered automatically for RPO=0 policies
Hash-based dedup preserved during replication
Comprehensive test coverage (unit + integration)
Updated documentation

Next: Step 3 - Policy Compiler (ETA: 3-5 days)

Uh oh!

FilesExpand file tree

podms.md

Latest commit

History

podms.md

File metadata and controls

PODMS: Policy-Orchestrated Disaggregated Mesh Scaling

⚠️ EXPERIMENTAL FEATURE - NOT PRODUCTION READY

Overview

Vision: Breaking Traditional Scaling Models

Traditional Approaches (What We Avoid)

PODMS Approach: Autonomous Swarm Intelligence

Core Principles

1. Policy as Intelligence

2. Telemetry-Driven Scaling

3. Disaggregated Mesh Topology

4. Zero-Disruption Compatibility

Architecture

Type Hierarchy

Pipeline Integration

Step 1: Bedrock Preparation (Complete)

Step 2: Metro-Sync Replication (Complete)

Implementation Guide

Architecture Details

Performance Characteristics

Debugging & Troubleshooting

Step 3: Policy Compiler (Complete)

Architecture

Decision Rules

Swarm Behavior Trait

Example Policy YAML

Integration with Agents

Testing

Step 4: Full Mesh Federation (Long-term)

Design Rationale

Why Not Traditional Clustering->

Why Not Microservices Model->

Why Telemetry Channels->

Security Considerations

Telemetry Data Sensitivity

Agent Trust Model

Performance Impact

Step 1 Overhead (This Implementation)

Step 2 Target (Replication Agents)

Testing Strategy

Unit Tests

Integration Tests

Benchmark Tests (Future)

Migration Path

Existing Single-Node Deployments

Enabling PODMS

Rollback

Comparison to Prior Art

Traditional Object Stores (RADOS-style)

CockroachDB

etcd (Raft Consensus)

Future Extensions

Adaptive RPO

Cost-Aware Placement

ML-Driven Heatmap Prediction

Glossary

References

Changelog