System Architecture

This document describes the high-level architecture, design decisions, and data flows of the py_rt Algorithm Trading System.

🎯 Architecture Overview

The py_rt system implements a hybrid Python-Rust architecture that separates offline research/backtesting tasks (Python) from online low-latency trading execution (Rust). This design maximizes Python's productivity for research while leveraging Rust's performance for production trading.

📖 Detailed Architecture Documentation

For comprehensive architecture documentation, see:

/docs/architecture/python-rust-separation.md - Complete system design with Python offline and Rust online components
/docs/architecture/component-diagram.md - C4 model diagrams showing component interactions
/docs/architecture/integration-layer.md - PyO3 bindings, ZeroMQ messaging, and shared memory IPC

Python-Rust Separation

The system is designed with a clear separation of concerns:

Python (Offline)

Backtesting: Historical data replay, strategy validation
Optimization: Parameter tuning with Optuna, genetic algorithms
Machine Learning: Model training with XGBoost, PyTorch
Analysis: Statistical analysis, visualization, performance attribution
Research: Jupyter notebooks, hypothesis testing

Rust (Online)

Market Data: Low-latency WebSocket ingestion, order book management
Execution: Order routing, retry logic, slippage protection
Risk Management: Pre-trade checks, position tracking, P&L monitoring
Signal Processing: Real-time indicators, ML inference (ONNX)

Integration Layer

PyO3: Python-Rust bindings for performance-critical functions
ZeroMQ: Pub/sub messaging for event-driven communication
Shared Memory: Lock-free ring buffers for ultra-low-latency data transfer
Protocol Buffers: Binary serialization for efficient data exchange

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                         PYTHON OFFLINE                               │
│  Research → Backtesting → Optimization → ML Training → ONNX Export  │
└─────────────────────────────┬───────────────────────────────────────┘
                              │
                    ┌─────────▼─────────┐
                    │  PyO3 / ZeroMQ    │
                    │  Protocol Buffers │
                    └─────────┬─────────┘
                              │
┌─────────────────────────────▼───────────────────────────────────────┐
│                          RUST ONLINE                                │
│  Market Data → Signal Processing → Risk Checks → Order Execution   │
└─────────────────────────────────────────────────────────────────────┘

Legacy Architecture Overview

The original system follows a microservices architecture with four independent components communicating via ZeroMQ (ZMQ) messaging. Each component is a separate Rust binary that can be deployed and scaled independently.

High-Level Architecture

graph TD
    A[Alpaca Markets API] -->|WebSocket| B[Market Data Service]
    B -->|ZMQ PUB| C[Signal Bridge Python ML]
    B -->|ZMQ PUB| D[Risk Manager]
    C -->|ZMQ PUB| D
    D -->|ZMQ PUB| E[Execution Engine]
    E -->|REST API| A

    B -.->|Prometheus| F[Monitoring]
    C -.->|Prometheus| F
    D -.->|Prometheus| F
    E -.->|Prometheus| F

Component Topology

┌─────────────────────────────────────────────────────────────┐
│                    Alpaca Markets Cloud                     │
│                 (WebSocket + REST API v2)                   │
└───────────────┬─────────────────────────────┬───────────────┘
                │                             │
                │ WebSocket                   │ REST API
                │ (Market Data)               │ (Orders)
                ▼                             ▲
    ┌─────────────────────┐         ┌─────────────────────┐
    │  Market Data        │         │  Execution Engine   │
    │  Service            │         │  - Order Router     │
    │  - WebSocket Client │         │  - Retry Logic      │
    │  - Order Book       │         │  - Slippage Check   │
    │  - Aggregation      │         │  - Rate Limiting    │
    │  - ZMQ Publisher    │         │                     │
    └──────────┬──────────┘         └──────────▲──────────┘
               │                               │
               │ ZMQ PUB (port 5555)          │ ZMQ SUB
               │ Topics: market.*             │ Topic: order.*
               │                               │
               ▼                               │
    ┌─────────────────────┐         ┌─────────────────────┐
    │  Signal Bridge      │         │  Risk Manager       │
    │  - PyO3 Bindings    │────────▶│  - Position Track   │
    │  - Python ML        │ ZMQ PUB │  - Risk Limits      │
    │  - Feature Eng.     │ (5556)  │  - Pre-trade Check  │
    │                     │         │  - P&L Tracking     │
    └─────────────────────┘         └─────────────────────┘
                                            │
                                            │ ZMQ PUB (5557)
                                            │ Topic: risk.*
                                            └─────────────────┘

Component Descriptions

1. Market Data Service

Responsibility: Real-time market data ingestion and distribution

Key Features:

WebSocket connection to Alpaca Markets data stream (v2)
Maintains live order books for subscribed symbols
Aggregates trades into OHLCV bars (multiple timeframes)
Publishes market data via ZMQ PUB socket

Implementation Details:

websocket.rs: Tokio-tungstenite WebSocket client with automatic reconnection
orderbook.rs: Binary heap-based order book with O(log n) updates
aggregation.rs: Time-based windowing for bar generation
publisher.rs: ZMQ PUB socket wrapper with topic-based routing

Performance:

Processes 50,000+ messages/second
Order book updates in <50μs (p99)
Memory: ~50MB for 100 symbols

Configuration:

{
  "alpaca_api_key": "string",
  "alpaca_secret_key": "string",
  "zmq_pub_address": "tcp://*:5555",
  "symbols": ["AAPL", "MSFT"],
  "reconnect_delay_secs": 5
}

2. Signal Bridge

Responsibility: Python integration for ML-based signal generation

Key Features:

PyO3 bindings expose Rust types to Python
Subscribes to market data via ZMQ SUB
Processes signals through Python ML models
Publishes trading signals via ZMQ PUB

Implementation Details:

lib.rs: PyO3 module with #[pymodule] attribute
Exposes Signal, Bar, Trade types to Python
Async runtime integration with Python's asyncio
Technical indicators via ta crate

Use Cases:

Feature Engineering: Calculate technical indicators in Python
ML Models: scikit-learn, TensorFlow, PyTorch integration
Custom Strategies: Implement complex logic in Python

Example Python Integration:

import signal_bridge

# Receive market data
bar = signal_bridge.receive_bar()

# Generate signal
if bar.close > bar.open * 1.01:
    signal = signal_bridge.Signal(
        symbol="AAPL",
        action="Buy",
        confidence=0.85
    )
    signal_bridge.publish_signal(signal)

Performance:

Python GIL can limit throughput to ~1,000 signals/second
Consider multi-process architecture for higher throughput

3. Risk Manager

Responsibility: Pre-trade risk checks and position tracking

Key Features:

Validates orders against configurable risk limits
Tracks open positions and realized/unrealized P&L
Enforces maximum position size, order size, daily loss limits
Publishes approved/rejected orders

Implementation Details:

lib.rs: Core risk engine with position tracking
In-memory position store (HashMap)
Atomic risk checks with async/await

Risk Checks:

Position Size Limit: Maximum position value per symbol
Order Size Limit: Maximum single order value
Daily Loss Limit: Maximum unrealized loss in trading day
Account Balance: Sufficient buying power

Configuration:

{
  "max_position_size": 10000.0,
  "max_order_size": 1000.0,
  "max_daily_loss": 5000.0,
  "zmq_sub_address": "tcp://localhost:5555",
  "zmq_pub_address": "tcp://*:5557"
}

Example Risk Check Flow:

Signal → Risk Manager → Check Limits → Approve/Reject → Publish
                             │
                             ├─ Position size OK?
                             ├─ Order size OK?
                             ├─ Daily loss OK?
                             └─ Account balance OK?

4. Execution Engine

Responsibility: Order execution through Alpaca Markets API

Key Features:

Smart order routing with retry logic
Slippage protection (rejects orders exceeding threshold)
Rate limiting (200 requests/minute via governor crate)
Order status tracking and updates

Implementation Details:

router.rs: Order routing logic with async HTTP client (reqwest)
retry.rs: Exponential backoff retry strategy (max 3 retries)
slippage.rs: Compares limit order price to current market price

Retry Strategy:

// Exponential backoff: 1s, 2s, 4s
async fn retry_order(order: Order) -> Result<OrderResponse> {
    let mut delay = Duration::from_secs(1);
    for attempt in 0..3 {
        match execute_order(&order).await {
            Ok(response) => return Ok(response),
            Err(e) if is_retryable(&e) => {
                tokio::time::sleep(delay).await;
                delay *= 2;
            }
            Err(e) => return Err(e),
        }
    }
    Err(Error::MaxRetriesExceeded)
}

Slippage Protection:

// Reject limit orders with >50bps slippage
fn check_slippage(order: &Order, market_price: Price) -> bool {
    let slippage_bps = ((order.price - market_price) / market_price * 10000.0).abs();
    slippage_bps <= config.max_slippage_bps
}

Configuration:

{
  "alpaca_api_key": "string",
  "alpaca_secret_key": "string",
  "max_retries": 3,
  "max_slippage_bps": 50,
  "rate_limit_per_minute": 200
}

Data Flow

End-to-End Order Flow

sequenceDiagram
    participant Alpaca as Alpaca Markets
    participant MD as Market Data
    participant SB as Signal Bridge
    participant RM as Risk Manager
    participant EE as Execution Engine

    Alpaca->>MD: WebSocket: Trade Update
    MD->>MD: Update Order Book
    MD->>SB: ZMQ PUB: market.trade
    MD->>RM: ZMQ PUB: market.trade

    SB->>SB: ML Model Inference
    SB->>RM: ZMQ PUB: signal.generated

    RM->>RM: Risk Check
    alt Risk Check Passed
        RM->>EE: ZMQ PUB: risk.approved
        EE->>Alpaca: REST: POST /v2/orders
        Alpaca->>EE: Response: Order Accepted
        EE->>RM: ZMQ PUB: order.filled
    else Risk Check Failed
        RM->>SB: ZMQ PUB: risk.rejected
    end

Message Flow Diagram

Time: t0
  Alpaca Markets ─────────────▶ Market Data Service
                              (WebSocket: Trade Update)

Time: t0 + 50μs
  Market Data Service ─────────▶ Signal Bridge
  Market Data Service ─────────▶ Risk Manager
                              (ZMQ PUB: market.trade)

Time: t0 + 5ms
  Signal Bridge ───────────────▶ Risk Manager
                              (ZMQ PUB: signal.generated)

Time: t0 + 6ms
  Risk Manager ────────────────▶ Execution Engine
                              (ZMQ PUB: risk.approved)

Time: t0 + 10ms
  Execution Engine ────────────▶ Alpaca Markets
                              (REST: POST /v2/orders)

Total Latency: ~10ms (signal to execution)

Communication Patterns

ZeroMQ PUB/SUB Pattern

Advantages:

Decoupling: Publishers don't know about subscribers
Scalability: Add subscribers without modifying publishers
Performance: Zero-copy messaging, <10μs latency
Reliability: Automatic reconnection and buffering

Topic-Based Routing:

// Publisher (Market Data)
pub async fn publish_trade(&self, trade: Trade) {
    let topic = "market.trade";
    let message = Message::TradeUpdate(trade);
    let data = serde_json::to_vec(&message)?;
    self.socket.send_multipart([topic.as_bytes(), &data], 0)?;
}

// Subscriber (Risk Manager)
pub async fn subscribe(&mut self) {
    self.socket.set_subscribe(b"market.")?;
    self.socket.set_subscribe(b"signal.")?;

    loop {
        let msg = self.socket.recv_multipart(0)?;
        let topic = String::from_utf8(msg[0].clone())?;
        let data = &msg[1];
        self.handle_message(&topic, data).await?;
    }
}

Message Topics:

market.trade - Trade updates
market.quote - Quote updates
market.bar - OHLCV bar updates
signal.generated - Trading signals
risk.approved - Risk-approved orders
risk.rejected - Risk-rejected orders
order.submitted - Orders sent to exchange
order.filled - Order fill updates
system.heartbeat - Component health checks

Request/Reply Pattern (Future)

For synchronous operations (e.g., account balance queries):

// Client (Risk Manager)
let request = Message::BalanceRequest;
socket.send(serde_json::to_vec(&request)?)?;
let response: BalanceResponse = socket.recv_json()?;

// Server (Execution Engine)
let request: Message = socket.recv_json()?;
let balance = alpaca.get_account_balance().await?;
let response = Message::BalanceResponse(balance);
socket.send(serde_json::to_vec(&response)?)?;

Design Decisions

1. Why Rust?

Performance:

Zero-cost abstractions enable C++ performance without manual memory management
No garbage collection pauses (critical for low-latency trading)
LLVM-optimized machine code

Safety:

Ownership system prevents data races at compile time
No null pointer dereferences or use-after-free bugs
Strong type system catches errors before production

Concurrency:

Tokio async runtime provides efficient task scheduling
Zero-cost async/await syntax
fearless concurrency with Send/Sync traits

Trade-offs:

Steeper learning curve compared to Python/Go
Longer compile times (mitigated with incremental compilation)
Smaller ecosystem compared to Java/C++

2. Why ZeroMQ?

Performance:

Zero-copy messaging via shared memory
<10μs message latency
Millions of messages/second throughput

Flexibility:

Multiple transport protocols (TCP, IPC, inproc)
Various messaging patterns (PUB/SUB, REQ/REP, PUSH/PULL)
Language-agnostic (Rust, Python, C++, Go, etc.)

Reliability:

Automatic reconnection
Configurable buffering and backpressure
Fair queuing for subscribers

Alternatives Considered:

gRPC: Higher latency (~100μs), requires code generation
NATS: Requires external broker, added operational complexity
Kafka: Too heavy for low-latency use case (ms latency)
Redis PUB/SUB: Requires external server, no persistence

3. Why Workspace Architecture?

Benefits:

Shared Dependencies: Single Cargo.lock for reproducible builds
Common Types: common crate shared across all components
Consistent Versioning: Workspace-level version management
Efficient Builds: Cargo caches shared dependencies

Structure:

rust/
├── Cargo.toml           # Workspace root
├── common/              # Shared types and utilities
├── market-data/         # Independent binary
├── signal-bridge/       # Independent binary
├── risk-manager/        # Independent binary
└── execution-engine/    # Independent binary

4. Why PyO3 for Python Integration?

Performance:

Near-native speed (10-100x faster than pure Python)
Zero-copy data transfer between Rust and Python
GIL management for concurrent access

Flexibility:

Gradual migration path (start with Python, optimize critical paths in Rust)
Leverage Python's ML ecosystem (scikit-learn, TensorFlow, PyTorch)
Maintain type safety with pyo3 type annotations

Alternative Considered:

Native Python: Too slow for high-frequency signal generation
Cython: Requires compilation, less type safety
RESTful API: Higher latency, serialization overhead

5. Why No Database?

Current Design:

Positions stored in-memory (HashMap)
No persistence across restarts

Rationale:

Simplicity: No database setup/maintenance
Performance: In-memory access is 100-1000x faster
MVP Focus: Sufficient for initial development

Future Improvement:

Add PostgreSQL for position persistence
Use TimescaleDB for time-series metrics
Redis for shared state across instances

Scalability

Horizontal Scaling

Each component can scale independently:

┌─────────────────────┐
│  Market Data 1      │────┐
└─────────────────────┘    │
┌─────────────────────┐    │    ┌─────────────────────┐
│  Market Data 2      │────┼───▶│  Load Balancer      │
└─────────────────────┘    │    │  (HAProxy/NGINX)    │
┌─────────────────────┐    │    └─────────────────────┘
│  Market Data N      │────┘              │
└─────────────────────┘                   ▼
                                    ZMQ PUB socket

Scaling Strategies:

Market Data: Shard by symbol (e.g., instance 1 handles AAPL-MSFT, instance 2 handles GOOGL-AMZN)
Signal Bridge: Run multiple Python processes (avoid GIL contention)
Risk Manager: Shard by symbol or use distributed locking (Redis)
Execution Engine: Round-robin order distribution

Vertical Scaling

CPU:

Multi-threaded Tokio runtime scales to available cores
CPU affinity can reduce context switching

Memory:

Order books grow linearly with number of symbols (~500KB per symbol)
Estimated memory: 100 symbols × 500KB = 50MB

Network:

ZMQ supports 10Gbps+ throughput with proper tuning
TCP_NODELAY disables Nagle's algorithm for lower latency

Performance Tuning

Rust Optimizations:

[profile.release]
opt-level = 3              # Maximum optimization
lto = true                 # Link-time optimization
codegen-units = 1          # Single codegen unit (slower compile, faster runtime)
strip = true               # Remove debug symbols

Tokio Tuning:

// Increase worker threads
tokio::runtime::Builder::new_multi_thread()
    .worker_threads(8)
    .enable_all()
    .build()?;

ZMQ Tuning:

// Increase socket buffers
socket.set_sndhwm(10000)?;  // Send high-water mark
socket.set_rcvhwm(10000)?;  // Receive high-water mark
socket.set_tcp_keepalive(1)?;

Limitations

Current Limitations

Single Exchange Support
- Only Alpaca Markets integration
- Cannot aggregate liquidity across exchanges
No Position Persistence
- Positions lost on restart
- Risk exposure during recovery
Simple Risk Model
- Position-based limits only
- No portfolio-level VaR or Greeks
No Backtesting
- Live trading only
- Cannot validate strategies on historical data
Limited Observability
- Basic Prometheus metrics
- No distributed tracing (Jaeger/Zipkin)
No High Availability
- Single point of failure for each component
- No leader election or failover

Known Issues

WebSocket Reconnection: 5-second delay on disconnect (Alpaca limitation)
Python GIL: Signal generation limited to ~1,000/second
Rate Limiting: Alpaca API limits to 200 requests/minute

Future Improvements

Short-Term (Q1 2024)

Position Persistence
- PostgreSQL integration
- Automatic recovery on restart
Enhanced Risk Management
- Portfolio-level risk (VaR, Expected Shortfall)
- Greeks calculation for options
- Correlation-based position limits
Multi-Exchange Support
- Binance integration (crypto)
- Interactive Brokers (stocks, futures, options)

Medium-Term (Q2-Q3 2024)

Backtesting Framework
- Historical data replay
- Strategy performance metrics
- Parameter optimization
Web Dashboard
- React-based monitoring UI
- Real-time position tracking
- P&L visualization
Distributed Tracing
- Jaeger integration
- Request ID propagation
- Performance profiling

Long-Term (Q4 2024+)

High Availability
- Raft consensus for leader election
- Automatic failover
- State replication
Machine Learning Pipeline
- Feature store (Feast)
- Model serving (Seldon Core)
- A/B testing framework
Options Trading
- Greeks calculation
- Volatility surface modeling
- Multi-leg strategies

References

Last Updated: 2024-10-14 | Version: 0.1.0

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

System Architecture

🎯 Architecture Overview

📚 Table of Contents

📖 Detailed Architecture Documentation

Python-Rust Separation

Python (Offline)

Rust (Online)

Integration Layer

Architecture Diagram

Legacy Architecture Overview

High-Level Architecture

Component Topology

Component Descriptions

1. Market Data Service

2. Signal Bridge

3. Risk Manager

4. Execution Engine

Data Flow

End-to-End Order Flow

Message Flow Diagram

Communication Patterns

ZeroMQ PUB/SUB Pattern

Request/Reply Pattern (Future)

Design Decisions

1. Why Rust?

2. Why ZeroMQ?

3. Why Workspace Architecture?

4. Why PyO3 for Python Integration?

5. Why No Database?

Scalability

Horizontal Scaling

Vertical Scaling

Performance Tuning

Limitations

Current Limitations

Known Issues

Future Improvements

Short-Term (Q1 2024)

Medium-Term (Q2-Q3 2024)

Long-Term (Q4 2024+)

References