Parallel streaming: Support concurrent streams from Amp Server, partitioned by block_range #10

fordN · 2025-10-14T15:21:32Z

This PR adds parallel streaming support enabling fast historical data backfills with automatic transition to live streaming.

What's New

Parallel Execution for Historical Data

Partition large block ranges across multiple workers using ThreadPoolExecutor
4-8x speedup for historical loads (scales with worker count)
Block-based partitioning with automatic or manual partition sizing

Hybrid Mode (Parallel Catchup → Live Streaming)

Auto-detect current max block and load historical data in parallel
Seamlessly transition to single-stream continuous mode for live blocks
Configurable reorg buffer (default: 200 blocks) for safe transition overlap

Usage

Basic parallel historical load:

parallel_config = ParallelConfig(
    num_workers=4,
    table_name='eth_firehose.blocks',
    min_block=0,
    max_block=1_000_000,
    block_column='block_num'
)

results = client.sql(query).load(
    connection='postgres',
    destination='blocks',
    stream=True,
    parallel_config=parallel_config
)

Hybrid mode (parallel → continuous):

parallel_config = ParallelConfig(
    num_workers=4,
    table_name='eth_firehose.blocks',
    min_block=0,
    max_block=None,  # Auto-detect and transition to streaming
    reorg_buffer=200  # Configurable overlap for reorg safety
)

Key Features

Configurable partition sizes and reorg buffer
Comprehensive logging and statistics tracking
Rich metadata in LoadResult for monitoring
Thread-safe statistics aggregation
Graceful error handling (workers fail independently)
Full documentation with usage patterns and troubleshooting
Create table earlier in flow, so all parallel workers can be sure it's already there before starting

Testing

28 unit tests (partitioning logic, query transformation, stats tracking)
4 integration tests (parallel load, hybrid mode, block detection)

Documentation

User guide: docs/parallel_streaming_usage.md - comprehensive usage patterns, configuration options, performance characteristics, and troubleshooting
Examples: Quick start, hybrid mode, checkpointing, custom partitioning

- Also move LoadResult and LoadConfig to /loaders/types to avoid circular dependencies

- Integration tests require an Amp server

- Configurable reorg buffer - Create table ahead of spinning up parallel workers to ensure it's ready for all of them and avoid complexity of thread locking - SQL variables for string replacement - Better docs, including limitations

fordN self-assigned this Oct 14, 2025

fordN changed the base branch from main to ford/streaming October 14, 2025 15:26

fordN force-pushed the ford/parallel-streams branch from 79d2b77 to 7b549f6 Compare October 14, 2025 15:35

fordN force-pushed the ford/parallel-streams branch from 7b549f6 to 7f7d0df Compare October 22, 2025 15:09

fordN force-pushed the ford/streaming branch from 3f2b905 to f93a7e6 Compare October 22, 2025 15:21

Base automatically changed from ford/streaming to main October 22, 2025 17:30

fordN force-pushed the ford/parallel-streams branch from 7f7d0df to a2c4142 Compare October 22, 2025 21:39

fordN requested review from JohnSwan1503 and craigtutterow October 22, 2025 21:40

fordN force-pushed the ford/parallel-streams branch from a2c4142 to 028d16f Compare October 22, 2025 21:46

fordN added 8 commits October 23, 2025 18:02

streaming: Setup primitives for parallel streaming

5bfb743

loader: Wire up parallel streaming

bb2f3c0

- Also move LoadResult and LoadConfig to /loaders/types to avoid circular dependencies

docs: Parallel streaming implementation plan

75a8335

tests: Unit and integration tests for parallel streaming

c14af50

- Integration tests require an Amp server

docs, README: Document parallel streams usage

e0e31d2

CLAUDE.md: Don't use mocks

d406962

postgresql_loader: Use Numeric type for Arrow UINT64 columns

ae83e03

parallel streaming: various improvements

5878a19

- Configurable reorg buffer - Create table ahead of spinning up parallel workers to ensure it's ready for all of them and avoid complexity of thread locking - SQL variables for string replacement - Better docs, including limitations

fordN force-pushed the ford/parallel-streams branch from 028d16f to 5878a19 Compare October 24, 2025 01:03

fordN requested a review from incrypto32 October 24, 2025 01:05

incrypto32 approved these changes Oct 27, 2025

View reviewed changes

JohnSwan1503 approved these changes Oct 27, 2025

View reviewed changes

fordN merged commit 68e1d60 into main Nov 4, 2025
7 checks passed

fordN deleted the ford/parallel-streams branch November 4, 2025 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallel streaming: Support concurrent streams from Amp Server, partitioned by block_range #10

Parallel streaming: Support concurrent streams from Amp Server, partitioned by block_range #10

Uh oh!

fordN commented Oct 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Parallel streaming: Support concurrent streams from Amp Server, partitioned by block_range #10

Parallel streaming: Support concurrent streams from Amp Server, partitioned by block_range #10

Uh oh!

Conversation

fordN commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's New

Parallel Execution for Historical Data

Hybrid Mode (Parallel Catchup → Live Streaming)

Usage

Key Features

Testing

Documentation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fordN commented Oct 14, 2025 •

edited

Loading