Skip to content

[python] Add streaming reads to paimon CLI (table stream command)#7456

Open
tub wants to merge 5 commits into
apache:masterfrom
tub:python-streaming-2e-cli-stream
Open

[python] Add streaming reads to paimon CLI (table stream command)#7456
tub wants to merge 5 commits into
apache:masterfrom
tub:python-streaming-2e-cli-stream

Conversation

@tub

@tub tub commented Mar 17, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds paimon table stream <db.table> CLI subcommand that continuously polls a Paimon table and prints new rows as they arrive until Ctrl+C
  • Adds StreamReadBuilder.with_scan_from() to the library so programmatic users can also control starting position ("latest", "earliest", or a snapshot ID integer)
  • Timestamp support in --from is resolved to a snapshot ID at the CLI layer (no timestamps in the library API)
    • I can push this down into the library if we think it'll be useful

Flags

Flag Default Description
--from latest Starting position: latest, earliest, snapshot ID, or timestamp (YYYY-MM-DD, ISO 8601)
--select Column projection (comma-separated)
--where SQL-subset filter predicate
--format table Output format: table or json
--poll-interval-ms 1000 Milliseconds between snapshot polls
--include-row-kind off Prepend _row_kind column (+I, -U, +U, -D)
--consumer-id Persist scan progress for at-least-once resume

Changes

  • paimon-python/pypaimon/cli/cli_table_stream.py — new command handler + parse_from_position() timestamp resolver
  • paimon-python/pypaimon/cli/cli_table.py — register stream subparser
  • paimon-python/pypaimon/read/stream_read_builder.pywith_scan_from()
  • paimon-python/pypaimon/read/streaming_table_scan.pyscan_from startup resolution (consumer restore always wins)

Test plan

  • pypaimon/tests/stream_read_builder_test.py — 7 new unit tests for with_scan_from()
  • pypaimon/tests/streaming_table_scan_test.py — 4 new integration tests (earliest, numeric ID, latest, consumer-overrides-scan_from)
  • pypaimon/tests/cli_table_stream_test.py — 20 new tests (CLI integration + parse_from_position unit tests)

🤖 Generated with Claude Code

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice CLI addition for pypaimon. The table stream command provides a user-friendly way to tail table changes.

Review:

  1. Feature set: --from, --select, --where, --format, --poll-interval-ms, --include-row-kind, --consumer-id — comprehensive flag set covering the common streaming read scenarios.

  2. Library API addition: StreamReadBuilder.with_scan_from() is useful beyond CLI. Good separation between library and CLI layers.

  3. +859 additions is substantial. The split between CLI handler, library API, and tests is clean.

  4. Consumer-id for at-least-once resume: Persisting scan progress enables resumable tailing. Important for operational use.

  5. Timestamp resolution at CLI layer: Making the library API accept snapshot IDs while the CLI resolves timestamps is the right layering. Pushing timestamp resolution down to the library is optional.

  6. Test coverage: cli_table_stream_test.py, stream_read_builder_test.py, streaming_table_scan_test.py — three test files covering different layers. Good.

  7. Poll interval: Default 1000ms is reasonable. Consider documenting that very low intervals (< 100ms) may cause unnecessary load on the metadata path.

LGTM. Well-designed feature.

@JingsongLi

Copy link
Copy Markdown
Contributor

Please make this not a draft.

@tub tub marked this pull request as ready for review May 24, 2026 05:41
@JingsongLi JingsongLi closed this May 31, 2026
@JingsongLi JingsongLi reopened this May 31, 2026

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I reviewed the latest version again.

The CLI layering looks reasonable now: table stream keeps the human-friendly parsing in the command layer, resolves timestamp-based --from values to snapshot IDs, and passes a normalized scan position down to StreamReadBuilder. The streaming scan behavior also looks consistent to me: consumer restore takes precedence over scan_from, earliest emits an initial full-scan plan from the earliest snapshot, and numeric snapshot IDs rely on the follow-up scan path.

The new tests cover the main CLI paths, output formats, --from parsing, builder propagation, and the streaming scan startup semantics. I do not see a blocking code issue from this pass.

One small follow-up you may consider (not blocking from my side): since StreamReadBuilder.with_scan_from() is a public programmatic API and its docstring restricts values to "latest", "earliest", or a positive integer snapshot ID, it would be a bit nicer to fail fast there for invalid strings or non-positive integers instead of letting the scan loop fail later. The CLI already normalizes its inputs, so this is mostly API hardening.

I noticed the latest Python lint jobs are still in progress, so I am leaving this as a review comment rather than approval for now. If those checks turn green, this looks good to merge from my side.

tub and others added 5 commits June 15, 2026 14:13
…reamReadBuilder

- Add `paimon table stream <db.table>` CLI subcommand that continuously
  polls a table and prints new rows as they arrive until Ctrl+C
- Flags: --select, --where, --format (table|json), --from, --poll-interval-ms,
  --include-row-kind, --consumer-id
- --from accepts 'latest' (default), 'earliest', a numeric snapshot ID,
  or a timestamp string (YYYY-MM-DD, ISO 8601 with/without timezone)
- Add StreamReadBuilder.with_scan_from() accepting 'latest', 'earliest',
  or an integer snapshot ID; passed through to AsyncStreamingTableScan
  with consumer restore taking highest priority over scan_from
- Tests: 7 new unit tests in stream_read_builder_test.py, 4 integration
  tests in streaming_table_scan_test.py, 20 tests in cli_table_stream_test.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…check in table stream CLI

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pport named timezones

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… to fix flake8

Co-Authored-By: Claude <noreply@anthropic.com>
…patching module-level SnapshotManager

Upstream 4aec277 removed the SnapshotManager import from streaming_table_scan.py
in favor of table.snapshot_manager(). Update the ScanFromTest tests to mock via
table.snapshot_manager.return_value instead of @patch at the module level.

Co-Authored-By: Claude <noreply@anthropic.com>
@tub tub force-pushed the python-streaming-2e-cli-stream branch from d4b57c1 to a37049f Compare June 15, 2026 13:15
@tub

tub commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Rebased & fixed up the tests, sorry for the delay, was out of the office for a while.

@tub tub requested review from JingsongLi and leaves12138 June 15, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants