Skip to content

Add parquet for scatter plots in CLI#79

Open
mud2monarch wants to merge 16 commits into
Psy-Fer:devfrom
mud2monarch:m2m/feat-parquet
Open

Add parquet for scatter plots in CLI#79
mud2monarch wants to merge 16 commits into
Psy-Fer:devfrom
mud2monarch:m2m/feat-parquet

Conversation

@mud2monarch

@mud2monarch mud2monarch commented May 22, 2026

Copy link
Copy Markdown

Description

What?

Adds support for generating scatter plots from CLI from parquet source data, at feature parity with CSV/TSV/{D}SV. Gated behind --feature parquet.

Why?

Parquet is the modern standard for data storage, especially large data. It is typed, stored in columns, and supports predicate pushdown + column selection, which is much more efficient than string-stored/string-delimited file formats.

CLI users may expect, or at least appreciate, the ability to generate charts directly from parquet, without having to downcast to a weaker format like CSV first.

Type of change

New feature / API addition (CLI only)


Checklist

Library (new plot type)

N/A

Tests

  • New test file in tests/ with ≥ basic render + SVG content + legend tests (added to tests/cli_basic.rs)
  • cargo test --features cli,full — all existing tests still pass

CLI (if applicable)

  • src/bin/kuva/<name>.rs — Args struct (with /// doc comment) + run() = N/A no new command
  • src/bin/kuva/main.rs — module, Commands variant, match arm = N/A no new command
  • scripts/smoke_tests.sh — at least one invocation
  • tests/cli_basic.rs — SVG output test + content verification test
  • docs/src/cli/index.md — subcommand entry = N/A no new command
  • man/kuva.1 — regenerated (./target/debug/kuva man > man/kuva.1) = N/A no new command

Documentation

  • examples/<name>.rs — Rust example for doc asset generation = N/A no new API
  • scripts/gen_docs.sh — invocations added; bash scripts/gen_docs.sh runs clean
  • docs/src/plots/<name>.md — documentation page with embedded SVGs
  • docs/src/SUMMARY.md — link added = N/A. Given the 'it just works' philosophy in Add Polars support #78 discussion, I chose not to add any indication that parquet works for scatter.
  • docs/src/gallery.md — gallery card added = N/A no new plot
  • README.md — plot types table updated = N/A, no new plot

Visual inspection

  • Opened test_outputs/ — new plot SVGs look correct
  • Scanned neighbouring plots in test_outputs/ for layout regressions
  • bash scripts/smoke_tests.sh — all existing smoke test outputs still look correct
  • No text clipped, no legend overlap, no spurious axes on pixel-space plots

Housekeeping

  • CHANGELOG.md — entry added under ## [Unreleased]
  • README.md — item marked done in TODO section if applicable = N/A

Adding a new feature (non-plot-type)

  • Implement in the relevant src/ file(s).
  • Add tests covering the new behaviour — both a positive case and at least one edge case.
  • Update the relevant docs/src/ page(s) if the feature is user-visible.
  • If the feature affects rendered output, run the visual inspection steps above.
  • CHANGELOG.md — add an entry under ## [Unreleased].

@mud2monarch mud2monarch marked this pull request as draft May 22, 2026 16:10
@mud2monarch

Copy link
Copy Markdown
Author

Hi @Psy-Fer, this is ready for review!

Per our discussion in #78, this initial implementation is:

  • Feature gated
  • Specific to scatter plots
  • Minimal dependencies. Bytes is required for stdin support.
  • Feature parity with csv/tsv, and data model parity as much as possible, too.
  • Minimally surfaced to the CLI user. Just works.

I think parquet support is very nice for any modern data library. As a user, I try to stay in parquet as much as possible because it's 1) typed, 2) faster/more efficient, and 3) more compact. I think it's much more important than Rust Polars support, and is a more effective bridge to other tools, such as DuckDB. It can also serve as a future jumping-off-point to support in-memory Arrow data.

Please review the code and share any comments. If you're happy with this implementation and desire this feature, it'd be great to merge this then discuss how to add parquet support for the rest of the CLI.

@mud2monarch mud2monarch marked this pull request as ready for review May 22, 2026 17:53
@Psy-Fer

Psy-Fer commented May 23, 2026

Copy link
Copy Markdown
Owner

This is great!
You absolutely cooked.

I'll go through this in the coming week.
Most likely i'll merge it, tweak it, test it, then extend it to the rest of the library after that tweaking. Then we can go from there.

Cheers mate,
James

@Psy-Fer

Psy-Fer commented May 23, 2026

Copy link
Copy Markdown
Owner

I'll fix up the clippy noise after the merge. nothing major there.

@mud2monarch

Copy link
Copy Markdown
Author

Great, I'm glad this works! I'm curious to see how you decide to extend it to the rest of the library.

@mud2monarch

Copy link
Copy Markdown
Author

Hey @Psy-Fer , I imagine you're busy w the rest of your life right now -- anything I can do to help out with Kuva more? Would you appreciate a design/suggestion for how to implement parquet support across all plots? Feel free to drop any loose threads of thinking in here if so.

@Psy-Fer

Psy-Fer commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Hey,

Yea getting back to kuva this week. Things have been a bit crazy lately. This is high on my list to do.

Cheers,
James

Psy-Fer and others added 4 commits June 11, 2026 19:00
- Drop dep:arrow (removes ~60 transitive crates)
- Move parquet reading into DataTable::parse() so all subcommands
  benefit automatically; delete parquet.rs and InputType enum
- Simplify scatter.rs back to single DataTable code path
- Fix docs typo in scatter.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Psy-Fer

Psy-Fer commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Okay, I have some strong opinions around including arrow.

I have come up with a way to just bring in parquet, and use it's row API to process the data. Even on 2 rows with 10M datapoints, it's still crazy fast. If this is an issue in the future, we can bring in some of arrow for that specifically, but i'd rather it be a complaint driven request when we can get around most of it with some simple type handling.

and moving the parsing to data.rs, so it globally applies to all plots. Your scatter scoped example really did help in thinking about how to handle this properly.

a few other clean ups, conflicts, clippy, etc.

Anyway, it might be worth you testing this PR on your parquet files before I merge it, in case I made a dumb mistake along the way.

Cheers,
James

@mud2monarch

Copy link
Copy Markdown
Author

Thanks, James! If the priority is to avoid dependencies then I think your implementation is definitely the right way to go.

However, I do want to highlight the resource cost of foregoing the parquet benefits of operating row-wise and materializing everything to a String. I'm sure you're aware of these costs, but I want to provide a numerical benchmark for additional discussion (chart generated with kuva scatter *.parquet of course 😄 ).

I expanded the sample data to a parquet that is 1m rows by 10 and 50 columns (two test cases). This is a realistic benchmark that, imo, is newly enabled by Kuva CLI. For example, consider a nanopore sequencing workflow that has one Parquet file with per-read or per-window metrics: read length, mean quality, signal-level summaries, derived QC features, etc. One could then use Kuva CLI from a small shell script to quickly generate scatter plots for pairs of metrics, looking for QC issues or relationships between features. A job could even run nightly or in CI to quickly catch any quality regressions.

This is especially likely in scripted or agent-assisted workflows where the same wide Parquet can now be reused for many quick plots.

Here's the wall time (x axis, seconds) vs. memory use (y axis, bytes) for the 10 column and 50 column tests.

benchmark_comparison

Numerically:

cols implementation Memory use Time Memory diff Time diff
10 projected_arrow 121 MB 0.05
row_string 915 MB 1.71 7.6x 35.9x
50 projected_arrow 120 MB 0.05
row_string 3,521 MB 7.31 29.3x 153.9x

As expected, the columnar projection + Arrow types implementation does not scale with parquet size, but the row-wise x String implementation requires much more memory and time.

I think this is on the edge of making the imagined use case unusable. Generating 20 quick-flippable scatters on a 50 column parquet would take 1 second vs. 2.5 minutes to generate. Or, you might even be limited by memory if you have a small box for running certain jobs.

Just want to throw up some concrete numbers to inform discussion. Can you share more detail on the dependency concern? Perhaps there's another way to get at the same goal. IMO, Arrow is a high-quality dependency (developed by the Apache Software Foundation) and is what enables the distinguishing features of parquet. It's already feature-gated behind parquet.

@Psy-Fer

Psy-Fer commented Jun 11, 2026

Copy link
Copy Markdown
Owner

I love raw numbers. thanks for doing this benchmark.

I'll have a think about it and have another crack at it tomorrow if I get some time, otherwise, next week :)

Cheers,
James

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants