Add parquet for scatter plots in CLI by mud2monarch · Pull Request #79 · Psy-Fer/kuva

mud2monarch · 2026-05-22T16:08:10Z

Description

What?

Adds support for generating scatter plots from CLI from parquet source data, at feature parity with CSV/TSV/{D}SV. Gated behind --feature parquet.

Why?

Parquet is the modern standard for data storage, especially large data. It is typed, stored in columns, and supports predicate pushdown + column selection, which is much more efficient than string-stored/string-delimited file formats.

CLI users may expect, or at least appreciate, the ability to generate charts directly from parquet, without having to downcast to a weaker format like CSV first.

Type of change

New feature / API addition (CLI only)

Checklist

Library (new plot type)

N/A

Tests

New test file in tests/ with ≥ basic render + SVG content + legend tests (added to tests/cli_basic.rs)
cargo test --features cli,full — all existing tests still pass

CLI (if applicable)

src/bin/kuva/<name>.rs — Args struct (with /// doc comment) + run() = N/A no new command
src/bin/kuva/main.rs — module, Commands variant, match arm = N/A no new command
scripts/smoke_tests.sh — at least one invocation
tests/cli_basic.rs — SVG output test + content verification test
docs/src/cli/index.md — subcommand entry = N/A no new command
man/kuva.1 — regenerated (./target/debug/kuva man > man/kuva.1) = N/A no new command

Documentation

examples/<name>.rs — Rust example for doc asset generation = N/A no new API
scripts/gen_docs.sh — invocations added; bash scripts/gen_docs.sh runs clean
docs/src/plots/<name>.md — documentation page with embedded SVGs
docs/src/SUMMARY.md — link added = N/A. Given the 'it just works' philosophy in Add Polars support #78 discussion, I chose not to add any indication that parquet works for scatter.
docs/src/gallery.md — gallery card added = N/A no new plot
README.md — plot types table updated = N/A, no new plot

Visual inspection

Opened test_outputs/ — new plot SVGs look correct
Scanned neighbouring plots in test_outputs/ for layout regressions
bash scripts/smoke_tests.sh — all existing smoke test outputs still look correct
No text clipped, no legend overlap, no spurious axes on pixel-space plots

Housekeeping

CHANGELOG.md — entry added under ## [Unreleased]
README.md — item marked done in TODO section if applicable = N/A

Adding a new feature (non-plot-type)

Implement in the relevant src/ file(s).
Add tests covering the new behaviour — both a positive case and at least one edge case.
Update the relevant docs/src/ page(s) if the feature is user-visible.
If the feature affects rendered output, run the visual inspection steps above.
CHANGELOG.md — add an entry under ## [Unreleased].

mud2monarch · 2026-05-22T17:53:47Z

Hi @Psy-Fer, this is ready for review!

Per our discussion in #78, this initial implementation is:

Feature gated
Specific to scatter plots
Minimal dependencies. Bytes is required for stdin support.
Feature parity with csv/tsv, and data model parity as much as possible, too.
Minimally surfaced to the CLI user. Just works.

I think parquet support is very nice for any modern data library. As a user, I try to stay in parquet as much as possible because it's 1) typed, 2) faster/more efficient, and 3) more compact. I think it's much more important than Rust Polars support, and is a more effective bridge to other tools, such as DuckDB. It can also serve as a future jumping-off-point to support in-memory Arrow data.

Please review the code and share any comments. If you're happy with this implementation and desire this feature, it'd be great to merge this then discuss how to add parquet support for the rest of the CLI.

Psy-Fer · 2026-05-23T04:35:02Z

This is great!
You absolutely cooked.

I'll go through this in the coming week.
Most likely i'll merge it, tweak it, test it, then extend it to the rest of the library after that tweaking. Then we can go from there.

Cheers mate,
James

Psy-Fer · 2026-05-23T04:39:04Z

I'll fix up the clippy noise after the merge. nothing major there.

mud2monarch · 2026-05-23T13:22:52Z

Great, I'm glad this works! I'm curious to see how you decide to extend it to the rest of the library.

mud2monarch · 2026-06-08T14:38:53Z

Hey @Psy-Fer , I imagine you're busy w the rest of your life right now -- anything I can do to help out with Kuva more? Would you appreciate a design/suggestion for how to implement parquet support across all plots? Feel free to drop any loose threads of thinking in here if so.

Psy-Fer · 2026-06-08T14:50:18Z

Hey,

Yea getting back to kuva this week. Things have been a bit crazy lately. This is high on my list to do.

Cheers,
James

- Drop dep:arrow (removes ~60 transitive crates) - Move parquet reading into DataTable::parse() so all subcommands benefit automatically; delete parquet.rs and InputType enum - Simplify scatter.rs back to single DataTable code path - Fix docs typo in scatter.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Psy-Fer · 2026-06-11T09:19:20Z

Okay, I have some strong opinions around including arrow.

I have come up with a way to just bring in parquet, and use it's row API to process the data. Even on 2 rows with 10M datapoints, it's still crazy fast. If this is an issue in the future, we can bring in some of arrow for that specifically, but i'd rather it be a complaint driven request when we can get around most of it with some simple type handling.

and moving the parsing to data.rs, so it globally applies to all plots. Your scatter scoped example really did help in thinking about how to handle this properly.

a few other clean ups, conflicts, clippy, etc.

Anyway, it might be worth you testing this PR on your parquet files before I merge it, in case I made a dumb mistake along the way.

Cheers,
James

mud2monarch · 2026-06-11T12:15:51Z

Thanks, James! If the priority is to avoid dependencies then I think your implementation is definitely the right way to go.

However, I do want to highlight the resource cost of foregoing the parquet benefits of operating row-wise and materializing everything to a String. I'm sure you're aware of these costs, but I want to provide a numerical benchmark for additional discussion (chart generated with kuva scatter *.parquet of course 😄 ).

I expanded the sample data to a parquet that is 1m rows by 10 and 50 columns (two test cases). This is a realistic benchmark that, imo, is newly enabled by Kuva CLI. For example, consider a nanopore sequencing workflow that has one Parquet file with per-read or per-window metrics: read length, mean quality, signal-level summaries, derived QC features, etc. One could then use Kuva CLI from a small shell script to quickly generate scatter plots for pairs of metrics, looking for QC issues or relationships between features. A job could even run nightly or in CI to quickly catch any quality regressions.

This is especially likely in scripted or agent-assisted workflows where the same wide Parquet can now be reused for many quick plots.

Here's the wall time (x axis, seconds) vs. memory use (y axis, bytes) for the 10 column and 50 column tests.

Numerically:

cols	implementation	Memory use	Time	Memory diff	Time diff
10	projected_arrow	121 MB	0.05
	row_string	915 MB	1.71	7.6x	35.9x
50	projected_arrow	120 MB	0.05
	row_string	3,521 MB	7.31	29.3x	153.9x

As expected, the columnar projection + Arrow types implementation does not scale with parquet size, but the row-wise x String implementation requires much more memory and time.

I think this is on the edge of making the imagined use case unusable. Generating 20 quick-flippable scatters on a 50 column parquet would take 1 second vs. 2.5 minutes to generate. Or, you might even be limited by memory if you have a small box for running certain jobs.

Just want to throw up some concrete numbers to inform discussion. Can you share more detail on the dependency concern? Perhaps there's another way to get at the same goal. IMO, Arrow is a high-quality dependency (developed by the Apache Software Foundation) and is what enables the distinguishing features of parquet. It's already feature-gated behind parquet.

Psy-Fer · 2026-06-11T12:27:40Z

I love raw numbers. thanks for doing this benchmark.

I'll have a think about it and have another crack at it tomorrow if I get some time, otherwise, next week :)

Cheers,
James

Psy-Fer and others added 7 commits May 13, 2026 13:52

update gitcount badge

62c1c47

parquet builders and colspec matching

9d3aa3a

some progress

5d273f3

complete parsing

0a5a488

started on grouping

21df9ea

draft impl

5513ff8

fully working e2e now need to go through pr tasks

de7ce2a

mud2monarch marked this pull request as draft May 22, 2026 16:10

mud2monarch added 5 commits May 22, 2026 12:13

remove unnecessary png

ad3e654

bash script led to new chord svg but no visual diff

7c39f90

added integration tests for parquet

013aa05

cargo fmt

cc97326

updated docs and comments

1dcfc91

mud2monarch marked this pull request as ready for review May 22, 2026 17:53

Psy-Fer and others added 4 commits June 11, 2026 19:00

Merge branch 'dev' into pr-79-parquet

ad4b4a3

fix group_by logic to be O(n) instead of O(n^2)

2217a17

fmt fix

8edac5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add parquet for scatter plots in CLI#79

Add parquet for scatter plots in CLI#79
mud2monarch wants to merge 16 commits into
Psy-Fer:devfrom
mud2monarch:m2m/feat-parquet

mud2monarch commented May 22, 2026 •

edited

Loading

Uh oh!

mud2monarch commented May 22, 2026

Uh oh!

Psy-Fer commented May 23, 2026

Uh oh!

Psy-Fer commented May 23, 2026

Uh oh!

mud2monarch commented May 23, 2026

Uh oh!

mud2monarch commented Jun 8, 2026

Uh oh!

Psy-Fer commented Jun 8, 2026

Uh oh!

Psy-Fer commented Jun 11, 2026

Uh oh!

mud2monarch commented Jun 11, 2026

Uh oh!

Psy-Fer commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mud2monarch commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist

Library (new plot type)

Tests

CLI (if applicable)

Documentation

Visual inspection

Housekeeping

Adding a new feature (non-plot-type)

Uh oh!

mud2monarch commented May 22, 2026

Uh oh!

Psy-Fer commented May 23, 2026

Uh oh!

Psy-Fer commented May 23, 2026

Uh oh!

mud2monarch commented May 23, 2026

Uh oh!

mud2monarch commented Jun 8, 2026

Uh oh!

Psy-Fer commented Jun 8, 2026

Uh oh!

Psy-Fer commented Jun 11, 2026

Uh oh!

mud2monarch commented Jun 11, 2026

Uh oh!

Psy-Fer commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mud2monarch commented May 22, 2026 •

edited

Loading