Skip to content

New sub-module: plugin-transform-arrow (jq queries over Arrow/Parquet/CSV/NDJSON) #70

@fdelbrayelle

Description

@fdelbrayelle

Summary

Add a new plugin-transform-arrow sub-module implementing aq — a tool that applies jq-style filter expressions to columnar data files (Parquet, Arrow IPC, CSV, NDJSON).


What aq does

aq is "jq for Apache Arrow". Its pipeline:

  1. Read — auto-detect and read Parquet, Arrow IPC, CSV, or NDJSON into Arrow RecordBatch objects
  2. Serialize — convert each row to a JSON object (via NDJSON intermediate)
  3. Filter/transform — apply a jq expression row-by-row (or in slurp mode over all rows)
  4. Output — write results as NDJSON, JSON, CSV, TSV, or Arrow IPC

Why plugin-transform, not plugin-serdes

plugin-serdes is purely a format converter (file ↔ Ion records) with no query or filter logic. aq's core value is the jq query engine applied to columnar data, which belongs in plugin-transform alongside the existing plugin-transform-json (JSONata) and plugin-transform-records (SQL-like filter/map/aggregate).


Why plugin-transform-arrow, not plugin-transform-jq

The name should reflect the data format, not the query language — consistent with how plugin-transform-json is named after JSON, not after JSONata. The differentiator here is Arrow/Parquet input support; jq is just the query mechanism. It also leaves room to add non-jq tasks later (schema inspection, format conversion within the Arrow ecosystem, etc.).


Proposed tasks

Task Description
Query Apply a jq expression to an Arrow/Parquet/CSV/NDJSON file, output as Ion records or file

Mirrors the Transform / TransformItems pattern from plugin-transform-json.


Java implementation

The Rust-to-Java mapping is straightforward:

aq dependency (Rust) Java equivalent
parquet crate org.apache.parquet:parquet-arrow
arrow (IPC, CSV, JSON reader/writer) org.apache.arrow:arrow-vector, arrow-dataset, arrow-ipc
Row → NDJSON serialization Arrow Java JSON writer (ArrowToJson)
jaq-core (pure-Rust jq engine) jackson-jq — best pure-Java jq implementation
serde_json Jackson Databind

Note on jackson-jq: it doesn't implement the full jq spec (missing $ENV, input/inputs, some path builtins). This is the same tradeoff aq itself makes (it uses jaq, not the reference jq), so it's acceptable for the target use cases.


References

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/pluginPlugin-related issue or feature request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions