Skip to content

[Track] Alternate writers for canonical formats (parquet-mr, parquet-go, ...) #6

@mprammer

Description

@mprammer

Add optional alternate writers for canonical formats — Parquet first — alongside the default pyarrow output. File-format research benefits from corpora produced by multiple writer implementations: encoding choices, page sizing, dictionary thresholds, and stats policies differ per library and shape downstream compression / pushdown evaluation.

Likely shares machinery with #5; both produce additional sibling artifacts under the slug's output directory.

Per writer

  • New convert stage variant (or generalised stage that dispatches on writer + format).
  • Extend sources.json: per-writer flag and skip-reason, e.g. convert.parquet_java.
  • Update validate_manifest invariants.
  • Outputs at outputs/v{n}/<slug>/<fmt>-<writer>/<slug>.<ext> (e.g. parquet-java/).
  • Regen docs/datasets.md + docs/snapshot.json.

Writers in scope

  • parquet-mr (Java) — reference writer; subprocess via java -jar.
  • parquet-go — Go-native writer; subprocess.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesttracking-issueShared implementation context for work likely to span multiple PRs.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions