Add optional alternate writers for canonical formats — Parquet first — alongside the default pyarrow output. File-format research benefits from corpora produced by multiple writer implementations: encoding choices, page sizing, dictionary thresholds, and stats policies differ per library and shape downstream compression / pushdown evaluation.
Likely shares machinery with #5; both produce additional sibling artifacts under the slug's output directory.
Per writer
- New convert stage variant (or generalised stage that dispatches on writer + format).
- Extend
sources.json: per-writer flag and skip-reason, e.g. convert.parquet_java.
- Update
validate_manifest invariants.
- Outputs at
outputs/v{n}/<slug>/<fmt>-<writer>/<slug>.<ext> (e.g. parquet-java/).
- Regen
docs/datasets.md + docs/snapshot.json.
Writers in scope
- parquet-mr (Java) — reference writer; subprocess via
java -jar.
- parquet-go — Go-native writer; subprocess.
Add optional alternate writers for canonical formats — Parquet first — alongside the default pyarrow output. File-format research benefits from corpora produced by multiple writer implementations: encoding choices, page sizing, dictionary thresholds, and stats policies differ per library and shape downstream compression / pushdown evaluation.
Likely shares machinery with #5; both produce additional sibling artifacts under the slug's output directory.
Per writer
sources.json: per-writer flag and skip-reason, e.g.convert.parquet_java.validate_manifestinvariants.outputs/v{n}/<slug>/<fmt>-<writer>/<slug>.<ext>(e.g.parquet-java/).docs/datasets.md+docs/snapshot.json.Writers in scope
java -jar.