Machine-readable companion:
sources.schema.jsonis the JSON Schema (Draft 2020-12) version of this document. Runpython -m scripts.pipeline.validate_manifestto validatesources.jsonagainst it plus cross-checks the schema can't express (handler registry, slug uniqueness, etc.). Keep both files in lockstep when adding fields.
This document defines the shape of sources.json, the manifest that drives the Raincloud pipeline. The file declares, per dataset, how to fetch, how to extract, how to parse, how to transform, and how to validate the resulting parquet in outputs/v{schema_version}/<slug>/parquet/<slug>.parquet. Stages are processed by separate scripts under scripts/pipeline/; each stage reads only the fields it needs so scripts can be developed, tested, and re-run independently.
{
/* Identity (used by every stage) */
"slug": "clickbench-hits", // kebab-case; matches outputs/v{n}/<slug>/parquet/<slug>.parquet
"short_name": "ClickBench Hits", // table-friendly label
"full_name": "ClickBench Hits (Yandex Metrica log)",
"description": "100M-row web-analytics event log used by the ClickBench OLAP benchmark.",
"family": "direct", // "direct" | "nyc-tlc" | "public-bi" | "uci" | "kaggle-upstream"
/* License (driven by the license-audit pass β machine-readable) */
"license": {
"spdx": "Apache-2.0", // SPDX id OR free-form token if no SPDX. Describes the aggregator's declared license; see `scrape_advisory` for the gap between that and any uncleared underlying content.
"source_url": "https://github.com/ClickHouse/ClickBench/blob/main/LICENSE",
"redistribution_permitted": true, // what the audit confirmed
"attribution_required": true,
"notes": null,
"scrape_advisory": null // null for cleared datasets. When non-null, holds a heavy-asterisk warning rendered prominently in datasets.md / list_datasets --long / the TUI for datasets that aggregate or reference content whose underlying licenses have not been individually cleared (public-web scrapes, Common Crawl derivatives, image/code corpora). Free-form per-source string β write a one-line summary of the gap and what a downstream user should do (e.g. "contact original authors before redistributing").
},
/* Stage 1 β fetch (scripts/pipeline/fetch.py) */
"fetch": {
"type": "http", // "http" | "kaggle" | "uci" | "huggingface" | "custom"
"urls": [ // list; multiple URLs are fetched in order and concatenated/merged per `extract`
"https://datasets.clickhouse.com/hits_compatible/hits.parquet"
],
"auth": null, // null | "kaggle" | "huggingface" | "<custom-token-key>"
"requires_interactive_accept": false, // kaggle-only: marks datasets that require a one-time ToS click-through on the Kaggle web UI before API access. fetch_kaggle surfaces a clear "visit URL, click Download, re-run" error on 403 regardless of this flag, but setting it lets the orchestrator announce the requirement up front.
"hf_allow_patterns": null, // huggingface-only: glob patterns forwarded to snapshot_download(allow_patterns=...). Use to fetch a subset of a giant repo (e.g. ["data/sample-10BT/*.parquet"] for fineweb).
"hf_revision": null, // huggingface-only: git revision (branch/tag/commit SHA) forwarded to snapshot_download(revision=...).
"expected_bytes": 14779976446, // optional; used only to warn on drift
"expected_sha256": null // optional; prefer when upstream publishes it
},
/* Stage 2 β extract (scripts/extract.py) */
"extract": {
"type": "passthrough", // "passthrough" | "zip" | "tar" | "bz2" | "gzip" | "7z" | "custom"
"include": ["hits.parquet"], // glob list; applied after decompression
"exclude": [], // optional; wins over include
"post": null // optional custom post-extract step name
},
/* Stage 3 β parse (scripts/pipeline/parse.py) */
"parse": {
"reader": "parquet", // "csv" | "parquet" | "jsonl" | "xml" | "pbf" | "custom"
"options": { // reader-specific
/* for csv: { "delimiter": ",", "has_header": true, "encoding": "utf-8", "quoting": "minimal" } */
/* for parquet: {} */
/* for jsonl: { "record_path": null } */
}
},
/* Stage 4 β transform (scripts/pipeline/transform.py) */
"transform": {
"handler": "identity", // named Python callable in scripts/pipeline/handlers/; "identity" means no-op
"params": {} // handler-specific kwargs
},
/* Stage 5 β write (scripts/pipeline/write.py) */
"write": {
"output": "clickbench-hits.parquet",
"compression": "zstd",
"row_group_size_rows": 1048576,
"statistics": true,
"page_index": false
},
/* Stage 6 β validate (scripts/pipeline/validate.py) */
"expect": {
"rows": 99997497, // exact; mismatch emits [WARN], does not fail unless --strict
"schema_hash": null, // optional; SHA-256 of canonicalised Arrow schema.
// May be the full 64-char hex or a leading prefix
// (manifest convention is 12 chars, matching the
// schema_hash= line printed by the validate stage).
// Mismatch emits [WARN] only; pass --strict to fail.
"notes": null
},
/* Stage 7 β convert (scripts/pipeline/convert.py, optional) */
"convert": {
"vortex": true, // opt-in; when true, emit a sibling <slug>.vortex alongside the parquet.
"vortex_skip_reason": null // null when vortex=true. When vortex=false, holds a non-null free-form string explaining why (typically a known type-support gap in the current Vortex release). The pair is rendered into docs/v{n}/vortex_skip.md so the catalog tracks *why* slugs lack a `.vortex` rather than leaving it ambiguous.
},
/* Hydration mark (optional, opt-in) β for slugs with a URL column whose
contents could be dereferenced into a sibling parquet under the
`parquet-hydrated/` format dir. Today this is a manifest-level mark only;
the hydrate stage isn't implemented yet. Surface in TUI / list_datasets
--hydrate / --json. */
"hydrate": {
"url_column": "url", // existing string column to dereference
"output_column": "content", // new column on the hydrated copy
"output_type": "binary", // "binary" (image/pdf bytes) | "string" (HTML/text)
"advisory": "Many LAION URLs return 404 (10-30% takedown rate); ..." // per-slug pitfalls
},
/* Optional curatorial labels β list_datasets --tag filters by membership;
TUI renders as chips. Kebab-case. Free-form (no enum) but kept short. */
"tags": ["nlp", "rlhf", "preference-data"],
/* Optional canonical references beyond license.source_url. kind β
{paper, blog, homepage, github, dataset_card}. */
"references": [
{"kind": "paper", "url": "https://arxiv.org/abs/2310.01377"},
{"kind": "github", "url": "https://github.com/foo/bar"}
]
}transform.handler names a Python callable registered in scripts/pipeline/handlers/__init__.py. Each handler takes
(spec: dict, parsed: list[(Path, pa.Table | None)], **params) -> list[(output_slug, pa.Table)]so a single source can produce multiple parquets (GloVe β 3 files, OSM Germany β 3 files, Stack Exchange dump β 5 files).
The full registry lives in scripts/pipeline/handlers/__init__.py; highlights:
identityβ passthrough.tighten_typesβ standard retype / list-element tightening / UUID/JSON annotation pass.tlc_merge_monthsβ concatenate 12 TLC monthly parquets into an annual file.public_bi_mergeβ concatenate.csv.bz2partitions using the companion.sqlschema.glove_splitβ read GloVe.txt, split into 3 per-dimensionfixed_size_list<float, N>parquets.osm_pbf_splitβ read.osm.pbfand emit 3 GeoParquet files (nodes/ways/relations) with WKB geometry.stack_exchange_splitβ read Stack Exchange XML dump and emit one parquet per table.openlibrary_parseβ readol_dump_*.txt.gzand split by record type.uci_defaultβ UCIdata.csvwith standard type-tightening + column-name normalisation.factbook_variant_parse/jsonbench_variant_parseβ stream JSON-per-row into a ParquetVARIANTcolumn via DuckDB'sCAST(... AS VARIANT)+COPY TO PARQUET.lichess_pgn_parseβ stream a Lichess.pgn.zstmonthly dump.
- fetch reads
fetch.*and writes bytes tooutputs/raw_downloads/<slug>/<basename>. Idempotent (skip ifexpected_bytes/expected_sha256already matches). Raw downloads are not version-scoped β the same upstream bytes are reused across schema_versions. - extract reads
extract.*and expands downloaded files into_workdir/<slug>/. Outputs a list of(relative_path, type)tuples. - parse reads
parse.*and each extracted file, produces onepyarrow.Tableper source file. Reader options mirror the underlying library. - transform dispatches to
handlerwith the parsed tables as input. Output is(output_slug, arrow_table)tuples. Streaming handlers may write parquet directly and return[]. - write writes parquet to
outputs/v{schema_version}/<output_slug>/<output_slug>.parquetperwrite.*settings. - validate reads written parquet and compares row count + schema hash to
expect.*. Errors are loud by default;--loosedowngrades to warnings. - convert (optional) β when
convert.vortex = true, emitoutputs/v{n}/<slug>/vortex/<slug>.vortex(a sibling format directory next toparquet/). No-op otherwise. Seescripts/pipeline/convert.pyfor type-support caveats; current known gaps in vortex 0.69 are listed inSKILLS.md.
When transform.handler returns multiple tables, the write.output field is ignored and each handler-emitted output_slug becomes the output filename. Example: a single source (glove.6B.zip) produces 3 parquets (glove-6b-50d.parquet, etc.).
Client-side the 3 outputs appear as 3 distinct DatasetSpec entries in sources.json, each referencing the same fetch config but with transform.handler = "glove_split" and a params.dimension discriminator. The pipeline dedupes the actual download.
{ "schema_version": 1, "datasets": [ /* DatasetSpec, one per dataset */ ] }