diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index a37e110..22c4542 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -53,7 +53,7 @@ jobs: enable-cache: true - name: Install dependencies - run: uv sync + run: uv sync --extra native - name: Cache core wasm fixture uses: actions/cache@v4 diff --git a/README.md b/README.md index c3323fa..4747b30 100644 --- a/README.md +++ b/README.md @@ -1,122 +1,100 @@ # chonkle -Chonkle is a demonstrator, built to explore WebAssembly (Wasm) as a codec delivery mechanism for chunked array formats like Zarr and COG. In these formats, each chunk is encoded by a sequence (i.e., a pipeline) of codecs applied in order — bytes, compression, prediction filters, etc. Chonkle pipelines can mix standard Python codecs (via [numcodecs](https://numcodecs.readthedocs.io/)) with custom Wasm codecs that run at near-native speed inside a sandbox — portable, safe, and free from platform-specific build tooling. See [WASM.md](docs/WASM.md) for details on how Wasm codecs work. The library is functional but should be treated as a learning artifact; our understanding of Wasm is still evolving. +A Python host for Wasm codec pipelines. Pipelines are directed acyclic graphs (DAGs) of codec steps defined in JSON. The orchestrator parses the DAG, validates wiring against codec signatures, and executes the pipeline via [Wasmtime](https://wasmtime.dev/). -## Install +Status: proof of concept. -Requires [uv](https://docs.astral.sh/uv/). +## Codec backends -```bash -uv sync -``` - -## How codec pipelines work - -Each chunk has a sidecar JSON file that describes the codec pipeline used to encode it: - -```json -{ - "codecs": [ - { - "name": "bytes", - "type": "numcodecs", - "configuration": { - "endian": "little", - "data_type": "uint16", - "shape": [1024, 1024] - } - }, - { - "name": "tiff_predictor_2", - "type": "wasm", - "uri": "file://path/to/tiff-predictor-2-c.wasm", - "configuration": { - "bytes_per_sample": 2, - "width": 1024 - } - }, - { - "name": "zlib", - "type": "numcodecs", - "configuration": { - "level": 9 - } - } - ] -} -``` +chonkle supports three codec backends. Each implements the same `Codec` ABC (`call(direction, port_map)` and `signature()`), so backends can be mixed freely within a single pipeline. -**Encoding** applies codecs in forward order (top to bottom): array → bytes → tiff_predictor_2 (Wasm) → zlib → compressed bytes. +**Component Model Wasm** — `.wasm` components implementing the `chonkle:codec/transform@0.1.0` WIT interface. Any language with a Component Model toolchain (Rust, C, Python via componentize-py) can produce a conforming component. Data transfer uses the canonical ABI. The Wasmtime sandbox isolates each component from the host. -**Decoding** applies codecs in reverse order, unwinding the encoding. +**Core Wasm** — wasm32-wasi reactor modules using a binary port-map wire format via `Memory.read`/`Memory.write`. When consecutive pipeline steps are both core wasm, data transfers between their linear memories use `ctypes.memmove` (single-copy, no serialization round-trip). -Each codec entry has: +**Native (numcodecs)** — Python codecs from the [numcodecs](https://numcodecs.readthedocs.io/) library. No Wasm overhead. `numcodecs` and `numpy` are optional dependencies, imported lazily. Adding a new numcodecs codec requires only adding a signature file. -- `"name"` — codec identifier (for numcodecs lookup and human readability) -- `"type"` — `"numcodecs"` or `"wasm"` -- `"configuration"` — codec-specific parameters -- `"uri"` — (Wasm only) URI of the `.wasm` module: `file://`, `https://`, or `oci://` +The `Resolver` selects among available implementations using a configurable backend preference list. The default preference order is `["native", "core", "component"]`. -Python and Wasm codec steps can be freely mixed in any order. For information on how Wasm codecs work, see [WASM.md](docs/WASM.md). +## Usage -## Python API +### CLI -```python -from chonkle import decode, encode, get_codecs -``` +```bash +# Run a pipeline +chonkle run pipeline.json --input bytes=chunk.bin --output bytes=out.bin -### Load codec specs +# With resolver options +chonkle run pipeline.json --input bytes=chunk.bin \ + --direction decode \ + --codec-store ./codec/ \ + --preference core,component,native \ + --override zlib=zlib-rs \ + --source zlib=https://example.com/zlib.wasm -`get_codecs` extracts the codec spec list from a pipeline JSON file or a dict: +# List installed codecs +chonkle codecs -```python -from pathlib import Path -from chonkle import get_codecs +# Show details for a specific codec +chonkle codecs zlib -codecs = get_codecs(Path("pipeline.json")) # from a file -codecs = get_codecs({"codecs": [...]}) # from a dict +# Embed a signature into a .wasm binary (build-time tool) +chonkle embed-signature codec.wasm signature.json ``` -### Decode - -`decode` applies a codec pipeline in reverse order to raw bytes, returning a numpy array: +### Python API ```python -from chonkle import decode, get_codecs +from chonkle.pipeline import prepare +from chonkle.executor import run -codecs = get_codecs(Path("chunks/0.json")) -arr = decode(Path("chunks/0").read_bytes(), codecs) +prepared = prepare("pipeline.json", direction="decode") +outputs = run(prepared, {"bytes": chunk_bytes}) ``` -### Encode +## Format drivers -`encode` applies a codec pipeline in forward order, returning encoded bytes: +The executor is format-agnostic. It accepts a pipeline DAG and chunk data, runs the codecs, and returns the result. It has no knowledge of Zarr, Parquet, COG, ORC, or any other file format. -```python -from chonkle import encode +A **format driver** is the layer above the executor that bridges a specific file format and the pipeline executor. It reads format-specific metadata, translates it into a pipeline DAG, supplies metadata-derived inputs, and manages chunk I/O. Format drivers are outside the scope of this repository. -encoded = encode(arr, codecs) -``` +## Documentation -## Demo +- [docs/OVERVIEW.md](docs/OVERVIEW.md) — Architecture, design rationale, and execution model +- [docs/reference/PIPELINE_SCHEMA.md](docs/reference/PIPELINE_SCHEMA.md) — Pipeline JSON schema +- [docs/reference/codec-contract/](docs/reference/codec-contract/) — Codec interface specs (Component Model, Core Wasm, Native) +- [docs/reference/CODEC_RESOLUTION.md](docs/reference/CODEC_RESOLUTION.md) — Codec resolution chain and backend preference -See [demo/](demo/) for a Jupyter notebook demonstrating the full pipeline with a real Sentinel-2 COG tile. +See [docs/README.md](docs/README.md) for the full index. -## CLI +## Development -A `chonkle` CLI is available for interactive use; run `chonkle --help` for usage. +- **Package manager**: uv +- **Build backend**: hatchling +- **Python**: >= 3.13 +- **Linting/formatting**: ruff +- **Type checking**: mypy +- **Testing**: pytest +- **Pre-commit**: ruff check, ruff format, mypy, yaml/toml validation +- **CI**: GitHub Actions (lint on 3.14, test on 3.13 and 3.14) -## Configuration +```bash +# Install dependencies +uv sync -Wasm codecs downloaded from HTTPS or OCI sources are cached locally to avoid redundant network requests. +# Include native (numcodecs) backend +uv sync --extra native -| Variable | Description | Default | -| --- | --- | --- | -| `CHONKLE_CACHE_DIR` | Override the Wasm module cache directory | `$TMPDIR/chonkle/wasm/` | -| `CHONKLE_FORCE_DOWNLOAD` | Set to `1` to re-download cached Wasm modules, bypassing the local cache. Primarily useful for testing and development | unset | +# Run tests +uv run pytest -`$TMPDIR` is the OS temporary directory (e.g. `/tmp` on Linux, `/var/folders/...` on macOS). Run `echo $TMPDIR` to see the value on your system. +# Run linter +uv run ruff check + +# Network tests (downloads codecs from OCI registries) +uv run pytest --run-network +``` ## Acknowledgements -Partially supported by NASA-IMPACT VEDA project +Partially supported by NASA-IMPACT VEDA project. diff --git a/bench/README.md b/bench/README.md new file mode 100644 index 0000000..4c3f578 --- /dev/null +++ b/bench/README.md @@ -0,0 +1,59 @@ +# bench/ + +Three benchmarking tools for different aspects of the chonkle stack. + +## rust-host/ + +Standalone Rust crate. Uses `wasmtime-rs 41` typed bindings generated by +`component::bindgen!`. `Vec` arguments are lowered via the compiled +canonical ABI path — no per-element Python interpreter overhead. + +``` +cd bench/rust-host +cargo build --release +cargo run --release +``` + +## python-host/ + +Minimal Python script. Uses `wasmtime-py 41` raw `Func` call, bypassing the +chonkle executor entirely. Structurally identical to the Rust binary so the +comparison isolates the host language, not chonkle overhead. + +Uses PEP 723 inline script metadata to declare `wasmtime==41.*` as a +dependency. Does not depend on the root chonkle project. + +``` +uv run bench/python-host/time_abi_raw.py +``` + +## chonkle-host/ + +Drives the chonkle executor directly to investigate per-step `fn()` timing +across codec types (zlib, predictor2, identity) and data sizes. This is the +script used to generate the data in `docs/decisions/CANONICAL_ABI_PERF.md`. + +Uses PEP 723 inline script metadata with `chonkle` as a local path dependency +(`path = "../.."`). Requires built codec `.wasm` files. + +``` +uv run bench/chonkle-host/time_codec.py +``` + +--- + +## Interpreting the output + +`rust-host` and `python-host` print `[TIMING]` lines in the same format: + +``` +[TIMING] identity.wasm decode: fn=s in=B out=B abi_total=B throughput=MB/s (