Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
b6553f8
feat: replace linear pipeline with DAG executor using Wasm Component …
pjhartzell Mar 10, 2026
27e31b5
docs: fix step outputs format and remove empty encode_only_inputs
pjhartzell Mar 10, 2026
baa4edb
rename: manifest → signature throughout code and docs
pjhartzell Mar 11, 2026
e9227e7
docs: add PROTOSPEC_NOTES with pipeline model analysis
pjhartzell Mar 11, 2026
630a322
feat: add direction parameter to run() for DAG inversion
pjhartzell Mar 11, 2026
0e12934
feat: add cross-step type checking in signature validation
pjhartzell Mar 11, 2026
a207a24
refactor: split _execute_steps into _execute_forward and _execute_inv…
pjhartzell Mar 11, 2026
c994454
refactor: inline executor helpers and move Pipeline helpers to module…
pjhartzell Mar 11, 2026
497c694
docs: update ai/ context for executor and pipeline refactoring
pjhartzell Mar 11, 2026
a0c9fc4
feat: add COG chunk pipeline tests and fix inverted port-map constants
pjhartzell Mar 12, 2026
bf637ef
docs: restructure docs/ and rewrite README; simplify executor interfa…
pjhartzell Mar 16, 2026
f54894c
refactor: change `steps` from array to named map, aligning with proto…
pjhartzell Mar 16, 2026
f59e97a
chore: untrack docs/ai/ (local-only context files)
pjhartzell Mar 16, 2026
7ea8130
docs: add codec pipeline tradeoff analysis and F3 comparison
pjhartzell Mar 16, 2026
3acddaa
chore: add bench/ and codec/ with gitignores and codec README
pjhartzell Mar 17, 2026
ceaae42
feat: embed codec signatures in .wasm binaries instead of sidecar JSO…
pjhartzell Mar 26, 2026
5500eff
refactor: extract Codec abstraction and split executor into prepare/r…
pjhartzell Mar 26, 2026
bc7da0b
feat: add Resolver for codec resolution and remove step-level src/out…
pjhartzell Mar 26, 2026
b1f7e9c
feat: add CoreWasmCodec and core ABI for direct-memory codec execution
pjhartzell Mar 26, 2026
bc4fd81
feat: add single-copy transfer between sequential core wasm codec steps
pjhartzell Mar 26, 2026
9da4ae5
feat: add NativeCodec for numcodecs integration as third codec backend
pjhartzell Mar 26, 2026
aabbaee
refactor: split codecs.py into codecs/ package and fix type safety is…
pjhartzell Mar 26, 2026
f22ea44
refactor: remove re-exports from __init__.py modules
pjhartzell Mar 26, 2026
75ad48d
refactor: move embed-signature from tools/ package to CLI subcommand
pjhartzell Mar 26, 2026
c9f91e2
refactor: merge parse and prepare into single pipeline entry point
pjhartzell Mar 27, 2026
36ba6d6
refactor: compute topological sort before Pipeline construction
pjhartzell Mar 27, 2026
b92b7e9
refactor: parse WiringRef once at construction time
pjhartzell Mar 27, 2026
e886a8b
refactor: split _validate_signature and consolidate encode_only filte…
pjhartzell Mar 27, 2026
29e684f
refactor: introduce Signature dataclass for typed codec signatures
pjhartzell Mar 27, 2026
26c98df
refactor: merge wiring validation into signature validation pass
pjhartzell Mar 27, 2026
36e455a
refactor: type pipeline input and constant descriptors
pjhartzell Mar 27, 2026
41dabe4
refactor: replace step list with ordered dict and precompute codec me…
pjhartzell Mar 27, 2026
d228387
refactor: rename validation functions and introduce _ValidationContext
pjhartzell Mar 27, 2026
b1484e5
refactor: extract CodecStore, add Backend enum, and consolidate signa…
pjhartzell Mar 28, 2026
639423c
docs: rewrite documentation for three-backend codec architecture
pjhartzell Mar 28, 2026
9ab8357
docs: rewrite DISTRIBUTION.md as an ADR
pjhartzell Mar 28, 2026
8f89e28
docs: fix Component Model assumptions and soften PoC tone in CODEC_PI…
pjhartzell Mar 28, 2026
d6f7ec0
docs: correct and update F3 comparison doc
pjhartzell Mar 30, 2026
68835cb
docs: rewrite PROTOSPEC_NOTES.md for clarity and accuracy
pjhartzell Mar 30, 2026
c42147c
docs: split CODEC_CONTRACT.md and CORE_ABI.md into codec-contract/ di…
pjhartzell Mar 30, 2026
cd2013a
docs: clean up PIPELINE_SCHEMA.md for accuracy against protospec
pjhartzell Mar 30, 2026
59acc5a
docs: reorganize directory structure and eliminate CODEC_RUNTIME.md
pjhartzell Mar 30, 2026
5a4080e
docs: merge design/MULTI_MEMORY.md into DATA_COPIES.md; rename prepar…
pjhartzell Mar 30, 2026
4d61d82
docs: rewrite README for accuracy and brevity
pjhartzell Mar 30, 2026
b919968
docs: remove chonkle references from WIT_RESOURCES.md; move project-s…
pjhartzell Mar 30, 2026
efd4f64
docs: add step-by-step codec build guides
pjhartzell Mar 30, 2026
2858642
feat: add native block, decode_only ports, and asymmetric-dtype codecs
pjhartzell Mar 31, 2026
023a653
feat: add json2 native codec signature and document unsupported vlen-…
pjhartzell Mar 31, 2026
828a942
docs: fix inaccurate docstrings and move wit/ to codec/wit/
pjhartzell Mar 31, 2026
fb1bc4a
docs: simplify demo notebook and fix missing HTTPS pipeline run
pjhartzell Apr 2, 2026
968b7a5
fix(ci): install native codec dependency
pjhartzell Apr 2, 2026
b2f8bea
fix(ci): add missing package
pjhartzell Apr 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ jobs:
enable-cache: true

- name: Install dependencies
run: uv sync
run: uv sync --extra native

- name: Cache core wasm fixture
uses: actions/cache@v4
Expand Down
148 changes: 63 additions & 85 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,122 +1,100 @@
# chonkle

Chonkle is a demonstrator, built to explore WebAssembly (Wasm) as a codec delivery mechanism for chunked array formats like Zarr and COG. In these formats, each chunk is encoded by a sequence (i.e., a pipeline) of codecs applied in order — bytes, compression, prediction filters, etc. Chonkle pipelines can mix standard Python codecs (via [numcodecs](https://numcodecs.readthedocs.io/)) with custom Wasm codecs that run at near-native speed inside a sandbox — portable, safe, and free from platform-specific build tooling. See [WASM.md](docs/WASM.md) for details on how Wasm codecs work. The library is functional but should be treated as a learning artifact; our understanding of Wasm is still evolving.
A Python host for Wasm codec pipelines. Pipelines are directed acyclic graphs (DAGs) of codec steps defined in JSON. The orchestrator parses the DAG, validates wiring against codec signatures, and executes the pipeline via [Wasmtime](https://wasmtime.dev/).

## Install
Status: proof of concept.

Requires [uv](https://docs.astral.sh/uv/).
## Codec backends

```bash
uv sync
```

## How codec pipelines work

Each chunk has a sidecar JSON file that describes the codec pipeline used to encode it:

```json
{
"codecs": [
{
"name": "bytes",
"type": "numcodecs",
"configuration": {
"endian": "little",
"data_type": "uint16",
"shape": [1024, 1024]
}
},
{
"name": "tiff_predictor_2",
"type": "wasm",
"uri": "file://path/to/tiff-predictor-2-c.wasm",
"configuration": {
"bytes_per_sample": 2,
"width": 1024
}
},
{
"name": "zlib",
"type": "numcodecs",
"configuration": {
"level": 9
}
}
]
}
```
chonkle supports three codec backends. Each implements the same `Codec` ABC (`call(direction, port_map)` and `signature()`), so backends can be mixed freely within a single pipeline.

**Encoding** applies codecs in forward order (top to bottom): array → bytes → tiff_predictor_2 (Wasm) → zlib → compressed bytes.
**Component Model Wasm** — `.wasm` components implementing the `chonkle:codec/transform@0.1.0` WIT interface. Any language with a Component Model toolchain (Rust, C, Python via componentize-py) can produce a conforming component. Data transfer uses the canonical ABI. The Wasmtime sandbox isolates each component from the host.

**Decoding** applies codecs in reverse order, unwinding the encoding.
**Core Wasm** — wasm32-wasi reactor modules using a binary port-map wire format via `Memory.read`/`Memory.write`. When consecutive pipeline steps are both core wasm, data transfers between their linear memories use `ctypes.memmove` (single-copy, no serialization round-trip).

Each codec entry has:
**Native (numcodecs)** — Python codecs from the [numcodecs](https://numcodecs.readthedocs.io/) library. No Wasm overhead. `numcodecs` and `numpy` are optional dependencies, imported lazily. Adding a new numcodecs codec requires only adding a signature file.

- `"name"` — codec identifier (for numcodecs lookup and human readability)
- `"type"` — `"numcodecs"` or `"wasm"`
- `"configuration"` — codec-specific parameters
- `"uri"` — (Wasm only) URI of the `.wasm` module: `file://`, `https://`, or `oci://`
The `Resolver` selects among available implementations using a configurable backend preference list. The default preference order is `["native", "core", "component"]`.

Python and Wasm codec steps can be freely mixed in any order. For information on how Wasm codecs work, see [WASM.md](docs/WASM.md).
## Usage

## Python API
### CLI

```python
from chonkle import decode, encode, get_codecs
```
```bash
# Run a pipeline
chonkle run pipeline.json --input bytes=chunk.bin --output bytes=out.bin

### Load codec specs
# With resolver options
chonkle run pipeline.json --input bytes=chunk.bin \
--direction decode \
--codec-store ./codec/ \
--preference core,component,native \
--override zlib=zlib-rs \
--source zlib=https://example.com/zlib.wasm

`get_codecs` extracts the codec spec list from a pipeline JSON file or a dict:
# List installed codecs
chonkle codecs

```python
from pathlib import Path
from chonkle import get_codecs
# Show details for a specific codec
chonkle codecs zlib

codecs = get_codecs(Path("pipeline.json")) # from a file
codecs = get_codecs({"codecs": [...]}) # from a dict
# Embed a signature into a .wasm binary (build-time tool)
chonkle embed-signature codec.wasm signature.json
```

### Decode

`decode` applies a codec pipeline in reverse order to raw bytes, returning a numpy array:
### Python API

```python
from chonkle import decode, get_codecs
from chonkle.pipeline import prepare
from chonkle.executor import run

codecs = get_codecs(Path("chunks/0.json"))
arr = decode(Path("chunks/0").read_bytes(), codecs)
prepared = prepare("pipeline.json", direction="decode")
outputs = run(prepared, {"bytes": chunk_bytes})
```

### Encode
## Format drivers

`encode` applies a codec pipeline in forward order, returning encoded bytes:
The executor is format-agnostic. It accepts a pipeline DAG and chunk data, runs the codecs, and returns the result. It has no knowledge of Zarr, Parquet, COG, ORC, or any other file format.

```python
from chonkle import encode
A **format driver** is the layer above the executor that bridges a specific file format and the pipeline executor. It reads format-specific metadata, translates it into a pipeline DAG, supplies metadata-derived inputs, and manages chunk I/O. Format drivers are outside the scope of this repository.

encoded = encode(arr, codecs)
```
## Documentation

## Demo
- [docs/OVERVIEW.md](docs/OVERVIEW.md) — Architecture, design rationale, and execution model
- [docs/reference/PIPELINE_SCHEMA.md](docs/reference/PIPELINE_SCHEMA.md) — Pipeline JSON schema
- [docs/reference/codec-contract/](docs/reference/codec-contract/) — Codec interface specs (Component Model, Core Wasm, Native)
- [docs/reference/CODEC_RESOLUTION.md](docs/reference/CODEC_RESOLUTION.md) — Codec resolution chain and backend preference

See [demo/](demo/) for a Jupyter notebook demonstrating the full pipeline with a real Sentinel-2 COG tile.
See [docs/README.md](docs/README.md) for the full index.

## CLI
## Development

A `chonkle` CLI is available for interactive use; run `chonkle --help` for usage.
- **Package manager**: uv
- **Build backend**: hatchling
- **Python**: >= 3.13
- **Linting/formatting**: ruff
- **Type checking**: mypy
- **Testing**: pytest
- **Pre-commit**: ruff check, ruff format, mypy, yaml/toml validation
- **CI**: GitHub Actions (lint on 3.14, test on 3.13 and 3.14)

## Configuration
```bash
# Install dependencies
uv sync

Wasm codecs downloaded from HTTPS or OCI sources are cached locally to avoid redundant network requests.
# Include native (numcodecs) backend
uv sync --extra native

| Variable | Description | Default |
| --- | --- | --- |
| `CHONKLE_CACHE_DIR` | Override the Wasm module cache directory | `$TMPDIR/chonkle/wasm/` |
| `CHONKLE_FORCE_DOWNLOAD` | Set to `1` to re-download cached Wasm modules, bypassing the local cache. Primarily useful for testing and development | unset |
# Run tests
uv run pytest

`$TMPDIR` is the OS temporary directory (e.g. `/tmp` on Linux, `/var/folders/...` on macOS). Run `echo $TMPDIR` to see the value on your system.
# Run linter
uv run ruff check

# Network tests (downloads codecs from OCI registries)
uv run pytest --run-network
```

## Acknowledgements

Partially supported by NASA-IMPACT VEDA project
Partially supported by NASA-IMPACT VEDA project.
59 changes: 59 additions & 0 deletions bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# bench/

Three benchmarking tools for different aspects of the chonkle stack.

## rust-host/

Standalone Rust crate. Uses `wasmtime-rs 41` typed bindings generated by
`component::bindgen!`. `Vec<u8>` arguments are lowered via the compiled
canonical ABI path — no per-element Python interpreter overhead.

```
cd bench/rust-host
cargo build --release
cargo run --release
```

## python-host/

Minimal Python script. Uses `wasmtime-py 41` raw `Func` call, bypassing the
chonkle executor entirely. Structurally identical to the Rust binary so the
comparison isolates the host language, not chonkle overhead.

Uses PEP 723 inline script metadata to declare `wasmtime==41.*` as a
dependency. Does not depend on the root chonkle project.

```
uv run bench/python-host/time_abi_raw.py
```

## chonkle-host/

Drives the chonkle executor directly to investigate per-step `fn()` timing
across codec types (zlib, predictor2, identity) and data sizes. This is the
script used to generate the data in `docs/decisions/CANONICAL_ABI_PERF.md`.

Uses PEP 723 inline script metadata with `chonkle` as a local path dependency
(`path = "../.."`). Requires built codec `.wasm` files.

```
uv run bench/chonkle-host/time_codec.py
```

---

## Interpreting the output

`rust-host` and `python-host` print `[TIMING]` lines in the same format:

```
[TIMING] identity.wasm decode: fn=<s>s in=<B>B out=<B>B abi_total=<B>B throughput=<MB/s>MB/s (<label> run <n>/3, <host>)
```

`throughput` counts both input and output bytes (total ABI traffic). Run-to-run
variance in the Rust numbers is likely CPU frequency scaling or allocator
effects, not JIT warm-up — wasmtime compiles the WASM once at load time, before
the timed loop starts.

See [docs/decisions/CANONICAL_ABI_PERF.md](../docs/decisions/CANONICAL_ABI_PERF.md)
for the full investigation and measured results.
Loading
Loading