From 0496dc0bb850cb13306237ccac731a7724f88b44 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Fri, 6 Mar 2026 14:11:43 +0100 Subject: [PATCH 01/19] Add new guide to content addressing directories --- docs/.vuepress/config.js | 1 + docs/how-to/content-addressed-folders.md | 194 +++++++++++++++++++++++ 2 files changed, 195 insertions(+) create mode 100644 docs/how-to/content-addressed-folders.md diff --git a/docs/.vuepress/config.js b/docs/.vuepress/config.js index 104c9e5a7..28285349d 100644 --- a/docs/.vuepress/config.js +++ b/docs/.vuepress/config.js @@ -252,6 +252,7 @@ module.exports = { '/how-to/store-play-videos', '/how-to/host-git-repo', '/how-to/move-ipfs-installation/move-ipfs-installation', + '/how-to/content-addressed-folders', ] }, { diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressed-folders.md new file mode 100644 index 000000000..2adec3915 --- /dev/null +++ b/docs/how-to/content-addressed-folders.md @@ -0,0 +1,194 @@ +--- +title: Content addressing directories of files +description: A comparison of UnixFS, iroh collections, and DASL/MASL for content addressing directories of files, covering overhead, determinism, subsetting, and ecosystem support. +--- + +# Content addressing directories of files + +This guide compares three approaches to content addressing directories of files: + +- [UnixFS](https://specs.ipfs.tech/unixfs/) +- [iroh collections](https://docs.iroh.computer/protocols/blobs#collections) +- [DASL](https://dasl.ing) along with its metadata system [MASL](https://dasl.ing/masl.html) + +The goal is to have a single content hash that represents a directory of files, such that verifying that hash verifies the entire contents. + +This matters for build outputs, software distributions, large datasets, website archives — any case where you need to verify that a collection of files hasn't changed. A naive approach like hashing a tarball is fragile: tar archives encode metadata (timestamps, permissions, ordering) that vary between machines, producing different hashes for identical file contents. Content addressing solves this, but the choice of format has real consequences — particularly for overhead, determinism, language support and existing tooling, and whether you can fetch subsets without downloading the whole thing. These differences compound as dataset size grows: what's negligible at megabyte scale — a few extra bytes of framing, an extra round of parsing per block — becomes a meaningful cost at terabyte scale across millions of files. + +A content-addressed directory representation should be: + +- **Deterministic**: the same files always produce the same hash, regardless of where or when they were produced +- **Lightweight**: minimal overhead beyond the file data itself +- **Canonical**: one correct representation, no ambiguity +- **Subsetable**: consumers should be able to fetch individual files or subdirectories without downloading everything + +## Subsetting: files vs folders + +For large directories, you rarely want the whole thing. You might need one component's assets from a monorepo build, a single day's logs from an archive, or one region's data from a dataset. The ability to fetch a verified subset — and know it's authentic without downloading everything else — is important. + +**UnixFS** has a structural advantage here: because directories are first-class DAG nodes, you can address and fetch an entire subdirectory by its CID. The directory node links to all its children, and the directory's CID verifies the whole subtree. This makes UnixFS well suited to collections where the folder hierarchy carries meaning. + +**iroh collections and MASL** are flat: they map paths to content hashes with no intermediate directory nodes. You can fetch individual files by their hash, but there is no native concept of "give me everything under this folder" — you would need to filter the path list client-side and fetch each matching file individually. For use cases where subsetting means "pick specific files," this works fine. For use cases where subsetting means "give me this folder and everything in it," UnixFS is more natural. + +## UnixFS (IPFS) + +[UnixFS](https://specs.ipfs.tech/unixfs/) is the original IPFS approach to representing files and directories as content-addressed DAGs. It uses a two-layer encoding: an outer `dag-pb` block format (the general-purpose IPFS block codec) wrapping an inner UnixFS protobuf message that carries file/directory-specific semantics. + +### How files are represented + +Small files (under the chunk size, typically 256 KiB or 1 MiB) can be stored as a single **raw leaf block** — just the bytes, no protobuf wrapping, identified by a CID. This is the most efficient case: zero structural overhead. + +Larger files are split into chunks, each stored as a raw leaf block. These chunks are then linked together by a **file root node** — a `dag-pb` block containing a UnixFS message of type `File`. The root node's `Links` array holds one entry per chunk: + +```text +File root (dag-pb + UnixFS) +├── Link: CID(chunk₁), Size: 262144 +├── Link: CID(chunk₂), Size: 262144 +├── Link: CID(chunk₃), Size: 262144 +└── Link: CID(chunk₄), Size: 131072 +``` + +Each link carries the chunk's CID and its byte size (`Tsize`). The UnixFS `Data` field in the root node also stores a `blocksizes` array — the unencoded size of each chunk — used for byte-range calculations. When chunks are **raw leaf blocks** (the modern default, indicated by a `raw` codec CID), seeking is straightforward: the `blocksizes` array lets you calculate which chunk contains a given byte offset, fetch just that chunk, and read directly from the raw bytes. + +For very large files, the DAG can be multiple levels deep: intermediate nodes link to groups of chunks, and the root links to those intermediate nodes. This is typically a balanced tree (the default in Kubo), though the balancing strategy is not mandated by the spec. + +### How directories are represented + +A **directory node** is a `dag-pb` block with a UnixFS message of type `Directory`. Each child (file or subdirectory) is a named link: + +```text +Directory (dag-pb + UnixFS) +├── Link: "components/" → CID(subdirectory node) +├── Link: "assets/" → CID(subdirectory node) +└── Link: "index.html" → CID(file root or raw leaf) +``` + +Each link carries a name (the filename or folder name), the child's CID, and a cumulative `Tsize` (the total bytes reachable through that link, including all descendants). This means a directory node is self-contained: it tells you everything you need to list the directory and navigate into children. + +When a directory contains too many entries to fit in a single block (typically above 256 KiB–1 MiB), UnixFS switches to a **HAMT-sharded directory** — a hash-array-mapped trie spread across multiple blocks. The root is a UnixFS node of type `HAMTShard`, and child nodes are distributed across buckets by hashing the filename. This keeps individual blocks small but adds traversal depth: looking up a file by name requires walking the trie. + +### Subsetting files and folders + +UnixFS models directories as DAG nodes with their own CIDs. This means every folder is independently addressable and verifiable. For a directory structured as: + +```text +project/ +├── components/ +│ ├── Header.tsx +│ └── Footer.tsx +├── assets/ +│ ├── style.css +│ └── logo.png +└── index.html +``` + +You can fetch just `components/` by its CID and get the entire subtree — both files and the directory structure — verified by a single hash. This is useful whenever the folder hierarchy carries meaning (by module, by date, by region, etc.). + +Large individual files also benefit: because UnixFS splits files into a DAG of chunks, you can fetch byte ranges within a file without downloading the whole thing. For multi-gigabyte files, this enables partial reads — e.g. reading a specific section of a large log or data file. + +### Tradeoffs + +- **dag-pb envelope overhead.** Each block is wrapped in a `dag-pb` outer protobuf (`PBNode`) containing an inner UnixFS protobuf message. With 1 MiB chunks and raw leaves, the overhead comes from two sources: multi-block file roots (~44 bytes per chunk for link CIDs and sizes) and directory entries (~60–80 bytes per file depending on filename length). For a typical 1 TiB dataset of 100K files averaging 10 MiB each, total overhead is roughly 50 MiB (~0.005%). Only pathological cases — millions of tiny files — push overhead toward 0.5–1%, because each file still needs a directory entry even if it fits in a single raw block. HAMT sharding adds roughly 10–20% on top of directory overhead when directories grow large. +- **Double-protobuf parsing cost.** Beyond the storage overhead, the nested encoding means every block requires two rounds of protobuf decoding — first the outer `PBNode`, then the inner UnixFS `Data` message. When traversing a large DAG (e.g. resolving a deeply nested path or iterating a sharded directory), this double-decode cost is paid at every node, adding up to meaningful CPU overhead for large-scale reads. +- **HAMT sharding.** Large directories automatically switch to hash-array-mapped-trie sharding, which adds traversal complexity and means the same logical directory can have different structures depending on size. +- **Optionality over determinism.** UnixFS embraces optionality, meaning the same data can produce different CIDs depending on how the underlying Merkle DAG is constructed. Parameters like chunk size, chunking algorithm (fixed-size vs Rabin vs Buzhash), DAG balancing strategy (balanced vs trickle), max link count per node, and whether to use raw leaves or dag-pb leaves all affect the resulting CID. Two different tools ingesting the same directory with different defaults will produce different CIDs. [IPIP-499: UnixFS CID Profiles](https://github.com/ipfs/specs/pull/499) aims to solve this by standardising a single set of parameters — defining one canonical way to construct the DAG so that any conforming implementation produces the same CID for the same input. Until that lands, determinism requires all parties to agree on identical parameters out of band. +- **Deep DAG traversal.** Resolving a file means walking the directory DAG node by node — `a/b/c.csv` requires resolving `a`, then `b`, then the file, each a separate block fetch and decode. Nested directories and HAMT shards make the depth unpredictable. +- **Mature ecosystem.** UnixFS has the broadest tooling support and is the de facto standard for IPFS content addressing, with implementations in Go ([Kubo](https://github.com/ipfs/kubo)), JavaScript ([Helia](https://github.com/ipfs/helia)), and Rust ([beetle](https://github.com/n0-computer/beetle)). + +## DASL: MASL and DRISL + +[DASL](https://dasl.ing) (Data Addressed Structures and Links) is a family of specs emerging from the Bluesky/AT Protocol ecosystem that provide content-addressed data structures built on CBOR rather than protobuf. + +**[DRISL](https://dasl.ing/drisl.html)** (Deterministic Representation for Interoperable Structures & Links) is a constrained CBOR application profile designed for deterministic serialization: + +- Identical data always produces identical bytes (and therefore identical CIDs) +- Native CID support via CBOR Tag 42 +- Strict constraints: string-only map keys, no indefinite-length arrays, restricted float representations +- Each CID refers to one complete, discrete CBOR object + +**[MASL](https://dasl.ing/masl.html)** is a CBOR-based metadata system built on DRISL, designed for content-addressed and decentralized systems. It operates in two modes: + +- **Single mode** (`src`): wraps one resource with metadata (content type, etc.) +- **Bundle mode** (`resources`): maps file paths to resource CIDs with per-file metadata — essentially a directory representation + +MASL bundles are conceptually similar to iroh collections: a flat map of paths to content hashes, no directory hierarchy nodes. The key difference is MASL also carries per-resource metadata (like content types) and uses CIDs (self-describing, multi-codec identifiers) rather than raw BLAKE3 hashes. Like iroh collections, subsetting operates at the individual file level — there is no native subdirectory addressing. + +Because DRISL and MASL build on CBOR — a widely supported serialization format with libraries in virtually every language — they likely have the widest potential for cross-language implementation. A [cross-implementation test suite](https://hyphacoop.github.io/dasl-testing/) tracks conformance across languages. + +## iroh collections + +An iroh `Collection` is a flat list of `(String, blake3::Hash)` pairs. Filenames are mapped to 32-byte BLAKE3 content hashes. Directory structure is encoded in the name strings as relative paths (e.g. `"assets/style.css"`, `"js/app.js"`), keeping the format flat while representing arbitrary directory trees. + +On the wire, a collection splits into two blobs: + +### The metadata blob (`CollectionMeta`) + +Serialized with [postcard](https://docs.rs/postcard) (a compact, no_std-friendly binary format): + +``` +┌──────────────────────────────┐ +│ header: "CollectionV0." │ 13 bytes, magic/version tag +├──────────────────────────────┤ +│ names: Vec │ varint-prefixed length, then +│ "assets/style.css" │ each string is varint-length +│ "js/app.js" │ prefixed + raw UTF-8 bytes +│ "index.html" │ +└──────────────────────────────┘ +``` + +No delimiters between strings — postcard uses length-prefixed encoding throughout (similar to protobuf, but without field tags, making it more compact). + +### The root blob (`HashSeq`) + +A sequence of 32-byte BLAKE3 hashes: + +``` +┌─────────────────────────────────┐ +│ hash(metadata blob) │ 32 bytes +│ hash("assets/style.css") │ 32 bytes +│ hash("js/app.js") │ 32 bytes +│ hash("index.html") │ 32 bytes +└─────────────────────────────────┘ +``` + +The first entry is the hash of the metadata blob. The remaining entries correspond 1:1 with the names in the metadata. + +### The collection hash and CIDs + +The **BLAKE3 hash of the root blob** is the single hash that identifies the entire collection. Verifying this one hash verifies every **file name** and every **file's contents**. + +``` +Collection Hash = blake3(root blob) + = blake3(hash(meta) ‖ hash(file₁) ‖ hash(file₂) ‖ …) +``` + +These are standard BLAKE3 hashes, but they can be encoded as CIDs for interoperability with the broader content-addressed ecosystem. The [multicodec table](https://github.com/multiformats/multicodec/blob/master/table.csv) defines the necessary codes: hash function `blake3` (`0x1e`) and codec `blake3_hashseq` (`0x80`). This lets iroh collection hashes be referenced anywhere CIDs are used, without changing the underlying data. + +### Characteristics + +- **No metadata pollution.** Unlike tar/zip, there are no timestamps, permissions, or ownership fields. Two directories with identical file names and contents always produce the same hash, regardless of when or where they were produced. +- **Flat representation of trees.** Directory structure lives in the name strings as relative paths, not as separate directory entries with their own metadata. One entry per file, no ambiguity about empty directories or nested paths. +- **Positional, tag-free encoding.** Postcard serializes fields in declaration order with no field numbers or type tags. The `"CollectionV0."` magic header handles versioning. +- **Compact.** The overhead per file is a varint-prefixed filename in the metadata blob and a 32-byte hash in the root blob. +- **O(1) file lookup.** The root blob is a flat array of fixed-size 32-byte hashes, so finding the Nth file is a constant-time offset (`N * 32` bytes) with no parsing required. +- **Streaming verification.** The root blob is a hash sequence, so a verifier can check individual files incrementally as they arrive. +- **Ready-made distribution.** Collections can be distributed directly over iroh's peer-to-peer network without conversion. +- **BLAKE3.** Fast (parallelizable, SIMD-accelerated), 256-bit digests, and adopted by the [BDASL](https://dasl.ing/bdasl.html) spec. +- **File-level subsetting only.** Individual files can be fetched and verified by their BLAKE3 hash, but there is no way to address a subdirectory as a unit. Fetching a subset means filtering the path list and requesting files one by one. +- **Rust only (for now).** The reference implementation is in Rust. The format is simple enough to implement in other languages — it's just postcard-encoded strings and a flat array of BLAKE3 hashes — but no other implementations exist yet. + +## Comparison + +| Criteria | iroh collections | UnixFS | MASL/DRISL | +| --------------- | -------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------- | +| Encoding | Postcard (tag-free binary) | Protobuf (dag-pb) | [CBOR (deterministic subset with support for CIDs)](https://dasl.ing/drisl.html) | +| Hash | BLAKE3 | Configurable (SHA-256 default) | Configurable via CID multihash | +| Directory model | Flat path→hash list | DAG with directory nodes + HAMT sharding | Flat path→CID map | +| Overhead | ~0% (names + hashes only) | 0.005–1% (depends on file count/size) | Minimal (CBOR framing) | +| Identifiers | BLAKE3 hash (CID-encodable via `blake3` + `blake3_hashseq` multicodecs) | CID (self-describing) | CID (self-describing) | +| File lookup | O(1) offset from root | DAG traversal, depth varies | O(1) key lookup from root | +| Subsetting | Individual files only | Files and folders (subtree by CID) | Individual files only | +| Byte ranges | No (whole-file hashes) | Yes (chunked DAG allows partial reads) | No (whole-file CIDs) | +| Determinism | By construction | Depends on DAG construction choices | By construction (DRISL) | +| Implementations | Rust only | Go, JavaScript, Rust | Wide See [cross-implementation test suite](https://hyphacoop.github.io/dasl-testing/) | +| Ecosystem | iroh/n0 | IPFS (broad) | AT Protocol/Bluesky (emerging) | From 38618646cc1137a7bc04d93af9a8199e06a3ec5b Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Fri, 6 Mar 2026 14:13:09 +0100 Subject: [PATCH 02/19] Add point about gateway support --- docs/how-to/content-addressed-folders.md | 27 ++++++++++++------------ 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressed-folders.md index 2adec3915..6938e98cb 100644 --- a/docs/how-to/content-addressed-folders.md +++ b/docs/how-to/content-addressed-folders.md @@ -179,16 +179,17 @@ These are standard BLAKE3 hashes, but they can be encoded as CIDs for interopera ## Comparison -| Criteria | iroh collections | UnixFS | MASL/DRISL | -| --------------- | -------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------- | -| Encoding | Postcard (tag-free binary) | Protobuf (dag-pb) | [CBOR (deterministic subset with support for CIDs)](https://dasl.ing/drisl.html) | -| Hash | BLAKE3 | Configurable (SHA-256 default) | Configurable via CID multihash | -| Directory model | Flat path→hash list | DAG with directory nodes + HAMT sharding | Flat path→CID map | -| Overhead | ~0% (names + hashes only) | 0.005–1% (depends on file count/size) | Minimal (CBOR framing) | -| Identifiers | BLAKE3 hash (CID-encodable via `blake3` + `blake3_hashseq` multicodecs) | CID (self-describing) | CID (self-describing) | -| File lookup | O(1) offset from root | DAG traversal, depth varies | O(1) key lookup from root | -| Subsetting | Individual files only | Files and folders (subtree by CID) | Individual files only | -| Byte ranges | No (whole-file hashes) | Yes (chunked DAG allows partial reads) | No (whole-file CIDs) | -| Determinism | By construction | Depends on DAG construction choices | By construction (DRISL) | -| Implementations | Rust only | Go, JavaScript, Rust | Wide See [cross-implementation test suite](https://hyphacoop.github.io/dasl-testing/) | -| Ecosystem | iroh/n0 | IPFS (broad) | AT Protocol/Bluesky (emerging) | +| Criteria | iroh collections | UnixFS | MASL/DRISL | +| -------------------- | ----------------------------------------------------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------- | +| Encoding | Postcard (tag-free binary) | Protobuf (dag-pb) | [CBOR (deterministic subset with support for CIDs)](https://dasl.ing/drisl.html) | +| Hash | BLAKE3 | Configurable (SHA-256 default) | Configurable via CID multihash | +| Directory model | Flat path→hash list | DAG with directory nodes + HAMT sharding | Flat path→CID map | +| Overhead | ~0% (names + hashes only) | 0.005–1% (depends on file count/size) | Minimal (CBOR framing) | +| Identifiers | BLAKE3 hash (CID-encodable via `blake3` + `blake3_hashseq` multicodecs) | CID (self-describing) | CID (self-describing) | +| File lookup | O(1) offset from root | DAG traversal, depth varies | O(1) key lookup from root | +| Subsetting | Individual files only | Files and folders (subtree by CID) | Individual files only | +| Byte ranges | No (whole-file hashes) | Yes (chunked DAG allows partial reads) | No (whole-file CIDs) | +| Determinism | By construction | Depends on DAG construction choices | By construction (DRISL) | +| Implementations | Rust only | Go, JavaScript, Rust | Wide See [cross-implementation test suite](https://hyphacoop.github.io/dasl-testing/) | +| IPFS Gateway support | No | Yes | Yes | +| Ecosystem | iroh/n0 | IPFS (broad) | AT Protocol/Bluesky (emerging) | From 7f31b301325ed55c8ec765d584f56e553aee5880 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Fri, 6 Mar 2026 14:19:32 +0100 Subject: [PATCH 03/19] Add words to vocab list --- .github/styles/pln-ignore.txt | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/.github/styles/pln-ignore.txt b/.github/styles/pln-ignore.txt index 08866894c..35fa71cca 100644 --- a/.github/styles/pln-ignore.txt +++ b/.github/styles/pln-ignore.txt @@ -1,3 +1,4 @@ + _redirects aave accessor @@ -99,6 +100,7 @@ ethereum exfiltrate explainers fabien +facto failovers filebase Filebase's @@ -188,6 +190,7 @@ metamask minimalistic minty('s) mojitos +monorepo multiaddr multiaddr(ess) multiaddress @@ -224,9 +227,10 @@ npm octodns onboarding orcestra -orcestras ORCESTRA's +orcestras packfile +parallelizable passthrough peergos performant @@ -244,7 +248,6 @@ preload prenegotiated prepended processannounce - proto protobuf protocol labs @@ -277,6 +280,7 @@ satoshi nakamoto SDKs se serverless +sharded sharding snapshotted sneakernet @@ -289,7 +293,9 @@ storacha Storacha's storj subcommand +subsetting substring +subtree sys systemd sztandera @@ -298,10 +304,12 @@ testground testnet toolkits toolset +trie trustlessly trustlessness uncensorable undialable +unencoded uniswap unixfs unreachability From 8f931958a860f92fa757b3c39fbab7666d7cf8fd Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Fri, 6 Mar 2026 14:19:45 +0100 Subject: [PATCH 04/19] Refine comparison --- docs/how-to/content-addressed-folders.md | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressed-folders.md index 6938e98cb..d73f1908f 100644 --- a/docs/how-to/content-addressed-folders.md +++ b/docs/how-to/content-addressed-folders.md @@ -15,13 +15,6 @@ The goal is to have a single content hash that represents a directory of files, This matters for build outputs, software distributions, large datasets, website archives — any case where you need to verify that a collection of files hasn't changed. A naive approach like hashing a tarball is fragile: tar archives encode metadata (timestamps, permissions, ordering) that vary between machines, producing different hashes for identical file contents. Content addressing solves this, but the choice of format has real consequences — particularly for overhead, determinism, language support and existing tooling, and whether you can fetch subsets without downloading the whole thing. These differences compound as dataset size grows: what's negligible at megabyte scale — a few extra bytes of framing, an extra round of parsing per block — becomes a meaningful cost at terabyte scale across millions of files. -A content-addressed directory representation should be: - -- **Deterministic**: the same files always produce the same hash, regardless of where or when they were produced -- **Lightweight**: minimal overhead beyond the file data itself -- **Canonical**: one correct representation, no ambiguity -- **Subsetable**: consumers should be able to fetch individual files or subdirectories without downloading everything - ## Subsetting: files vs folders For large directories, you rarely want the whole thing. You might need one component's assets from a monorepo build, a single day's logs from an archive, or one region's data from a dataset. The ability to fetch a verified subset — and know it's authentic without downloading everything else — is important. @@ -93,7 +86,7 @@ Large individual files also benefit: because UnixFS splits files into a DAG of c - **HAMT sharding.** Large directories automatically switch to hash-array-mapped-trie sharding, which adds traversal complexity and means the same logical directory can have different structures depending on size. - **Optionality over determinism.** UnixFS embraces optionality, meaning the same data can produce different CIDs depending on how the underlying Merkle DAG is constructed. Parameters like chunk size, chunking algorithm (fixed-size vs Rabin vs Buzhash), DAG balancing strategy (balanced vs trickle), max link count per node, and whether to use raw leaves or dag-pb leaves all affect the resulting CID. Two different tools ingesting the same directory with different defaults will produce different CIDs. [IPIP-499: UnixFS CID Profiles](https://github.com/ipfs/specs/pull/499) aims to solve this by standardising a single set of parameters — defining one canonical way to construct the DAG so that any conforming implementation produces the same CID for the same input. Until that lands, determinism requires all parties to agree on identical parameters out of band. - **Deep DAG traversal.** Resolving a file means walking the directory DAG node by node — `a/b/c.csv` requires resolving `a`, then `b`, then the file, each a separate block fetch and decode. Nested directories and HAMT shards make the depth unpredictable. -- **Mature ecosystem.** UnixFS has the broadest tooling support and is the de facto standard for IPFS content addressing, with implementations in Go ([Kubo](https://github.com/ipfs/kubo)), JavaScript ([Helia](https://github.com/ipfs/helia)), and Rust ([beetle](https://github.com/n0-computer/beetle)). +- **Mature ecosystem.** UnixFS has the broadest tooling support and is the de facto standard for IPFS content addressing, with implementations in Go ([Kubo](https://github.com/ipfs/kubo)) and TypeScript ([Helia](https://github.com/ipfs/helia)). ## DASL: MASL and DRISL From 38077b05da90bf914376aab11c249f23249ce32f Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Fri, 6 Mar 2026 14:34:57 +0100 Subject: [PATCH 05/19] Refine iroh collections --- docs/how-to/content-addressed-folders.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressed-folders.md index d73f1908f..cf685e99f 100644 --- a/docs/how-to/content-addressed-folders.md +++ b/docs/how-to/content-addressed-folders.md @@ -110,13 +110,13 @@ Because DRISL and MASL build on CBOR — a widely supported serialization format ## iroh collections -An iroh `Collection` is a flat list of `(String, blake3::Hash)` pairs. Filenames are mapped to 32-byte BLAKE3 content hashes. Directory structure is encoded in the name strings as relative paths (e.g. `"assets/style.css"`, `"js/app.js"`), keeping the format flat while representing arbitrary directory trees. +An iroh `Collection` is a way to represent a directory of files as a single content hash, designed for efficient verification and distribution with [iroh-blobs](https://docs.iroh.computer/protocols/blobs). The format is simple by design: a flat list of `(String, blake3::Hash)` pairs. Filenames are mapped to 32-byte BLAKE3 content hashes. Directory structure is encoded in the name strings as relative paths (e.g. `"assets/style.css"`, `"js/app.js"`), keeping the format flat while representing arbitrary directory trees. On the wire, a collection splits into two blobs: ### The metadata blob (`CollectionMeta`) -Serialized with [postcard](https://docs.rs/postcard) (a compact, no_std-friendly binary format): +Serialized with [postcard](https://docs.rs/postcard): ``` ┌──────────────────────────────┐ @@ -165,7 +165,7 @@ These are standard BLAKE3 hashes, but they can be encoded as CIDs for interopera - **Compact.** The overhead per file is a varint-prefixed filename in the metadata blob and a 32-byte hash in the root blob. - **O(1) file lookup.** The root blob is a flat array of fixed-size 32-byte hashes, so finding the Nth file is a constant-time offset (`N * 32` bytes) with no parsing required. - **Streaming verification.** The root blob is a hash sequence, so a verifier can check individual files incrementally as they arrive. -- **Ready-made distribution.** Collections can be distributed directly over iroh's peer-to-peer network without conversion. +- **Ready-made distribution.** Collections can be distributed in a peer-to-peer fashion with iroh-blobs. - **BLAKE3.** Fast (parallelizable, SIMD-accelerated), 256-bit digests, and adopted by the [BDASL](https://dasl.ing/bdasl.html) spec. - **File-level subsetting only.** Individual files can be fetched and verified by their BLAKE3 hash, but there is no way to address a subdirectory as a unit. Fetching a subset means filtering the path list and requesting files one by one. - **Rust only (for now).** The reference implementation is in Rust. The format is simple enough to implement in other languages — it's just postcard-encoded strings and a flat array of BLAKE3 hashes — but no other implementations exist yet. From 4d42b4d56ef06f79438c842591adc4958d1932da Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Fri, 13 Mar 2026 17:43:44 +0100 Subject: [PATCH 06/19] Address feedback and refine --- docs/how-to/content-addressed-folders.md | 47 ++++++++++++++---------- 1 file changed, 27 insertions(+), 20 deletions(-) diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressed-folders.md index cf685e99f..9ef3f7303 100644 --- a/docs/how-to/content-addressed-folders.md +++ b/docs/how-to/content-addressed-folders.md @@ -1,9 +1,9 @@ --- -title: Content addressing directories of files -description: A comparison of UnixFS, iroh collections, and DASL/MASL for content addressing directories of files, covering overhead, determinism, subsetting, and ecosystem support. +title: Content addressing data sets +description: A comparison of UnixFS, iroh collections, and DASL/MASL for content addressing data sets like directories of files, with a focus on overhead, determinism, subsetting, and ecosystem support. --- -# Content addressing directories of files +# Content addressing data sets This guide compares three approaches to content addressing directories of files: @@ -11,9 +11,19 @@ This guide compares three approaches to content addressing directories of files: - [iroh collections](https://docs.iroh.computer/protocols/blobs#collections) - [DASL](https://dasl.ing) along with its metadata system [MASL](https://dasl.ing/masl.html) -The goal is to have a single content hash that represents a directory of files, such that verifying that hash verifies the entire contents. +## Merkle DAGs: the common foundation -This matters for build outputs, software distributions, large datasets, website archives — any case where you need to verify that a collection of files hasn't changed. A naive approach like hashing a tarball is fragile: tar archives encode metadata (timestamps, permissions, ordering) that vary between machines, producing different hashes for identical file contents. Content addressing solves this, but the choice of format has real consequences — particularly for overhead, determinism, language support and existing tooling, and whether you can fetch subsets without downloading the whole thing. These differences compound as dataset size grows: what's negligible at megabyte scale — a few extra bytes of framing, an extra round of parsing per block — becomes a meaningful cost at terabyte scale across millions of files. +Before comparing formats, it helps to understand what they all share: every approach effectively constructs a [_Merkle DAG_](../concepts/merkle-dag.md), a data structure which allows you to derive a small verification identifier like a CID to represents a collection of data. + +The formats below vary in how they construct the Merkle DAG and the trade-offs they make, but in essence they all allow you to produce a CID that represents a collection of files, such that verifying that hash verifies the entire contents. + +This matters for build outputs, software distributions, large datasets, website archives — any case where you need to verify that a collection of files hasn't changed. + +A naive approach like hashing a tarball is fragile: tar archives encode metadata (timestamps, permissions, ordering) that vary between machines, producing different hashes for identical file contents. It's also impractical for large datasets where you cannot afford to store two copies of the data. + +Content addressing solves this, but the choice of format has real consequences, particularly for overhead, determinism, language support and interoprability within an ecosystem. + +These differences compound as dataset size grows: what's negligible at megabyte scale —a few extra bytes of framing, an extra round of parsing per block— becomes a meaningful cost at terabyte scale across millions of files. ## Subsetting: files vs folders @@ -118,14 +128,14 @@ On the wire, a collection splits into two blobs: Serialized with [postcard](https://docs.rs/postcard): -``` +```ascii ┌──────────────────────────────┐ │ header: "CollectionV0." │ 13 bytes, magic/version tag ├──────────────────────────────┤ │ names: Vec │ varint-prefixed length, then -│ "assets/style.css" │ each string is varint-length -│ "js/app.js" │ prefixed + raw UTF-8 bytes -│ "index.html" │ +│ "assets/style.css" │ each string is varint-length +│ "js/app.js" │ prefixed + raw UTF-8 bytes +│ "index.html" │ └──────────────────────────────┘ ``` @@ -135,12 +145,12 @@ No delimiters between strings — postcard uses length-prefixed encoding through A sequence of 32-byte BLAKE3 hashes: -``` +```ascii ┌─────────────────────────────────┐ │ hash(metadata blob) │ 32 bytes -│ hash("assets/style.css") │ 32 bytes -│ hash("js/app.js") │ 32 bytes -│ hash("index.html") │ 32 bytes +│ hash("assets/style.css") │ 32 bytes +│ hash("js/app.js") │ 32 bytes +│ hash("index.html") │ 32 bytes └─────────────────────────────────┘ ``` @@ -150,7 +160,7 @@ The first entry is the hash of the metadata blob. The remaining entries correspo The **BLAKE3 hash of the root blob** is the single hash that identifies the entire collection. Verifying this one hash verifies every **file name** and every **file's contents**. -``` +```code Collection Hash = blake3(root blob) = blake3(hash(meta) ‖ hash(file₁) ‖ hash(file₂) ‖ …) ``` @@ -160,14 +170,12 @@ These are standard BLAKE3 hashes, but they can be encoded as CIDs for interopera ### Characteristics - **No metadata pollution.** Unlike tar/zip, there are no timestamps, permissions, or ownership fields. Two directories with identical file names and contents always produce the same hash, regardless of when or where they were produced. -- **Flat representation of trees.** Directory structure lives in the name strings as relative paths, not as separate directory entries with their own metadata. One entry per file, no ambiguity about empty directories or nested paths. - **Positional, tag-free encoding.** Postcard serializes fields in declaration order with no field numbers or type tags. The `"CollectionV0."` magic header handles versioning. - **Compact.** The overhead per file is a varint-prefixed filename in the metadata blob and a 32-byte hash in the root blob. -- **O(1) file lookup.** The root blob is a flat array of fixed-size 32-byte hashes, so finding the Nth file is a constant-time offset (`N * 32` bytes) with no parsing required. +- **O(1) hash retrieval by index.** Once you know a file's index N, its hash is at a constant-time offset (`N * 32` bytes) in the root blob — no parsing required. Finding N by filename requires a linear scan of the metadata blob to match the path string, but the hash fetch itself is O(1). - **Streaming verification.** The root blob is a hash sequence, so a verifier can check individual files incrementally as they arrive. - **Ready-made distribution.** Collections can be distributed in a peer-to-peer fashion with iroh-blobs. - **BLAKE3.** Fast (parallelizable, SIMD-accelerated), 256-bit digests, and adopted by the [BDASL](https://dasl.ing/bdasl.html) spec. -- **File-level subsetting only.** Individual files can be fetched and verified by their BLAKE3 hash, but there is no way to address a subdirectory as a unit. Fetching a subset means filtering the path list and requesting files one by one. - **Rust only (for now).** The reference implementation is in Rust. The format is simple enough to implement in other languages — it's just postcard-encoded strings and a flat array of BLAKE3 hashes — but no other implementations exist yet. ## Comparison @@ -175,14 +183,13 @@ These are standard BLAKE3 hashes, but they can be encoded as CIDs for interopera | Criteria | iroh collections | UnixFS | MASL/DRISL | | -------------------- | ----------------------------------------------------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------- | | Encoding | Postcard (tag-free binary) | Protobuf (dag-pb) | [CBOR (deterministic subset with support for CIDs)](https://dasl.ing/drisl.html) | -| Hash | BLAKE3 | Configurable (SHA-256 default) | Configurable via CID multihash | +| Hash | BLAKE3 | Configurable (SHA-256 default) | SHA-256 (DASL-CIDs only) | | Directory model | Flat path→hash list | DAG with directory nodes + HAMT sharding | Flat path→CID map | | Overhead | ~0% (names + hashes only) | 0.005–1% (depends on file count/size) | Minimal (CBOR framing) | | Identifiers | BLAKE3 hash (CID-encodable via `blake3` + `blake3_hashseq` multicodecs) | CID (self-describing) | CID (self-describing) | | File lookup | O(1) offset from root | DAG traversal, depth varies | O(1) key lookup from root | | Subsetting | Individual files only | Files and folders (subtree by CID) | Individual files only | -| Byte ranges | No (whole-file hashes) | Yes (chunked DAG allows partial reads) | No (whole-file CIDs) | | Determinism | By construction | Depends on DAG construction choices | By construction (DRISL) | | Implementations | Rust only | Go, JavaScript, Rust | Wide See [cross-implementation test suite](https://hyphacoop.github.io/dasl-testing/) | | IPFS Gateway support | No | Yes | Yes | -| Ecosystem | iroh/n0 | IPFS (broad) | AT Protocol/Bluesky (emerging) | +| Ecosystem | iroh/n0 | IPFS (broad) | AT Protocol/Bluesky | From 0ddbd91ebaaa4047649a9e8a612e6c5acbb7017a Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Wed, 18 Mar 2026 18:06:23 +0100 Subject: [PATCH 07/19] Expand on the two key properties of merkle dags --- docs/how-to/content-addressed-folders.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressed-folders.md index 9ef3f7303..6feb30a89 100644 --- a/docs/how-to/content-addressed-folders.md +++ b/docs/how-to/content-addressed-folders.md @@ -15,7 +15,10 @@ This guide compares three approaches to content addressing directories of files: Before comparing formats, it helps to understand what they all share: every approach effectively constructs a [_Merkle DAG_](../concepts/merkle-dag.md), a data structure which allows you to derive a small verification identifier like a CID to represents a collection of data. -The formats below vary in how they construct the Merkle DAG and the trade-offs they make, but in essence they all allow you to produce a CID that represents a collection of files, such that verifying that hash verifies the entire contents. +The formats below vary in how they construct the Merkle DAG and the trade-offs they make, but in essence they all allow you to produce a CID that represents a collection of files, such that you can easily verify two properties: + +- **Inclusion** in the collection: a file (`cat.jpg`) is in the collection addressed by the CID (`bafy..`). +- **Integrity** of the collection as a whole: none of the contents of the collection have been modified since the CID was generated. This matters for build outputs, software distributions, large datasets, website archives — any case where you need to verify that a collection of files hasn't changed. @@ -98,7 +101,7 @@ Large individual files also benefit: because UnixFS splits files into a DAG of c - **Deep DAG traversal.** Resolving a file means walking the directory DAG node by node — `a/b/c.csv` requires resolving `a`, then `b`, then the file, each a separate block fetch and decode. Nested directories and HAMT shards make the depth unpredictable. - **Mature ecosystem.** UnixFS has the broadest tooling support and is the de facto standard for IPFS content addressing, with implementations in Go ([Kubo](https://github.com/ipfs/kubo)) and TypeScript ([Helia](https://github.com/ipfs/helia)). -## DASL: MASL and DRISL +## DASL, MASL, and DRISL [DASL](https://dasl.ing) (Data Addressed Structures and Links) is a family of specs emerging from the Bluesky/AT Protocol ecosystem that provide content-addressed data structures built on CBOR rather than protobuf. From 404764dcbee714777a2c1beffdf3bb4fd39164f4 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Wed, 18 Mar 2026 18:17:36 +0100 Subject: [PATCH 08/19] Remove confusing point --- docs/how-to/content-addressed-folders.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressed-folders.md index 6feb30a89..3566ebc49 100644 --- a/docs/how-to/content-addressed-folders.md +++ b/docs/how-to/content-addressed-folders.md @@ -17,7 +17,7 @@ Before comparing formats, it helps to understand what they all share: every appr The formats below vary in how they construct the Merkle DAG and the trade-offs they make, but in essence they all allow you to produce a CID that represents a collection of files, such that you can easily verify two properties: -- **Inclusion** in the collection: a file (`cat.jpg`) is in the collection addressed by the CID (`bafy..`). +- **Inclusion** in the collection: a file (`cat.jpg`) is in the collection addressed by the CID (`bafy...`). - **Integrity** of the collection as a whole: none of the contents of the collection have been modified since the CID was generated. This matters for build outputs, software distributions, large datasets, website archives — any case where you need to verify that a collection of files hasn't changed. @@ -175,7 +175,6 @@ These are standard BLAKE3 hashes, but they can be encoded as CIDs for interopera - **No metadata pollution.** Unlike tar/zip, there are no timestamps, permissions, or ownership fields. Two directories with identical file names and contents always produce the same hash, regardless of when or where they were produced. - **Positional, tag-free encoding.** Postcard serializes fields in declaration order with no field numbers or type tags. The `"CollectionV0."` magic header handles versioning. - **Compact.** The overhead per file is a varint-prefixed filename in the metadata blob and a 32-byte hash in the root blob. -- **O(1) hash retrieval by index.** Once you know a file's index N, its hash is at a constant-time offset (`N * 32` bytes) in the root blob — no parsing required. Finding N by filename requires a linear scan of the metadata blob to match the path string, but the hash fetch itself is O(1). - **Streaming verification.** The root blob is a hash sequence, so a verifier can check individual files incrementally as they arrive. - **Ready-made distribution.** Collections can be distributed in a peer-to-peer fashion with iroh-blobs. - **BLAKE3.** Fast (parallelizable, SIMD-accelerated), 256-bit digests, and adopted by the [BDASL](https://dasl.ing/bdasl.html) spec. From 3aa2bb2069e515cd2c22a5e1d61583e8bfc161b2 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Wed, 18 Mar 2026 18:20:16 +0100 Subject: [PATCH 09/19] Apply suggestions from code review Co-authored-by: Bumblefudge Co-authored-by: Mosh <1306020+mishmosh@users.noreply.github.com> --- docs/how-to/content-addressed-folders.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressed-folders.md index 3566ebc49..6d02c3164 100644 --- a/docs/how-to/content-addressed-folders.md +++ b/docs/how-to/content-addressed-folders.md @@ -24,7 +24,7 @@ This matters for build outputs, software distributions, large datasets, website A naive approach like hashing a tarball is fragile: tar archives encode metadata (timestamps, permissions, ordering) that vary between machines, producing different hashes for identical file contents. It's also impractical for large datasets where you cannot afford to store two copies of the data. -Content addressing solves this, but the choice of format has real consequences, particularly for overhead, determinism, language support and interoprability within an ecosystem. +Content addressing solves this, but the choice of format has real consequences, particularly for overhead, determinism, language support and interoperability within an ecosystem. These differences compound as dataset size grows: what's negligible at megabyte scale —a few extra bytes of framing, an extra round of parsing per block— becomes a meaningful cost at terabyte scale across millions of files. @@ -103,7 +103,7 @@ Large individual files also benefit: because UnixFS splits files into a DAG of c ## DASL, MASL, and DRISL -[DASL](https://dasl.ing) (Data Addressed Structures and Links) is a family of specs emerging from the Bluesky/AT Protocol ecosystem that provide content-addressed data structures built on CBOR rather than protobuf. +[DASL](https://dasl.ing) (Data Addressed Structures and Links) is a set of simple, standard primitives for working with content-addressed, linked data. Designed as a web-friendly, interoperable subset of IPFS and IPLD primitives, DASL is used in production by the AT Protocol ecosystem, including Bluesky. **[DRISL](https://dasl.ing/drisl.html)** (Deterministic Representation for Interoperable Structures & Links) is a constrained CBOR application profile designed for deterministic serialization: @@ -117,7 +117,7 @@ Large individual files also benefit: because UnixFS splits files into a DAG of c - **Single mode** (`src`): wraps one resource with metadata (content type, etc.) - **Bundle mode** (`resources`): maps file paths to resource CIDs with per-file metadata — essentially a directory representation -MASL bundles are conceptually similar to iroh collections: a flat map of paths to content hashes, no directory hierarchy nodes. The key difference is MASL also carries per-resource metadata (like content types) and uses CIDs (self-describing, multi-codec identifiers) rather than raw BLAKE3 hashes. Like iroh collections, subsetting operates at the individual file level — there is no native subdirectory addressing. +MASL bundles are conceptually similar to iroh collections: a flat map of paths to content hashes, no directory hierarchy nodes. The key difference is MASL also carries per-resource metadata (like content types) and uses CIDs (self-describing, multi-codec identifiers) rather than raw BLAKE3 hashes. Like iroh collections, subsetting operates at the individual resource level — there is no native subdirectory addressing. Because DRISL and MASL build on CBOR — a widely supported serialization format with libraries in virtually every language — they likely have the widest potential for cross-language implementation. A [cross-implementation test suite](https://hyphacoop.github.io/dasl-testing/) tracks conformance across languages. From e0b06b1ab5a4f48c54913e3c44dd2895b7e687e5 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Wed, 18 Mar 2026 18:44:40 +0100 Subject: [PATCH 10/19] Address feedback on masl --- docs/how-to/content-addressed-folders.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressed-folders.md index 6d02c3164..fc67df46b 100644 --- a/docs/how-to/content-addressed-folders.md +++ b/docs/how-to/content-addressed-folders.md @@ -5,7 +5,7 @@ description: A comparison of UnixFS, iroh collections, and DASL/MASL for content # Content addressing data sets -This guide compares three approaches to content addressing directories of files: +This guide compares three binary formats for content addressing collections of files organised in a directory tree structure: - [UnixFS](https://specs.ipfs.tech/unixfs/) - [iroh collections](https://docs.iroh.computer/protocols/blobs#collections) @@ -103,7 +103,7 @@ Large individual files also benefit: because UnixFS splits files into a DAG of c ## DASL, MASL, and DRISL -[DASL](https://dasl.ing) (Data Addressed Structures and Links) is a set of simple, standard primitives for working with content-addressed, linked data. Designed as a web-friendly, interoperable subset of IPFS and IPLD primitives, DASL is used in production by the AT Protocol ecosystem, including Bluesky. +[DASL](https://dasl.ing) (Data Addressed Structures and Links) is a set of simple, standard primitives for working with content-addressed, linked data. Designed as a web-friendly, interoperable subset of IPFS and IPLD primitives, DASL is used in production by the AT Protocol ecosystem, including Bluesky. **[DRISL](https://dasl.ing/drisl.html)** (Deterministic Representation for Interoperable Structures & Links) is a constrained CBOR application profile designed for deterministic serialization: @@ -115,9 +115,9 @@ Large individual files also benefit: because UnixFS splits files into a DAG of c **[MASL](https://dasl.ing/masl.html)** is a CBOR-based metadata system built on DRISL, designed for content-addressed and decentralized systems. It operates in two modes: - **Single mode** (`src`): wraps one resource with metadata (content type, etc.) -- **Bundle mode** (`resources`): maps file paths to resource CIDs with per-file metadata — essentially a directory representation +- **Bundle mode** (`resources`): maps file paths to resource CIDs with per-file metadata forming a directory tree representation -MASL bundles are conceptually similar to iroh collections: a flat map of paths to content hashes, no directory hierarchy nodes. The key difference is MASL also carries per-resource metadata (like content types) and uses CIDs (self-describing, multi-codec identifiers) rather than raw BLAKE3 hashes. Like iroh collections, subsetting operates at the individual resource level — there is no native subdirectory addressing. +MASL bundles are conceptually similar to iroh collections: a flat map of paths to content hashes, no directory hierarchy nodes. The key difference is MASL also carries per-resource metadata (modelled after HTTP headers) and uses CIDs rather than raw BLAKE3 hashes. Like iroh collections, subsetting operates at the individual file level — there is no native subdirectory addressing. Because DRISL and MASL build on CBOR — a widely supported serialization format with libraries in virtually every language — they likely have the widest potential for cross-language implementation. A [cross-implementation test suite](https://hyphacoop.github.io/dasl-testing/) tracks conformance across languages. From 89353e567c33441a527df936c5a6052d86ab1261 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Wed, 18 Mar 2026 18:46:05 +0100 Subject: [PATCH 11/19] Add note about relative paths --- docs/how-to/content-addressed-folders.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressed-folders.md index fc67df46b..bb875674f 100644 --- a/docs/how-to/content-addressed-folders.md +++ b/docs/how-to/content-addressed-folders.md @@ -117,7 +117,7 @@ Large individual files also benefit: because UnixFS splits files into a DAG of c - **Single mode** (`src`): wraps one resource with metadata (content type, etc.) - **Bundle mode** (`resources`): maps file paths to resource CIDs with per-file metadata forming a directory tree representation -MASL bundles are conceptually similar to iroh collections: a flat map of paths to content hashes, no directory hierarchy nodes. The key difference is MASL also carries per-resource metadata (modelled after HTTP headers) and uses CIDs rather than raw BLAKE3 hashes. Like iroh collections, subsetting operates at the individual file level — there is no native subdirectory addressing. +MASL bundles are conceptually similar to iroh collections: a flat map of relative paths to content hashes, no directory hierarchy nodes. The key difference is MASL also carries per-resource metadata (modelled after HTTP headers) and uses CIDs rather than raw BLAKE3 hashes. Like iroh collections, subsetting operates at the individual file level — there is no native subdirectory addressing. Because DRISL and MASL build on CBOR — a widely supported serialization format with libraries in virtually every language — they likely have the widest potential for cross-language implementation. A [cross-implementation test suite](https://hyphacoop.github.io/dasl-testing/) tracks conformance across languages. From a122090ed51d268020ed511a6dc2f304067378a1 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Wed, 18 Mar 2026 18:55:32 +0100 Subject: [PATCH 12/19] Address feedback about postcard --- docs/how-to/content-addressed-folders.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressed-folders.md index bb875674f..5cce9ffc4 100644 --- a/docs/how-to/content-addressed-folders.md +++ b/docs/how-to/content-addressed-folders.md @@ -129,11 +129,11 @@ On the wire, a collection splits into two blobs: ### The metadata blob (`CollectionMeta`) -Serialized with [postcard](https://docs.rs/postcard): +Serialized with [postcard]: ```ascii ┌──────────────────────────────┐ -│ header: "CollectionV0." │ 13 bytes, magic/version tag +│ header: "CollectionV0." │ 13 bytes, version tag ├──────────────────────────────┤ │ names: Vec │ varint-prefixed length, then │ "assets/style.css" │ each string is varint-length @@ -173,7 +173,7 @@ These are standard BLAKE3 hashes, but they can be encoded as CIDs for interopera ### Characteristics - **No metadata pollution.** Unlike tar/zip, there are no timestamps, permissions, or ownership fields. Two directories with identical file names and contents always produce the same hash, regardless of when or where they were produced. -- **Positional, tag-free encoding.** Postcard serializes fields in declaration order with no field numbers or type tags. The `"CollectionV0."` magic header handles versioning. +- **Positional, tag-free encoding.** [Postcard] serializes fields in declaration order with no field numbers or type tags. The `"CollectionV0."` header serves like a [file signature](https://en.wikipedia.org/wiki/List_of_file_signatures) and allows evolution via versioning. - **Compact.** The overhead per file is a varint-prefixed filename in the metadata blob and a 32-byte hash in the root blob. - **Streaming verification.** The root blob is a hash sequence, so a verifier can check individual files incrementally as they arrive. - **Ready-made distribution.** Collections can be distributed in a peer-to-peer fashion with iroh-blobs. @@ -195,3 +195,6 @@ These are standard BLAKE3 hashes, but they can be encoded as CIDs for interopera | Implementations | Rust only | Go, JavaScript, Rust | Wide See [cross-implementation test suite](https://hyphacoop.github.io/dasl-testing/) | | IPFS Gateway support | No | Yes | Yes | | Ecosystem | iroh/n0 | IPFS (broad) | AT Protocol/Bluesky | + + +[postcard]: https://github.com/jamesmunns/postcard \ No newline at end of file From c29ccc544100f9833b5c58efe0462fa169d97c04 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Wed, 18 Mar 2026 19:00:00 +0100 Subject: [PATCH 13/19] Refine ecosystems and iroh collections --- docs/how-to/content-addressed-folders.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressed-folders.md index 5cce9ffc4..c094f05d1 100644 --- a/docs/how-to/content-addressed-folders.md +++ b/docs/how-to/content-addressed-folders.md @@ -178,7 +178,7 @@ These are standard BLAKE3 hashes, but they can be encoded as CIDs for interopera - **Streaming verification.** The root blob is a hash sequence, so a verifier can check individual files incrementally as they arrive. - **Ready-made distribution.** Collections can be distributed in a peer-to-peer fashion with iroh-blobs. - **BLAKE3.** Fast (parallelizable, SIMD-accelerated), 256-bit digests, and adopted by the [BDASL](https://dasl.ing/bdasl.html) spec. -- **Rust only (for now).** The reference implementation is in Rust. The format is simple enough to implement in other languages — it's just postcard-encoded strings and a flat array of BLAKE3 hashes — but no other implementations exist yet. +- **Rust only** The reference implementation is in Rust and there's an [open issue to add WebAssembly support](https://github.com/n0-computer/iroh-blobs/issues/90). The format is simple enough to implement in other languages — it's just postcard-encoded strings and a flat array of BLAKE3 hashes — but no other implementations exist yet. ## Comparison @@ -194,7 +194,6 @@ These are standard BLAKE3 hashes, but they can be encoded as CIDs for interopera | Determinism | By construction | Depends on DAG construction choices | By construction (DRISL) | | Implementations | Rust only | Go, JavaScript, Rust | Wide See [cross-implementation test suite](https://hyphacoop.github.io/dasl-testing/) | | IPFS Gateway support | No | Yes | Yes | -| Ecosystem | iroh/n0 | IPFS (broad) | AT Protocol/Bluesky | +| Ecosystem | iroh/n0 | IPFS (broad) | Multiple (AT Protocol, Bluesky, IPFS, and others) | - -[postcard]: https://github.com/jamesmunns/postcard \ No newline at end of file +[postcard]: https://github.com/jamesmunns/postcard From cf7e8f70feb8ed29ccd24305abde2da1e9f4a6c4 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Wed, 18 Mar 2026 19:02:44 +0100 Subject: [PATCH 14/19] Add bluesky to dictionary --- .github/styles/pln-ignore.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/styles/pln-ignore.txt b/.github/styles/pln-ignore.txt index 35fa71cca..cb9c6d955 100644 --- a/.github/styles/pln-ignore.txt +++ b/.github/styles/pln-ignore.txt @@ -26,6 +26,7 @@ bitswap blockchain blockchains blockstore +Bluesky bool bool(ean) boolean From 12a8cf4543e491dfa7b97e9183006144e10257d6 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Wed, 18 Mar 2026 19:06:22 +0100 Subject: [PATCH 15/19] Add note about web compatibility --- docs/how-to/content-addressed-folders.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressed-folders.md index c094f05d1..bfd90e157 100644 --- a/docs/how-to/content-addressed-folders.md +++ b/docs/how-to/content-addressed-folders.md @@ -117,7 +117,7 @@ Large individual files also benefit: because UnixFS splits files into a DAG of c - **Single mode** (`src`): wraps one resource with metadata (content type, etc.) - **Bundle mode** (`resources`): maps file paths to resource CIDs with per-file metadata forming a directory tree representation -MASL bundles are conceptually similar to iroh collections: a flat map of relative paths to content hashes, no directory hierarchy nodes. The key difference is MASL also carries per-resource metadata (modelled after HTTP headers) and uses CIDs rather than raw BLAKE3 hashes. Like iroh collections, subsetting operates at the individual file level — there is no native subdirectory addressing. +MASL bundles are conceptually similar to iroh collections: a flat map of relative paths to content hashes, no directory hierarchy nodes. The key difference is MASL also carries per-resource metadata and uses CIDs rather than raw BLAKE3 hashes. The metadata is deliberately a set of HTTP headers, e.g. `content-type`, `content-encoding` for maximal compatibility with the web. Like iroh collections, subsetting operates at the individual file level — there is no native subdirectory addressing. Because DRISL and MASL build on CBOR — a widely supported serialization format with libraries in virtually every language — they likely have the widest potential for cross-language implementation. A [cross-implementation test suite](https://hyphacoop.github.io/dasl-testing/) tracks conformance across languages. From 77fb9798ae0964eb9f700404de0ca3a27e049e69 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Thu, 19 Mar 2026 11:29:50 +0100 Subject: [PATCH 16/19] Remove empty line from dictionary --- .github/styles/pln-ignore.txt | 1 - 1 file changed, 1 deletion(-) diff --git a/.github/styles/pln-ignore.txt b/.github/styles/pln-ignore.txt index cb9c6d955..690bdb970 100644 --- a/.github/styles/pln-ignore.txt +++ b/.github/styles/pln-ignore.txt @@ -1,4 +1,3 @@ - _redirects aave accessor From bc9a9fb7a3a5800a9791c9e715131971646a9871 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Thu, 19 Mar 2026 11:45:02 +0100 Subject: [PATCH 17/19] Rename file for consistency --- docs/.vuepress/config.js | 2 +- ...ent-addressed-folders.md => content-addressing-data-sets.md} | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename docs/how-to/{content-addressed-folders.md => content-addressing-data-sets.md} (100%) diff --git a/docs/.vuepress/config.js b/docs/.vuepress/config.js index 28285349d..775624d62 100644 --- a/docs/.vuepress/config.js +++ b/docs/.vuepress/config.js @@ -252,7 +252,7 @@ module.exports = { '/how-to/store-play-videos', '/how-to/host-git-repo', '/how-to/move-ipfs-installation/move-ipfs-installation', - '/how-to/content-addressed-folders', + '/how-to/content-addressing-data-sets', ] }, { diff --git a/docs/how-to/content-addressed-folders.md b/docs/how-to/content-addressing-data-sets.md similarity index 100% rename from docs/how-to/content-addressed-folders.md rename to docs/how-to/content-addressing-data-sets.md From d6aa70e9f1847cc61b8330bb796b50166cb22e23 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Thu, 19 Mar 2026 12:22:45 +0100 Subject: [PATCH 18/19] Add link from lifecycle guide --- docs/concepts/lifecycle.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/concepts/lifecycle.md b/docs/concepts/lifecycle.md index 2abbcf05c..524311cb8 100644 --- a/docs/concepts/lifecycle.md +++ b/docs/concepts/lifecycle.md @@ -20,6 +20,10 @@ For example, merkleizing a static web application into a UnixFS DAG looks like t ![UnixFS Dag](./images/unixfs-dag-diagram.png) +::: tip +See the [content-addressing data sets guide](../how-to/content-addressing-data-sets.md) for more on the different approaches to content-addressing data sets with IPFS. +::: + ## 2. Providing Once the input data has been merkleized and addressed by a CID, the node announces itself as a provider of the CID(s) to the IPFS network, thereby creating a public mapping between the CID and the node. This is typically known as **providing**, other names for this step are **publishing** **advertising**. On routing systems with built-in expiration/TTL like the Amino DHT, this is also known as **reproviding** to emphasize the continuous nature of the process in which a node advertises provider records. From dd5096806533d8f88d81f626166a6034a66e98d4 Mon Sep 17 00:00:00 2001 From: Daniel Norman Date: Thu, 19 Mar 2026 12:39:44 +0100 Subject: [PATCH 19/19] Refine ipfs guides landing page --- docs/how-to/README.md | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/docs/how-to/README.md b/docs/how-to/README.md index 60088bbdc..061eaa225 100644 --- a/docs/how-to/README.md +++ b/docs/how-to/README.md @@ -5,20 +5,13 @@ description: Hands-on guides to using and developing with IPFS to build decentra # IPFS Guides and Tutorials -::: callout -Check out the new [Ecosystem guides](#ecosystem-guides) section to learn more about the amazing tools, software and implementations created by IPFS ecosystem partners. -::: - -No matter what you're looking to do with IPFS, you can find how-tos and tutorials here. These items are a work in progress, so please check back periodically to check what we've added! - See the site navigation menu for all our how-tos, organized by topic area, including favorites like these: - **Customize your install** by [configuring a node](configure-node.md), modifying the [bootstrap list](modify-bootstrap-list.md), and more -- **Learn how to manage files** in IPFS with tutorials on concepts like [pinning](pin-files.md), how to [work with blocks](work-with-blocks.md), learning how to [troubleshoot file transfers](https://github.com/ipfs/kubo/blob/master/docs/file-transfer.md), and understanding [working with large datasets](https://github.com/ipfs/archives/tree/master/tutorials/replicating-large-datasets) -- **See how to work with peers** using methods like [customizing libp2p bundles](https://github.com/ipfs-examples/js-ipfs-examples/tree/master/examples/custom-libp2p) and using circuit relay +- **Learn how to manage files** in IPFS with tutorials on concepts like [pinning](pin-files.md), how to [work with blocks](work-with-blocks.md), learning how to [content address data sets](content-addressing-data-sets.md). +- **Publish scientific data** by exploring the [scientific data and IPFS landscape guide](scientific-data/landscape-guide.md) or learning how to [publish geospatial Zarr data with IPFS](scientific-data/publish-geospatial-zarr-data.md) - **Understand website hosting** by starting with how to [host a simple single-page site](websites-on-ipfs/single-page-website.md) -- **Learn how to build apps** on IPFS, starting with [exploring the IPFS API](https://github.com/ipfs/camp/tree/master/CORE_AND_ELECTIVE_COURSES/CORE_COURSE_C) and [making a basic libp2p app](https://github.com/ipfs/camp/tree/master/CORE_AND_ELECTIVE_COURSES/CORE_COURSE_B) -- **Understand how IPFS works in the browser** by learning how to [address IPFS on the Web](address-ipfs-on-web.md) and [how IPFS can be used in your favorite browser tools and frameworks](browser-tools-frameworks.md) +- **Understand how to use IPFS in the browser** by learning how to [address IPFS on the Web](address-ipfs-on-web.md) and [IPFS in web applications](ipfs-in-web-apps.md) ## Don't see what you're looking for?