From 88421763386acdd660a064ea728a75d088030655 Mon Sep 17 00:00:00 2001 From: Mosh <1306020+mishmosh@users.noreply.github.com> Date: Thu, 3 Apr 2025 10:02:35 -0400 Subject: [PATCH 01/35] Create ipip-0000.md --- src/ipips/ipip-0000.md | 102 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 102 insertions(+) create mode 100644 src/ipips/ipip-0000.md diff --git a/src/ipips/ipip-0000.md b/src/ipips/ipip-0000.md new file mode 100644 index 000000000..68fb29b8e --- /dev/null +++ b/src/ipips/ipip-0000.md @@ -0,0 +1,102 @@ +--- +# IPIP number should match its pull request number. After you open a PR, +# please update title and update the filename to `ipip0000`. +title: "IPIP-0000: CID Profiles" +date: 2025-04-03 +ipip: proposal +editors: + - name: Michelle Lee +relatedIssues: + - n/a +order: 0000 +tags: ['ipips'] +--- + +## Summary + + +This proposal introduces profiles for IPFS CIDs. Profiles explicitly define CID version, hash algorithm, chunk size, DAG width, layout, and other parameters. + +## Motivation + +Currently, CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID. Profiles offer With profiles, following the same profile will produce identical CIDs for identical content, whic makes verification regardless of implementation. + +## Detailed design + +We introduce a profile naming system, + +Each profile must specify the following characteristics: + +1. CID version (CIDv0 or CIDv1) +2. Hash algorithm +3. Chunk size +4. DAG width +5. DAG layout +6. Required + +Additional profiles can be added at a future date. Profile names may be chosen from the names of any botanical tree with compound leaves. + +| | Helia default | Kubo default | Storacha default | "test-cid-v1" profile | DASL | +|-------------|---------------|-----------------------------|------------------|-----------------------|---------------| +| CID version | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | +| Hash Algo | sha-256 | sha-256 | sha-256 | sha-256 | sha-256 | +| Chunk size | 1MiB | 256KiB | 1MiB | 1MiB | not specified | +| DAG width | 1024 | 174 (but it's complicated*) | 1024 | 174 | not specified | +| DAG layout | balanced | balanced | balanced | balanced | not specified | + + + +This would be specified as a table in (forthcoming UnixFS spec). + + + +## Design rationale + +The profile names are chosen to be easy to pronounce. + +Here is a summary table of current defaults, thanks to input & clarifications from @2color @achingbrain @lidel: + +| | Helia default | Kubo default | Storacha default | "test-cid-v1" profile | DASL | +|-------------|---------------|-----------------------------|------------------|-----------------------|---------------| +| CID version | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | +| Hash Algo | sha-256 | sha-256 | sha-256 | sha-256 | sha-256 | +| Chunk size | 1MiB | 256KiB | 1MiB | 1MiB | not specified | +| DAG width | 1024 | 174 (but it's complicated*) | 1024 | 174 | not specified | +| DAG layout | balanced | balanced | balanced | balanced | not specified | + +* Kubo has 2 different default DAG widths: + * For HAMT-sharded directories, the `DefaultShardWidth` [here](https://github.com/ipfs/boxo/blob/f1d5312e3be45d151bb9c8f11c9283820687bea3/ipld/unixfs/io/directory.go#L30) is 256. + * For files, `DefaultLinksPerBlock` [here](https://github.com/ipfs/boxo/blob/v0.29.0/ipld/unixfs/importer/helpers/helpers.go#L30) is ~174 + +See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/18507/ + +### User benefit + +Reliable, deterministic CIDs allow independent verification of content across tools and ipmlementations. + +### Compatibility + +Implementations will need to (1) make CID generation settings configurable and (2) support user setting of profiles. + +Kubo currently has no CLI / RPC / Config option to control DAG width in Kubo. https://github.com/ipfs/kubo/issues/10751 is the starting point to add that ability. + +### Security + +TODO + +### Alternatives + +Another approach could be to name profiles based on the key UnixFS/CID parameters, e.g. v1-sha256-balanced-1mib-1024w-raw. This is longer and more convoluted. + +## Test fixtures + +TODO + +List relevant CIDs. Describe how implementations can use them to determine +specification compliance. This section can be skipped if IPIP does not deal +with the way IPFS handles content-addressed data, or the modified specification +file already includes this information. + +### Copyright + +Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). From 4ba68f030e067ea3acaba5514e5d97ba87d535f5 Mon Sep 17 00:00:00 2001 From: Mosh <1306020+mishmosh@users.noreply.github.com> Date: Thu, 3 Apr 2025 10:03:29 -0400 Subject: [PATCH 02/35] Update and rename ipip-0000.md to ipip-0499.md --- src/ipips/{ipip-0000.md => ipip-0499.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename src/ipips/{ipip-0000.md => ipip-0499.md} (99%) diff --git a/src/ipips/ipip-0000.md b/src/ipips/ipip-0499.md similarity index 99% rename from src/ipips/ipip-0000.md rename to src/ipips/ipip-0499.md index 68fb29b8e..d1947e2d7 100644 --- a/src/ipips/ipip-0000.md +++ b/src/ipips/ipip-0499.md @@ -1,7 +1,7 @@ --- # IPIP number should match its pull request number. After you open a PR, # please update title and update the filename to `ipip0000`. -title: "IPIP-0000: CID Profiles" +title: "IPIP-0499: CID Profiles" date: 2025-04-03 ipip: proposal editors: From 6cc64cb765aaab872793b2bd3b49c7f02c8f14b2 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Tue, 15 Apr 2025 23:41:17 +0200 Subject: [PATCH 03/35] add extra attributes proposed in review Co-authored-by: Bumblefudge --- src/ipips/ipip-0499.md | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index d1947e2d7..7f75d7288 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -27,11 +27,15 @@ We introduce a profile naming system, Each profile must specify the following characteristics: -1. CID version (CIDv0 or CIDv1) +1. CID version (currently only CIDv0 or CIDv1) 2. Hash algorithm -3. Chunk size -4. DAG width -5. DAG layout +3. UnixFS Chunk size (explicitly set, not contextual/reactive to input) +4. UnixFS directory DAG width +5. UnixFS directory DAG layout +6. HAMT directory DAG threshold +7. HAMT directory DAG width +8. Leaf Envelope (historically dag-pb, now none/raw) +9. Allow empty directories 6. Required Additional profiles can be added at a future date. Profile names may be chosen from the names of any botanical tree with compound leaves. @@ -43,7 +47,10 @@ Additional profiles can be added at a future date. Profile names may be chosen f | Chunk size | 1MiB | 256KiB | 1MiB | 1MiB | not specified | | DAG width | 1024 | 174 (but it's complicated*) | 1024 | 174 | not specified | | DAG layout | balanced | balanced | balanced | balanced | not specified | - +| HAMT threshold | 256KiB (est) | 256KiB (est) | 1000 **links** | 256KiB | not specified | +| HAMT width | 256 blocks | 256 blocks | 256 blocks | 256 blocks | not specified | +| Leaves | raw | raw | raw | raw | not specified | +| EmptyDirs | allowed | allowed | disallowed | allowed | not specified | This would be specified as a table in (forthcoming UnixFS spec). From d8b83891fdef2e104278a05c085faf8c568b258f Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Wed, 16 Apr 2025 00:29:09 +0200 Subject: [PATCH 04/35] incorporate kubo#10774 Import.* config params for controlling DAG width were added in: https://github.com/ipfs/kubo/pull/10774 --- src/ipips/ipip-0499.md | 82 +++++++++++++++++++++--------------------- 1 file changed, 40 insertions(+), 42 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 7f75d7288..648a48ed7 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -6,16 +6,19 @@ date: 2025-04-03 ipip: proposal editors: - name: Michelle Lee + github: mishmosh + affiliation: + name: IPFS Foundation relatedIssues: - - n/a -order: 0000 + - https://discuss.ipfs.tech/t/should-we-profile-cids/18507 +order: 0499 tags: ['ipips'] --- ## Summary -This proposal introduces profiles for IPFS CIDs. Profiles explicitly define CID version, hash algorithm, chunk size, DAG width, layout, and other parameters. +This proposal introduces profiles for IPFS CIDs. Profiles explicitly define CID version, hash algorithm, chunk size, DAG width, layout, and other parameters. ## Motivation @@ -23,57 +26,43 @@ Currently, CIDs can be generated with a variety of settings and optimizations fo ## Detailed design -We introduce a profile naming system, +We introduce a profile naming system, Each profile must specify the following characteristics: 1. CID version (currently only CIDv0 or CIDv1) -2. Hash algorithm -3. UnixFS Chunk size (explicitly set, not contextual/reactive to input) -4. UnixFS directory DAG width -5. UnixFS directory DAG layout -6. HAMT directory DAG threshold -7. HAMT directory DAG width -8. Leaf Envelope (historically dag-pb, now none/raw) -9. Allow empty directories -6. Required +1. Hash algorithm +1. UnixFS Chunk algorithm (e.g. size-based or content-based) +1. UnixFS directory DAG layout (e.g. balanced, trickle) +1. UnixFS file DAG width (max number of links per `File` node) +1. UnixFS directory DAG width (max number of links per basic `Directory` node) +1. UnixFS HAMT directory DAG threshold (max `Directory` size before switching to `HAMTDirectory`) +1. HAMT directory DAG width (max number of fanout links per internal HAMTDirectory node) +1. Leaf Envelope (historically `dag-pb`, CIDv1 introduced `raw` leaves) +1. Empty directories (informative suggestion) Additional profiles can be added at a future date. Profile names may be chosen from the names of any botanical tree with compound leaves. -| | Helia default | Kubo default | Storacha default | "test-cid-v1" profile | DASL | -|-------------|---------------|-----------------------------|------------------|-----------------------|---------------| -| CID version | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | -| Hash Algo | sha-256 | sha-256 | sha-256 | sha-256 | sha-256 | -| Chunk size | 1MiB | 256KiB | 1MiB | 1MiB | not specified | -| DAG width | 1024 | 174 (but it's complicated*) | 1024 | 174 | not specified | -| DAG layout | balanced | balanced | balanced | balanced | not specified | -| HAMT threshold | 256KiB (est) | 256KiB (est) | 1000 **links** | 256KiB | not specified | -| HAMT width | 256 blocks | 256 blocks | 256 blocks | 256 blocks | not specified | -| Leaves | raw | raw | raw | raw | not specified | -| EmptyDirs | allowed | allowed | disallowed | allowed | not specified | - - This would be specified as a table in (forthcoming UnixFS spec). - - ## Design rationale -The profile names are chosen to be easy to pronounce. - -Here is a summary table of current defaults, thanks to input & clarifications from @2color @achingbrain @lidel: +The profile names are chosen to be easy to pronounce. -| | Helia default | Kubo default | Storacha default | "test-cid-v1" profile | DASL | -|-------------|---------------|-----------------------------|------------------|-----------------------|---------------| -| CID version | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | -| Hash Algo | sha-256 | sha-256 | sha-256 | sha-256 | sha-256 | -| Chunk size | 1MiB | 256KiB | 1MiB | 1MiB | not specified | -| DAG width | 1024 | 174 (but it's complicated*) | 1024 | 174 | not specified | -| DAG layout | balanced | balanced | balanced | balanced | not specified | +Here is a summary table of current (2025-Q2) defaults, thanks to input & clarifications from @2color @achingbrain @lidel: -* Kubo has 2 different default DAG widths: - * For HAMT-sharded directories, the `DefaultShardWidth` [here](https://github.com/ipfs/boxo/blob/f1d5312e3be45d151bb9c8f11c9283820687bea3/ipld/unixfs/io/directory.go#L30) is 256. - * For files, `DefaultLinksPerBlock` [here](https://github.com/ipfs/boxo/blob/v0.29.0/ipld/unixfs/importer/helpers/helpers.go#L30) is ~174 +| | Helia default | Kubo `legacy-cid-v0` (default) | Storacha default | Kubo `test-cid-v1` | Kubo `test-cid-v1-wide` | DASL | +|---------------------------------|---------------|-----------------------------------|------------------|--------------------|---------------------------|---------------| +| CID version | CIDv1 | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | +| Hash Algo | sha-256 | sha-256 | sha-256 | sha-256 | sha-256 | sha-256 | +| Chunk size | 1MiB | 256KiB | 1MiB | 1MiB | 1MiB | not specified | +| Max links `File` node | 1024 | 174 | 1024 | 174 | **1024** | not specified | +| Max links `Directory` node | ? | 0 | ? | 0 | 0 | ? | +| Max fanout `HAMTDirectory` node | 256 blocks | 256 blocks | 256 blocks | 256 blocks | **1024** | not specified | +| `HAMTDirectory` threshold | 256KiB (est) | 256KiB (est:links[name+cid]) | 1000 **links** | 256KiB | **1MiB** | not specified | +| DAG layout | balanced | balanced | balanced | balanced | balanced | not specified | +| Leaves | raw | raw | raw | raw | raw | not specified | +| Empty directories | allowed | allowed | disallowed | allowed | allowed | not specified | See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/18507/ @@ -85,7 +74,7 @@ Reliable, deterministic CIDs allow independent verification of content across to Implementations will need to (1) make CID generation settings configurable and (2) support user setting of profiles. -Kubo currently has no CLI / RPC / Config option to control DAG width in Kubo. https://github.com/ipfs/kubo/issues/10751 is the starting point to add that ability. +Kubo 0.35 will have [`Import.*` configuration](https://github.com/ipfs/kubo/blob/master/docs/config.md#import) option to control DAG width. ### Security @@ -95,6 +84,15 @@ TODO Another approach could be to name profiles based on the key UnixFS/CID parameters, e.g. v1-sha256-balanced-1mib-1024w-raw. This is longer and more convoluted. + +#### Empty directories + +Decision if empty directories should be included is left out of scope. + +Tools can apply arbitrary filtering before passing filesystem entries +to be converted into a DAG, thus for 1:1 CID reproducibility one should +run without any prefilters, or ensure the same prefilters are applied. + ## Test fixtures TODO From 595588c8d4dd47bba835950a212d32769a3ec28e Mon Sep 17 00:00:00 2001 From: Daniel Norman <1992255+2color@users.noreply.github.com> Date: Tue, 12 Aug 2025 09:21:54 +0200 Subject: [PATCH 05/35] Update src/ipips/ipip-0499.md Co-authored-by: Christian Paul --- src/ipips/ipip-0499.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 648a48ed7..362ca3b77 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -68,7 +68,7 @@ See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/185 ### User benefit -Reliable, deterministic CIDs allow independent verification of content across tools and ipmlementations. +Reliable, deterministic CIDs allow independent verification of content across tools and implementations. ### Compatibility From 41f9b86982d10abd32da9cf7e5fc820054011d3f Mon Sep 17 00:00:00 2001 From: Daniel N <2color@users.noreply.github.com> Date: Tue, 12 Aug 2025 10:06:47 +0200 Subject: [PATCH 06/35] add daniel as editor --- src/ipips/ipip-0499.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 362ca3b77..09146efed 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -1,7 +1,5 @@ --- -# IPIP number should match its pull request number. After you open a PR, -# please update title and update the filename to `ipip0000`. -title: "IPIP-0499: CID Profiles" +title: 'IPIP-0499: CID Profiles' date: 2025-04-03 ipip: proposal editors: @@ -9,6 +7,11 @@ editors: github: mishmosh affiliation: name: IPFS Foundation + - name: Daniel Norman + github: 2color + affiliation: + name: Shipyard + url: https://ipshipyard.com relatedIssues: - https://discuss.ipfs.tech/t/should-we-profile-cids/18507 order: 0499 From 229988f67d088a03850fd229c438efd8c6bb1044 Mon Sep 17 00:00:00 2001 From: Daniel N <2color@users.noreply.github.com> Date: Tue, 12 Aug 2025 12:16:41 +0200 Subject: [PATCH 07/35] edit summary and motivation --- src/ipips/ipip-0499.md | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 09146efed..a203b6e2e 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -20,12 +20,27 @@ tags: ['ipips'] ## Summary - -This proposal introduces profiles for IPFS CIDs. Profiles explicitly define CID version, hash algorithm, chunk size, DAG width, layout, and other parameters. +This proposal introduces configuration profiles for CIDs used to represent files and directories with UnixFS. These ensure that the same content will yield the same CID across different implementations. + +Profiles explicitly define the following UnixFS parameters: CID version, hash algorithm, chunk size, DAG width, layout, and other parameters that affect the resulting CID. + +This allows for deterministic UnixFS CIDs. ## Motivation -Currently, CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID. Profiles offer With profiles, following the same profile will produce identical CIDs for identical content, whic makes verification regardless of implementation. +UnixFS CIDs are not deterministic. This means that the same file tree can yield different CIDs depending on the parameters used by the implementation to generate it, which in some cases, aren't even configurable by the user. For example, the chunk size, DAG width, and layout can vary between implementations or even between different versions of the same implementation. + +This lack of determinism makes has a number of drawbacks: + +- It is difficult to verify content across different tools and implementations, as the same content may yield different CIDs. +- Users need to include the UnixFS merkle proofs in order to verify the CID, adding storage overhead and complexity to the verification process. +- In terms of developer experience, it deviates from the mental model of a hash function, where the same input should always yield the same output. This leads to confusion and frustration when working with UnixFS CIDs + +By introducing profiles, we can benefit from both the optionality offered by UnixFS, where users are free to chose their own parameters, and the determinism of CIDs, where the same content will yield the same CID across different implementations. + + same content will yield the same CID across different implementations, making it easier to verify content and improving the developer experience. + +UnixFS CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file tree can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID. Profiles offer With profiles, following the same profile will produce identical CIDs for identical content, whic makes verification regardless of implementation. ## Detailed design From f37e6107f672e2f427598a29f6764e39316425bd Mon Sep 17 00:00:00 2001 From: Daniel N <2color@users.noreply.github.com> Date: Tue, 12 Aug 2025 12:17:00 +0200 Subject: [PATCH 08/35] edit summary --- src/ipips/ipip-0499.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index a203b6e2e..3fcb54fde 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -20,11 +20,9 @@ tags: ['ipips'] ## Summary -This proposal introduces configuration profiles for CIDs used to represent files and directories with UnixFS. These ensure that the same content will yield the same CID across different implementations. +This proposal introduces configuration profiles for CIDs used to represent files and directories with UnixFS. These ensure that the deterministic CID generation for the same data, regardless of the implementation. -Profiles explicitly define the following UnixFS parameters: CID version, hash algorithm, chunk size, DAG width, layout, and other parameters that affect the resulting CID. - -This allows for deterministic UnixFS CIDs. +Profiles explicitly define the UnixFS parameters, e.g. dag width, hash algorithem, and chunk size, that affect the resulting CID, such that given the profile and input data different implementations will generate identical CIDs. ## Motivation From 7a12f0a936054e68a52ebc18d4d16f9308197e60 Mon Sep 17 00:00:00 2001 From: Daniel N <2color@users.noreply.github.com> Date: Tue, 12 Aug 2025 12:17:28 +0200 Subject: [PATCH 09/35] edit parameters and design --- src/ipips/ipip-0499.md | 58 ++++++++++++++++++++---------------------- 1 file changed, 27 insertions(+), 31 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 3fcb54fde..a97758777 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -31,33 +31,30 @@ UnixFS CIDs are not deterministic. This means that the same file tree can yield This lack of determinism makes has a number of drawbacks: - It is difficult to verify content across different tools and implementations, as the same content may yield different CIDs. -- Users need to include the UnixFS merkle proofs in order to verify the CID, adding storage overhead and complexity to the verification process. +- Users are requires to store and transfer UnixFS merkle proofs in order to verify CIDs, adding storage overhead, network bandwidth, and complexity to the verification process. - In terms of developer experience, it deviates from the mental model of a hash function, where the same input should always yield the same output. This leads to confusion and frustration when working with UnixFS CIDs -By introducing profiles, we can benefit from both the optionality offered by UnixFS, where users are free to chose their own parameters, and the determinism of CIDs, where the same content will yield the same CID across different implementations. - - same content will yield the same CID across different implementations, making it easier to verify content and improving the developer experience. +By introducing profiles, we can benefit from both the optionality offered by UnixFS, where users are free to chose their own parameters, and determinism through profiles. UnixFS CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file tree can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID. Profiles offer With profiles, following the same profile will produce identical CIDs for identical content, whic makes verification regardless of implementation. ## Detailed design -We introduce a profile naming system, +We introduce a set of named profiles that define a set of parameters for generating UnixFS CIDs. These profiles can be used by implementations to ensure that the same content will yield the same CID across different tools and implementations. + +### UnixFS parameters -Each profile must specify the following characteristics: +The profiles define a set of parameters that affect the resulting CID. These parameters are based on the UnixFS specification and are used to generate the CID for a given file tree. The parameters include: 1. CID version (currently only CIDv0 or CIDv1) -1. Hash algorithm -1. UnixFS Chunk algorithm (e.g. size-based or content-based) -1. UnixFS directory DAG layout (e.g. balanced, trickle) -1. UnixFS file DAG width (max number of links per `File` node) -1. UnixFS directory DAG width (max number of links per basic `Directory` node) -1. UnixFS HAMT directory DAG threshold (max `Directory` size before switching to `HAMTDirectory`) -1. HAMT directory DAG width (max number of fanout links per internal HAMTDirectory node) -1. Leaf Envelope (historically `dag-pb`, CIDv1 introduced `raw` leaves) -1. Empty directories (informative suggestion) - -Additional profiles can be added at a future date. Profile names may be chosen from the names of any botanical tree with compound leaves. +1. Hash function +1. UnixFS chunk size +1. UnixFS DAG layout (e.g. balanced, trickle) +1. UnixFS DAG width (max number of links per `File` node) +1. `HAMTDirectory` fanout (must be a power of 2) +2. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links +3. Leaf Envelope: either `dag-pb` or `raw` +4. Whether empty directories are included in the DAG This would be specified as a table in (forthcoming UnixFS spec). @@ -65,20 +62,19 @@ This would be specified as a table in (forthcoming UnixFS spec). The profile names are chosen to be easy to pronounce. -Here is a summary table of current (2025-Q2) defaults, thanks to input & clarifications from @2color @achingbrain @lidel: - -| | Helia default | Kubo `legacy-cid-v0` (default) | Storacha default | Kubo `test-cid-v1` | Kubo `test-cid-v1-wide` | DASL | -|---------------------------------|---------------|-----------------------------------|------------------|--------------------|---------------------------|---------------| -| CID version | CIDv1 | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | -| Hash Algo | sha-256 | sha-256 | sha-256 | sha-256 | sha-256 | sha-256 | -| Chunk size | 1MiB | 256KiB | 1MiB | 1MiB | 1MiB | not specified | -| Max links `File` node | 1024 | 174 | 1024 | 174 | **1024** | not specified | -| Max links `Directory` node | ? | 0 | ? | 0 | 0 | ? | -| Max fanout `HAMTDirectory` node | 256 blocks | 256 blocks | 256 blocks | 256 blocks | **1024** | not specified | -| `HAMTDirectory` threshold | 256KiB (est) | 256KiB (est:links[name+cid]) | 1000 **links** | 256KiB | **1MiB** | not specified | -| DAG layout | balanced | balanced | balanced | balanced | balanced | not specified | -| Leaves | raw | raw | raw | raw | raw | not specified | -| Empty directories | allowed | allowed | disallowed | allowed | allowed | not specified | +Here is a summary table of current (2025-Q2) defaults: + +| | Helia default | Kubo `legacy-cid-v0` (default) | Storacha default | Kubo `test-cid-v1` | Kubo `test-cid-v1-wide` | DASL | +| ----------------------------- | ------------- | ------------------------------ | ---------------- | ------------------ | ----------------------- | ------------- | +| CID version | CIDv1 | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | +| Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | +| Max chunk size | 1MiB | 256KiB | 1MiB | 1MiB | 1MiB | not specified | +| DAG layout | balanced | balanced | balanced | balanced | balanced | not specified | +| DAG width (children per node) | 1024 | 174 | 1024 | 174 | **1024** | not specified | +| `HAMTDirectory` fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | **1024** | not specified | +| `HAMTDirectory` threshold | 256KiB (est) | 256KiB (est:links[name+cid]) | 1000 **links** | 256KiB | **1MiB** | not specified | +| Leaves | raw | raw | raw | raw | raw | not specified | +| Empty directories | Included | Included | Ignored | Included | Included | not specified | See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/18507/ From ff69e563c1a9e6bb2b781d08d7a3b09168318aae Mon Sep 17 00:00:00 2001 From: Daniel N <2color@users.noreply.github.com> Date: Tue, 12 Aug 2025 12:18:08 +0200 Subject: [PATCH 10/35] edit user benefit and compatibility --- src/ipips/ipip-0499.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index a97758777..4335ea638 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -80,23 +80,25 @@ See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/185 ### User benefit -Reliable, deterministic CIDs allow independent verification of content across tools and implementations. +Profiles reduce the burden of verifying UnixFS content, as users can simply choose a profile and know that the resulting CIDs will be deterministic across implementations. This eliminates the need for users to understand the underlying parameters that affect CID generation, and allows them to focus on the content itself. + +Moreover, profiles allow users to verify content without needing to rely on additional merkle proofs and CAR files, which can be cumbersome and inefficient. + +Finally, profiles improve the developer experience by aligning with the mental model of a hash function. ### Compatibility -Implementations will need to (1) make CID generation settings configurable and (2) support user setting of profiles. +UnixFS Data encoded with the profiles defined in this IPIP is fully compatible with existing implementations, as it is fully compliant with the UnixFS specification. -Kubo 0.35 will have [`Import.*` configuration](https://github.com/ipfs/kubo/blob/master/docs/config.md#import) option to control DAG width. +To produce CIDs that are compliant with this IPIP, implementations will need to support the parameters defined in the profiles. This may require changes to existing implementations to expose configuration options for the parameters, or to implement new functionality to support the profiles. -### Security +Kubo 0.35 will have [`Import.*` configuration](https://github.com/ipfs/kubo/blob/master/docs/config.md#import) option to control DAG width. -TODO ### Alternatives Another approach could be to name profiles based on the key UnixFS/CID parameters, e.g. v1-sha256-balanced-1mib-1024w-raw. This is longer and more convoluted. - #### Empty directories Decision if empty directories should be included is left out of scope. From 09baf68c7a5bc76f2a69a3f326c7f1d54ec578dd Mon Sep 17 00:00:00 2001 From: Daniel N <2color@users.noreply.github.com> Date: Tue, 12 Aug 2025 12:26:21 +0200 Subject: [PATCH 11/35] refine parameters and introduce a named profile --- src/ipips/ipip-0499.md | 42 ++++++++++++++++++++++++++---------------- 1 file changed, 26 insertions(+), 16 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 4335ea638..3fc669594 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -52,15 +52,34 @@ The profiles define a set of parameters that affect the resulting CID. These par 1. UnixFS DAG layout (e.g. balanced, trickle) 1. UnixFS DAG width (max number of links per `File` node) 1. `HAMTDirectory` fanout (must be a power of 2) -2. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links -3. Leaf Envelope: either `dag-pb` or `raw` -4. Whether empty directories are included in the DAG +1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links +1. Leaf Envelope: either `dag-pb` or `raw` +1. Whether empty directories are included in the DAG + - Some implementations apply filtering before merkleizing filesystem entries in the DAG. -This would be specified as a table in (forthcoming UnixFS spec). +This would be specified as a table in (forthcoming [UnixFS spec](https://github.com/ipfs/specs/pull/331/files)). -## Design rationale +## Named profiles -The profile names are chosen to be easy to pronounce. +To make it easier for users and implementations to choose a set of parameters, we define a named profile `unixfs-2025` to encapsulate the parameters established as the baseline default by multiple implementations as of 2025. + +The **`unixfs-2025`** profile name is designed to be referenced by implementations and users to ensure that the same content will yield the same CID across different tools and implementations. + +The profile is defined as follows: + +| Parameter | Value | +| ----------------------------- | ------------------------------------------------------- | +| CID version | CIDv1 | +| Hash function | sha2-256 | +| Max chunk size | 1MiB | +| DAG layout | balanced | +| DAG width (children per node) | 1024 | +| `HAMTDirectory` fanout | 256 blocks | +| `HAMTDirectory` threshold | 256KiB (estimated by counting the size of PBNode.links) | +| Leaves | raw | +| Empty directories | TODO | + +## Current defaults Here is a summary table of current (2025-Q2) defaults: @@ -94,18 +113,9 @@ To produce CIDs that are compliant with this IPIP, implementations will need to Kubo 0.35 will have [`Import.*` configuration](https://github.com/ipfs/kubo/blob/master/docs/config.md#import) option to control DAG width. - ### Alternatives -Another approach could be to name profiles based on the key UnixFS/CID parameters, e.g. v1-sha256-balanced-1mib-1024w-raw. This is longer and more convoluted. - -#### Empty directories - -Decision if empty directories should be included is left out of scope. - -Tools can apply arbitrary filtering before passing filesystem entries -to be converted into a DAG, thus for 1:1 CID reproducibility one should -run without any prefilters, or ensure the same prefilters are applied. +As an alternative to profiles, users can store and transfer CAR files of UnixFS content, which include the merkle proofs needed to verify the CID. ## Test fixtures From cffade84d0945c2fbd06be95aa9d5f2c1d5cd8d3 Mon Sep 17 00:00:00 2001 From: Daniel Norman <1992255+2color@users.noreply.github.com> Date: Wed, 20 Aug 2025 10:31:52 +0200 Subject: [PATCH 12/35] Apply suggestions from code review Co-authored-by: Hector Sanjuan --- src/ipips/ipip-0499.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 3fc669594..53d498839 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -22,7 +22,7 @@ tags: ['ipips'] This proposal introduces configuration profiles for CIDs used to represent files and directories with UnixFS. These ensure that the deterministic CID generation for the same data, regardless of the implementation. -Profiles explicitly define the UnixFS parameters, e.g. dag width, hash algorithem, and chunk size, that affect the resulting CID, such that given the profile and input data different implementations will generate identical CIDs. +Profiles explicitly define the UnixFS parameters, e.g. dag width, hash algorithm, and chunk size, that affect the resulting CID, such that given the profile and input data different implementations will generate identical CIDs. ## Motivation @@ -31,7 +31,7 @@ UnixFS CIDs are not deterministic. This means that the same file tree can yield This lack of determinism makes has a number of drawbacks: - It is difficult to verify content across different tools and implementations, as the same content may yield different CIDs. -- Users are requires to store and transfer UnixFS merkle proofs in order to verify CIDs, adding storage overhead, network bandwidth, and complexity to the verification process. +- Users are required to store and transfer UnixFS merkle proofs in order to verify CIDs, adding storage overhead, network bandwidth, and complexity to the verification process. - In terms of developer experience, it deviates from the mental model of a hash function, where the same input should always yield the same output. This leads to confusion and frustration when working with UnixFS CIDs By introducing profiles, we can benefit from both the optionality offered by UnixFS, where users are free to chose their own parameters, and determinism through profiles. @@ -49,7 +49,7 @@ The profiles define a set of parameters that affect the resulting CID. These par 1. CID version (currently only CIDv0 or CIDv1) 1. Hash function 1. UnixFS chunk size -1. UnixFS DAG layout (e.g. balanced, trickle) +1. UnixFS DAG layout (e.g. balanced, trickle etc...) 1. UnixFS DAG width (max number of links per `File` node) 1. `HAMTDirectory` fanout (must be a power of 2) 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links @@ -115,7 +115,7 @@ Kubo 0.35 will have [`Import.*` configuration](https://github.com/ipfs/kubo/blob ### Alternatives -As an alternative to profiles, users can store and transfer CAR files of UnixFS content, which include the merkle proofs needed to verify the CID. +As an alternative to profiles, users can store and transfer CAR files of UnixFS content, which include the merkle DAG nodes needed to verify the CID. ## Test fixtures From 0402c840d713f95c2565fef6cb1074e96fd2487b Mon Sep 17 00:00:00 2001 From: Daniel N <2color@users.noreply.github.com> Date: Wed, 20 Aug 2025 10:55:56 +0200 Subject: [PATCH 13/35] edit based on hector's feedback --- src/ipips/ipip-0499.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 53d498839..238e4796e 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -34,9 +34,7 @@ This lack of determinism makes has a number of drawbacks: - Users are required to store and transfer UnixFS merkle proofs in order to verify CIDs, adding storage overhead, network bandwidth, and complexity to the verification process. - In terms of developer experience, it deviates from the mental model of a hash function, where the same input should always yield the same output. This leads to confusion and frustration when working with UnixFS CIDs -By introducing profiles, we can benefit from both the optionality offered by UnixFS, where users are free to chose their own parameters, and determinism through profiles. - -UnixFS CIDs can be generated with a variety of settings and optimizations for chunking, DAG width, and more. This means the same file tree can yield multiple, different CIDs depending on which tools and settings are used, and it is not possible to reliably reproduce or verify the CID. Profiles offer With profiles, following the same profile will produce identical CIDs for identical content, whic makes verification regardless of implementation. +By introducing profiles which define the parameters that affect the root CID of the DAG, we can benefit from both the optionality offered by UnixFS, where users are free to chose their own parameters, and determinism through profiles. ## Detailed design From ec07e30d5bef63a654feac04292d450eaa1a4fef Mon Sep 17 00:00:00 2001 From: Daniel Norman <1992255+2color@users.noreply.github.com> Date: Wed, 20 Aug 2025 11:11:19 +0200 Subject: [PATCH 14/35] Apply suggestions from code review Co-authored-by: Rod Vagg --- src/ipips/ipip-0499.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 238e4796e..c83ed6fbc 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -46,7 +46,8 @@ The profiles define a set of parameters that affect the resulting CID. These par 1. CID version (currently only CIDv0 or CIDv1) 1. Hash function -1. UnixFS chunk size +1. UnixFS file chunking algorithm +1. UnixFS file chunk size or target (if required by the chunking algorithm) 1. UnixFS DAG layout (e.g. balanced, trickle etc...) 1. UnixFS DAG width (max number of links per `File` node) 1. `HAMTDirectory` fanout (must be a power of 2) From f454912150e6d478ca8144d8ebb495e414da0851 Mon Sep 17 00:00:00 2001 From: Daniel N <2color@users.noreply.github.com> Date: Wed, 20 Aug 2025 11:31:15 +0200 Subject: [PATCH 15/35] add multibase encoding --- src/ipips/ipip-0499.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index c83ed6fbc..2058dc75f 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -42,10 +42,11 @@ We introduce a set of named profiles that define a set of parameters for generat ### UnixFS parameters -The profiles define a set of parameters that affect the resulting CID. These parameters are based on the UnixFS specification and are used to generate the CID for a given file tree. The parameters include: +The profiles define a set of parameters that affect the resulting string encoding of the CID. These parameters are based on the UnixFS specification and are used to generate the CID for a given file tree. The parameters include: -1. CID version (currently only CIDv0 or CIDv1) -1. Hash function +1. CID version, e.g. CIDv0 or CIDv1 +1. Multibase encoding for the CID, e.g. base32 +1. Hash function used for all nodes in the DAG, e.g. sha2-256 1. UnixFS file chunking algorithm 1. UnixFS file chunk size or target (if required by the chunking algorithm) 1. UnixFS DAG layout (e.g. balanced, trickle etc...) @@ -53,8 +54,7 @@ The profiles define a set of parameters that affect the resulting CID. These par 1. `HAMTDirectory` fanout (must be a power of 2) 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links 1. Leaf Envelope: either `dag-pb` or `raw` -1. Whether empty directories are included in the DAG - - Some implementations apply filtering before merkleizing filesystem entries in the DAG. +1. Whether empty directories are included in the DAG. Some implementations apply filtering before merkleizing filesystem entries in the DAG. This would be specified as a table in (forthcoming [UnixFS spec](https://github.com/ipfs/specs/pull/331/files)). From 9c621ba7d7f5f80bad090656074bc6a430f28901 Mon Sep 17 00:00:00 2001 From: Daniel N <2color@users.noreply.github.com> Date: Wed, 20 Aug 2025 12:29:05 +0200 Subject: [PATCH 16/35] address feedback from rvagg --- src/ipips/ipip-0499.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 2058dc75f..d3cc2df14 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -51,10 +51,12 @@ The profiles define a set of parameters that affect the resulting string encodin 1. UnixFS file chunk size or target (if required by the chunking algorithm) 1. UnixFS DAG layout (e.g. balanced, trickle etc...) 1. UnixFS DAG width (max number of links per `File` node) -1. `HAMTDirectory` fanout (must be a power of 2) +1. `HAMTDirectory` bitwidth, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves). 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links 1. Leaf Envelope: either `dag-pb` or `raw` 1. Whether empty directories are included in the DAG. Some implementations apply filtering before merkleizing filesystem entries in the DAG. +1. Directory wrapping for single files: in order to retain the name of a single file, some implementations have the option to wrap the file in a `Directory` with link to the file. +2. Presence and accurate setting of `Tsize`. This would be specified as a table in (forthcoming [UnixFS spec](https://github.com/ipfs/specs/pull/331/files)). From c109c1ac6a270506f04dc624714495f4d4c5b638 Mon Sep 17 00:00:00 2001 From: Mosh <1306020+mishmosh@users.noreply.github.com> Date: Fri, 14 Nov 2025 20:22:22 -0500 Subject: [PATCH 17/35] Update ipip-0499.md Fixed outdated references, consistent profile names, streamlined Summary and Motivation sections. --- src/ipips/ipip-0499.md | 84 +++++++++++++++++++++--------------------- 1 file changed, 42 insertions(+), 42 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index d3cc2df14..c86cf9efd 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -1,17 +1,18 @@ --- title: 'IPIP-0499: CID Profiles' -date: 2025-04-03 +date: 2025-11-14 ipip: proposal editors: - name: Michelle Lee github: mishmosh affiliation: name: IPFS Foundation + url: https://ipfsfoundation.org - name: Daniel Norman github: 2color affiliation: - name: Shipyard - url: https://ipshipyard.com + name: Independent + url: https://norman.life relatedIssues: - https://discuss.ipfs.tech/t/should-we-profile-cids/18507 order: 0499 @@ -20,29 +21,27 @@ tags: ['ipips'] ## Summary -This proposal introduces configuration profiles for CIDs used to represent files and directories with UnixFS. These ensure that the deterministic CID generation for the same data, regardless of the implementation. - -Profiles explicitly define the UnixFS parameters, e.g. dag width, hash algorithm, and chunk size, that affect the resulting CID, such that given the profile and input data different implementations will generate identical CIDs. +This proposal introduces **configuration profiles** for CIDs that represent files and directories using [UnixFS](https://specs.ipfs.tech/unixfs/). ## Motivation -UnixFS CIDs are not deterministic. This means that the same file tree can yield different CIDs depending on the parameters used by the implementation to generate it, which in some cases, aren't even configurable by the user. For example, the chunk size, DAG width, and layout can vary between implementations or even between different versions of the same implementation. +UnixFS CIDs are currently non-deterministic. The same file or directory can produce different CIDs across implementations, because parameters like chunk size, DAG width, and layout vary between implementations. Often, these parameters are not even configurable by users. -This lack of determinism makes has a number of drawbacks: +This creates three problems: -- It is difficult to verify content across different tools and implementations, as the same content may yield different CIDs. -- Users are required to store and transfer UnixFS merkle proofs in order to verify CIDs, adding storage overhead, network bandwidth, and complexity to the verification process. -- In terms of developer experience, it deviates from the mental model of a hash function, where the same input should always yield the same output. This leads to confusion and frustration when working with UnixFS CIDs +- **Verification difficulty:** The same content produces different CIDs across tools, making content verification unreliable. +- **Additional overhead:** Users must store and transfer UnixFS merkle proofs to verify CIDs, adding storage overhead, network bandwidth, and complexity. +- **Broken expectations:** Unlike standard hash functions where identical input produces identical output, UnixFS CIDs behave unpredictably. -By introducing profiles which define the parameters that affect the root CID of the DAG, we can benefit from both the optionality offered by UnixFS, where users are free to chose their own parameters, and determinism through profiles. +Configuration profiles solve this by explicitly defining all parameters that affect CID generation. This preserves UnixFS flexibility (users can still choose parameters) while enabling deterministic results. ## Detailed design -We introduce a set of named profiles that define a set of parameters for generating UnixFS CIDs. These profiles can be used by implementations to ensure that the same content will yield the same CID across different tools and implementations. +We introduce a set of **named configuration profiles**, each specifying the complete set of parameters for generating UnixFS CIDs. When implementations use these profiles, they guarantee that the same input, processed with the same profile, will yield the same CID across different tools and implementations. ### UnixFS parameters -The profiles define a set of parameters that affect the resulting string encoding of the CID. These parameters are based on the UnixFS specification and are used to generate the CID for a given file tree. The parameters include: +Here is the complete set of UnixFS parameters that affect the resulting string encoding of the CID: 1. CID version, e.g. CIDv0 or CIDv1 1. Multibase encoding for the CID, e.g. base32 @@ -51,24 +50,22 @@ The profiles define a set of parameters that affect the resulting string encodin 1. UnixFS file chunk size or target (if required by the chunking algorithm) 1. UnixFS DAG layout (e.g. balanced, trickle etc...) 1. UnixFS DAG width (max number of links per `File` node) -1. `HAMTDirectory` bitwidth, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves). +1. `HAMTDirectory` fanout, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves). 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links 1. Leaf Envelope: either `dag-pb` or `raw` 1. Whether empty directories are included in the DAG. Some implementations apply filtering before merkleizing filesystem entries in the DAG. 1. Directory wrapping for single files: in order to retain the name of a single file, some implementations have the option to wrap the file in a `Directory` with link to the file. -2. Presence and accurate setting of `Tsize`. - -This would be specified as a table in (forthcoming [UnixFS spec](https://github.com/ipfs/specs/pull/331/files)). +1. Presence and accurate setting of `Tsize`. -## Named profiles +The handling of empty files, hidden files, unreadable files, symlinks, and symlink follows is defined by the [UnixFS](https://specs.ipfs.tech/unixfs/) spec. -To make it easier for users and implementations to choose a set of parameters, we define a named profile `unixfs-2025` to encapsulate the parameters established as the baseline default by multiple implementations as of 2025. +## CID profiles -The **`unixfs-2025`** profile name is designed to be referenced by implementations and users to ensure that the same content will yield the same CID across different tools and implementations. +To enable consistent CID generation, we define a series of named profiles that specify complete UnixFS parameter sets. Profile names may have any prefix, but must end in `YYYY-MM`. -The profile is defined as follows: +The initial profile in the series, **`unixfs-2025`**, captures the baseline default parameters used by multiple implementations as of November 2025. -| Parameter | Value | +| Parameter | `unixfs-2025` | | ----------------------------- | ------------------------------------------------------- | | CID version | CIDv1 | | Hash function | sha2-256 | @@ -80,39 +77,42 @@ The profile is defined as follows: | Leaves | raw | | Empty directories | TODO | -## Current defaults +## Legacy profiles -Here is a summary table of current (2025-Q2) defaults: +We also define a series of **legacy profiles**, used by various implementations as of November 2025: -| | Helia default | Kubo `legacy-cid-v0` (default) | Storacha default | Kubo `test-cid-v1` | Kubo `test-cid-v1-wide` | DASL | -| ----------------------------- | ------------- | ------------------------------ | ---------------- | ------------------ | ----------------------- | ------------- | -| CID version | CIDv1 | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | -| Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | -| Max chunk size | 1MiB | 256KiB | 1MiB | 1MiB | 1MiB | not specified | -| DAG layout | balanced | balanced | balanced | balanced | balanced | not specified | -| DAG width (children per node) | 1024 | 174 | 1024 | 174 | **1024** | not specified | -| `HAMTDirectory` fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | **1024** | not specified | -| `HAMTDirectory` threshold | 256KiB (est) | 256KiB (est:links[name+cid]) | 1000 **links** | 256KiB | **1MiB** | not specified | -| Leaves | raw | raw | raw | raw | raw | not specified | -| Empty directories | Included | Included | Ignored | Included | Included | not specified | +| | `kubo-legacy-2015` (kubo default) | `helia-2025` | `storacha-2025` | `kubo-2025` | `kubo-wide-2025` | `dasl-2025` | +| ----------------------------- | ------------------------------ | --------------- | ------------------ | ------------------ | ----------------------- | ------------- | +| CID version | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | +| Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | +| Max chunk size | 256KiB | 1MiB | 1MiB | 1MiB | 1MiB | not specified | +| DAG layout | balanced | balanced | balanced | balanced | balanced | not specified | +| DAG width (children per node) | 174 | 1024 | 1024 | 174 | **1024** | not specified | +| `HAMTDirectory` fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | **1024** | not specified | +| `HAMTDirectory` threshold | 256KiB (est:links[name+cid]) | 256KiB (est) | 1000 **links** | 256KiB | **1MiB** | not specified | +| Leaves | raw | raw | raw | raw | raw | not specified | +| Empty directories | Included | Included | Ignored | Included | Included | not specified | See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/18507/ ### User benefit -Profiles reduce the burden of verifying UnixFS content, as users can simply choose a profile and know that the resulting CIDs will be deterministic across implementations. This eliminates the need for users to understand the underlying parameters that affect CID generation, and allows them to focus on the content itself. +Profiles provide 3 key advantages for working with content-addressed data: + +1. **Predictable, deterministic behavior:** Profiles restore the expected property of content addressing: identical input data always produces identical CIDs, regardless of which implementation generates them. -Moreover, profiles allow users to verify content without needing to rely on additional merkle proofs and CAR files, which can be cumbersome and inefficient. +2. **Lightweight verification:** Users can verify content without needing to rely on additional merkle proofs or CAR files. -Finally, profiles improve the developer experience by aligning with the mental model of a hash function. +3. **Simplified workflow:** Users can select a profile and automatically get consistent CIDs across all implementations, without needing to configure or understand the underlying parameters. ### Compatibility -UnixFS Data encoded with the profiles defined in this IPIP is fully compatible with existing implementations, as it is fully compliant with the UnixFS specification. +UnixFS data encoded with the CID profiles defined in this IPIP remains fully compatible with existing implementations, since it conforms to the [https://specs.ipfs.tech/unixfs/](specification). -To produce CIDs that are compliant with this IPIP, implementations will need to support the parameters defined in the profiles. This may require changes to existing implementations to expose configuration options for the parameters, or to implement new functionality to support the profiles. +To generate CIDs in compliance with this IPIP, implementations must support the parameters defined in the profiles and support the the set of named profiles. They MAY also support legacy profiles. -Kubo 0.35 will have [`Import.*` configuration](https://github.com/ipfs/kubo/blob/master/docs/config.md#import) option to control DAG width. +* Adding new functionality to support parameters and/or profiles +* Exposing configuration options for profiles ### Alternatives From 383f9e34af2e0761cc0675e2bac371d00f61ecc1 Mon Sep 17 00:00:00 2001 From: Mosh <1306020+mishmosh@users.noreply.github.com> Date: Thu, 20 Nov 2025 10:10:26 -0500 Subject: [PATCH 18/35] Update src/ipips/ipip-0499.md Co-authored-by: Rod Vagg --- src/ipips/ipip-0499.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index c86cf9efd..4f8ae9da6 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -53,7 +53,8 @@ Here is the complete set of UnixFS parameters that affect the resulting string e 1. `HAMTDirectory` fanout, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves). 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links 1. Leaf Envelope: either `dag-pb` or `raw` -1. Whether empty directories are included in the DAG. Some implementations apply filtering before merkleizing filesystem entries in the DAG. +1. Whether empty directories are included in the DAG. Some implementations may apply filtering. +1. Whether hidden entities (including dot files) are included in the DAG. Some implementations may apply filtering. 1. Directory wrapping for single files: in order to retain the name of a single file, some implementations have the option to wrap the file in a `Directory` with link to the file. 1. Presence and accurate setting of `Tsize`. From e564968ce274fc8d209554ba24501ff447d69976 Mon Sep 17 00:00:00 2001 From: Mosh <1306020+mishmosh@users.noreply.github.com> Date: Thu, 20 Nov 2025 10:29:48 -0500 Subject: [PATCH 19/35] Update src/ipips/ipip-0499.md Co-authored-by: Rod Vagg --- src/ipips/ipip-0499.md | 1 + 1 file changed, 1 insertion(+) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 4f8ae9da6..7e82d69fd 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -77,6 +77,7 @@ The initial profile in the series, **`unixfs-2025`**, captures the baseline defa | `HAMTDirectory` threshold | 256KiB (estimated by counting the size of PBNode.links) | | Leaves | raw | | Empty directories | TODO | +| Hidden entities | TODO | ## Legacy profiles From bbd547f8021ef9327a3c6359b854c2a3f0c81b5c Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Thu, 20 Nov 2025 17:07:18 +0100 Subject: [PATCH 20/35] Update src/ipips/ipip-0499.md Co-authored-by: Rod Vagg --- src/ipips/ipip-0499.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 7e82d69fd..0f088d776 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -58,7 +58,7 @@ Here is the complete set of UnixFS parameters that affect the resulting string e 1. Directory wrapping for single files: in order to retain the name of a single file, some implementations have the option to wrap the file in a `Directory` with link to the file. 1. Presence and accurate setting of `Tsize`. -The handling of empty files, hidden files, unreadable files, symlinks, and symlink follows is defined by the [UnixFS](https://specs.ipfs.tech/unixfs/) spec. +The handling of symlinks and symlink follows is defined by the [UnixFS](https://specs.ipfs.tech/unixfs/) spec. ## CID profiles From 70514b9f4f16914c8d0b4a99d80883f902a3fe63 Mon Sep 17 00:00:00 2001 From: Mosh <1306020+mishmosh@users.noreply.github.com> Date: Fri, 21 Nov 2025 10:55:50 -0500 Subject: [PATCH 21/35] fix typo (the the) --- src/ipips/ipip-0499.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 0f088d776..2992ba9d9 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -111,7 +111,7 @@ Profiles provide 3 key advantages for working with content-addressed data: UnixFS data encoded with the CID profiles defined in this IPIP remains fully compatible with existing implementations, since it conforms to the [https://specs.ipfs.tech/unixfs/](specification). -To generate CIDs in compliance with this IPIP, implementations must support the parameters defined in the profiles and support the the set of named profiles. They MAY also support legacy profiles. +To generate CIDs in compliance with this IPIP, implementations must support the parameters defined in the profiles and support the set of named profiles. They MAY also support legacy profiles. * Adding new functionality to support parameters and/or profiles * Exposing configuration options for profiles From 92352d7f38a3689b8c3272cc0213f620dbf697db Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Sat, 13 Dec 2025 00:01:06 +0100 Subject: [PATCH 22/35] feat(ipip-0499): add chunking algorithm and align profile tables - add chunking algorithm parameter to both tables (fixed-size) - add hidden entities row to legacy profiles table - ensures both unixfs-2025 and legacy tables cover same parameters --- src/ipips/ipip-0499.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 2992ba9d9..1ab90fcbe 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -70,6 +70,7 @@ The initial profile in the series, **`unixfs-2025`**, captures the baseline defa | ----------------------------- | ------------------------------------------------------- | | CID version | CIDv1 | | Hash function | sha2-256 | +| Chunking algorithm | fixed-size | | Max chunk size | 1MiB | | DAG layout | balanced | | DAG width (children per node) | 1024 | @@ -87,6 +88,7 @@ We also define a series of **legacy profiles**, used by various implementations | ----------------------------- | ------------------------------ | --------------- | ------------------ | ------------------ | ----------------------- | ------------- | | CID version | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | | Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | +| Chunking algorithm | fixed-size | fixed-size | fixed-size | fixed-size | fixed-size | not specified | | Max chunk size | 256KiB | 1MiB | 1MiB | 1MiB | 1MiB | not specified | | DAG layout | balanced | balanced | balanced | balanced | balanced | not specified | | DAG width (children per node) | 174 | 1024 | 1024 | 174 | **1024** | not specified | @@ -94,6 +96,7 @@ We also define a series of **legacy profiles**, used by various implementations | `HAMTDirectory` threshold | 256KiB (est:links[name+cid]) | 256KiB (est) | 1000 **links** | 256KiB | **1MiB** | not specified | | Leaves | raw | raw | raw | raw | raw | not specified | | Empty directories | Included | Included | Ignored | Included | Included | not specified | +| Hidden entities | Included | Ignored | Ignored | Included | Included | not specified | See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/18507/ From 9d0d415a0dea993765404b4da541adea39f4179b Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Sat, 13 Dec 2025 00:13:43 +0100 Subject: [PATCH 23/35] fix(ipip-0499): correct kubo legacy profile - rename kubo-legacy-2015 to kubo-legacy-2025 - clarify (v0.39 default) instead of (kubo default) - fix leaves value: dag-pb (UnixFSRawLeaves=False in legacy-cid-v0) --- src/ipips/ipip-0499.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 1ab90fcbe..88db684a5 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -84,7 +84,7 @@ The initial profile in the series, **`unixfs-2025`**, captures the baseline defa We also define a series of **legacy profiles**, used by various implementations as of November 2025: -| | `kubo-legacy-2015` (kubo default) | `helia-2025` | `storacha-2025` | `kubo-2025` | `kubo-wide-2025` | `dasl-2025` | +| | `kubo-legacy-2025` (v0.39) | `helia-2025` | `storacha-2025` | `kubo-2025` | `kubo-wide-2025` | `dasl-2025` | | ----------------------------- | ------------------------------ | --------------- | ------------------ | ------------------ | ----------------------- | ------------- | | CID version | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | | Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | @@ -94,7 +94,7 @@ We also define a series of **legacy profiles**, used by various implementations | DAG width (children per node) | 174 | 1024 | 1024 | 174 | **1024** | not specified | | `HAMTDirectory` fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | **1024** | not specified | | `HAMTDirectory` threshold | 256KiB (est:links[name+cid]) | 256KiB (est) | 1000 **links** | 256KiB | **1MiB** | not specified | -| Leaves | raw | raw | raw | raw | raw | not specified | +| Leaves | dag-pb | raw | raw | raw | raw | not specified | | Empty directories | Included | Included | Ignored | Included | Included | not specified | | Hidden entities | Included | Ignored | Ignored | Included | Included | not specified | From a3dc7e2a50bb42be112a627a0800942fa5259957 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Sat, 13 Dec 2025 01:02:19 +0100 Subject: [PATCH 24/35] fix(ipip-0499): document legacy profile filtering behavior clarify empty directories and hidden entities handling with precise terminology based on kubo v0.39, helia, and storacha implementations: - `included`: always in DAG, no option to exclude (kubo/helia empty dirs) - `excluded`: never in DAG, no option to include (storacha empty dirs) - `opt-in`: excluded by default, flag to include (all hidden entities) - `opt-out`: included by default, flag to exclude add terminology note to explain these terms --- src/ipips/ipip-0499.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 88db684a5..e0b670ed5 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -95,8 +95,14 @@ We also define a series of **legacy profiles**, used by various implementations | `HAMTDirectory` fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | **1024** | not specified | | `HAMTDirectory` threshold | 256KiB (est:links[name+cid]) | 256KiB (est) | 1000 **links** | 256KiB | **1MiB** | not specified | | Leaves | dag-pb | raw | raw | raw | raw | not specified | -| Empty directories | Included | Included | Ignored | Included | Included | not specified | -| Hidden entities | Included | Ignored | Ignored | Included | Included | not specified | +| Empty directories | included | included | excluded | included | included | not specified | +| Hidden entities | opt-in | opt-in | opt-in | opt-in | opt-in | not specified | + +**Terminology:** +- `included`: Always included in the DAG (no option to exclude) +- `excluded`: Always excluded from the DAG (no option to include) +- `opt-in`: Excluded by default; implementations provide a flag to include (e.g., `--hidden` in Kubo/Storacha, `hidden: true` in Helia) +- `opt-out`: Included by default; implementations provide a flag to exclude See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/18507/ From 94a1b79ad6989f33376dc42c2b8970a0658d62af Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Sat, 13 Dec 2025 01:05:03 +0100 Subject: [PATCH 25/35] fix(ipip-0499): note that legacy table includes non-UnixFS implementations --- src/ipips/ipip-0499.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index e0b670ed5..8d7eaaf6d 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -21,7 +21,7 @@ tags: ['ipips'] ## Summary -This proposal introduces **configuration profiles** for CIDs that represent files and directories using [UnixFS](https://specs.ipfs.tech/unixfs/). +This proposal introduces **configuration profiles** for CIDs that represent files and directories using [UnixFS](https://specs.ipfs.tech/unixfs/). The legacy profiles table also documents non-UnixFS implementations for reference. ## Motivation From 7a8d6ab33474d1ed5ef7d9c163484aea153d68c5 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Sat, 13 Dec 2025 01:18:46 +0100 Subject: [PATCH 26/35] feat(ipip-0499): add implementation versions to legacy profiles table add "Based on" row with package/tool versions and kubo profile names --- src/ipips/ipip-0499.md | 1 + 1 file changed, 1 insertion(+) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 8d7eaaf6d..e3c042c9c 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -86,6 +86,7 @@ We also define a series of **legacy profiles**, used by various implementations | | `kubo-legacy-2025` (v0.39) | `helia-2025` | `storacha-2025` | `kubo-2025` | `kubo-wide-2025` | `dasl-2025` | | ----------------------------- | ------------------------------ | --------------- | ------------------ | ------------------ | ----------------------- | ------------- | +| Based on | kubo v0.39 (`legacy-cid-v0`) | @helia/unixfs 6.0.4 | w3cli 7.12.0 | kubo v0.39 (`test-cid-v1`) | kubo v0.39 (`test-cid-v1-wide`) | 2025-12 | | CID version | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | | Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | | Chunking algorithm | fixed-size | fixed-size | fixed-size | fixed-size | fixed-size | not specified | From a3044d61c88eecb8f2a45daa3861c0435cbd30f0 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Sat, 13 Dec 2025 01:24:03 +0100 Subject: [PATCH 27/35] fix(ipip-0499): update HAMTDirectory threshold and clean up parameters - unixfs-2025: mark threshold as TODO, prefer Helia's block size approach - unixfs-2025: note kubo needs opt-out flag for empty directories - legacy profiles: add estimation method to kubo profiles - parameters section: add backticks, clarify threshold estimation methods --- src/ipips/ipip-0499.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index e3c042c9c..0b2840430 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -44,14 +44,14 @@ We introduce a set of **named configuration profiles**, each specifying the comp Here is the complete set of UnixFS parameters that affect the resulting string encoding of the CID: 1. CID version, e.g. CIDv0 or CIDv1 -1. Multibase encoding for the CID, e.g. base32 -1. Hash function used for all nodes in the DAG, e.g. sha2-256 +1. Multibase encoding for the CID, e.g. `base32` +1. Hash function used for all nodes in the DAG, e.g. `sha2-256` 1. UnixFS file chunking algorithm 1. UnixFS file chunk size or target (if required by the chunking algorithm) -1. UnixFS DAG layout (e.g. balanced, trickle etc...) +1. UnixFS DAG layout, e.g. `balanced`, `trickle` 1. UnixFS DAG width (max number of links per `File` node) 1. `HAMTDirectory` fanout, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves). -1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links +1. `HAMTDirectory` threshold: max `Directory` size before switching to `HAMTDirectory`. Size can be estimated by link count (naive), `PBNode.Links` size (name + CID), or full dag-pb block size (most accurate). 1. Leaf Envelope: either `dag-pb` or `raw` 1. Whether empty directories are included in the DAG. Some implementations may apply filtering. 1. Whether hidden entities (including dot files) are included in the DAG. Some implementations may apply filtering. @@ -75,9 +75,9 @@ The initial profile in the series, **`unixfs-2025`**, captures the baseline defa | DAG layout | balanced | | DAG width (children per node) | 1024 | | `HAMTDirectory` fanout | 256 blocks | -| `HAMTDirectory` threshold | 256KiB (estimated by counting the size of PBNode.links) | +| `HAMTDirectory` threshold | TODO (likely entire block size, as in Helia) | | Leaves | raw | -| Empty directories | TODO | +| Empty directories | TODO (kubo needs opt-out flag) | | Hidden entities | TODO | ## Legacy profiles @@ -94,7 +94,7 @@ We also define a series of **legacy profiles**, used by various implementations | DAG layout | balanced | balanced | balanced | balanced | balanced | not specified | | DAG width (children per node) | 174 | 1024 | 1024 | 174 | **1024** | not specified | | `HAMTDirectory` fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | **1024** | not specified | -| `HAMTDirectory` threshold | 256KiB (est:links[name+cid]) | 256KiB (est) | 1000 **links** | 256KiB | **1MiB** | not specified | +| `HAMTDirectory` threshold | 256KiB (est:links[name+cid]) | 256KiB (est) | 1000 **links** | 256KiB (est:links[name+cid]) | **1MiB** (est:links[name+cid]) | not specified | | Leaves | dag-pb | raw | raw | raw | raw | not specified | | Empty directories | included | included | excluded | included | included | not specified | | Hidden entities | opt-in | opt-in | opt-in | opt-in | opt-in | not specified | From 5b19f2b6438f61fc98abaf179454c4714e02ca61 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Sat, 13 Dec 2025 02:12:22 +0100 Subject: [PATCH 28/35] feat(ipip-0499): document symlink handling in profiles - add Symlinks parameter to UnixFS parameters list - add Symlinks row to unixfs-2025 (TODO) and legacy profiles tables - kubo: preserved, helia/storacha: followed, dasl: not specified - add terminology for preserved/followed with UnixFS spec reference - clarify kubo --dereference-args behavior --- src/ipips/ipip-0499.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 0b2840430..d0d7d7ac5 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -57,8 +57,9 @@ Here is the complete set of UnixFS parameters that affect the resulting string e 1. Whether hidden entities (including dot files) are included in the DAG. Some implementations may apply filtering. 1. Directory wrapping for single files: in order to retain the name of a single file, some implementations have the option to wrap the file in a `Directory` with link to the file. 1. Presence and accurate setting of `Tsize`. +1. Symlink handling: preserved as UnixFS Type=4 nodes, or followed (dereferenced to target). -The handling of symlinks and symlink follows is defined by the [UnixFS](https://specs.ipfs.tech/unixfs/) spec. +The [UnixFS spec](https://specs.ipfs.tech/unixfs/) defines Type=4 for symlinks with target path stored in the Data field. ## CID profiles @@ -79,6 +80,7 @@ The initial profile in the series, **`unixfs-2025`**, captures the baseline defa | Leaves | raw | | Empty directories | TODO (kubo needs opt-out flag) | | Hidden entities | TODO | +| Symlinks | TODO (preserved?) | ## Legacy profiles @@ -98,12 +100,15 @@ We also define a series of **legacy profiles**, used by various implementations | Leaves | dag-pb | raw | raw | raw | raw | not specified | | Empty directories | included | included | excluded | included | included | not specified | | Hidden entities | opt-in | opt-in | opt-in | opt-in | opt-in | not specified | +| Symlinks | preserved | followed | followed | preserved | preserved | not specified | **Terminology:** - `included`: Always included in the DAG (no option to exclude) - `excluded`: Always excluded from the DAG (no option to include) - `opt-in`: Excluded by default; implementations provide a flag to include (e.g., `--hidden` in Kubo/Storacha, `hidden: true` in Helia) - `opt-out`: Included by default; implementations provide a flag to exclude +- `preserved`: Symlinks stored as UnixFS Type=4 nodes with target path (per [UnixFS spec](https://specs.ipfs.tech/unixfs/)). Note: Kubo (v0.39) `--dereference-args` only follows symlinks passed as CLI arguments; symlinks found during recursive traversal are always preserved. +- `followed`: Symlinks dereferenced and treated as target files/directories See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/18507/ From 3a092a4a0fce6b56b59778d8aec0b95aa7545e9a Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Sat, 13 Dec 2025 02:32:12 +0100 Subject: [PATCH 29/35] fix(ipip-0499): clarify HAMTDirectory threshold calculation methods recommend full serialized PBNode size, link to dag-pb spec ref: https://github.com/ipfs/specs/pull/499#discussion_r2523340313 --- src/ipips/ipip-0499.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index d0d7d7ac5..a2fe79fb6 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -51,7 +51,7 @@ Here is the complete set of UnixFS parameters that affect the resulting string e 1. UnixFS DAG layout, e.g. `balanced`, `trickle` 1. UnixFS DAG width (max number of links per `File` node) 1. `HAMTDirectory` fanout, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves). -1. `HAMTDirectory` threshold: max `Directory` size before switching to `HAMTDirectory`. Size can be estimated by link count (naive), `PBNode.Links` size (name + CID), or full dag-pb block size (most accurate). +1. `HAMTDirectory` threshold: max `Directory` size before switching to `HAMTDirectory`. Size can be calculated using full serialized [PBNode](https://specs.ipfs.tech/unixfs/#dag-pb-node) size (recommended), or estimated by `PBNode.Links` size (name + CID), or link count (naive). 1. Leaf Envelope: either `dag-pb` or `raw` 1. Whether empty directories are included in the DAG. Some implementations may apply filtering. 1. Whether hidden entities (including dot files) are included in the DAG. Some implementations may apply filtering. From 123be3d78ff18430c7d83ff1729755778257d781 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Sat, 13 Dec 2025 02:45:13 +0100 Subject: [PATCH 30/35] fix(ipip-0499): update metadata and add contributors - rename to UnixFS CID Profiles - add lidel as editor - add thanks section with PR reviewers --- src/ipips/ipip-0499.md | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index a2fe79fb6..d53404780 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -1,6 +1,6 @@ --- -title: 'IPIP-0499: CID Profiles' -date: 2025-11-14 +title: 'IPIP-0499: UnixFS CID Profiles' +date: 2025-12-13 ipip: proposal editors: - name: Michelle Lee @@ -13,8 +13,37 @@ editors: affiliation: name: Independent url: https://norman.life + - name: Marcin Rataj + github: lidel + affiliation: + name: Shipyard + url: https://ipshipyard.com/ relatedIssues: - https://discuss.ipfs.tech/t/should-we-profile-cids/18507 +thanks: + - name: Alex Potsides + github: achingbrain + affiliation: + name: Shipyard + url: https://ipshipyard.com/ + - name: Juan Caballero + github: bumblefudge + affiliation: + name: IPFS Foundation + url: https://ipfsfoundation.org + - name: Hector Sanjuan + github: hsanjuan + affiliation: + name: Shipyard + url: https://ipshipyard.com/ + - name: Steven Vandevelde + github: icidasset + - name: Christian Paul + github: jaller94 + - name: Rod Vagg + github: rvagg + - name: Seth Docherty + github: SethDocherty order: 0499 tags: ['ipips'] --- From 263892a2b318aab06e0827bc07f3ea0f57cda752 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Tue, 13 Jan 2026 23:18:00 +0100 Subject: [PATCH 31/35] feat(ipip-0499): document HAMTDirectory threshold estimation methods - add `links-count`, `links-bytes`, `block-bytes` estimation methods - fill unixfs-2025 profile: 256KiB (block-bytes), empty dirs included (opt-out), hidden opt-in, symlinks preserved - update legacy profiles table with cleaner method names - clarify profile naming allows YYYY or YYYY-MM suffix - add reference from unixfs.md to IPIP-499 for threshold methods - fix broken markdown links and formatting - clarify compliance requirements: MUST support unixfs-2025, MAY support legacy profiles --- src/ipips/ipip-0499.md | 37 ++++++++++++++++++++++++------------- src/unixfs.md | 2 ++ 2 files changed, 26 insertions(+), 13 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index d53404780..142c4d608 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -70,7 +70,7 @@ We introduce a set of **named configuration profiles**, each specifying the comp ### UnixFS parameters -Here is the complete set of UnixFS parameters that affect the resulting string encoding of the CID: +The following UnixFS parameters were identified as factors that affect the resulting CID: 1. CID version, e.g. CIDv0 or CIDv1 1. Multibase encoding for the CID, e.g. `base32` @@ -80,7 +80,7 @@ Here is the complete set of UnixFS parameters that affect the resulting string e 1. UnixFS DAG layout, e.g. `balanced`, `trickle` 1. UnixFS DAG width (max number of links per `File` node) 1. `HAMTDirectory` fanout, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves). -1. `HAMTDirectory` threshold: max `Directory` size before switching to `HAMTDirectory`. Size can be calculated using full serialized [PBNode](https://specs.ipfs.tech/unixfs/#dag-pb-node) size (recommended), or estimated by `PBNode.Links` size (name + CID), or link count (naive). +1. `HAMTDirectory` threshold: max `Directory` size before converting to `HAMTDirectory`, based on `PBNode.Links` count or estimated serialized [PBNode](https://ipld.io/specs/codecs/dag-pb/spec/) size. 1. Leaf Envelope: either `dag-pb` or `raw` 1. Whether empty directories are included in the DAG. Some implementations may apply filtering. 1. Whether hidden entities (including dot files) are included in the DAG. Some implementations may apply filtering. @@ -92,7 +92,7 @@ The [UnixFS spec](https://specs.ipfs.tech/unixfs/) defines Type=4 for symlinks w ## CID profiles -To enable consistent CID generation, we define a series of named profiles that specify complete UnixFS parameter sets. Profile names may have any prefix, but must end in `YYYY-MM`. +To enable consistent CID generation, we define a series of named profiles that specify complete UnixFS parameter sets. Profile names may have any prefix, but must end in `YYYY` or `YYYY-MM`. The initial profile in the series, **`unixfs-2025`**, captures the baseline default parameters used by multiple implementations as of November 2025. @@ -105,11 +105,23 @@ The initial profile in the series, **`unixfs-2025`**, captures the baseline defa | DAG layout | balanced | | DAG width (children per node) | 1024 | | `HAMTDirectory` fanout | 256 blocks | -| `HAMTDirectory` threshold | TODO (likely entire block size, as in Helia) | +| `HAMTDirectory` threshold | 256KiB (block-bytes) | | Leaves | raw | -| Empty directories | TODO (kubo needs opt-out flag) | -| Hidden entities | TODO | -| Symlinks | TODO (preserved?) | +| Empty directories | included (opt-out) | +| Hidden entities | opt-in | +| Symlinks | preserved | + +### HAMTDirectory threshold + +This IPIP recognizes and documents the divergence across ecosystem: the decision when to switch to HAMT sharded directory can be based on child link count (naive), or one of several serialized size estimation methods to keep directory blocks under a byte limit. + +Methods: + +- **`links-count`**: `PBNode.Links` length (child count). Simple but ignores varying entry sizes. + +- **`links-bytes`**: sum of `PBNode.Links[].Name` and `PBNode.Links[].Hash` byte lengths. Underestimates actual size by ignoring UnixFS Data, Tsize, and protobuf overhead. + +- **`block-bytes`**: full serialized [dag-pb](https://ipld.io/specs/codecs/dag-pb/spec/) node size. Recommended: most accurate, accounts for varint `Tsize` and optional metadata such as `mode` or `mtime`. ## Legacy profiles @@ -125,7 +137,7 @@ We also define a series of **legacy profiles**, used by various implementations | DAG layout | balanced | balanced | balanced | balanced | balanced | not specified | | DAG width (children per node) | 174 | 1024 | 1024 | 174 | **1024** | not specified | | `HAMTDirectory` fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | **1024** | not specified | -| `HAMTDirectory` threshold | 256KiB (est:links[name+cid]) | 256KiB (est) | 1000 **links** | 256KiB (est:links[name+cid]) | **1MiB** (est:links[name+cid]) | not specified | +| `HAMTDirectory` threshold | 256KiB (links-bytes) | 256KiB (links-bytes) | 1000 (links-count) | 256KiB (links-bytes) | **1MiB** (links-bytes) | not specified | | Leaves | dag-pb | raw | raw | raw | raw | not specified | | Empty directories | included | included | excluded | included | included | not specified | | Hidden entities | opt-in | opt-in | opt-in | opt-in | opt-in | not specified | @@ -139,7 +151,7 @@ We also define a series of **legacy profiles**, used by various implementations - `preserved`: Symlinks stored as UnixFS Type=4 nodes with target path (per [UnixFS spec](https://specs.ipfs.tech/unixfs/)). Note: Kubo (v0.39) `--dereference-args` only follows symlinks passed as CLI arguments; symlinks found during recursive traversal are always preserved. - `followed`: Symlinks dereferenced and treated as target files/directories -See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/18507/ +See related discussion at ### User benefit @@ -153,12 +165,11 @@ Profiles provide 3 key advantages for working with content-addressed data: ### Compatibility -UnixFS data encoded with the CID profiles defined in this IPIP remains fully compatible with existing implementations, since it conforms to the [https://specs.ipfs.tech/unixfs/](specification). +UnixFS data encoded with the CID profiles defined in this IPIP remains fully compatible with existing implementations, since it conforms to the [UnixFS specification](https://specs.ipfs.tech/unixfs/). -To generate CIDs in compliance with this IPIP, implementations must support the parameters defined in the profiles and support the set of named profiles. They MAY also support legacy profiles. +To generate CIDs in compliance with this IPIP, implementations MUST support the `unixfs-2025` profile. Legacy profiles are provided for historical context and MAY be supported for backward compatibility. -* Adding new functionality to support parameters and/or profiles -* Exposing configuration options for profiles +Implementations SHOULD allow users to inspect default values and adjust configuration options related to CID generation. ### Alternatives diff --git a/src/unixfs.md b/src/unixfs.md index 934414b3a..f036ead0e 100644 --- a/src/unixfs.md +++ b/src/unixfs.md @@ -501,6 +501,8 @@ node exceeds a size threshold between 256 KiB and 1 MiB. This threshold: See [Block Size Considerations](#block-size-considerations) for details on block size limits and conventions. +For standardized threshold estimation methods that enable deterministic CID generation, see [IPIP-499: UnixFS CID Profiles](../ipips/ipip-0499.md). + ### `dag-pb` `Symlink` A :dfn[Symlink] represents a POSIX [symbolic link](https://pubs.opengroup.org/onlinepubs/9699919799/functions/symlink.html). From e2f95dd086e7491fe2b0f8b311a8040de60aa888 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Wed, 14 Jan 2026 00:04:21 +0100 Subject: [PATCH 32/35] chore: bump spec-generator to 1.7.0 adds GFM table support for markdown --- package-lock.json | 259 +++++++++++++++++++++++++++++++++++++++++++++- package.json | 2 +- 2 files changed, 256 insertions(+), 5 deletions(-) diff --git a/package-lock.json b/package-lock.json index bdd27102b..86333d1e9 100644 --- a/package-lock.json +++ b/package-lock.json @@ -8,7 +8,7 @@ "name": "ipfs-specs-website", "version": "1.0.0", "dependencies": { - "spec-generator": "^1.6.1" + "spec-generator": "^1.7.0" } }, "node_modules/@11ty/dependency-tree": { @@ -5201,6 +5201,16 @@ "markdown-it": "bin/markdown-it.js" } }, + "node_modules/markdown-table": { + "version": "3.0.4", + "resolved": "https://registry.npmjs.org/markdown-table/-/markdown-table-3.0.4.tgz", + "integrity": "sha512-wiYz4+JrLyb/DqW2hkFJxP7Vd7JuTDm77fvbM8VfEQdmSMqcImWeeRbHwZjBjIFki/VaMK2BhFi7oUUZeM5bqw==", + "license": "MIT", + "funding": { + "type": "github", + "url": "https://github.com/sponsors/wooorm" + } + }, "node_modules/marked": { "version": "12.0.2", "resolved": "https://registry.npmjs.org/marked/-/marked-12.0.2.tgz", @@ -5354,6 +5364,107 @@ "url": "https://opencollective.com/unified" } }, + "node_modules/mdast-util-gfm": { + "version": "3.1.0", + "resolved": "https://registry.npmjs.org/mdast-util-gfm/-/mdast-util-gfm-3.1.0.tgz", + "integrity": "sha512-0ulfdQOM3ysHhCJ1p06l0b0VKlhU0wuQs3thxZQagjcjPrlFRqY215uZGHHJan9GEAXd9MbfPjFJz+qMkVR6zQ==", + "license": "MIT", + "dependencies": { + "mdast-util-from-markdown": "^2.0.0", + "mdast-util-gfm-autolink-literal": "^2.0.0", + "mdast-util-gfm-footnote": "^2.0.0", + "mdast-util-gfm-strikethrough": "^2.0.0", + "mdast-util-gfm-table": "^2.0.0", + "mdast-util-gfm-task-list-item": "^2.0.0", + "mdast-util-to-markdown": "^2.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, + "node_modules/mdast-util-gfm-autolink-literal": { + "version": "2.0.1", + "resolved": "https://registry.npmjs.org/mdast-util-gfm-autolink-literal/-/mdast-util-gfm-autolink-literal-2.0.1.tgz", + "integrity": "sha512-5HVP2MKaP6L+G6YaxPNjuL0BPrq9orG3TsrZ9YXbA3vDw/ACI4MEsnoDpn6ZNm7GnZgtAcONJyPhOP8tNJQavQ==", + "license": "MIT", + "dependencies": { + "@types/mdast": "^4.0.0", + "ccount": "^2.0.0", + "devlop": "^1.0.0", + "mdast-util-find-and-replace": "^3.0.0", + "micromark-util-character": "^2.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, + "node_modules/mdast-util-gfm-footnote": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/mdast-util-gfm-footnote/-/mdast-util-gfm-footnote-2.1.0.tgz", + "integrity": "sha512-sqpDWlsHn7Ac9GNZQMeUzPQSMzR6Wv0WKRNvQRg0KqHh02fpTz69Qc1QSseNX29bhz1ROIyNyxExfawVKTm1GQ==", + "license": "MIT", + "dependencies": { + "@types/mdast": "^4.0.0", + "devlop": "^1.1.0", + "mdast-util-from-markdown": "^2.0.0", + "mdast-util-to-markdown": "^2.0.0", + "micromark-util-normalize-identifier": "^2.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, + "node_modules/mdast-util-gfm-strikethrough": { + "version": "2.0.0", + "resolved": "https://registry.npmjs.org/mdast-util-gfm-strikethrough/-/mdast-util-gfm-strikethrough-2.0.0.tgz", + "integrity": "sha512-mKKb915TF+OC5ptj5bJ7WFRPdYtuHv0yTRxK2tJvi+BDqbkiG7h7u/9SI89nRAYcmap2xHQL9D+QG/6wSrTtXg==", + "license": "MIT", + "dependencies": { + "@types/mdast": "^4.0.0", + "mdast-util-from-markdown": "^2.0.0", + "mdast-util-to-markdown": "^2.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, + "node_modules/mdast-util-gfm-table": { + "version": "2.0.0", + "resolved": "https://registry.npmjs.org/mdast-util-gfm-table/-/mdast-util-gfm-table-2.0.0.tgz", + "integrity": "sha512-78UEvebzz/rJIxLvE7ZtDd/vIQ0RHv+3Mh5DR96p7cS7HsBhYIICDBCu8csTNWNO6tBWfqXPWekRuj2FNOGOZg==", + "license": "MIT", + "dependencies": { + "@types/mdast": "^4.0.0", + "devlop": "^1.0.0", + "markdown-table": "^3.0.0", + "mdast-util-from-markdown": "^2.0.0", + "mdast-util-to-markdown": "^2.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, + "node_modules/mdast-util-gfm-task-list-item": { + "version": "2.0.0", + "resolved": "https://registry.npmjs.org/mdast-util-gfm-task-list-item/-/mdast-util-gfm-task-list-item-2.0.0.tgz", + "integrity": "sha512-IrtvNvjxC1o06taBAVJznEnkiHxLFTzgonUdy8hzFVeDun0uTjxxrRGVaNFqkU1wJR3RBPEfsxmU6jDWPofrTQ==", + "license": "MIT", + "dependencies": { + "@types/mdast": "^4.0.0", + "devlop": "^1.0.0", + "mdast-util-from-markdown": "^2.0.0", + "mdast-util-to-markdown": "^2.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, "node_modules/mdast-util-phrasing": { "version": "4.1.0", "resolved": "https://registry.npmjs.org/mdast-util-phrasing/-/mdast-util-phrasing-4.1.0.tgz", @@ -5565,6 +5676,127 @@ "url": "https://opencollective.com/unified" } }, + "node_modules/micromark-extension-gfm": { + "version": "3.0.0", + "resolved": "https://registry.npmjs.org/micromark-extension-gfm/-/micromark-extension-gfm-3.0.0.tgz", + "integrity": "sha512-vsKArQsicm7t0z2GugkCKtZehqUm31oeGBV/KVSorWSy8ZlNAv7ytjFhvaryUiCUJYqs+NoE6AFhpQvBTM6Q4w==", + "license": "MIT", + "dependencies": { + "micromark-extension-gfm-autolink-literal": "^2.0.0", + "micromark-extension-gfm-footnote": "^2.0.0", + "micromark-extension-gfm-strikethrough": "^2.0.0", + "micromark-extension-gfm-table": "^2.0.0", + "micromark-extension-gfm-tagfilter": "^2.0.0", + "micromark-extension-gfm-task-list-item": "^2.0.0", + "micromark-util-combine-extensions": "^2.0.0", + "micromark-util-types": "^2.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, + "node_modules/micromark-extension-gfm-autolink-literal": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/micromark-extension-gfm-autolink-literal/-/micromark-extension-gfm-autolink-literal-2.1.0.tgz", + "integrity": "sha512-oOg7knzhicgQ3t4QCjCWgTmfNhvQbDDnJeVu9v81r7NltNCVmhPy1fJRX27pISafdjL+SVc4d3l48Gb6pbRypw==", + "license": "MIT", + "dependencies": { + "micromark-util-character": "^2.0.0", + "micromark-util-sanitize-uri": "^2.0.0", + "micromark-util-symbol": "^2.0.0", + "micromark-util-types": "^2.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, + "node_modules/micromark-extension-gfm-footnote": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/micromark-extension-gfm-footnote/-/micromark-extension-gfm-footnote-2.1.0.tgz", + "integrity": "sha512-/yPhxI1ntnDNsiHtzLKYnE3vf9JZ6cAisqVDauhp4CEHxlb4uoOTxOCJ+9s51bIB8U1N1FJ1RXOKTIlD5B/gqw==", + "license": "MIT", + "dependencies": { + "devlop": "^1.0.0", + "micromark-core-commonmark": "^2.0.0", + "micromark-factory-space": "^2.0.0", + "micromark-util-character": "^2.0.0", + "micromark-util-normalize-identifier": "^2.0.0", + "micromark-util-sanitize-uri": "^2.0.0", + "micromark-util-symbol": "^2.0.0", + "micromark-util-types": "^2.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, + "node_modules/micromark-extension-gfm-strikethrough": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/micromark-extension-gfm-strikethrough/-/micromark-extension-gfm-strikethrough-2.1.0.tgz", + "integrity": "sha512-ADVjpOOkjz1hhkZLlBiYA9cR2Anf8F4HqZUO6e5eDcPQd0Txw5fxLzzxnEkSkfnD0wziSGiv7sYhk/ktvbf1uw==", + "license": "MIT", + "dependencies": { + "devlop": "^1.0.0", + "micromark-util-chunked": "^2.0.0", + "micromark-util-classify-character": "^2.0.0", + "micromark-util-resolve-all": "^2.0.0", + "micromark-util-symbol": "^2.0.0", + "micromark-util-types": "^2.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, + "node_modules/micromark-extension-gfm-table": { + "version": "2.1.1", + "resolved": "https://registry.npmjs.org/micromark-extension-gfm-table/-/micromark-extension-gfm-table-2.1.1.tgz", + "integrity": "sha512-t2OU/dXXioARrC6yWfJ4hqB7rct14e8f7m0cbI5hUmDyyIlwv5vEtooptH8INkbLzOatzKuVbQmAYcbWoyz6Dg==", + "license": "MIT", + "dependencies": { + "devlop": "^1.0.0", + "micromark-factory-space": "^2.0.0", + "micromark-util-character": "^2.0.0", + "micromark-util-symbol": "^2.0.0", + "micromark-util-types": "^2.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, + "node_modules/micromark-extension-gfm-tagfilter": { + "version": "2.0.0", + "resolved": "https://registry.npmjs.org/micromark-extension-gfm-tagfilter/-/micromark-extension-gfm-tagfilter-2.0.0.tgz", + "integrity": "sha512-xHlTOmuCSotIA8TW1mDIM6X2O1SiX5P9IuDtqGonFhEK0qgRI4yeC6vMxEV2dgyr2TiD+2PQ10o+cOhdVAcwfg==", + "license": "MIT", + "dependencies": { + "micromark-util-types": "^2.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, + "node_modules/micromark-extension-gfm-task-list-item": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/micromark-extension-gfm-task-list-item/-/micromark-extension-gfm-task-list-item-2.1.0.tgz", + "integrity": "sha512-qIBZhqxqI6fjLDYFTBIa4eivDMnP+OZqsNwmQ3xNLE4Cxwc+zfQEfbs6tzAo2Hjq+bh6q5F+Z8/cksrLFYWQQw==", + "license": "MIT", + "dependencies": { + "devlop": "^1.0.0", + "micromark-factory-space": "^2.0.0", + "micromark-util-character": "^2.0.0", + "micromark-util-symbol": "^2.0.0", + "micromark-util-types": "^2.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, "node_modules/micromark-factory-destination": { "version": "2.0.1", "resolved": "https://registry.npmjs.org/micromark-factory-destination/-/micromark-factory-destination-2.0.1.tgz", @@ -8658,6 +8890,24 @@ "url": "https://opencollective.com/unified" } }, + "node_modules/remark-gfm": { + "version": "4.0.1", + "resolved": "https://registry.npmjs.org/remark-gfm/-/remark-gfm-4.0.1.tgz", + "integrity": "sha512-1quofZ2RQ9EWdeN34S79+KExV1764+wCUGop5CPL1WGdD0ocPpu91lzPGbwWMECpEpd42kJGQwzRfyov9j4yNg==", + "license": "MIT", + "dependencies": { + "@types/mdast": "^4.0.0", + "mdast-util-gfm": "^3.0.0", + "micromark-extension-gfm": "^3.0.0", + "remark-parse": "^11.0.0", + "remark-stringify": "^11.0.0", + "unified": "^11.0.0" + }, + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/unified" + } + }, "node_modules/remark-heading-id": { "version": "1.0.1", "resolved": "https://registry.npmjs.org/remark-heading-id/-/remark-heading-id-1.0.1.tgz", @@ -9320,9 +9570,9 @@ "integrity": "sha512-zC8zGoGkmc8J9ndvml8Xksr1Amk9qBujgbF0JAIWO7kXr43w0h/0GJNM/Vustixu+YE8N/MTrQ7N31FvHUACxQ==" }, "node_modules/spec-generator": { - "version": "1.6.1", - "resolved": "https://registry.npmjs.org/spec-generator/-/spec-generator-1.6.1.tgz", - "integrity": "sha512-yDzubb+cWKPlg82SQSaFeHjHVbKu58tlcvbnAy8yFtxnikUL2c06GViBw7yAOZPYjTS/meZ7vQp61IJ0myG0XQ==", + "version": "1.7.0", + "resolved": "https://registry.npmjs.org/spec-generator/-/spec-generator-1.7.0.tgz", + "integrity": "sha512-U5itp3X8mU84chN0xmwgEtAaq/VL4gbC4AK5EdqJxGydhUhz+OnNUcRTKP9v3k9wYVaXbNIeoOzVBnzrNDV1XQ==", "license": "MIT", "dependencies": { "@11ty/eleventy": "^2.0.1", @@ -9342,6 +9592,7 @@ "pluralize": "^8.0.0", "remark": "^15.0.1", "remark-directive": "^3.0.0", + "remark-gfm": "^4.0.1", "remark-heading-id": "^1.0.1", "remark-html": "^16.0.1", "remark-squeeze-paragraphs": "^6.0.0", diff --git a/package.json b/package.json index 53ed95956..2a9f851c8 100644 --- a/package.json +++ b/package.json @@ -10,6 +10,6 @@ "license": "", "private": true, "dependencies": { - "spec-generator": "^1.6.1" + "spec-generator": "^1.7.0" } } From b832bcc86d9ccc9f5b06ede5d8eb6952a4798eee Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Wed, 14 Jan 2026 01:42:43 +0100 Subject: [PATCH 33/35] feat(ipip-0499): restructure document and rename profiles - rename `unixfs-2025` to `unixfs-v1-2025` for clarity on CID version - add `unixfs-v0-2015` legacy profile for backward compatibility with kubo - move divergence analysis from separate section into Motivation - clarify motivation: CIDs are verifiable, problem is DAG construction variance - consolidate three problems into two: broken hash semantics, verification overhead - add Mode and Mtime parameters to all profile tables - improve UnixFS parameters section with inline links and explanations - add balanced vs trickle DAG layout descriptions - update unixfs.md to reference IPIP-499 in a note block with full URL --- src/ipips/ipip-0499.md | 169 ++++++++++++++++++++++------------------- src/unixfs.md | 4 +- 2 files changed, 95 insertions(+), 78 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 142c4d608..e28efe5b7 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -1,6 +1,6 @@ --- title: 'IPIP-0499: UnixFS CID Profiles' -date: 2025-12-13 +date: 2026-01-14 ipip: proposal editors: - name: Michelle Lee @@ -54,96 +54,67 @@ This proposal introduces **configuration profiles** for CIDs that represent file ## Motivation -UnixFS CIDs are currently non-deterministic. The same file or directory can produce different CIDs across implementations, because parameters like chunk size, DAG width, and layout vary between implementations. Often, these parameters are not even configurable by users. +While CIDs and UnixFS DAGs are cryptographically verifiable, the same file or directory can produce different CIDs across UnixFS implementations, because DAG construction parameters like chunk size, DAG width, and layout vary between tools. Often, these parameters are not even configurable by users. -This creates three problems: +This creates two problems: -- **Verification difficulty:** The same content produces different CIDs across tools, making content verification unreliable. -- **Additional overhead:** Users must store and transfer UnixFS merkle proofs to verify CIDs, adding storage overhead, network bandwidth, and complexity. -- **Broken expectations:** Unlike standard hash functions where identical input produces identical output, UnixFS CIDs behave unpredictably. +- **Broken hash semantics:** Unlike standard hash functions where identical input produces identical output, UnixFS CIDs depend on DAG construction parameters. Simple CID comparison leads to false-negatives. +- **Verification overhead:** Without knowing the original parameters, users must retrieve and compare entire DAGs to verify content, adding storage, bandwidth, and complexity. -Configuration profiles solve this by explicitly defining all parameters that affect CID generation. This preserves UnixFS flexibility (users can still choose parameters) while enabling deterministic results. +A potential solution is to define configuration profiles: well-known parameter presets that implementations can adopt when common conventions for DAG creation are desired. -## Detailed design - -We introduce a set of **named configuration profiles**, each specifying the complete set of parameters for generating UnixFS CIDs. When implementations use these profiles, they guarantee that the same input, processed with the same profile, will yield the same CID across different tools and implementations. +See related discussion at ### UnixFS parameters -The following UnixFS parameters were identified as factors that affect the resulting CID: +The following [UnixFS](https://specs.ipfs.tech/unixfs/) parameters were identified as factors that affect the resulting CID: 1. CID version, e.g. CIDv0 or CIDv1 1. Multibase encoding for the CID, e.g. `base32` 1. Hash function used for all nodes in the DAG, e.g. `sha2-256` -1. UnixFS file chunking algorithm -1. UnixFS file chunk size or target (if required by the chunking algorithm) -1. UnixFS DAG layout, e.g. `balanced`, `trickle` +1. UnixFS file chunking algorithm and chunk size (e.g., fixed-size chunks of 256KiB) +1. UnixFS DAG layout: + - `balanced`: builds a balanced tree where all leaf nodes are at the same depth. Optimized for random access, seeking, and range requests within files (e.g., video). + - `trickle`: builds a tree optimized for streaming, where data can be consumed before the entire file is available. Useful for logs and other append-only data structures where random access is not important. 1. UnixFS DAG width (max number of links per `File` node) -1. `HAMTDirectory` fanout, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves). -1. `HAMTDirectory` threshold: max `Directory` size before converting to `HAMTDirectory`, based on `PBNode.Links` count or estimated serialized [PBNode](https://ipld.io/specs/codecs/dag-pb/spec/) size. -1. Leaf Envelope: either `dag-pb` or `raw` +1. [HAMTDirectory](https://specs.ipfs.tech/unixfs/#dag-pb-hamtdirectory) fanout: the branching factor at each level of the HAMT tree (e.g., 256 leaves). +1. [HAMTDirectory threshold](https://specs.ipfs.tech/unixfs/#when-to-use-hamt-sharding): max `Directory` size before converting to `HAMTDirectory`, based on `PBNode.Links` count or estimated serialized [dag-pb](https://ipld.io/specs/codecs/dag-pb/spec/) size: + - `links-count`: `PBNode.Links` length (child count). Simple but ignores varying entry sizes. + - `links-bytes`: sum of `PBNode.Links[].Name` and `PBNode.Links[].Hash` byte lengths. Underestimates actual size by ignoring UnixFS Data, Tsize, and protobuf overhead. + - `block-bytes`: full serialized dag-pb node size. Most accurate, accounts for varint `Tsize` and optional metadata such as `mode` or `mtime`. +1. Leaves: either [dag-pb wrapped](https://specs.ipfs.tech/unixfs/#dag-pb-node) or [raw](https://specs.ipfs.tech/unixfs/#raw-node) 1. Whether empty directories are included in the DAG. Some implementations may apply filtering. 1. Whether hidden entities (including dot files) are included in the DAG. Some implementations may apply filtering. 1. Directory wrapping for single files: in order to retain the name of a single file, some implementations have the option to wrap the file in a `Directory` with link to the file. -1. Presence and accurate setting of `Tsize`. -1. Symlink handling: preserved as UnixFS Type=4 nodes, or followed (dereferenced to target). - -The [UnixFS spec](https://specs.ipfs.tech/unixfs/) defines Type=4 for symlinks with target path stored in the Data field. - -## CID profiles - -To enable consistent CID generation, we define a series of named profiles that specify complete UnixFS parameter sets. Profile names may have any prefix, but must end in `YYYY` or `YYYY-MM`. - -The initial profile in the series, **`unixfs-2025`**, captures the baseline default parameters used by multiple implementations as of November 2025. - -| Parameter | `unixfs-2025` | -| ----------------------------- | ------------------------------------------------------- | -| CID version | CIDv1 | -| Hash function | sha2-256 | -| Chunking algorithm | fixed-size | -| Max chunk size | 1MiB | -| DAG layout | balanced | -| DAG width (children per node) | 1024 | -| `HAMTDirectory` fanout | 256 blocks | -| `HAMTDirectory` threshold | 256KiB (block-bytes) | -| Leaves | raw | -| Empty directories | included (opt-out) | -| Hidden entities | opt-in | -| Symlinks | preserved | - -### HAMTDirectory threshold - -This IPIP recognizes and documents the divergence across ecosystem: the decision when to switch to HAMT sharded directory can be based on child link count (naive), or one of several serialized size estimation methods to keep directory blocks under a byte limit. - -Methods: - -- **`links-count`**: `PBNode.Links` length (child count). Simple but ignores varying entry sizes. - -- **`links-bytes`**: sum of `PBNode.Links[].Name` and `PBNode.Links[].Hash` byte lengths. Underestimates actual size by ignoring UnixFS Data, Tsize, and protobuf overhead. - -- **`block-bytes`**: full serialized [dag-pb](https://ipld.io/specs/codecs/dag-pb/spec/) node size. Recommended: most accurate, accounts for varint `Tsize` and optional metadata such as `mode` or `mtime`. - -## Legacy profiles - -We also define a series of **legacy profiles**, used by various implementations as of November 2025: - -| | `kubo-legacy-2025` (v0.39) | `helia-2025` | `storacha-2025` | `kubo-2025` | `kubo-wide-2025` | `dasl-2025` | -| ----------------------------- | ------------------------------ | --------------- | ------------------ | ------------------ | ----------------------- | ------------- | -| Based on | kubo v0.39 (`legacy-cid-v0`) | @helia/unixfs 6.0.4 | w3cli 7.12.0 | kubo v0.39 (`test-cid-v1`) | kubo v0.39 (`test-cid-v1-wide`) | 2025-12 | -| CID version | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | -| Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | -| Chunking algorithm | fixed-size | fixed-size | fixed-size | fixed-size | fixed-size | not specified | -| Max chunk size | 256KiB | 1MiB | 1MiB | 1MiB | 1MiB | not specified | -| DAG layout | balanced | balanced | balanced | balanced | balanced | not specified | -| DAG width (children per node) | 174 | 1024 | 1024 | 174 | **1024** | not specified | -| `HAMTDirectory` fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | **1024** | not specified | -| `HAMTDirectory` threshold | 256KiB (links-bytes) | 256KiB (links-bytes) | 1000 (links-count) | 256KiB (links-bytes) | **1MiB** (links-bytes) | not specified | -| Leaves | dag-pb | raw | raw | raw | raw | not specified | -| Empty directories | included | included | excluded | included | included | not specified | -| Hidden entities | opt-in | opt-in | opt-in | opt-in | opt-in | not specified | -| Symlinks | preserved | followed | followed | preserved | preserved | not specified | +1. Presence and accurate setting of `Tsize` (correct UnixFS has `Tsize` of child sub-DAGs). +1. [Symlink](https://specs.ipfs.tech/unixfs/#dag-pb-symlink) handling: preserved as UnixFS Type=4 nodes, or followed (dereferenced to target). +1. [Mode](https://specs.ipfs.tech/unixfs/#mode-field): optional POSIX file permissions. +1. [Mtime](https://specs.ipfs.tech/unixfs/#mtime-field): optional modification timestamp. + +### Divergence in current implementations + +We analyzed the default settings across the most popular UnixFS implementations in the ecosystem. The table below documents the divergence that prevents deterministic CID generation today: + +| Parameter | kubo (CIDv0) | helia | storacha | kubo (CIDv1) | dasl | +| ----------------------------- | ------------------------ | -------------------- | ------------------ | ----------------------------- | ------------ | +| Based on | v0.39 (`unixfs-v0-2015`) | @helia/unixfs 6.0.4 | w3cli 7.12.0 | v0.39 (`test-cid-v1` profile) | spec 2025-12 | +| CID version | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | +| Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | +| Chunking algorithm | fixed-size | fixed-size | fixed-size | fixed-size | N/A | +| Max chunk size | 256KiB | 1MiB | 1MiB | 1MiB | N/A | +| DAG layout | balanced | balanced | balanced | balanced | N/A | +| DAG width (children per node) | 174 | 1024 | 1024 | 174 | N/A | +| HAMTDirectory fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | N/A | +| HAMTDirectory threshold | 256KiB (links-bytes) | 256KiB (links-bytes) | 1000 (links-count) | 256KiB (links-bytes) | N/A | +| Leaves | dag-pb | raw | raw | raw | N/A | +| Empty directories | included | included | excluded | included | N/A | +| Hidden entities | excluded (opt-in) | excluded (opt-in) | excluded (opt-in) | excluded (opt-in) | N/A | +| Symlinks | preserved | followed | followed | preserved | N/A | +| Mode (permissions) | excluded (opt-in) | excluded (opt-in) | not supported | excluded (opt-in) | N/A | +| Mtime (modification time) | excluded (opt-in) | excluded (opt-in) | not supported | excluded (opt-in) | N/A | **Terminology:** + - `included`: Always included in the DAG (no option to exclude) - `excluded`: Always excluded from the DAG (no option to include) - `opt-in`: Excluded by default; implementations provide a flag to include (e.g., `--hidden` in Kubo/Storacha, `hidden: true` in Helia) @@ -151,13 +122,57 @@ We also define a series of **legacy profiles**, used by various implementations - `preserved`: Symlinks stored as UnixFS Type=4 nodes with target path (per [UnixFS spec](https://specs.ipfs.tech/unixfs/)). Note: Kubo (v0.39) `--dereference-args` only follows symlinks passed as CLI arguments; symlinks found during recursive traversal are always preserved. - `followed`: Symlinks dereferenced and treated as target files/directories -See related discussion at +## Detailed design + +We introduce a set of **named configuration profiles**, each specifying the complete set of parameters for generating UnixFS CIDs. When implementations use these profiles, they guarantee that the same input, processed with the same profile, will yield the same CID across different tools and implementations. + +### The `unixfs-v1-2025` modern profile + +Based on the research above, we define **`unixfs-v1-2025`** as an opinionated profile for implementations that want to adopt deterministic CID generation for UnixFS DAGs with CIDv1. + +| Parameter | `unixfs-v1-2025` | +| ----------------------------- | -------------------- | +| CID version | CIDv1 | +| Hash function | sha2-256 | +| Chunking algorithm | fixed-size | +| Max chunk size | 1MiB | +| DAG layout | balanced | +| DAG width (children per node) | 1024 | +| HAMTDirectory fanout | 256 blocks | +| HAMTDirectory threshold | 256KiB (block-bytes) | +| Leaves | raw | +| Empty directories | included (opt-out) | +| Hidden entities | excluded (opt-in) | +| Symlinks | preserved | +| Mode (permissions) | excluded (opt-in) | +| Mtime (modification time) | excluded (opt-in) | + +### The `unixfs-v0-2015` legacy profile + +This profile documents the default UnixFS DAG construction parameters used by Kubo through version 0.39 when producing CIDv0. It is provided for users who depend on CIDv0 identifiers generated by Kubo and need to reproduce them with other implementations, or verify content against existing CIDv0 references. The year 2015 in the name indicates that the majority of these parameters were picked a decade ago, when the initial go-ipfs alpha software was implemented, and these defaults were never contested since then. + +| Parameter | `unixfs-v0-2015` | +| ----------------------------- | -------------------- | +| CID version | CIDv0 | +| Hash function | sha2-256 | +| Chunking algorithm | fixed-size | +| Max chunk size | 256KiB | +| DAG layout | balanced | +| DAG width (children per node) | 174 | +| HAMTDirectory fanout | 256 blocks | +| HAMTDirectory threshold | 256KiB (links-bytes) | +| Leaves | dag-pb | +| Empty directories | included | +| Hidden entities | excluded (opt-in) | +| Symlinks | preserved | +| Mode (permissions) | excluded (opt-in) | +| Mtime (modification time) | excluded (opt-in) | ### User benefit Profiles provide 3 key advantages for working with content-addressed data: -1. **Predictable, deterministic behavior:** Profiles restore the expected property of content addressing: identical input data always produces identical CIDs, regardless of which implementation generates them. +1. **Predictable, deterministic behavior:** Profiles restore intuitive hash-like behavior: identical input data always produces identical CIDs, regardless of which implementation generates them. 2. **Lightweight verification:** Users can verify content without needing to rely on additional merkle proofs or CAR files. @@ -167,7 +182,7 @@ Profiles provide 3 key advantages for working with content-addressed data: UnixFS data encoded with the CID profiles defined in this IPIP remains fully compatible with existing implementations, since it conforms to the [UnixFS specification](https://specs.ipfs.tech/unixfs/). -To generate CIDs in compliance with this IPIP, implementations MUST support the `unixfs-2025` profile. Legacy profiles are provided for historical context and MAY be supported for backward compatibility. +To generate CIDs in compliance with this IPIP, implementations MUST support the `unixfs-v1-2025` profile. The `unixfs-v0-2015` profile is provided for backward compatibility and MAY be supported by implementations that need to produce CIDs matching historical Kubo output. Implementations SHOULD allow users to inspect default values and adjust configuration options related to CID generation. diff --git a/src/unixfs.md b/src/unixfs.md index f036ead0e..856640671 100644 --- a/src/unixfs.md +++ b/src/unixfs.md @@ -501,7 +501,9 @@ node exceeds a size threshold between 256 KiB and 1 MiB. This threshold: See [Block Size Considerations](#block-size-considerations) for details on block size limits and conventions. -For standardized threshold estimation methods that enable deterministic CID generation, see [IPIP-499: UnixFS CID Profiles](../ipips/ipip-0499.md). +:::note +For standardized threshold estimation methods that enable deterministic CID generation, see [IPIP-499: UnixFS CID Profiles](https://specs.ipfs.tech/ipips/ipip-0499/). +::: ### `dag-pb` `Symlink` From d7e81d7fb25e3721da2c145042d572c9c1d08d28 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Wed, 14 Jan 2026 02:08:17 +0100 Subject: [PATCH 34/35] feat(ipip-0499): document efficiency benefits of modern profile parameters add user benefit explaining why 1 MiB chunks with 1024 links per node results in shallower DAG trees, fewer total nodes, faster seeking, and reduced DHT announcement overhead compared to legacy 256 KiB/174 params --- src/ipips/ipip-0499.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index e28efe5b7..277ccdefc 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -170,7 +170,7 @@ This profile documents the default UnixFS DAG construction parameters used by Ku ### User benefit -Profiles provide 3 key advantages for working with content-addressed data: +Profiles provide key advantages for working with content-addressed data: 1. **Predictable, deterministic behavior:** Profiles restore intuitive hash-like behavior: identical input data always produces identical CIDs, regardless of which implementation generates them. @@ -178,6 +178,12 @@ Profiles provide 3 key advantages for working with content-addressed data: 3. **Simplified workflow:** Users can select a profile and automatically get consistent CIDs across all implementations, without needing to configure or understand the underlying parameters. +4. **Improved efficiency:** The `unixfs-v1-2025` profile uses 1 MiB chunks with 1024 links per node, compared to the legacy 256 KiB chunks with 174 links. This results in: + - Shallower DAG trees (3 levels for a 1 TiB file vs 4 levels with legacy parameters) + - Approximately 4x fewer total nodes for the same content + - Faster random access and seeking in large files (fewer round-trips to traverse the tree) + - Fewer CIDs to announce, reducing stress on public good routing infrastructure such as the Amino DHT + ### Compatibility UnixFS data encoded with the CID profiles defined in this IPIP remains fully compatible with existing implementations, since it conforms to the [UnixFS specification](https://specs.ipfs.tech/unixfs/). From 37132f1f7027debf88e67d227d3d6417745b6288 Mon Sep 17 00:00:00 2001 From: Marcin Rataj Date: Fri, 16 Jan 2026 22:01:31 +0100 Subject: [PATCH 35/35] feat(ipip-0499): add singularity to divergence table include singularity as example showing balanced layout has implementation variants that affect CID determinism for large files: - document balanced-packed DAG layout variant (https://github.com/data-preservation-programs/singularity/issues/525) - note boxo defaults for HAMT parameters - note rclone defaults for hidden files and symlinks --- src/ipips/ipip-0499.md | 77 ++++++++++++++++++++++++++++++++---------- 1 file changed, 59 insertions(+), 18 deletions(-) diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md index 277ccdefc..77a0c6215 100644 --- a/src/ipips/ipip-0499.md +++ b/src/ipips/ipip-0499.md @@ -75,7 +75,8 @@ The following [UnixFS](https://specs.ipfs.tech/unixfs/) parameters were identifi 1. UnixFS file chunking algorithm and chunk size (e.g., fixed-size chunks of 256KiB) 1. UnixFS DAG layout: - `balanced`: builds a balanced tree where all leaf nodes are at the same depth. Optimized for random access, seeking, and range requests within files (e.g., video). - - `trickle`: builds a tree optimized for streaming, where data can be consumed before the entire file is available. Useful for logs and other append-only data structures where random access is not important. + - `balanced-packed`: variant of `balanced` that may produce different tree structure for large files. See [Balanced DAG layout variants](#balanced-dag-layout-variants) below. + - `trickle`: builds a tree optimized for on-the-fly one-time streaming, where data can be consumed before the entire file is available. Useful for logs and other append-only data structures where random access is not important. 1. UnixFS DAG width (max number of links per `File` node) 1. [HAMTDirectory](https://specs.ipfs.tech/unixfs/#dag-pb-hamtdirectory) fanout: the branching factor at each level of the HAMT tree (e.g., 256 leaves). 1. [HAMTDirectory threshold](https://specs.ipfs.tech/unixfs/#when-to-use-hamt-sharding): max `Directory` size before converting to `HAMTDirectory`, based on `PBNode.Links` count or estimated serialized [dag-pb](https://ipld.io/specs/codecs/dag-pb/spec/) size: @@ -91,27 +92,57 @@ The following [UnixFS](https://specs.ipfs.tech/unixfs/) parameters were identifi 1. [Mode](https://specs.ipfs.tech/unixfs/#mode-field): optional POSIX file permissions. 1. [Mtime](https://specs.ipfs.tech/unixfs/#mtime-field): optional modification timestamp. +#### Balanced DAG layout variants + +The `balanced` DAG layout has implementation variants that affect CID determinism for large files. CID mismatches have been [observed](https://discuss.ipfs.tech/t/should-we-profile-cids/18507/41) and [investigated](https://discuss.ipfs.tech/t/should-we-profile-cids/18507/44) when comparing [kubo][] and [Singularity][singularity] outputs for files exceeding 1 GiB. This IPIP introduces the name `balanced-packed` to distinguish Singularity's variant from the original `balanced` layout. + +Implementations adopting a profile SHOULD specify which balanced variant they use. The `unixfs-v1-2025` profile uses `balanced` for maximum compatibility with existing implementations. + +##### `balanced` + +The original balanced layout used by [kubo][]/[boxo][], [helia][], and others in the ecosystem. Builds the tree incrementally as chunks stream in: +- Starts with first chunk as root, grows tree upward as needed +- Uses explicit depth tracking to fill nodes recursively +- All leaf nodes end up at the **same depth** from the root +- Reference: [`boxo/ipld/unixfs/importer/balanced/builder.go`](https://github.com/ipfs/boxo/blob/v0.35.2/ipld/unixfs/importer/balanced/builder.go) + +##### `balanced-packed` + +Name introduced by this IPIP for [Singularity][singularity]'s variant. Groups pre-computed links in batch: +- Takes all chunk links as input, then packs them into parent nodes (up to max width) +- Repeats packing level-by-level until single root remains +- Trailing nodes may have fewer children, causing leaf depth to vary +- Optimized for batch processing of pre-chunked data in CAR files +- Reference: [`singularity/pack/packutil/util.go`](https://github.com/data-preservation-programs/singularity/blob/v0.6.0-RC4/pack/packutil/util.go) `AssembleFileFromLinks()` + +##### Observed differences + +According to [Singularity issue #525](https://github.com/data-preservation-programs/singularity/issues/525): +> "In Singularity's DAG, the last leaf node is not at the same distance from the root as the others." + +This structural difference causes CID mismatches for files larger than `chunk_size * dag_width` (e.g., >1 GiB with 1 MiB chunks and 1024 links per node), even when all other parameters match. + ### Divergence in current implementations We analyzed the default settings across the most popular UnixFS implementations in the ecosystem. The table below documents the divergence that prevents deterministic CID generation today: -| Parameter | kubo (CIDv0) | helia | storacha | kubo (CIDv1) | dasl | -| ----------------------------- | ------------------------ | -------------------- | ------------------ | ----------------------------- | ------------ | -| Based on | v0.39 (`unixfs-v0-2015`) | @helia/unixfs 6.0.4 | w3cli 7.12.0 | v0.39 (`test-cid-v1` profile) | spec 2025-12 | -| CID version | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | -| Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | -| Chunking algorithm | fixed-size | fixed-size | fixed-size | fixed-size | N/A | -| Max chunk size | 256KiB | 1MiB | 1MiB | 1MiB | N/A | -| DAG layout | balanced | balanced | balanced | balanced | N/A | -| DAG width (children per node) | 174 | 1024 | 1024 | 174 | N/A | -| HAMTDirectory fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | N/A | -| HAMTDirectory threshold | 256KiB (links-bytes) | 256KiB (links-bytes) | 1000 (links-count) | 256KiB (links-bytes) | N/A | -| Leaves | dag-pb | raw | raw | raw | N/A | -| Empty directories | included | included | excluded | included | N/A | -| Hidden entities | excluded (opt-in) | excluded (opt-in) | excluded (opt-in) | excluded (opt-in) | N/A | -| Symlinks | preserved | followed | followed | preserved | N/A | -| Mode (permissions) | excluded (opt-in) | excluded (opt-in) | not supported | excluded (opt-in) | N/A | -| Mtime (modification time) | excluded (opt-in) | excluded (opt-in) | not supported | excluded (opt-in) | N/A | +| Parameter | [kubo][] (CIDv0) | [helia][] | [storacha][] | [kubo][] (CIDv1) | [singularity][] | [dasl][] | +| ----------------------------- | ------------------------ | -------------------- | ------------------ | ----------------------------- | ----------------------------------- | ------------ | +| Based on | v0.39 (`unixfs-v0-2015`) | @helia/unixfs 6.0.4 | w3cli 7.12.0 | v0.39 (`test-cid-v1` profile) | v0.6.0-RC4 (454b630) | spec 2025-12 | +| CID version | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | +| Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | +| Chunking algorithm | fixed-size | fixed-size | fixed-size | fixed-size | fixed-size | N/A | +| Max chunk size | 256KiB | 1MiB | 1MiB | 1MiB | 1MiB | N/A | +| DAG layout | balanced | balanced | balanced | balanced | [balanced-packed](#balanced-packed) | N/A | +| DAG width (children per node) | 174 | 1024 | 1024 | 174 | 1024 | N/A | +| HAMTDirectory fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | 256 blocks (boxo) | N/A | +| HAMTDirectory threshold | 256KiB (links-bytes) | 256KiB (links-bytes) | 1000 (links-count) | 256KiB (links-bytes) | 256KiB (links-bytes) (boxo) | N/A | +| Leaves | dag-pb | raw | raw | raw | raw | N/A | +| Empty directories | included | included | excluded | included | included | N/A | +| Hidden entities | excluded (opt-in) | excluded (opt-in) | excluded (opt-in) | excluded (opt-in) | included (rclone) | N/A | +| Symlinks | preserved | followed | followed | preserved | skipped (rclone) | N/A | +| Mode (permissions) | excluded (opt-in) | excluded (opt-in) | not supported | excluded (opt-in) | not supported | N/A | +| Mtime (modification time) | excluded (opt-in) | excluded (opt-in) | not supported | excluded (opt-in) | not supported | N/A | **Terminology:** @@ -121,6 +152,9 @@ We analyzed the default settings across the most popular UnixFS implementations - `opt-out`: Included by default; implementations provide a flag to exclude - `preserved`: Symlinks stored as UnixFS Type=4 nodes with target path (per [UnixFS spec](https://specs.ipfs.tech/unixfs/)). Note: Kubo (v0.39) `--dereference-args` only follows symlinks passed as CLI arguments; symlinks found during recursive traversal are always preserved. - `followed`: Symlinks dereferenced and treated as target files/directories +- `skipped`: Symlinks ignored during traversal (not included in DAG) +- `(rclone)`: Singularity delegates file traversal to [rclone](https://rclone.org/); values shown reflect rclone defaults +- `(boxo)`: Singularity overrides some [boxo][] defaults but relies on implicit boxo defaults for these values ## Detailed design @@ -205,6 +239,13 @@ specification compliance. This section can be skipped if IPIP does not deal with the way IPFS handles content-addressed data, or the modified specification file already includes this information. +[kubo]: https://github.com/ipfs/kubo +[boxo]: https://github.com/ipfs/boxo +[helia]: https://github.com/ipfs/helia +[storacha]: https://github.com/storacha/w3cli +[singularity]: https://github.com/data-preservation-programs/singularity +[dasl]: https://dasl.ing + ### Copyright Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).