doc/plans: Update OCI sealing spec (kernel sigs, flattened layers)#224
doc/plans: Update OCI sealing spec (kernel sigs, flattened layers)#224cgwalters wants to merge 1 commit intocomposefs:mainfrom
Conversation
a404b59 to
2a23a13
Compare
allisonkarlitskaya
left a comment
There was a problem hiding this comment.
I started reviewing this before I noticed how large it was. A few comments from the first part...
doc/plans/oci-sealing-impl.md
Outdated
| ### 1. Algorithm string format | ||
|
|
||
| The sealing workflow in composefs-rs begins with `create_filesystem()` building the filesystem from OCI layers. Layer tar streams are imported via `import_layer()`, converting them to composefs split streams. Files 64 bytes or smaller are stored inline in the split stream, while larger files are stored in the object store with fsverity digests. Layers are processed in order, applying overlayfs semantics including whiteout handling (`.wh.` files). Hardlinks are tracked properly across layers to maintain filesystem semantics. | ||
| The spec defines `${DIGEST}-${BLOCKSIZEBITS}` identifiers (e.g. `sha512-12`). Need to implement parsing and mapping to kernel constants (`FS_VERITY_HASH_ALG_SHA512`, 4096-byte blocks, no salt). This is a prerequisite for everything else. |
There was a problem hiding this comment.
I'm a proponent of the alg-blkbits approach, but maybe we also want a prefix like fsverity- to make sure people understand that it's not straight-up sha512 or not even a straight-up merkel tree. If we did do an fsverity prefix we could use the algorithm number instead, like fsverity-2-12. Just a bit of bikeshedding.
doc/plans/oci-sealing-impl.md
Outdated
| ### 3. Persist manifest and config as regular files | ||
|
|
||
| Two-level naming allows access by fsverity digest (verified) or by ref name (unverified). The `ensure_stream()` method provides idempotent stream creation with SHA256-based deduplication. Streams can reference other streams via digest maps stored in split stream headers, enabling the layer→config relationship tracking. | ||
| The manifest is currently not persisted at all (fetched, parsed, discarded in `skopeo.rs`). The config is stored as splitstream via `write_config()`. Both need to be stored as regular files so `FS_IOC_ENABLE_VERITY` can be called on them. |
There was a problem hiding this comment.
We definitely don't need to write a file to the disk in order to calculate its fs-verity digest. What's the advantage of doing this? Will you continue to also store the splitstream? If not, GC is going to get a lot more complicated and also slower...
There was a problem hiding this comment.
We definitely don't need to write a file to the disk in order to calculate its fs-verity digest. What's the advantage of doing this?
What I want to do is support "read + mount a container image with strict IPE enabled" and to do that we ideally have all of the metadata covered by fsverity as well.
That said, I think most policies are mainly concerned with denying execute-unsigned, not read-unsigned (as that gets obviously hard).
But it felt appealing to me to be able to say that the manifest and config are also just signed-fsverity files.
Or to say it differently that in theory one could omit e.g. cosign covering an image - just a composefs-signature artifact is enough as well to verify a complete image.
Will you continue to also store the splitstream? If not, GC is going to get a lot more complicated and also slower...
That's all we need to do right?
There was a problem hiding this comment.
Yes, I mean that should work just fine. I'm not sure what the value of IPE for "just data" here is though, since the kernel won't block that, and since we already have (and check) the canonical identifier (ie: sha256 content hash). fsverity (and dm-verity) are merkle trees because they're about protecting sparse access to large files without having to hash the entire thing at the start, which doesn't really apply to a JSON document...
There was a problem hiding this comment.
Yes, I mean that should work just fine. I'm not sure what the value of IPE for "just data"
The way I was thinking about this more is from the "image integrity" angle and less about IPE specifically. It relates to the topic you brought up below.
If we have a kernel-fsverity signature that directly covers the manifest then "for free" we get verification when we read the manifest that it is valid.
And the manifest + config define things like the layer ordering. So unless a runtime validated some other signature (such as cosign) on startup it would still be possible to swap image layers around.
I guess backing up a bit though, even with this though someone could e.g. replace the logical tag for a floating quay.io/someorg/somecontainer with some other image on disk.
This relates to something pwithnall did for ostree in including bindings in the (signed) commit. In theory we could add an OCI extension that included the (or multiple) image name in the manifest as an annotation, and a runtime could validate then when it goes to run an image with that name that the app matches. I think DDIs basically have some of this because the os-release field can contain a name.
(And of course arguably...we could add container image tags into /usr/lib/os-release too...)
Anyways though, yes, kernel-fsverity signature on the manifest/config is not required under all threat models, but since it seems easy to do and (AFAIK) there's no downside, I'd like to do it.
There was a problem hiding this comment.
I guess though backing up...there's really two cases.
- The system configuration pins an image by explicit (manifest|config) digest
- The there's a floating tag
For sure with 1) I don't think we need the kernel-fsverity signature for the manifest, config, it is indeed just "find image named by manifest|config digest", verify sha digest of object, then mount merged erofs with kernel-fsverity signature in use.
Of course, this whole thing is predicated a bit on "how do I verify the config which specifies that quay.io/foo/bar@sha256:... - for bootc LBIs that's very clear, it's covered by the UKI -> composefs-for-root. Other cases might use something like signed confexts (which I'd like to support being OCI+composefs too).
For floating tags, yeah we can't really defend against image swapping "offline" without strengthening what gets signed.
That said, for use cases like Kubernetes, there is an option for kubelet to re-ping the registry for images on startup for a reason related to this a bit - validating that a user can pull an image even it happens to be on disk.
In the end I just come back to this:
since it seems easy to do and (AFAIK) there's no downside, I'd like to do it.
|
|
||
| ### Signatures | ||
|
|
||
| #### Linux kernel fsverity signatures (recommended) |
There was a problem hiding this comment.
This is a very significant departure from the way the trust model works now and how the object store works in general. From what I understand, each file can only have a single fs-verity signature on it, but we store objects by their content hash, which means if we had two objects enter the object store from differently-signed containers, we'd be in trouble, no?
Is this for every object or just the erofs image?
I'm also not sure that kernel-level fsverity signatures provide very strong protection (at the level that would be provided by signatures on disk images, for example) because they are on a file-by-file basis, assuming that is the intent here. If you ignore the userspace stuff, you could probably still use your ability to freely mix-and-match various individually-signed files into a system configuration that let's you do "bad things"...
There was a problem hiding this comment.
From what I understand, each file can only have a single fs-verity signature on it, but we store objects by their content hash, which means if we had two objects enter the object store from differently-signed containers, we'd be in trouble, no?
This is a good point - however it's interesting as I think it's more of an implementation concern and not a spec concern.
Is this for every object or just the erofs image?
Yes exactly: we don't require fsverity signatures on individual objects (that form part of a split layer tarball). See this issue in a nutshell the goal is that having the fsverity signature on the EROFS blob + detecting overlay require-verity should be sufficient for chain-of-trust.
But that said, what would happen if e.g. two distinct images shared a layer? I think "first one wins" is sufficient for the rootful case. The Linux kernel fsverity signature mechanism only has one keyring which applies to everything, and we can't do anything different. In a future world where there's e.g. per-user keyrings or so...it would just preclude sharing the EROFS metadata blob between trust domains (root and rootless e.g.) right? We could still share the underlying layer objects via hardlinks.
If you ignore the userspace stuff, you could probably still use your ability to freely mix-and-match various individually-signed files into a system configuration that let's you do "bad things"...
Again not individual signed files, but only complete layers. I don't think think this is any different from e.g. dm-verity + IPE. I wouldn't dismiss this concern entirely, but it seems like it'd be quite difficult offhand in practice to craft such a chain.
Now that said one obvious thing even with layers is that this doesn't protect against e.g. rollback attacks. I think that type of thing needs to be out of scope of this spec.
There was a problem hiding this comment.
I was thinking about this a bit more, and I think there is still an argument that we should support inline digests in the manifest (or config). It would naturally mean that we can reliably chain from "trust in manifest" to "trust in mounted root", which was again the original goal.
For cases where "trust in manifest" is implicitly handled (e.g. kubernetes API server tells us to run an image with this particular digest) it would Just Work as long as fsverity is supported by the underlying filesystem.
This support would give us an generic "out" for rootless/unprivileged use. (Though honestly in the medium term it'd be clearly nice to enhance the linux kernel fsverity to at least do something like "allow per-user keyrings for files owned by that user" or so)
There was a problem hiding this comment.
But that said, what would happen if e.g. two distinct images shared a layer? I think "first one wins" is sufficient for the rootful case. The Linux kernel fsverity signature mechanism only has one keyring which applies to everything, and we can't do anything different. In a future world where there's e.g. per-user keyrings or so...it would just preclude sharing the EROFS metadata blob between trust domains (root and rootless e.g.) right? We could still share the underlying layer objects via hardlinks.
could you please elaborate on this point? For the runtime side, why would it be a problem as long as there is at least one accepted signature?
There was a problem hiding this comment.
I agree, "one accepted signature" is fine - we can't do anything different today because the Linux kernel's fsverity mechanism only allows the same.
2a23a13 to
fa8852b
Compare
The biggest goal here is support for Linux kernel-native fsverity signatures to be attached to layers, which enables integration with IPE. Add support for a fully separate OCI "composefs signature" artifact which can be attached to an image. Drop the -impl.md doc...it's not useful to try to write this stuff in markdown. The spec has some implementation considerations, but it's easier to look at implementation side from a code draft. Add standardized-erofs-meta.md as a placeholder document outlining the goal of standardizing composefs EROFS serialization across implementations (canonical model: tar -> dumpfile -> EROFS). Assisted-by: OpenCode (Claude Opus 4.5) Signed-off-by: Colin Walters <walters@verbum.org>
fa8852b to
b470f73
Compare
| @@ -0,0 +1,74 @@ | |||
| # Standardized EROFS Metadata Serialization | |||
There was a problem hiding this comment.
When talking about creating a specification for the sealing, of course this heavily depends on a spec for the EROFS layout, which pulls back in all the debate in composefs/composefs#198
Now... #225 is starting to look at what it'd take to have us support being bit-for-bit compatible with the previous composefs-c (1.0) format.
In an ideal world perhaps we teach mkfs.erofs how to generate this too? This also relates a bit to uapi-group/specifications#207
There was a problem hiding this comment.
hi @cgwalters, do you mean the erofs metadata arrangement or the sealing format?
For the sealing format, I'm fine to get any help as long as anyone has interest to port this to erofs-utils, it can be used to improve the interaction between composefs tools and erofs-utils.
As for the erofs metadata itself, I don't think erofs-utils should strictly align with mkcomposefs (just because erofs-utils itself already have different arrangement for different cases, but erofs-utils is always designed reproduciblely), also I think it sounds unnecessary since erofs metadata layout is flexible enough (yet composefs can definitely define a strict on-disk layout for all related stuffs.)
But my own TODO list is already overloaded, so I don't help on some practical development on this.
There was a problem hiding this comment.
but erofs-utils is always designed reproduciblely)
Only within a specific binary version, right? You don't guarantee that a future mkfs.erofs wouldn't generate a different metadata layout, correct?
There was a problem hiding this comment.
but erofs-utils is always designed reproduciblely)
Only within a specific binary version, right? You don't guarantee that a future mkfs.erofs wouldn't generate a different metadata layout, correct?
Yes, of course. but erofs-utils layout won't be frequently changed in the foreseen future I think.
There was a problem hiding this comment.
Right, so a very important thing that we're trying to do with composefs+OCI is not change the wire format for OCI - we're not shipping EROFS on the wire, only generating it reproducibly on the client and server.
Hence we must:
- Lock in the bit-for-bit file format basically forever
- Have solid tooling to generate it (and ideally that tooling is easily accessible from multiple programming languages)
There was a problem hiding this comment.
just one note: currently tar format also doesn't strictly the header/file order for example and various tools generate various tars, so I guess reproduciblely within a specific binary version is also fine as long as users can reproduce it with the tools/command line and the exact version.
I respect your choices but I don't see a manifest erofs metadata on the wire sounds inappropriate from whatever point of view (of course you could lock in the bit-for-bit file format forever.)
There was a problem hiding this comment.
just one note: currently tar format also doesn't strictly the header/file order for example and various tools generate various tars, so I guess reproduciblely within a specific binary version is also fine as long as users can reproduce it with the tools/command line and the exact version.
Yes, this is a reason why the doc text here is proposing only a canonical mapping from composefs dumpfile ➡️ EROFS.
A composefs dumpfile has less representational flexibility (e.g. paths always start with /, xattrs are always sorted and serialized in band) I will actually try to ensure we have a "canonical dumpfile" format which should help.
I respect your choices but I don't see a manifest erofs metadata on the wire sounds inappropriate from whatever point of view (of course you could lock in the bit-for-bit file format forever.)
Hmmm. It is an interesting design choice; now that we've moved the signatures to a separate OCI artifact, indeed we could try to create a design where the EROFS-metadata is actually stored on the registry too.
However...it introduces the same "representational ambiguity" in that we'd be shipping both tar and metadata-EROFS and have to answer the question of: what happens when they disagree? This problem is one argument why zstd:chunked is still always validating against the diffid for security reasons.
But OTOH with composefs here we aren't actually trying to optimize incremental fetches, so we're always parsing the whole tarball anyways and could actually do something like:
- fetch erofs-meta from detached composefs OCI signature
- fetch layers
- For each layer, parse the tarball and generate a canonical in-memory metadata representation (like a composefs dumpfile) and then compare it with the erofs-meta: if they differ, it's a fatal error
Hummm.....yes, I think such an approach would allow us to entirely punt on the problem domain of standardizing an EROFS layout, but at the cost of duplicating all of the metadata.
Is that the right call? I am...unsure.
There was a problem hiding this comment.
I don't have the answer: I could find some technical ways, yet when it comes to OCI world, it really su*ks...
There was a problem hiding this comment.
Right now the "sealed UKI" mode is relying on the bit-for-bit EROFS reproducibility, which argues for standardizing it.
But if we went all in on this external fsverity signature mechanism, then I think we could (per discussion) also convert basically all use cases for sealed UKIs to use it as well. (After some design work)
Two big goals: