Skip to content

Latest commit

 

History

History
407 lines (324 loc) · 15.8 KB

File metadata and controls

407 lines (324 loc) · 15.8 KB

Lockbox Format Notes

This document records the intended pre-1.0 production format. All on-disk numeric fields are little-endian unless stated otherwise.

Implementations must not serialize on-disk structures with native-endian conversions, memory transmutation, or raw struct layout. Every numeric field must be encoded and decoded with an explicit byte order. Language bindings should expose parsed APIs rather than asking callers to reinterpret raw lockbox bytes.

The CI workflow .github/workflows/endian-interop.yml verifies this contract by transferring lockbox fixtures between Linux x64, Linux arm64, macOS arm64, and an emulated big-endian s390x environment. The big-endian job reads a little-endian-created fixture and emits a fixture that Linux x64 reads back.

The physical unit of the lockbox is a fixed-size encrypted page. Higher level structures such as TOC nodes, file chunks, environment variables, key directories, free-space indexes, and commit roots are encoded as objects inside pages. Public APIs must not expose page management.

Fixed Header

The fixed header is 96 bytes and is the only mutable fixed-location structure in the file.

offset  size  field
0       8     magic: "LBX2HDR\0"
8       2     version: 3
10      2     header flags
12      4     header length: 96
16      8     latest commit-root page offset, or 0
24      8     latest commit sequence
32      8     latest public key-directory offset, or 0
40      16    public lockbox UUID
56      8     reserved
64      32    SHA-256 checksum over bytes 0..64

The lockbox UUID is public metadata. It exists so tools can identify a lockbox even if the file is renamed or moved. It must not be derived from file names, content, recipients, or passwords.

The header checksum is a domain-separated SHA-256 digest. It detects torn or malformed header updates. It is not a security boundary. Security decisions must be based on authenticated pages and authenticated commit roots.

The key-directory pointer remains in the fixed header because users need the key directory before the content key is unlocked. The key directory stores only unlock metadata and must not contain private file metadata.

Pages

Every page has the same physical size. The default page size is 8 MiB. Implementations may read and write whole pages for native storage. Browser/WASM clients may fetch page ranges over HTTP using the TOC offsets, but decryption still authenticates a whole page.

offset  size  field
0       8     magic: "LBX2PAG\0"
8       2     page header version: 2
10      2     public flags
12      4     header length
16      8     page id
24      8     commit sequence that wrote this page
32      12    AEAD nonce
44      4     encrypted body length
48      16    reserved header extension
64      32    SHA-256 checksum over bytes 0..64
H       m     encrypted body
H+m     p     zero padding to the fixed page size

The nonce is generated per page write and must be unique for the content key. It must not be derived only from record kind, object kind, or commit sequence.

Only the page header is public. Object kinds, object lengths, logical paths, symlink targets, environment variable names, permissions, compression selection, and file contents are inside the encrypted body.

Page public-header checksums are generated and verified at the page cache/page-codec boundary. Higher-level TOC, file, recovery, and extraction code must not bypass the page cache for normal page reads or writes.

Page AEAD associated data includes:

  • lockbox format domain string
  • fixed header version
  • lockbox UUID
  • page id
  • commit sequence
  • public flags
  • encrypted body length

Page Body

The decrypted page body is an object container.

offset  size  field
0       1     page body version: 1
1       1     compression algorithm
2       1     compression profile
3       1     reserved
4       8     uncompressed object-stream length
12      4     reserved
16      n     compressed or uncompressed object stream

Compression is chosen by the core per page body:

  • default writes try zstd with the normal profile
  • if compression is larger or not useful, the body is stored uncompressed
  • compaction may use a higher-ratio internal zstd profile
  • the chosen algorithm and profile are stored inside encrypted metadata

The public API should not expose many compression modes. Normal callers get the default policy. Maintenance commands such as compaction may choose the archival profile internally.

Compressed File Extents

The production file-data layout is page-packed compressed extents, not one compressed object per physical page. A fixed 8 MiB physical page is a container. Its encrypted body may contain:

  • many complete compressed small files
  • many compressed chunks from one or more files
  • one fragment of a large compressed chunk
  • a mix of complete chunks and fragments, as long as the page body fits

Large files are compressed as independent bounded frames rather than one whole-file solid stream. A frame may span multiple physical pages, but it remains independently decompressible once its page fragments have been fetched and decrypted. TOC chunk entries identify the logical file offset, logical length, compressed length, compression algorithm, frame id, and ordered physical page fragments needed to reassemble that frame. Each physical fragment reference contains the page offset, fixed page length, encrypted object id, compressed frame offset, and fragment length.

This gives browser and web-service clients fast random access at frame granularity:

  1. fetch the TOC pages
  2. decrypt the TOC
  3. locate the compressed frame or frames for the requested file/range
  4. request only the physical pages containing those frame fragments
  5. decrypt the pages, reassemble the compressed frame bytes, and decompress

Recovery does not depend solely on the TOC. File fragment metadata is inside encrypted page bodies and includes path, permissions, optional final file length, logical frame offset, frame length, compression algorithm, frame id, compressed frame length, compressed fragment offset, and fragment length. Streaming writes may store 0 for the final file length because the final length is not known when early frames are written; the TOC is authoritative when available, and recovery can still infer a best-effort length from intact frames.

Paths remain private because both TOC entries and fragment metadata are inside encrypted page bodies. They are exposed only after the caller has unlocked the content key.

Page Cache Boundary

Encryption and decryption are owned by the page cache. Higher layers construct or consume decoded page objects and are otherwise oblivious to encryption. On read, the cache loads fixed encrypted bytes from storage, authenticates and decrypts the page once, then caches the decoded page. On write, callers submit decoded page objects; the cache encodes, compresses, encrypts, writes one fixed page to storage, and stores the decoded page in cache.

Raw page encode/decode helpers are format primitives. Production read/write paths should route through the cache boundary. Direct raw decoding is reserved for recovery scans and low-level format tests, where the caller starts from untrusted bytes rather than from an opened lockbox.

Objects

The object stream contains typed objects. Object headers are encrypted because they are part of the page body.

offset  size  field
0       1     object kind
1       1     object header version
2       2     object flags
4       8     object id
12      8     object payload length
20      n     object payload

Object ids are stable references used by TOC entries and indexes. A logical file may reference one or more file-data objects. Multiple small logical files may be packed into one file-pack object.

Initial object kinds:

1       commit root
2       TOC leaf node
3       TOC internal node
4       file data
5       packed file data
6       symlink
7       env set
8       env delete
9       key directory
10      free-space index leaf
11      free-space index internal

Commit Root

The fixed header points to the latest commit-root page offset. The commit root is an encrypted object inside that page.

The commit root payload contains:

field
commit sequence
lockbox UUID
format parameter set id
TOC root object reference
free-space index root object reference
primary key-directory offset, or zero
key-directory mirror offset A, or zero
key-directory mirror offset B, or zero
key-directory generation
previous commit-root reference, or zero
commit creation timestamp, optional and coarse
commit flags

Opening a lockbox reads the header, decrypts the commit-root page, validates the commit root, then opens the referenced TOC and free-space indexes. If the header is corrupt or stale, recovery may scan pages for valid commit roots and choose the highest valid sequence.

Rollback attacks on a standalone copied file cannot be fully prevented without an external freshness anchor. Lockbox detects internal corruption; it cannot prove that an attacker has not replaced the entire file with an older valid copy.

An external freshness anchor is state outside the lockbox that records the latest known version, generation, hash, or signed timestamp. Examples include a server-side object generation number, transparency log entry, signed manifest, append-only audit log, or application database row. Lockbox can reject stale internal metadata within one file by choosing the highest authenticated generation; only an external anchor can detect replacement of the whole file with an older but internally valid lockbox.

Table Of Contents

The TOC is a live-only copy-on-write BTree. Tombstones are not stored in the current TOC. Deletes remove entries from the live TOC and return old object/page references to the free-space index once they are no longer referenced.

Leaf payloads contain sorted manifest entries. Internal payloads contain sorted child separators and child object references.

Decode rules are intentionally strict:

  • leaf entries must be strictly sorted by logical path
  • duplicate leaf paths are corrupt
  • internal children must be strictly sorted by separator path
  • duplicate separators are corrupt
  • child references must resolve to valid TOC objects
  • every stored path must pass logical path validation
  • missing or corrupt child objects make the TOC corrupt

Updating a file rewrites the touched TOC leaf and changed ancestors. Unchanged TOC pages remain referenced by the previous and current commit roots until compaction reclaims unreachable history.

Free-Space Index

Reusable physical pages and reusable free regions are tracked by a transactional free-space index committed with the same commit root as the TOC.

The index is maintained in two logical orders:

  • by offset/page id, for coalescing adjacent free ranges
  • by size, for best-fit allocation

The free-space index is a performance and space-reuse structure, not the source of user-visible truth. If it is corrupt, tools may rebuild it by scanning valid pages and comparing reachable objects from the latest valid TOC and commit root.

The root object may be either a free index leaf or a free index internal object. Leaf payloads contain sorted non-overlapping (offset, length) free ranges. Internal payloads contain sorted child references:

offset  size  field
0       1     free-index version: 1
1       1     node kind: 0 = leaf, 1 = internal
2       2     reserved
4       4     entry count
8       n     leaf ranges or internal children

Leaf entries are (free_offset, free_length). Internal entries are (first_free_offset, child_page_offset). Children must be strictly sorted by first_free_offset. Free-index pages are append-only during commit so the published index never lists the page that stores the index itself.

Key Directory

The key directory is public unlock metadata referenced by the fixed header and mirrored in the commit root. It stores only slot ids, slot kinds, salts/ciphertexts, public recipient wrapping data, and encrypted content-key bytes. It must not store paths, file names, environment variable names, or file contents.

The key directory is intentionally readable before the content key is available. Its wrapped content-key values are authenticated by their wrapping algorithms; its outer structure is length-limited and protected by SHA-256 checksums so tools can reject malformed metadata early.

Every key-directory block has its own public recovery header:

offset  size  field
0       8     magic: "LBX2KEY\0"
8       2     key-directory version: 3
10      2     flags
12      4     header length: 128
16      8     total key-directory length
24      8     key-directory generation
32      16    lockbox UUID
48      4     copy index
52      4     reserved
56      32    SHA-256 checksum over key-slot payload
88      8     reserved
96      32    SHA-256 checksum over bytes 0..96
128     n     key-slot payload

The lockbox writes three copies of the key directory for every key-directory generation: a primary copy referenced by the fixed header, plus two mirror copies referenced by the commit root. Recovery can also scan the raw lockbox for LBX2KEY\0 blocks, validate their checksums, group them by lockbox UUID, and use the highest generation that successfully unwraps the content key.

The fixed header is therefore a fast path, not the only path. If the header is corrupt, password/public-key unlock can recover the lockbox UUID and content key from a scanned key-directory mirror, then use those values to authenticate and decrypt pages while scanning for the latest valid commit root.

Removing a password or recipient is not just a metadata delete. Because old COW history may contain old key directories or data pages, the CLI must treat key removal as a conservative maintenance operation:

  1. remove the key slot from the live key directory
  2. rewrite reachable encrypted pages as needed under the retained content key or a new content key
  3. compact unreachable old pages
  4. commit the new key directory and free-space index

Page Cache

The core uses a unified page cache for reads and dirty writes. Clean decoded pages are held in a weighted LRU cache. Dirty pages stay in the cache and are visible to reads from the same opened lockbox, but they are not written to the backing store until commit() flushes them and publishes a new commit root. There is no background writer.

Copy-on-write happens at commit time. This allows the same dirty page to absorb multiple logical mutations before the library allocates and writes replacement pages.

When a file, symlink, or environment variable is deleted or replaced, the commit path must redact the physical page that held the old encrypted object. If the old page also contains live objects, those live objects are relocated to a new page first; then the old physical page is overwritten with zeros and removed from the decoded-page cache. This is required because old COW pages may still contain decryptable ciphertext even after the live TOC no longer references them.

CacheLimit::Auto is page-aware:

  • minimum native cache: max of eight pages or 64 MiB
  • native target: about 15% of currently available/reclaimable memory
  • native cap: 4 GiB by default
  • WASM default: 64 MiB unless the embedder supplies an explicit limit

The cache is a performance mechanism, not a correctness requirement. TOC traversal, recovery, and extraction must all work when the cache is disabled.

Recovery

Recovery scans pages from after the fixed header. It does not require a valid fixed header, TOC, or free-space index.

Recovery can:

  • authenticate and decrypt intact pages independently
  • locate valid commit roots
  • rebuild a best-effort live view from the highest valid commit root
  • salvage file objects whose metadata can still be associated with paths
  • report intact, corrupt, and lost counts

Recovery is not an undelete guarantee. Once freed pages or free regions have been overwritten, the old objects are no longer recoverable.