Rethink or discuss chunk content addressing

I _believe_ that the code doesn't enforce that chunks are named by a hash of their content. But the docs talk about it, so I wanted to brain-dump a little.

For something like Git, using content addressing makes sense:

- Data is stored locally as blobs identified by content address, so everyone participating in the protocol already has the value.
- Most of the data is opaque (file contents), so it's a small extension to store trees and such in the same manner.
- Everything is cleartext, so there's a straightforward mapping between data and identifier: `sha256(data)`.
- We _want_ file contents to collide with files already stored in the object store; that's how deduping works.
- The object store is used directly to find commits and trees by hash.

These things aren't true for our purposes.

- Data is not stored locally as blobs: it's structured, so the client needs to assemble the data and serialize it in order to compute the hash.
- None of our data is opaque, so there's no data for which computing a hash is a single step: the hash of the cleartext content is `sha256(serialize(assemble(db, inputs)))`.
- It's encrypted on the server, which means that either the hash has no relation to the stored blob (no validation possible), or the hash is computed as `sha256(encrypt(key, serialize(assemble(db, inputs)))`, which is a long pipeline that involves the key in the production of the identifier.
- We don't ever expect a collision: two identical transactions should never occur, and so a hash collision is a problem, not an opportunity.
- We don't expect to find data directly by hash: we'll find out an identifier from an external source, or (more commonly) from a transaction itself.

I can see two reasons for using content addressing:

- To avoid potential collisions when rewriting bits of history — if my client tries to delete old data from `abcdef12345-1` and does so by uploading `abcdef12345-2`, it can collide with another client that's rewriting the same blob.
- To avoid the need to correctly map and maintain identifiers from a transaction record.

I think it's worth spending a few brain cycles on this at some point.

We can certainly do hashed-identifiers, but we might want to document limitations that we impose — e.g., that we won't do consistent encoding (e.g., JOSE), that the hash depends on the encryption key in use, that it's really just an opaque identifier that happens to be derived from the content, rather than being useful for content-based lookup…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink or discuss chunk content addressing #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Rethink or discuss chunk content addressing #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions