This repository was archived by the owner on Mar 31, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
This repository was archived by the owner on Mar 31, 2021. It is now read-only.
Rethink or discuss chunk content addressing #3
Copy link
Copy link
Open
Description
I believe that the code doesn't enforce that chunks are named by a hash of their content. But the docs talk about it, so I wanted to brain-dump a little.
For something like Git, using content addressing makes sense:
- Data is stored locally as blobs identified by content address, so everyone participating in the protocol already has the value.
- Most of the data is opaque (file contents), so it's a small extension to store trees and such in the same manner.
- Everything is cleartext, so there's a straightforward mapping between data and identifier:
sha256(data). - We want file contents to collide with files already stored in the object store; that's how deduping works.
- The object store is used directly to find commits and trees by hash.
These things aren't true for our purposes.
- Data is not stored locally as blobs: it's structured, so the client needs to assemble the data and serialize it in order to compute the hash.
- None of our data is opaque, so there's no data for which computing a hash is a single step: the hash of the cleartext content is
sha256(serialize(assemble(db, inputs))). - It's encrypted on the server, which means that either the hash has no relation to the stored blob (no validation possible), or the hash is computed as
sha256(encrypt(key, serialize(assemble(db, inputs))), which is a long pipeline that involves the key in the production of the identifier. - We don't ever expect a collision: two identical transactions should never occur, and so a hash collision is a problem, not an opportunity.
- We don't expect to find data directly by hash: we'll find out an identifier from an external source, or (more commonly) from a transaction itself.
I can see two reasons for using content addressing:
- To avoid potential collisions when rewriting bits of history — if my client tries to delete old data from
abcdef12345-1and does so by uploadingabcdef12345-2, it can collide with another client that's rewriting the same blob. - To avoid the need to correctly map and maintain identifiers from a transaction record.
I think it's worth spending a few brain cycles on this at some point.
We can certainly do hashed-identifiers, but we might want to document limitations that we impose — e.g., that we won't do consistent encoding (e.g., JOSE), that the hash depends on the encryption key in use, that it's really just an opaque identifier that happens to be derived from the content, rather than being useful for content-based lookup…
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels