From 7f26b34cc35bd7795198cffb2f04f8a2189cf8cd Mon Sep 17 00:00:00 2001 From: Martin Hutchinson Date: Wed, 6 May 2026 11:29:19 +0000 Subject: [PATCH 1/3] [Docs] Added section on costs for VIndex The costs and benefits of running a VIndex are subtle. Added this section to introduce a framing for how to reason about it. --- vindex/docs/v1/README.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/vindex/docs/v1/README.md b/vindex/docs/v1/README.md index b3a0663..8a7cb6c 100644 --- a/vindex/docs/v1/README.md +++ b/vindex/docs/v1/README.md @@ -65,6 +65,26 @@ The parsed outputs from the `MapFn` are physically separated into two purpose-bu - **Reading**: The VIndex is exclusively queried at its *latest* version; there are no historic queries. The system returns the latest list of matching indices in the Input Log, alongside inclusion proofs tying those results to the map's current root hash. Because the list of indices is structured as an append-only Merkle tree, a client who previously fetched the index at size N can simply request the delta up to size M and use a [compact range](https://github.com/transparency-dev/merkle/blob/main/docs/compact_ranges.md) to locally reconstruct and verify their historical state. No consistency proof is computed or returned by the server. - **Verifying**: The brand new append-only **Output Log** exists entirely for auditing. Anyone with compute resources can act as a verifier by running the universally specified MapFn against the Input Log to construct an identical local index. By comparing their computed root hash against the sequence of roots permanently published in the Output Log, they can verify every past state commitment. This guarantees that all past map revisions clients relied upon were constructed correctly, proving the operator never served an invalid map root. +## Map Operation Costs + +Adding a Verifiable Index to a transparency deployment introduces a distinct profile of resource consumption alongside the primary append-only log. While the log handles low-compute sequencing and high-bandwidth serving, the index overlay requires additional compute and local storage to provide highly efficient, low-bandwidth serving. + +### Storage + +- **Content vs. Pointers**: Unlike the primary log which stores the full payload of every entry, the Verifiable Index does not store the original data. It stores search keys mapped to a list of pointers (8-byte indices) indicating where the full data resides in the Input Log. +- **Overhead**: Storage requirements scale with both event volume and **key cardinality** (the number of unique search terms). The system maintains a Key-Value store mapping keys to their log of occurrences and a Merkle Prefix Trie (MPT) that stores a 32-byte root hash for each unique key ever observed. Consequently, the choice of `MapFn` and the distribution of subjects in the log drastically affect costs: a log with millions of entries for the *same* subject keeps the MPT extremely small, whereas a log with millions of *distinct* subjects forces the MPT to scale linearly with the number of keys. +- **Optimization**: The bulk data storage is optimized to keep only the latest state for a key. + +### Compute + +- **Parsing Overhead**: The indexing process is relatively compute-intensive per leaf. Each entry must be fetched, cryptographically verified against the Input Log checkpoint, and parsed via the WebAssembly sandbox to extract mapping keys. +- **Isolation**: This compute load is decoupled from the primary log's write path. The polling ingestion loop can be batched and pipelined asynchronously, ensuring that index processing does not add latency to the core log's sequencing or checkpointing. + +### Network Egress + +- **Targeted Queries**: Traditional logs require monitors to download and process all leaves of the log to find entries of interest. The Verifiable Index allows clients to query for specific keys and receive targeted lists of pointers, with small inclusion proofs. +- **Egress Reduction**: This transforms a bulk data distribution problem into a low-bandwidth query service. While operation requires additional storage and compute, it significantly reduces network egress pressure associated with log scraping. + ## Design Rationale for Transparency Experts For those familiar with Key Transparency (KT), Merkle Tree Certs (MTC), or other verifiable maps, the "Map Sandwich" architecture makes an intentional departure from standard map designs: From 4c016558f436aba4460894e12f86820a22b1c0ae Mon Sep 17 00:00:00 2001 From: Martin Hutchinson Date: Wed, 6 May 2026 13:36:18 +0000 Subject: [PATCH 2/3] Reviewer comments --- vindex/docs/v1/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vindex/docs/v1/README.md b/vindex/docs/v1/README.md index 8a7cb6c..b224fe7 100644 --- a/vindex/docs/v1/README.md +++ b/vindex/docs/v1/README.md @@ -72,7 +72,7 @@ Adding a Verifiable Index to a transparency deployment introduces a distinct pro ### Storage - **Content vs. Pointers**: Unlike the primary log which stores the full payload of every entry, the Verifiable Index does not store the original data. It stores search keys mapped to a list of pointers (8-byte indices) indicating where the full data resides in the Input Log. -- **Overhead**: Storage requirements scale with both event volume and **key cardinality** (the number of unique search terms). The system maintains a Key-Value store mapping keys to their log of occurrences and a Merkle Prefix Trie (MPT) that stores a 32-byte root hash for each unique key ever observed. Consequently, the choice of `MapFn` and the distribution of subjects in the log drastically affect costs: a log with millions of entries for the *same* subject keeps the MPT extremely small, whereas a log with millions of *distinct* subjects forces the MPT to scale linearly with the number of keys. +- **Overhead**: Storage requirements scale with the number of log entries, but most importantly, with **key cardinality** (the number of unique search terms). See [The Index Data Structure](#2-the-index-data-structure) for details on how state is maintained. Consequently, the choice of `MapFn` and the distribution of subjects in the log drastically affect costs: a log with millions of entries for the *same* subject keeps the MPT extremely small, whereas a log with millions of *distinct* subjects forces the MPT to scale linearly with the number of keys. - **Optimization**: The bulk data storage is optimized to keep only the latest state for a key. ### Compute From 075616b229fca9c7d67ac22cf4c4c83a646ce730 Mon Sep 17 00:00:00 2001 From: Martin Hutchinson Date: Wed, 6 May 2026 13:52:29 +0000 Subject: [PATCH 3/3] More improvements from Al suggestions --- vindex/docs/v1/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/vindex/docs/v1/README.md b/vindex/docs/v1/README.md index b224fe7..d0258c0 100644 --- a/vindex/docs/v1/README.md +++ b/vindex/docs/v1/README.md @@ -78,12 +78,12 @@ Adding a Verifiable Index to a transparency deployment introduces a distinct pro ### Compute - **Parsing Overhead**: The indexing process is relatively compute-intensive per leaf. Each entry must be fetched, cryptographically verified against the Input Log checkpoint, and parsed via the WebAssembly sandbox to extract mapping keys. -- **Isolation**: This compute load is decoupled from the primary log's write path. The polling ingestion loop can be batched and pipelined asynchronously, ensuring that index processing does not add latency to the core log's sequencing or checkpointing. +- **Isolation**: This compute load is decoupled from the primary log's write path. The polling ingestion loop can be batched and pipelined asynchronously, ensuring that index processing does not add latency to the Input Log's sequencing or checkpointing. ### Network Egress -- **Targeted Queries**: Traditional logs require monitors to download and process all leaves of the log to find entries of interest. The Verifiable Index allows clients to query for specific keys and receive targeted lists of pointers, with small inclusion proofs. -- **Egress Reduction**: This transforms a bulk data distribution problem into a low-bandwidth query service. While operation requires additional storage and compute, it significantly reduces network egress pressure associated with log scraping. +- **Targeted Queries**: Traditional logs require monitors to download and process all leaves of the log to find entries of interest. The Verifiable Index allows verifiers to query for specific keys and receive targeted lists of pointers, with small inclusion proofs. +- **Egress Reduction**: Serving data is often the most costly component of log operations. This transforms a bulk data distribution problem into a low-bandwidth query service, significantly reducing network egress costs associated with log scraping. ## Design Rationale for Transparency Experts