Skip to content

Support lazy reading of CDB64 root TX indexes from remote sources #569

@djwhitt

Description

@djwhitt

Summary

Add support for reading CDB64 root TX index files lazily from remote sources, enabling gateways to use distributed index files without requiring local storage. This extends the existing Cdb64RootTxIndex to support local files, Arweave TX IDs (with optional offset addressing), and arbitrary HTTP endpoints as index sources.

Background

The current Cdb64RootTxIndex implementation (src/discovery/cdb64-root-tx-index.ts) provides O(1) lookups of data item ID → root TX ID mappings from pre-built CDB64 files stored locally. This works well but requires:

  • Downloading entire index files before use
  • Local storage for potentially large index files
  • Manual distribution/syncing of index files

Since the existing ContiguousDataSource interface already supports range-based fetching via the region?: Region parameter, and HTTP servers commonly support Range headers, we can fetch only the bytes needed for each lookup from CDB64 files stored remotely.

Requirements

Must Have

  • ByteRangeSource abstraction: Interface for random-access byte reads that can be backed by local files, Arweave, or HTTP endpoints
  • FileByteRangeSource: Implementation using fs.FileHandle for local files (minimal overhead wrapper)
  • ContiguousDataByteRangeSource: Implementation using ContiguousDataSource.getData() with region support for Arweave
  • HttpByteRangeSource: Implementation using HTTP Range requests for arbitrary URLs (S3, CDN, dedicated servers)
  • Refactored Cdb64Reader: Use ByteRangeSource instead of direct file handle access
  • Mixed source support: Allow configuring local files, Arweave TX IDs, and HTTP URLs as index sources
  • Caching for remote sources: Cache header (4KB) permanently and use LRU for hash table regions to minimize network round trips
  • Offset-based index addressing: Support specifying index sources as ID:offset:size tuples where:
    • ID is the L1 transaction ID (the bundle containing the index)
    • offset is the byte offset where the data item's data starts within the bundle
    • size is the size of the data item data
    • This allows addressing CDB64 indexes stored as data items within bundles without requiring them to be indexed first (bootstrapping scenario)

Should Have

  • Configurable source order: Local files first (faster), then remote sources
  • Graceful degradation: If a remote source is unavailable, continue with other sources
  • Metrics: Track cache hit rates and fetch latencies for remote sources

Won't Have (for now)

  • Automatic discovery of index TX IDs (requires separate manifest/registry)
  • Write support for remote indexes (read-only)
  • Chunk-based fetching for Arweave (use HTTP Range requests via gateways)

Technical Design

ByteRangeSource Interface

interface ByteRangeSource {
  /** Read bytes at offset */
  read(offset: number, length: number): Promise<Buffer>;
  /** Total size if known (for validation) */
  getSize?(): Promise<number>;
  /** Cleanup resources */
  close?(): Promise<void>;
}

IndexSourceRef Type

/** Reference to a CDB64 index source */
type IndexSourceRef =
  | { type: 'file'; path: string }
  | { type: 'tx'; id: string }  // Simple TX ID (indexed data item or L1 TX)
  | { type: 'tx'; id: string; offset: number; size: number }  // Bundle data item with offset
  | { type: 'http'; url: string };

Implementations

// Local file - wraps fs.FileHandle
class FileByteRangeSource implements ByteRangeSource {
  async read(offset: number, length: number): Promise<Buffer> {
    const buffer = Buffer.alloc(length);
    await this.fileHandle.read(buffer, 0, length, offset);
    return buffer;
  }
}

// Arweave - uses existing ContiguousDataSource with region support
// Supports optional base offset for addressing data items within bundles
class ContiguousDataByteRangeSource implements ByteRangeSource {
  private txId: string;
  private baseOffset: number;  // Offset within the TX/bundle where CDB data starts
  private totalSize?: number;  // Total size of the CDB data (for bounds checking)

  constructor(
    dataSource: ContiguousDataSource,
    txId: string,
    baseOffset: number = 0,
    totalSize?: number,
  ) {
    this.dataSource = dataSource;
    this.txId = txId;
    this.baseOffset = baseOffset;
    this.totalSize = totalSize;
  }

  async read(offset: number, length: number): Promise<Buffer> {
    // Bounds checking if size is known
    if (this.totalSize !== undefined && offset + length > this.totalSize) {
      throw new Error(`Read beyond CDB bounds: ${offset}+${length} > ${this.totalSize}`);
    }

    const result = await this.dataSource.getData({
      id: this.txId,
      region: {
        offset: this.baseOffset + offset,  // Translate to absolute offset
        size: length,
      },
    });
    return streamToBuffer(result.stream);
  }

  async getSize(): Promise<number | undefined> {
    return this.totalSize;
  }
}

// HTTP - uses Range headers for arbitrary URLs (S3, CDN, etc.)
class HttpByteRangeSource implements ByteRangeSource {
  async read(offset: number, length: number): Promise<Buffer> {
    const response = await this.httpClient.get(this.url, {
      headers: {
        Range: `bytes=${offset}-${offset + length - 1}`,
      },
      responseType: 'arraybuffer',
    });
    return Buffer.from(response.data);
  }
}

// Caching wrapper - critical for remote source performance
class CachingByteRangeSource implements ByteRangeSource {
  // Cache header permanently, LRU for hash table regions
}

Parsing Logic for Index Source Specs

function parseIndexSourceId(spec: string): IndexSourceRef {
  const parts = spec.split(':');

  if (parts.length === 1) {
    // Simple TX ID: "TxId123"
    return { type: 'tx', id: parts[0] };
  }

  if (parts.length === 3) {
    // TX with offset: "TxId123:1048576:52428800"
    const [id, offsetStr, sizeStr] = parts;
    const offset = parseInt(offsetStr, 10);
    const size = parseInt(sizeStr, 10);

    if (isNaN(offset) || isNaN(size) || offset < 0 || size <= 0) {
      throw new Error(`Invalid offset/size in index spec: ${spec}`);
    }

    return { type: 'tx', id, offset, size };
  }

  throw new Error(`Invalid index source spec: ${spec}`);
}

CDB64 Lookup Access Pattern

Each lookup requires reading:

  1. Header (4096 bytes) - table pointers, cached permanently
  2. Hash table slots (16 bytes each) - linear probing, 1-N reads
  3. Record (16 byte header + 32 byte key + ~50 byte value) - verification + data

With caching, typical lookups would be:

  • Local file: Same as today (negligible abstraction overhead)
  • Remote (warm cache): 1-2 network requests for hash table + record
  • Remote (cold): 2-3 network requests (header + hash table + record)

Configuration

# Existing - local files
CDB64_ROOT_TX_INDEX_PATH=/path/to/indexes/

# Arweave TX IDs - simple format (indexed data items or L1 TXs)
CDB64_ROOT_TX_INDEX_IDS=TxId123,TxId456

# Arweave TX IDs - with offset for unindexed bundle data items
# Format: txId:offset:size (colon-separated to avoid ambiguity with base64url)
CDB64_ROOT_TX_INDEX_IDS=BundleTxId1:1048576:52428800,BundleTxId2:0:10485760

# Mixed formats supported in same config
CDB64_ROOT_TX_INDEX_IDS=IndexedDataItem1,BundleTxId:1048576:52428800,IndexedDataItem2

# HTTP URLs (comma-separated, supports S3, CDN, dedicated servers)
CDB64_ROOT_TX_INDEX_URLS=https://indexes.example.com/root.cdb,https://s3.amazonaws.com/bucket/index.cdb

Use Cases

  1. Indexed data item: CDB64_ROOT_TX_INDEX_IDS=DataItemTxId - Gateway fetches via normal ContiguousDataSource which handles resolution
  2. L1 transaction: CDB64_ROOT_TX_INDEX_IDS=L1TxId - Direct fetch of L1 TX data
  3. Unindexed bundle data item: CDB64_ROOT_TX_INDEX_IDS=BundleTxId:1048576:52428800 - Fetch from specific offset within bundle, useful when the CDB index itself isn't indexed yet (bootstrapping scenario)

Files to Modify

  • src/lib/cdb64.ts - Refactor Cdb64Reader to use ByteRangeSource
  • src/lib/byte-range-source.ts - New file with interface and implementations
  • src/discovery/cdb64-root-tx-index.ts - Support mixed local/Arweave/HTTP sources with offset addressing
  • src/config.ts - Add CDB64_ROOT_TX_INDEX_IDS and CDB64_ROOT_TX_INDEX_URLS configs
  • src/system.ts - Wire up ContiguousDataSource for Arweave-backed indexes

Testing

  • Unit tests for ByteRangeSource implementations
  • Unit tests for refactored Cdb64Reader with mock ByteRangeSource
  • Unit tests for index source spec parsing (simple ID vs ID:offset:size)
  • Integration tests with actual CDB64 files via all source types
  • Integration tests for offset-based addressing within bundles
  • Performance comparison: local vs remote (with/without cache)

Performance Considerations

  • Local files: Negligible overhead from abstraction (one extra function call)
  • Remote sources: Network latency dominates; caching is critical
    • Header cache: Eliminates 1 round trip per lookup
    • Hash table region cache: Reduces probing costs
    • Consider prefetching common hash table regions on initialization

Future Enhancements

  • Index manifest TX that lists all index TX IDs for automatic discovery
  • Composite indexes spanning multiple TXs with routing hints
  • Background warming of remote index caches

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions