Summary
Add support for reading CDB64 root TX index files lazily from remote sources, enabling gateways to use distributed index files without requiring local storage. This extends the existing Cdb64RootTxIndex to support local files, Arweave TX IDs (with optional offset addressing), and arbitrary HTTP endpoints as index sources.
Background
The current Cdb64RootTxIndex implementation (src/discovery/cdb64-root-tx-index.ts) provides O(1) lookups of data item ID → root TX ID mappings from pre-built CDB64 files stored locally. This works well but requires:
- Downloading entire index files before use
- Local storage for potentially large index files
- Manual distribution/syncing of index files
Since the existing ContiguousDataSource interface already supports range-based fetching via the region?: Region parameter, and HTTP servers commonly support Range headers, we can fetch only the bytes needed for each lookup from CDB64 files stored remotely.
Requirements
Must Have
- ByteRangeSource abstraction: Interface for random-access byte reads that can be backed by local files, Arweave, or HTTP endpoints
- FileByteRangeSource: Implementation using
fs.FileHandle for local files (minimal overhead wrapper)
- ContiguousDataByteRangeSource: Implementation using
ContiguousDataSource.getData() with region support for Arweave
- HttpByteRangeSource: Implementation using HTTP Range requests for arbitrary URLs (S3, CDN, dedicated servers)
- Refactored Cdb64Reader: Use
ByteRangeSource instead of direct file handle access
- Mixed source support: Allow configuring local files, Arweave TX IDs, and HTTP URLs as index sources
- Caching for remote sources: Cache header (4KB) permanently and use LRU for hash table regions to minimize network round trips
- Offset-based index addressing: Support specifying index sources as
ID:offset:size tuples where:
ID is the L1 transaction ID (the bundle containing the index)
offset is the byte offset where the data item's data starts within the bundle
size is the size of the data item data
- This allows addressing CDB64 indexes stored as data items within bundles without requiring them to be indexed first (bootstrapping scenario)
Should Have
- Configurable source order: Local files first (faster), then remote sources
- Graceful degradation: If a remote source is unavailable, continue with other sources
- Metrics: Track cache hit rates and fetch latencies for remote sources
Won't Have (for now)
- Automatic discovery of index TX IDs (requires separate manifest/registry)
- Write support for remote indexes (read-only)
- Chunk-based fetching for Arweave (use HTTP Range requests via gateways)
Technical Design
ByteRangeSource Interface
interface ByteRangeSource {
/** Read bytes at offset */
read(offset: number, length: number): Promise<Buffer>;
/** Total size if known (for validation) */
getSize?(): Promise<number>;
/** Cleanup resources */
close?(): Promise<void>;
}
IndexSourceRef Type
/** Reference to a CDB64 index source */
type IndexSourceRef =
| { type: 'file'; path: string }
| { type: 'tx'; id: string } // Simple TX ID (indexed data item or L1 TX)
| { type: 'tx'; id: string; offset: number; size: number } // Bundle data item with offset
| { type: 'http'; url: string };
Implementations
// Local file - wraps fs.FileHandle
class FileByteRangeSource implements ByteRangeSource {
async read(offset: number, length: number): Promise<Buffer> {
const buffer = Buffer.alloc(length);
await this.fileHandle.read(buffer, 0, length, offset);
return buffer;
}
}
// Arweave - uses existing ContiguousDataSource with region support
// Supports optional base offset for addressing data items within bundles
class ContiguousDataByteRangeSource implements ByteRangeSource {
private txId: string;
private baseOffset: number; // Offset within the TX/bundle where CDB data starts
private totalSize?: number; // Total size of the CDB data (for bounds checking)
constructor(
dataSource: ContiguousDataSource,
txId: string,
baseOffset: number = 0,
totalSize?: number,
) {
this.dataSource = dataSource;
this.txId = txId;
this.baseOffset = baseOffset;
this.totalSize = totalSize;
}
async read(offset: number, length: number): Promise<Buffer> {
// Bounds checking if size is known
if (this.totalSize !== undefined && offset + length > this.totalSize) {
throw new Error(`Read beyond CDB bounds: ${offset}+${length} > ${this.totalSize}`);
}
const result = await this.dataSource.getData({
id: this.txId,
region: {
offset: this.baseOffset + offset, // Translate to absolute offset
size: length,
},
});
return streamToBuffer(result.stream);
}
async getSize(): Promise<number | undefined> {
return this.totalSize;
}
}
// HTTP - uses Range headers for arbitrary URLs (S3, CDN, etc.)
class HttpByteRangeSource implements ByteRangeSource {
async read(offset: number, length: number): Promise<Buffer> {
const response = await this.httpClient.get(this.url, {
headers: {
Range: `bytes=${offset}-${offset + length - 1}`,
},
responseType: 'arraybuffer',
});
return Buffer.from(response.data);
}
}
// Caching wrapper - critical for remote source performance
class CachingByteRangeSource implements ByteRangeSource {
// Cache header permanently, LRU for hash table regions
}
Parsing Logic for Index Source Specs
function parseIndexSourceId(spec: string): IndexSourceRef {
const parts = spec.split(':');
if (parts.length === 1) {
// Simple TX ID: "TxId123"
return { type: 'tx', id: parts[0] };
}
if (parts.length === 3) {
// TX with offset: "TxId123:1048576:52428800"
const [id, offsetStr, sizeStr] = parts;
const offset = parseInt(offsetStr, 10);
const size = parseInt(sizeStr, 10);
if (isNaN(offset) || isNaN(size) || offset < 0 || size <= 0) {
throw new Error(`Invalid offset/size in index spec: ${spec}`);
}
return { type: 'tx', id, offset, size };
}
throw new Error(`Invalid index source spec: ${spec}`);
}
CDB64 Lookup Access Pattern
Each lookup requires reading:
- Header (4096 bytes) - table pointers, cached permanently
- Hash table slots (16 bytes each) - linear probing, 1-N reads
- Record (16 byte header + 32 byte key + ~50 byte value) - verification + data
With caching, typical lookups would be:
- Local file: Same as today (negligible abstraction overhead)
- Remote (warm cache): 1-2 network requests for hash table + record
- Remote (cold): 2-3 network requests (header + hash table + record)
Configuration
# Existing - local files
CDB64_ROOT_TX_INDEX_PATH=/path/to/indexes/
# Arweave TX IDs - simple format (indexed data items or L1 TXs)
CDB64_ROOT_TX_INDEX_IDS=TxId123,TxId456
# Arweave TX IDs - with offset for unindexed bundle data items
# Format: txId:offset:size (colon-separated to avoid ambiguity with base64url)
CDB64_ROOT_TX_INDEX_IDS=BundleTxId1:1048576:52428800,BundleTxId2:0:10485760
# Mixed formats supported in same config
CDB64_ROOT_TX_INDEX_IDS=IndexedDataItem1,BundleTxId:1048576:52428800,IndexedDataItem2
# HTTP URLs (comma-separated, supports S3, CDN, dedicated servers)
CDB64_ROOT_TX_INDEX_URLS=https://indexes.example.com/root.cdb,https://s3.amazonaws.com/bucket/index.cdb
Use Cases
- Indexed data item:
CDB64_ROOT_TX_INDEX_IDS=DataItemTxId - Gateway fetches via normal ContiguousDataSource which handles resolution
- L1 transaction:
CDB64_ROOT_TX_INDEX_IDS=L1TxId - Direct fetch of L1 TX data
- Unindexed bundle data item:
CDB64_ROOT_TX_INDEX_IDS=BundleTxId:1048576:52428800 - Fetch from specific offset within bundle, useful when the CDB index itself isn't indexed yet (bootstrapping scenario)
Files to Modify
src/lib/cdb64.ts - Refactor Cdb64Reader to use ByteRangeSource
src/lib/byte-range-source.ts - New file with interface and implementations
src/discovery/cdb64-root-tx-index.ts - Support mixed local/Arweave/HTTP sources with offset addressing
src/config.ts - Add CDB64_ROOT_TX_INDEX_IDS and CDB64_ROOT_TX_INDEX_URLS configs
src/system.ts - Wire up ContiguousDataSource for Arweave-backed indexes
Testing
- Unit tests for
ByteRangeSource implementations
- Unit tests for refactored
Cdb64Reader with mock ByteRangeSource
- Unit tests for index source spec parsing (simple ID vs ID:offset:size)
- Integration tests with actual CDB64 files via all source types
- Integration tests for offset-based addressing within bundles
- Performance comparison: local vs remote (with/without cache)
Performance Considerations
- Local files: Negligible overhead from abstraction (one extra function call)
- Remote sources: Network latency dominates; caching is critical
- Header cache: Eliminates 1 round trip per lookup
- Hash table region cache: Reduces probing costs
- Consider prefetching common hash table regions on initialization
Future Enhancements
- Index manifest TX that lists all index TX IDs for automatic discovery
- Composite indexes spanning multiple TXs with routing hints
- Background warming of remote index caches
Summary
Add support for reading CDB64 root TX index files lazily from remote sources, enabling gateways to use distributed index files without requiring local storage. This extends the existing
Cdb64RootTxIndexto support local files, Arweave TX IDs (with optional offset addressing), and arbitrary HTTP endpoints as index sources.Background
The current
Cdb64RootTxIndeximplementation (src/discovery/cdb64-root-tx-index.ts) provides O(1) lookups of data item ID → root TX ID mappings from pre-built CDB64 files stored locally. This works well but requires:Since the existing
ContiguousDataSourceinterface already supports range-based fetching via theregion?: Regionparameter, and HTTP servers commonly support Range headers, we can fetch only the bytes needed for each lookup from CDB64 files stored remotely.Requirements
Must Have
fs.FileHandlefor local files (minimal overhead wrapper)ContiguousDataSource.getData()with region support for ArweaveByteRangeSourceinstead of direct file handle accessID:offset:sizetuples where:IDis the L1 transaction ID (the bundle containing the index)offsetis the byte offset where the data item's data starts within the bundlesizeis the size of the data item dataShould Have
Won't Have (for now)
Technical Design
ByteRangeSource Interface
IndexSourceRef Type
Implementations
Parsing Logic for Index Source Specs
CDB64 Lookup Access Pattern
Each lookup requires reading:
With caching, typical lookups would be:
Configuration
Use Cases
CDB64_ROOT_TX_INDEX_IDS=DataItemTxId- Gateway fetches via normal ContiguousDataSource which handles resolutionCDB64_ROOT_TX_INDEX_IDS=L1TxId- Direct fetch of L1 TX dataCDB64_ROOT_TX_INDEX_IDS=BundleTxId:1048576:52428800- Fetch from specific offset within bundle, useful when the CDB index itself isn't indexed yet (bootstrapping scenario)Files to Modify
src/lib/cdb64.ts- RefactorCdb64Readerto useByteRangeSourcesrc/lib/byte-range-source.ts- New file with interface and implementationssrc/discovery/cdb64-root-tx-index.ts- Support mixed local/Arweave/HTTP sources with offset addressingsrc/config.ts- AddCDB64_ROOT_TX_INDEX_IDSandCDB64_ROOT_TX_INDEX_URLSconfigssrc/system.ts- Wire up ContiguousDataSource for Arweave-backed indexesTesting
ByteRangeSourceimplementationsCdb64Readerwith mockByteRangeSourcePerformance Considerations
Future Enhancements