Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@
*.swp

# Local index/serialisation artifacts produced by tests/benches.
# Current `.ov*` magics plus the legacy `.tv*` ones (still loadable, files persist).
*.ovr
*.ovrq
*.ovbm
*.ovsb
*.tvr
*.tvrq
*.tvbm
Expand Down
12 changes: 8 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,14 @@ Contributions to the code, the docs, and the paper are all welcome.
caveat. The Lean bitmap theorem proves a constant-weight overlap admission
model under explicit assumptions; it is not a blanket retrieval guarantee.
- **MSRV is Rust 1.89.** Don't use newer standard-library or language APIs.
- **Stable surface.** The persistence file magics (`.tvr` / `.tvrq` /
`.tvbm` / `.tvsb`) and the public method names
(`new` / `add` / `search` / `search_asymmetric*` / `top_m_candidates*` /
`write` / `load`) are stable — please don't rename them.
- **Stable surface.** The on-disk formats remain loadable forever: writers emit
the current `.ov*` magics (`.ovr` / `.ovrq` / `.ovbm` / `.ovsb`, renamed from
the turbovec-era `.tv*`), and the loaders accept **both** the current `.ov*`
and the legacy `.tv*` magics — so every file the crate has ever written still
loads. Only the write path changed; the read contract is never broken. The
public method names (`new` / `add` / `search` / `search_asymmetric*` /
`top_m_candidates*` / `write` / `load`) are likewise stable — please don't
rename them.
- **Tests are required for new functionality.** As major new functionality
is added, tests covering it MUST be added to the automated test suite
(`cargo test`, plus `pytest` for the Python bindings). Changes that add
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -426,7 +426,8 @@ clean-checkout kernel sanity check.

## Security: index-file trust

The on-disk formats (`.tvr` / `.tvrq` / `.tvbm` / `.tvsb`) carry **no built-in
The on-disk formats (`.ovr` / `.ovrq` / `.ovbm` / `.ovsb`; legacy `.tvr` /
`.tvrq` / `.tvbm` / `.tvsb` files still load) carry **no built-in
checksum, MAC, or signature — by design.** The loaders validate *structure*
(magic, version, bounds, exact-length payload) but not *origin*: a
structurally valid file can still be untrusted. If an index file crosses a
Expand Down
5 changes: 3 additions & 2 deletions SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,9 @@ Use GitHub's private vulnerability reporting:

We aim to acknowledge reports within a few business days.

`ordvec` parses serialized index files (`.tvr` / `.tvrq` / `.tvbm` /
`.tvsb`); the loaders are fuzzed (`cargo +nightly fuzz`), so
`ordvec` parses serialized index files (`.ovr` / `.ovrq` / `.ovbm` /
`.ovsb`; the loaders also accept the legacy `.tvr` / `.tvrq` / `.tvbm` /
`.tvsb` magics); the loaders are fuzzed (`cargo +nightly fuzz`), so
parsing-robustness reports against the deserialization paths are especially
welcome. Reports are also welcome against the `unsafe` SIMD kernels (shape /
bounds invariants), the Python FFI contract (buffer handling, GIL discipline),
Expand Down
6 changes: 3 additions & 3 deletions THREAT_MODEL.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ absence of a second maintainer is itself a tracked supply-chain residual

| Layer | Components | Trust boundary |
|---|---|---|
| **Deserialization** | `rank_io.rs` — `.tvr` / `.tvrq` / `.tvbm` / `.tvsb` loaders | Untrusted filesystem / network byte stream |
| **Deserialization** | `rank_io.rs` — `.ovr` / `.ovrq` / `.ovbm` / `.ovsb` loaders (also accept the legacy `.tvr` / `.tvrq` / `.tvbm` / `.tvsb` magics) | Untrusted filesystem / network byte stream |
| **Manifest verification** | `ordvec-manifest` — JSON sidecar verifier | Manifest + index + optional row-map files before load |
| **Compute kernels** | `fastscan.rs`, `quant_kernels.rs`, `bitmap.rs`, `sign_bitmap.rs` | Trust established after format validation |
| **Index API** | `rank.rs`, `quant.rs`, `bitmap.rs`, `sign_bitmap.rs` | Caller-controlled query embeddings |
Expand Down Expand Up @@ -221,8 +221,8 @@ those kernels, and layering ASAN onto the existing SDE leg remains a follow-up.

### 4.1 C ABI defenses (code-verified)

`ordvec-ffi` exposes only loaded `.tvrq` `RankQuant` and `.tvbm` `Bitmap`
indexes through one opaque handle. The ABI checks raw pointer nullness and
`ordvec-ffi` exposes only loaded `.ovrq` `RankQuant` and `.ovbm` `Bitmap`
indexes (legacy `.tvrq` / `.tvbm` files also load) through one opaque handle. The ABI checks raw pointer nullness and
caller-supplied lengths before use, requires exact v1 `struct_size` values for
input structs, rejects unknown flags and nonzero reserved input fields,
validates query dimension and finiteness before entering core search,
Expand Down
2 changes: 1 addition & 1 deletion docs/INDEX_PROVENANCE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Index file provenance

`ordvec` persists indexes as `.tvr` / `.tvrq` / `.tvbm` / `.tvsb` files and
`ordvec` persists indexes as `.ovr` / `.ovrq` / `.ovbm` / `.ovsb` files and
reloads them through `Rank::load`, `RankQuant::load`, `Bitmap::load`, and
`SignBitmap::load`. This note states exactly **what the loaders guarantee and
what they do not**, so you can decide whether an index file needs out-of-band
Expand Down
34 changes: 23 additions & 11 deletions docs/PERSISTED_FORMAT.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Persisted Index Format

This document is the compatibility contract for ordvec persisted index files.
It covers the primitive index artifacts only: `.tvr`, `.tvrq`, `.tvbm`, and
`.tvsb`. It does not define a database, transaction log, replication protocol,
It covers the primitive index artifacts only: `.ovr`, `.ovrq`, `.ovbm`, and
`.ovsb`. It does not define a database, transaction log, replication protocol,
provenance system, checksum manifest, signature, or trust policy.

All integer fields are little-endian. Each format has one fixed header followed
Expand Down Expand Up @@ -58,7 +58,7 @@ Example external segment entry:

```json
{
"path": "segments/shard-0007/index.tvrq",
"path": "segments/shard-0007/index.ovrq",
"sha256": "0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef",
"metadata": {
"kind": "RankQuant",
Expand Down Expand Up @@ -92,13 +92,16 @@ persisted row.

## Format Layouts

### Rank (`.tvr`, magic `TVR1`)
### Rank (`.ovr`, magic `OVR1`)

Current writers emit magic `OVR1`. Loaders also accept the legacy magic `TVR1`
(written by versions before the format rename).

Header:

| Offset | Bytes | Field |
| ---: | ---: | --- |
| 0 | 4 | magic `TVR1` |
| 0 | 4 | magic `OVR1` (or legacy `TVR1`) |
| 4 | 1 | format version `1` |
| 5 | 4 | `dim` as `u32` little-endian |
| 9 | 4 | `n_vectors` as `u32` little-endian |
Expand All @@ -112,13 +115,16 @@ Probe metadata:
- `params = Rank`
- `bytes_per_vec = dim * 2`

### RankQuant (`.tvrq`, magic `TVRQ`)
### RankQuant (`.ovrq`, magic `OVRQ`)

Current writers emit magic `OVRQ`. Loaders also accept the legacy magic `TVRQ`
(written by versions before the format rename).

Header:

| Offset | Bytes | Field |
| ---: | ---: | --- |
| 0 | 4 | magic `TVRQ` |
| 0 | 4 | magic `OVRQ` (or legacy `TVRQ`) |
| 4 | 1 | format version `1` |
| 5 | 1 | `bits` as `u8`, one of `1`, `2`, or `4` |
| 6 | 4 | `dim` as `u32` little-endian |
Expand All @@ -139,13 +145,16 @@ Probe metadata:
- `params = RankQuant { bits }`
- `bytes_per_vec = dim * bits / 8`

### Bitmap (`.tvbm`, magic `TVBM`)
### Bitmap (`.ovbm`, magic `OVBM`)

Current writers emit magic `OVBM`. Loaders also accept the legacy magic `TVBM`
(written by versions before the format rename).

Header:

| Offset | Bytes | Field |
| ---: | ---: | --- |
| 0 | 4 | magic `TVBM` |
| 0 | 4 | magic `OVBM` (or legacy `TVBM`) |
| 4 | 1 | format version `1` |
| 5 | 4 | `dim` as `u32` little-endian |
| 9 | 4 | `n_top` as `u32` little-endian |
Expand All @@ -161,13 +170,16 @@ Probe metadata:
- `params = Bitmap { n_top }`
- `bytes_per_vec = dim / 8`

### SignBitmap (`.tvsb`, magic `TVSB`)
### SignBitmap (`.ovsb`, magic `OVSB`)

Current writers emit magic `OVSB`. Loaders also accept the legacy magic `TVSB`
(written by versions before the format rename).

Header:

| Offset | Bytes | Field |
| ---: | ---: | --- |
| 0 | 4 | magic `TVSB` |
| 0 | 4 | magic `OVSB` (or legacy `TVSB`) |
| 4 | 1 | format version `1` |
| 5 | 4 | `dim` as `u32` little-endian |
| 9 | 4 | `n_vectors` as `u32` little-endian |
Expand Down
6 changes: 3 additions & 3 deletions docs/c-api.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# C API

`ordvec-ffi` exposes a small ABI v1 for loading persisted `.tvrq`
`RankQuant` and `.tvbm` `Bitmap` indexes and running synchronous single-query
`ordvec-ffi` exposes a small ABI v1 for loading persisted `.ovrq`
`RankQuant` and `.ovbm` `Bitmap` indexes and running synchronous single-query
searches. The public header is [`../ordvec-ffi/include/ordvec.h`](../ordvec-ffi/include/ordvec.h).

## Build and Link
Expand Down Expand Up @@ -33,7 +33,7 @@ When linking dynamically, make sure your platform's loader can find

int main(void) {
ordvec_index_t *index = NULL;
ordvec_status_t st = ordvec_index_load("index.tvrq", 0, &index);
ordvec_status_t st = ordvec_index_load("index.ovrq", 0, &index);
if (st != ORDVEC_STATUS_OK) {
fprintf(stderr, "load failed: %s\n", ordvec_last_error());
return 1;
Expand Down
12 changes: 8 additions & 4 deletions docs/compatibility-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,10 +121,14 @@ with documented migration steps.
The primitive index formats are the files written and loaded by the core index
types:

- `.tvr` / `TVR1` for `Rank`;
- `.tvrq` / `TVRQ` for `RankQuant`;
- `.tvbm` / `TVBM` for `Bitmap`;
- `.tvsb` / `TVSB` for `SignBitmap`.
- `.ovr` / `OVR1` for `Rank`;
- `.ovrq` / `OVRQ` for `RankQuant`;
- `.ovbm` / `OVBM` for `Bitmap`;
- `.ovsb` / `OVSB` for `SignBitmap`.

Legacy files using the old turbovec-era magics (`TVR1`, `TVRQ`, `TVBM`, `TVSB`
and extensions `.tvr`, `.tvrq`, `.tvbm`, `.tvsb`) are still accepted by current
loaders. Writers no longer emit those magics.

Patch releases should keep valid files from the same minor series loadable.
Loader hardening may reject malformed files, forged sizes, trailing bytes, bad
Expand Down
5 changes: 3 additions & 2 deletions fuzz/fuzz_targets/load_bitmap.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
//! libFuzzer target for the `.tvbm` / `TVBM` loader, driven through the
//! public `ordvec::Bitmap::load` entry point.
//! libFuzzer target for the `.ovbm` / `OVBM` loader (which also accepts the
//! legacy `.tvbm` / `TVBM` magic), driven through the public
//! `ordvec::Bitmap::load` entry point.
//!
//! The low-level `rank_io::load_bitmap` parser is crate-internal
//! (`pub(crate)`), so the fuzzer exercises it through `Bitmap::load` — which
Expand Down
5 changes: 3 additions & 2 deletions fuzz/fuzz_targets/load_rank.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
//! libFuzzer target for the `.tvr` / `TVR1` loader, driven through the
//! public `ordvec::Rank::load` entry point.
//! libFuzzer target for the `.ovr` / `OVR1` loader (which also accepts the
//! legacy `.tvr` / `TVR1` magic), driven through the public `ordvec::Rank::load`
//! entry point.
//!
//! The low-level `rank_io::load_rank` parser is crate-internal (`pub(crate)`),
//! so the fuzzer exercises it through `Rank::load` — which runs that exact
Expand Down
5 changes: 3 additions & 2 deletions fuzz/fuzz_targets/load_rankquant.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
//! libFuzzer target for the `.tvrq` / `TVRQ` loader, driven through the
//! public `ordvec::RankQuant::load` entry point.
//! libFuzzer target for the `.ovrq` / `OVRQ` loader (which also accepts the
//! legacy `.tvrq` / `TVRQ` magic), driven through the public
//! `ordvec::RankQuant::load` entry point.
//!
//! The low-level `rank_io::load_rankquant` parser is crate-internal
//! (`pub(crate)`), so the fuzzer exercises it through `RankQuant::load` —
Expand Down
7 changes: 4 additions & 3 deletions fuzz/fuzz_targets/load_sign_bitmap.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
//! libFuzzer target for the `.tvsb` / `TVSB` loader, driven through the
//! public `ordvec::SignBitmap::load` entry point.
//! libFuzzer target for the `.ovsb` / `OVSB` loader (which also accepts the
//! legacy `.tvsb` / `TVSB` magic), driven through the public
//! `ordvec::SignBitmap::load` entry point.
//!
//! The low-level `rank_io::load_sign_bitmap` parser is crate-internal
//! (`pub(crate)`), so the fuzzer exercises it through `SignBitmap::load` —
Expand All @@ -13,7 +14,7 @@
//! Contract: on arbitrary bytes the loader must return `Ok(..)` or
//! `Err(..)` — never panic, abort, or read out of bounds. libFuzzer
//! treats any panic/abort as a crash, so simply letting the result drop
//! is the assertion. The `.tvsb` dim validation path differs from the
//! is the assertion. The `.ovsb` dim validation path differs from the
//! other three (`MAX_SIGN_BITMAP_DIM`, multiple-of-64), so it gets its
//! own target rather than riding on `load_bitmap`.

Expand Down
2 changes: 1 addition & 1 deletion fuzz/fuzz_targets/roundtrip_rankquant.rs
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ fuzz_target!(|data: &[u8]| {
Ok(d) => d,
Err(_) => return,
};
let path = dir.path().join("roundtrip.tvrq");
let path = dir.path().join("roundtrip.ovrq");
idx.write(&path).expect("write of a validly-built index must succeed");
let reloaded = RankQuant::load(&path).expect("write output must reload (round-trip)");
assert_eq!(reloaded.dim(), idx.dim());
Expand Down
5 changes: 3 additions & 2 deletions fuzz/fuzz_targets/scratch.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
//! Shared per-worker scratch temp file for the `.tvr` / `.tvrq` / `.tvbm` /
//! `.tvsb` loader fuzz targets.
//! Shared per-worker scratch temp file for the `.ovr` / `.ovrq` / `.ovbm` /
//! `.ovsb` loader fuzz targets (the loaders also accept the legacy `.tv*`
//! magics).
//!
//! # Why this exists (issue #6)
//!
Expand Down
8 changes: 5 additions & 3 deletions ordvec-ffi/include/ordvec.h
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,8 @@ void ordvec_search_params_init(ordvec_search_params_t *params);
void ordvec_search_stats_init(ordvec_search_stats_t *stats);

/**
* Load a `.tvrq` RankQuant or `.tvbm` Bitmap index.
* Load a `.ovrq` RankQuant or `.ovbm` Bitmap index (legacy `.tvrq` / `.tvbm`
* files are also accepted).
*
* # Safety
*
Expand All @@ -190,8 +191,9 @@ void ordvec_search_stats_init(ordvec_search_stats_t *stats);
ordvec_status_t ordvec_index_load(const char *path, uint64_t flags, ordvec_index_t **out);

/**
* Probe on-disk metadata for a `.tvrq` RankQuant or `.tvbm` Bitmap index
* without loading payload rows into an index handle.
* Probe on-disk metadata for a `.ovrq` RankQuant or `.ovbm` Bitmap index
* (legacy `.tv*` also accepted) without loading payload rows into an index
* handle.
*
* This validates the fixed header, declared dimensions, payload byte count,
* and exact file length. Full row-invariant validation remains the job of
Expand Down
Loading
Loading