Skip to content

Vector subsystem — correctness, memory-safety, quantization (7 sub-items) #50

@hollanf

Description

@hollanf

Seven independent bugs in nodedb-vector beyond what's already filed (#37 mmap overflow, #40 SQ8 brute-force scan, #41 ef_search unbounded). Three are correctness bugs that silently return wrong results. One is memory-safety UB on the search hot path. One is a feature that's silently a no-op. Two are performance pathologies that turn the index from log-scale into linear.


1. SIMD distance kernels read past buffer end on length mismatch (memory-safety UB)

File: nodedb-vector/src/distance/simd.rs:147-210 (AVX2 L2 + cosine, NEON twins)

#[target_feature(enable = "avx2,fma")]
unsafe fn l2_squared_impl(a: &[f32], b: &[f32]) -> f32 {
    unsafe {
        let n = a.len();                              // ← uses a.len() only
        let chunks = n / 8;
        for i in 0..chunks {
            let off = i * 8;
            let va = _mm256_loadu_ps(a.as_ptr().add(off));
            let vb = _mm256_loadu_ps(b.as_ptr().add(off));   // UB if b.len() < a.len()
            ...
        }
        for i in (chunks * 8)..n {
            let d = a[i] - b[i];                      // panic if b shorter
            ...
        }
    }
}

The kernel iterates with a.len(), dereferences b.as_ptr().add(off) without verifying b.len() >= a.len(). The assert_eq!(query.len(), self.dim) guard in hnsw/search.rs:19 validates the query against the index dim, but does not validate the query against each individual stored vector's length. A node whose Vec<f32> is shorter than the index dim (possible after a truncated checkpoint restore — from_checkpoint has no per-node length check) lands the search hot path in undefined behavior via _mm256_loadu_ps past the buffer end.


2. Tenant Roaring bitmap pre-filter feeds GLOBAL ids to per-segment LOCAL-id lookup → near-zero recall after first segment seal

File: nodedb-vector/src/collection/search.rs:115-135 + hnsw/search.rs:140 (f.contains(id))

for seg in &self.sealed {
    let results = seg.index.search_with_bitmap_bytes(query, top_k, ef, bitmap);
    for mut r in results {
        r.id += seg.base_id;        // ← offset added AFTER bitmap filtering
        all.push(r);
    }
}

The Roaring bitmap is built by the query planner from global vector IDs. Each sealed segment stores vectors at local IDs 0..seg.index.len(). HnswIndex::search_with_bitmap_bytes tests f.contains(id) with the local ID. For the first segment (base_id == 0) local == global so filtering works by accident; for every segment with base_id > 0 (i.e. once the collection has > ~65k vectors and seals once), the bitmap lookup is against the wrong ID space and almost every candidate is silently dropped → recall collapses. The growing segment has the same shape via growing_base_id.


3. index_type='hnsw_pq' is silently a no-op — PQ is never instantiated

File: nodedb-vector/src/collection/lifecycle.rs:209-261

pub fn complete_build(&mut self, segment_id: u32, index: HnswIndex) {
    ...
    let sq8 = Self::build_sq8_for_index(&index);    // ← always SQ8, never PQ
    ...
}

grep -rn PqCodec nodedb-vector/src/collection/ returns zero matches. The DSL accepts index_type='hnsw_pq', pq_m=8 and stores it, but complete_build unconditionally calls build_sq8_for_index. Users who configured PQ for the advertised 8–16× memory reduction silently get 4× SQ8 instead. At 1024-dim × 100k+ rows per collection that's the difference between fitting in RAM and spilling to mmap on the hot path. Stats API (lifecycle.rs:385) reports quantization as Sq8 or None, never Pq — operators have no signal that the configuration was ignored.

Repro:

CREATE INDEX ON t USING hnsw (v) WITH (index_type='hnsw_pq', pq_m=8, m=16);
INSERT INTO t SELECT array_fill(random()::real, ARRAY[1024])::vector(1024) FROM generate_series(1, 100000);
SELECT * FROM vector_index_stats WHERE relation = 't'; -- quantization = 'Sq8'

4. Soft-deletes in growing & building segments lost on checkpoint restore

File: nodedb-vector/src/collection/checkpoint.rs:55-76, 120-154

growing_vectors: (0..self.growing.len() as u32)
    .filter_map(|i| self.growing.get_vector(i).map(|v| v.to_vec()))
    .collect(),
growing_deleted: (0..self.growing.len() as u32)
    .map(|i| self.growing.get_vector(i).is_none())
    .collect(),
...
let mut growing = FlatIndex::new(snap.dim, metric);
for v in &snap.growing_vectors {
    growing.insert(v.clone());
}

FlatIndex::get_vector (flat.rs:161-169) returns Some(...) even for tombstoned vectors — it gates only on bounds, not on deleted[idx]. So growing_deleted is always all-false and every live vector is serialized. Then from_checkpoint (line 120-123) never reads growing_deleted anyway — every restored vector comes back as live. Building-segment snapshots (line 74-76) have the same bug. Soft-delete-via-valid_until workflows depend on deletes being durable across restarts; after a crash, every deleted row in growing/building segments resurrects.


5. HNSW compaction renumbers local IDs but doc_id_map / multi_doc_map are not remapped

Files: nodedb-vector/src/hnsw/graph.rs:282-352 (HnswIndex::compact) + nodedb-vector/src/collection/lifecycle.rs:38-40, 119-122, 279-285

// lifecycle.rs
pub fn compact(&mut self) -> usize {
    let mut total_removed = 0;
    for seg in &mut self.sealed {
        total_removed += seg.index.compact();   // renumbers LOCAL node ids
    }
    total_removed
}

HnswIndex::compact builds an id_map: Vec<u32> and rewrites every node's neighbor edges so the in-graph references stay consistent. But the collection stores doc_id_map: HashMap<u32 /* GLOBAL */, String> and multi_doc_map: HashMap<String, Vec<u32 /* GLOBAL */>>, and these are not remapped. After compact, collection.get_doc_id(vid) either returns a stale string (the global id seg.base_id + old_local no longer maps to the right live node) or the multi-vector delete-by-doc dispatches to IDs that point at a different live chunk. With bulk-insert workloads (many vectors per logical doc) this corrupts the doc → vector mapping every compaction cycle.


6. random_layer has no max_layer cap — one outlier insert promotes index to ~60+ layers, every subsequent search pays per-layer cost

File: nodedb-vector/src/hnsw/graph.rs:257-261

pub(crate) fn random_layer(&mut self) -> usize {
    let ml = 1.0 / (self.params.m as f64).ln();
    let r = self.rng.next_f64().max(f64::MIN_POSITIVE);
    (-r.ln() * ml).floor() as usize
}

next_f64 is xorshift.0 as f64 / u64::MAX as f64. The .max(f64::MIN_POSITIVE) clamp floors r at 2.2e-308 so -ln(r) reaches ~708; with m=2 the layer can be ~1022, with m=16 it can be ~177. Realistic xorshift states a few thousand inserts in produce r ≈ 1e-17..1e-19 → 50-65 layers. Every subsequent search's Phase 1 greedy descent (hnsw/search.rs:31) iterates (1..=self.max_layer).rev(), so one outlier insert turns the constant-time descent into per-search O(max_layer). Standard HNSW caps max_layer at ~16.


7. PQ k-means in pq.rs and ivf.rs is deterministic farthest-point with broken min-distance update — codebooks collapse on duplicated data

Files: nodedb-vector/src/quantize/pq.rs:181-201, nodedb-vector/src/ivf.rs:223-238 (identical pattern)

// Update min_dists against centroids[c - 1] only
for (i, dists) in min_dists.iter_mut().enumerate() {
    let d = squared_distance(data[i], &centroids[c - 1]);
    if d < *dists { *dists = d; }
}
// Then deterministic argmax (NOT k-means++ sampling proportional to d²)
let best_idx = min_dists.iter().enumerate()
    .max_by(|a, b| a.1.partial_cmp(b.1).unwrap_or(Ordering::Equal))
    .map(|(i, _)| i).unwrap_or(0);
centroids.push(data[best_idx].to_vec());

Two compounding bugs: (a) min_dists is updated only against the last centroid, not against the set, so once two centroids coincide every subsequent pick lands on the same best_idx. (b) Comment claims "k-means++" but logic is deterministic farthest-point — outliers dominate. With pq_m=8 on 1024-dim vectors (128 dims per subspace), a chunk-ingest batch where embeddings share a common prefix or suffix (typical for templated chat / repeated headers / footers) trains a codebook where most of 256 centroids alias to one or two points. Encode then maps essentially every vector to the same code → PQ recall destroyed.


Checklist

  • 1. Validate length parity in SIMD distance kernels (or in dispatcher) — no loadu past min(a.len(), b.len()); reject on mismatch
  • 2. Either pass per-segment-localized bitmaps to search_with_bitmap_bytes, or apply the global-id offset before the f.contains(id) check
  • 3. Wire index_type='hnsw_pq' end-to-end through complete_buildPqCodec::train instead of falling through to SQ8; surface Pq in stats
  • 4. FlatIndex::get_vector should return None for tombstoned indices (or expose a live_iter); from_checkpoint must read & apply growing_deleted/building-segment delete bitmaps
  • 5. After HnswIndex::compact, walk doc_id_map/multi_doc_map and rewrite global IDs through (seg.base_id + new_local)
  • 6. Cap random_layer at a configured max (typical: 16); also consider switching xorshift seed handling so first-call output isn't pathological
  • 7. Real k-means++ (sample proportional to d²) and update min_dists against the full centroid set on every pick; treat as warning when centroid duplicates after training

All items independently reproducible with the SQL/Rust shown above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions