diff --git a/.gitignore b/.gitignore
index f80e5c682..46b23fe11 100644
--- a/.gitignore
+++ b/.gitignore
@@ -378,3 +378,6 @@ gperftools
 
 # Rust
 rust/target
+
+# DiskANN unit-test scratch artifacts (generated by tests/unified_index_tests.cpp)
+unified_index_test_*
diff --git a/docs/unified_index_format.md b/docs/unified_index_format.md
new file mode 100644
index 000000000..428267561
--- /dev/null
+++ b/docs/unified_index_format.md
@@ -0,0 +1,348 @@
+# Unified Index Format
+
+**Status:** Draft v1
+**Scope:** static (non-streaming) indices; no tags; no disk-side secondary PQ
+**Audience:** DiskANN maintainers, third-party loader implementers (e.g. `rust/` crate, Python tools)
+
+---
+
+## 1. Motivation
+
+Today, building an index for SSD-served (`PQFlashIndex`) versus in-memory (`Index<T,TagT,LabelT>`) serving requires two distinct build pipelines that produce different on-disk artifacts:
+
+- **In-memory build** (`Index::save` in `src/index.cpp`) writes a variable-width graph file, a `.data` file (full-precision vectors), `.tags`, `_labels.txt`, `_labels_to_medoids.txt`, `_labels_map.txt`, `_bitmask_labels.bin`, `_integer_labels.bin`, etc.
+- **Disk build** (`build_disk_index` in `src/disk_utils.cpp`) writes `_disk.index` (4 KiB sector-packed graph + full-precision coords), `_pq_pivots.bin`, `_pq_compressed.bin`, `_medoids.bin`, `_centroids.bin`, `_max_base_norm.bin`, plus the same family of label files.
+
+Both pipelines build the same underlying Vamana graph (shared `build_merged_vamana_index` call), but the serialized artifacts diverge. As a result an index built for one serving mode cannot be loaded in the other, and users must commit to a serving mode at build time.
+
+**Goal:** define a single self-describing container file that can be:
+
+- produced once by a unified build pipeline, and
+- loaded as either an in-memory index (`Index::load_unified`) or an SSD-served index (`PQFlashIndex::load_unified`),
+
+with PQ data simply ignored on the in-memory path.
+
+### 1.1 Non-goals (v1)
+
+- **No tags.** `Index<T,TagT,LabelT>` template instantiations stay (existing code untouched), but the unified writer/reader does not emit or consume tags.
+- **No frozen points / streaming.** No dynamic-index support in this version. No `num_frozen_pts` field.
+- **No disk-side secondary PQ.** The optional `_disk.index_pq_pivots.bin` / `_use_disk_index_pq` path at `src/pq_flash_index.cpp:828-835, 1534/1542` is not supported. Users needing very-high-dim disk PQ keep using the legacy format.
+- **No centroids region.** `_centroids.bin` today is a load-time optimization that pre-populates `_centroid_data` (used as query-expansion seeds at `src/pq_flash_index.cpp:1327`). It is always derivable from the medoid node records via `use_medoids_data_as_centroids()` (`src/pq_flash_index.cpp:401-438`). The unified loader calls that fallback at startup, paying `_num_medoids` extra disk reads.
+- **No legacy→unified conversion tool.** New format lives alongside legacy. Existing indices keep loading via their legacy code paths.
+
+### 1.2 Design principles
+
+1. **One file, self-describing.** All sidecar files merge into one container with a fixed-layout header that declares which optional sections are present.
+2. **Disk and memory share one graph encoding.** The graph + embeddings region is byte-for-byte identical regardless of which loader consumes it.
+3. **Region-level 4 KiB alignment, intra-region packing.** Major regions begin at 4 KiB-aligned file offsets to preserve the `AlignedFileReader` invariants (`include/aligned_file_reader.h:73-90`, `src/windows_aligned_file_reader.cpp:125-127`). Inside a region, payload is packed without per-record padding.
+4. **No redundant per-node metadata.** Embedding size is constant (`dim * sizeof(T)`, from header) and neighbor IDs are fixed-width `uint32_t`. Per-node degree is *derived* from the offset table, not stored.
+
+### 1.3 Supporting facts from existing code
+
+- In-memory graph load is sequential per-node `[degree:u32][nbrs:u32*degree]` (`src/in_mem_graph_store.cpp:138-202`). The in-memory search loop is degree-oblivious, so a fixed-stride or offset-indexed layout works equally well.
+- Disk search keeps full-precision coords *inside each sector record* (`src/pq_flash_index.cpp:1651`) alongside the adjacency list. PQ codes live in memory (`_pq_compressed.bin`) and approximate distances during traversal.
+- In-memory `Index<T>::search` never references PQ (`grep -n _pq_table src/index.cpp` returns zero hits).
+- `load_bin_impl` already accepts a `file_offset` parameter (`include/utils.h:412-426`), so embedded sub-files can be read from a byte range with no API change.
+
+---
+
+## 2. Format Specification (normative)
+
+All multi-byte integers are little-endian. All offsets and lengths are in bytes from the start of the file.
+
+### 2.1 File layout
+
+```
++--------------------------------------------------+
+| Header (4 KiB)                                   |  offset 0
++--------------------------------------------------+
+| Node Offset Table: uint64[npts + 1]              |  offset = header.offset_table_off
+| (padded to next 4 KiB boundary)                  |
++--------------------------------------------------+
+| Graph + Embeddings Region                        |  offset = header.graph_region_off
+|   Per node N: [coords:T*dim][nbrs:u32*degree]    |
+|   No per-node degree field.                      |
+|   Variable-width packing, no sector padding.     |
+|   (region padded to next 4 KiB boundary)         |
++--------------------------------------------------+
+| Medoids Region (always present)                  |
+|   uint32[num_medoids] of node IDs.               |
+|   num_medoids = medoids_len / sizeof(uint32_t).  |
+|   (padded to 4 KiB)                              |
++--------------------------------------------------+
+| PQ Pivots Region                  [optional]     |  present iff HAS_PQ
+|   Mirrors current _pq_pivots.bin payload byte    |
+|   for byte. (padded to 4 KiB)                    |
++--------------------------------------------------+
+| PQ Compressed Codes Region        [optional]     |  present iff HAS_PQ
+|   Mirrors current _pq_compressed.bin payload     |
+|   byte for byte. (padded to 4 KiB)               |
++--------------------------------------------------+
+| Max Base Norm Region              [optional]     |  present iff HAS_MAX_BASE_NORM
+|   float[1]. MIPS preprocessing only.             |
+|   (padded to 4 KiB)                              |
++--------------------------------------------------+
+| Labels Region                     [optional]     |  present iff HAS_LABELS
+|   Three sub-sections — see §2.4.                 |
++--------------------------------------------------+
+```
+
+Optional regions whose flag is unset have both their `off` and `len` header fields set to `0`. Regions appear in the order above; readers MUST locate each region via its header offset, not by position.
+
+### 2.2 Header (fixed 4 KiB)
+
+```cpp
+// include/unified_index_format.h
+namespace diskann {
+
+constexpr uint32_t UNIFIED_FORMAT_MAGIC   = 0x444E4E55; // "UNND" in little-endian ASCII
+constexpr uint32_t UNIFIED_FORMAT_VERSION = 1;
+
+enum class DataTypeTag   : uint32_t { Float = 1, Uint8 = 2, Int8 = 3 };
+enum class MetricTag     : uint32_t { L2 = 1, InnerProduct = 2, Cosine = 3 };
+enum class LabelEncoding : uint32_t { None = 0, Bitmask = 1, Integer = 2 };
+
+enum UnifiedFormatFlags : uint32_t {
+    HAS_PQ              = 1u << 0,
+    HAS_LABELS          = 1u << 1,
+    HAS_MAX_BASE_NORM   = 1u << 2,
+};
+
+struct UnifiedIndexHeader {            // total reserved 4096 bytes (one sector)
+    uint32_t      magic;
+    uint32_t      version;
+    DataTypeTag   data_type;
+    MetricTag     metric;
+    uint64_t      npts;
+    uint64_t      dim;
+    uint64_t      aligned_dim;
+    uint32_t      max_degree;
+    uint32_t      flags;
+    uint64_t      start_node;
+
+    // Section pointers. (off = 0, len = 0) means the optional region is absent.
+    uint64_t      offset_table_off,        offset_table_len;
+    uint64_t      graph_region_off,        graph_region_len;
+    uint64_t      medoids_off,             medoids_len;      // always present
+    uint64_t      pq_pivots_off,           pq_pivots_len;    // optional
+    uint64_t      pq_codes_off,            pq_codes_len;     // optional
+    uint64_t      max_base_norm_off,       max_base_norm_len;// MIPS only
+
+    // Labels (when HAS_LABELS)
+    LabelEncoding label_encoding;          // Bitmask or Integer
+    uint64_t      universal_label;         // 0 if none; else the integer label value
+    uint64_t      total_labels;            // distinct label count; derives bitmask row width
+    uint64_t      label_dictionary_off,        label_dictionary_len;
+    uint64_t      per_point_labels_off,        per_point_labels_len;
+    uint64_t      per_point_label_offsets_off, per_point_label_offsets_len; // Integer encoding only
+
+    uint64_t      file_size_bytes;       // total file size in bytes, set by writer in finalize(); 0 in v1 files
+
+    // Implementation must pad with reserved zero bytes to reach exactly 4096 bytes.
+};
+static_assert(sizeof(UnifiedIndexHeader) <= 4096, "header must fit in one sector");
+
+} // namespace diskann
+```
+
+Readers MUST:
+
+- Reject files whose `magic != UNIFIED_FORMAT_MAGIC`.
+- Reject files whose `version > UNIFIED_FORMAT_VERSION` they understand (no silent partial parsing).
+- Treat reserved trailing bytes within the header as opaque (do not assume zero).
+- When `file_size_bytes != 0`, reject files whose on-disk size does not match the recorded value (truncation / partial write / corruption check). The `!= 0` guard allows v1 files (which did not carry this field) to load through a v2 reader without spurious rejection.
+
+### 2.3 Node Offset Table and Graph Region
+
+The offset table is `uint64[npts + 1]` values, packed contiguously. For node `N` (0 ≤ N < npts):
+
+- record start (in file): `header.graph_region_off + offset_table[N]`
+- record end (in file): `header.graph_region_off + offset_table[N + 1]`
+- record size: `offset_table[N + 1] - offset_table[N]`
+
+The trailing sentinel `offset_table[npts]` equals `header.graph_region_len` (the size of the graph region payload, not counting trailing 4 KiB padding).
+
+Each node record contains, in order:
+
+1. `coords`: exactly `dim * sizeof(T)` bytes of vector data, where `T` corresponds to `header.data_type`.
+2. `neighbors`: zero or more `uint32_t` neighbor node IDs.
+
+There is no per-node degree field. The degree is derived:
+
+```
+degree = (record_size - dim * sizeof(T)) / sizeof(uint32_t)
+```
+
+The graph region is otherwise unstructured. Implementations MUST pad with zero bytes from `header.graph_region_off + header.graph_region_len` to the next 4 KiB-aligned file offset, so that subsequent regions begin sector-aligned. Padding bytes are not part of `graph_region_len`.
+
+### 2.4 Labels Region
+
+Present iff `flags & HAS_LABELS`. Three sub-sections:
+
+#### 2.4.1 Label dictionary
+
+Replaces today's `_labels_map.txt` + `_labels_to_medoids.txt`. One row per distinct label, packed contiguously:
+
+```
+[label_string_len:u32][label_string bytes (label_string_len bytes, no nul terminator)]
+[label_integer:u32][medoid_node_id:u32]
+```
+
+`label_integer` is always written as a 4-byte little-endian unsigned integer, independent of the build-time `LabelT` template parameter (`uint16_t` values are zero-extended). This makes the on-disk dictionary self-describing and uniform across writer instantiations. Row count is implicit: read rows until `label_dictionary_len` bytes are consumed.
+
+If `header.universal_label != 0`, the dictionary MAY contain a row whose `label_integer` matches it; otherwise the universal label has no explicit dictionary entry.
+
+#### 2.4.2 Per-point labels
+
+The payload format depends on `header.label_encoding`:
+
+- **`Bitmask`**: row width is fixed at `simple_bitmask::get_bitmask_size(total_labels) * sizeof(uint64_t)` bytes (see `include/label_bitmask.h:57`). Random access: point N starts at offset `N * row_width` within the region. Each row's payload is the equivalent of today's `_bitmask_labels.bin` row.
+- **`Integer`**: payload bytes are raw `uint32_t` label integers packed in point order, equivalent to `integer_label_vector::_data` (`include/integer_label_vector.h:38`). To locate point N's labels, use the per-point label offsets sub-section (§2.4.3).
+
+#### 2.4.3 Per-point label offsets
+
+Present iff `header.label_encoding == Integer`. Format: `uint64[npts + 1]` offsets into the per-point labels region. Point N's labels span the range `_data[offsets[N] : offsets[N+1]]` (each element a `uint32_t`). Mirrors `integer_label_vector::_offset` (`include/integer_label_vector.h:37`).
+
+For `header.label_encoding == Bitmask`, `per_point_label_offsets_off` and `per_point_label_offsets_len` MUST both be `0`.
+
+**On-disk ordering (Integer encoding):** for symmetry with the graph region's `[offset_table, graph_data]` layout, the writer emits the per-point-label *offsets* first, then the per-point-label *payload*. Since both regions are addressed by absolute file offsets from the header, readers are unaffected by the ordering.
+
+### 2.5 Medoids Region (always present)
+
+A packed `uint32_t` array of node IDs. Length: `medoids_len / sizeof(uint32_t)`. Unfiltered indices write exactly one entry; filtered indices write one entry per label-bound medoid (semantics identical to today's `_medoids.bin`).
+
+### 2.6 PQ Regions (optional)
+
+When `HAS_PQ` is set, both `pq_pivots_off` and `pq_codes_off` MUST be non-zero. Each region's payload is byte-identical to today's `_pq_pivots.bin` / `_pq_compressed.bin`, including the in-bin metadata header that `load_bin_impl` expects (`include/utils.h:412-426`). Loaders read these via `load_bin_impl(path, pq_pivots_off)` and `load_bin_impl(path, pq_codes_off)`.
+
+When `HAS_PQ` is unset, both fields MUST be zero, and an SSD loader MUST reject the file with a clear error (SSD serving requires PQ).
+
+### 2.7 Max Base Norm Region (optional)
+
+Present iff `HAS_MAX_BASE_NORM` (MIPS preprocessing only). Payload: byte-identical to today's `_max_base_norm.bin`.
+
+---
+
+## 3. Load Paths (informative)
+
+### 3.1 In-memory load — `Index::load_unified(path)`
+
+1. Open file, read first 4 KiB → parse `UnifiedIndexHeader`. Validate magic and version.
+2. Read the offset table (`npts + 1` `uint64`s starting at `header.offset_table_off`).
+3. Read the graph region into a buffer (or stream it in chunks).
+4. For each node N in `[0, npts)`:
+   - `record = region_buf[offset_table[N] : offset_table[N+1]]`
+   - `coords = record[0 : dim * sizeof(T)]` → copy into `_data_store`
+   - `degree = (len(record) - dim * sizeof(T)) / sizeof(uint32_t)`
+   - `neighbors = record[dim * sizeof(T) :]` interpreted as `uint32_t[degree]` → copy into `InMemGraphStore::_graph[N]`
+5. If `flags & HAS_LABELS`:
+   - Read the dictionary; reconstruct in-memory `label_map` and `labels_to_medoids`.
+   - Read `per_point_labels`; dispatch on `header.label_encoding`:
+     - `Bitmask`: feed bytes into `simple_bitmask_buf` with row width derived from `total_labels`.
+     - `Integer`: also read `per_point_label_offsets`; feed both into `integer_label_vector`.
+   - If `header.universal_label != 0`, apply it to the label holder.
+6. Read the medoids region (always present) into the in-memory medoid list (used by filtered search).
+7. **PQ regions are skipped entirely.**
+
+### 3.2 SSD load — `PQFlashIndex::load_unified(num_threads, path)`
+
+1. Open the file via `AlignedFileReader` plus a sync `ifstream` for the small bits.
+2. Read header and offset table synchronously. Keep the offset table in memory as `_node_offsets` (`8 * npts` bytes — same order of magnitude as the existing `_medoids` / cache overhead).
+3. Set `_disk_index_file = path` and `_graph_region_base = header.graph_region_off`.
+4. Load PQ pivots and PQ codes via `load_bin_impl(path, header.pq_pivots_off)` and `load_bin_impl(path, header.pq_codes_off)`. SSD load fails fast if `HAS_PQ` is unset.
+5. Load medoids (always present) and `max_base_norm` (if `HAS_MAX_BASE_NORM`) from their `(off, len)`. Centroids are populated by calling `use_medoids_data_as_centroids()` (`src/pq_flash_index.cpp:401`) after the medoid list is known — this reads each medoid's full-precision vector from the graph region.
+6. Load labels (when `HAS_LABELS`) by the same dispatch as §3.1 step 5.
+7. At search time, replace the implicit per-node sector arithmetic (`get_node_sector(N) * SECTOR_LEN`, currently at `src/pq_flash_index.cpp:1430-1431`) with an offset-table lookup:
+   ```
+   start_byte    = graph_region_base + node_offsets[N]
+   end_byte      = graph_region_base + node_offsets[N + 1]
+   aligned_start = start_byte & ~(SECTOR_LEN - 1)
+   aligned_end   = (end_byte + SECTOR_LEN - 1) & ~(SECTOR_LEN - 1)
+   ```
+   Issue the aligned read; advance the in-buffer pointer by `(start_byte - aligned_start)` to land on the node record. Degree is `(end_byte - start_byte - dim * sizeof(T)) / 4`.
+
+This change is encapsulated in a single helper (`node_read_window(N)`) so the bulk of `cached_beam_search` is unchanged.
+
+---
+
+## 4. Build Path (informative)
+
+`build_unified_index` reuses the existing pipeline (preprocess → optional PQ training → `build_merged_vamana_index`) up to the point where the legacy code would write separate files or call `create_disk_layout`. From there:
+
+1. Train PQ if requested (same as today; skip entirely for in-memory-only builds).
+2. Stream each node from the in-memory Vamana graph + base vector file into `UnifiedIndexWriter`. The writer:
+   - Reserves the 4 KiB header.
+   - Reserves space for the offset table (`8 * (npts + 1)` bytes, rounded up to 4 KiB).
+   - Streams node records into the graph region, recording each record's offset in the offset-table buffer.
+   - Pads to 4 KiB, writes the medoids region.
+   - If PQ trained, pads and writes pivots + codes.
+   - If MIPS, pads and writes `max_base_norm`.
+   - If labels present, pads and writes the dictionary, per-point label offsets (Integer encoding only), and per-point labels, in that order.
+   - Seeks back to the start of the offset table and writes it.
+   - Seeks back to byte 0 and writes the final populated `UnifiedIndexHeader`.
+
+PQ-less builds simply leave `HAS_PQ = 0` and omit the PQ regions.
+
+---
+
+## 5. Implementation Roadmap
+
+### 5.1 New files
+
+| Path | Purpose |
+|------|---------|
+| `include/unified_index_format.h` | `UnifiedIndexHeader`, magic/version/flag constants, `DataTypeTag`/`MetricTag`/`LabelEncoding` enums, alignment helpers (`align_up_4k`). |
+| `include/unified_index_io.h` + `src/unified_index_io.cpp` | `UnifiedIndexWriter` (assembles container with correct alignment, accumulates offset table as it streams nodes) and `UnifiedIndexReader` (parses header, exposes region `(off, len)` pairs, plus a `read_node(N)` helper for in-memory loaders). |
+
+### 5.2 Modified files (additive only)
+
+| File | Change |
+|------|--------|
+| `src/disk_utils.cpp` | Add `build_unified_index(...)` next to `build_disk_index`. Same pipeline, but the post-Vamana repack step calls `UnifiedIndexWriter` instead of `create_disk_layout`, and label/medoid emission writes into the container instead of sidecar files. `build_disk_index` is untouched. |
+| `include/index.h`, `src/index.cpp` | Add `Index::save_unified(path)` and `Index::load_unified(path)`. `save_unified` walks `_data_store` + `InMemGraphStore::_graph` + label holders into `UnifiedIndexWriter`. `load_unified` parses the header and populates `_data_store` + `InMemGraphStore::_graph` from the graph region. Existing `save`/`load` paths are untouched. |
+| `include/pq_flash_index.h`, `src/pq_flash_index.cpp` | Add `PQFlashIndex::load_unified(num_threads, path)`. Replaces the load path; search path adds `node_read_window(N)` helper and routes the existing async read through it. Existing `load` / `load_from_separate_paths` are untouched. |
+| `src/in_mem_graph_store.cpp` | Add `set_graph_from_unified(npts, max_degree, start, per_node_adjacency_view)` so `Index::load_unified` can populate the graph without going through the file-based `load_impl`. No change to `load`/`save`/`get_neighbours`. |
+| `src/abstract_index.cpp` | (Optional, follow-up.) Expose `save_unified` / `load_unified` through the virtual dispatch (`_save_unified`, `_load_unified`), mirroring the recently added `_debug_search` pattern. |
+
+### 5.3 Phasing
+
+The implementation is broken into phases so that each lands as a reviewable unit and can be reverted without affecting legacy paths.
+
+1. **Phase 1 — Format primitives.** Add `include/unified_index_format.h` and the `UnifiedIndexWriter`/`UnifiedIndexReader` library. Unit tests: round-trip header, round-trip a few graph regions, round-trip both label encodings.
+2. **Phase 2 — In-memory save/load.** Add `Index::save_unified` and `Index::load_unified`. Test: build a small in-memory index the legacy way, `save_unified`, `load_unified` into a fresh `Index`, run search, compare top-K against the original.
+3. **Phase 3 — Disk build (unified).** Add `build_unified_index` reusing the existing PQ training and Vamana code. Test: build dataset twice (legacy vs unified) with the same parameters; compare PQ pivots/codes/medoids/labels byte-for-byte where the legacy bins are payload-identical to the corresponding unified regions.
+4. **Phase 4 — SSD load (unified).** Add `PQFlashIndex::load_unified` and the `node_read_window` helper. Test: cross-load — `build_unified_index` → `PQFlashIndex::load_unified` → search → compare recall and latency against legacy disk-build + legacy disk-load.
+5. **Phase 5 — Optional virtual dispatch.** Expose `save_unified` / `load_unified` on `AbstractIndex`.
+
+Each phase keeps legacy paths fully working and adds no caller-side migration burden.
+
+---
+
+## 6. Verification
+
+1. **Build symmetry.** Build a small dataset (~10 K vectors) the legacy way and the unified way with identical parameters. The unified file's PQ pivots, PQ codes, medoids, max-norm, and label payload bytes should match the corresponding legacy bin payloads byte-for-byte (modulo any in-bin headers that `load_bin_impl` handles).
+2. **Cross-load (memory).** Build unified → load with `Index::load_unified` → run search; compare recall@10 against legacy in-memory build + legacy load over the same dataset. The graph is identical so recall should match within a tight margin.
+3. **Cross-load (disk).** Build unified → load with `PQFlashIndex::load_unified` → run search; compare recall@10 *and* latency against legacy disk build + legacy disk load. Flag if the unaligned-slice read amplification regresses by more than ~10 % (this is a known "test later" item).
+4. **PQ-less unified.** Build unified without PQ (in-memory-only). Confirm: file is smaller; `PQFlashIndex::load_unified` rejects it with a clear "missing PQ" error; `Index::load_unified` succeeds.
+5. **Legacy regression.** Run the existing test suite (`tests/`, `tests/utils/`). All legacy load/build paths must continue to pass unchanged.
+6. **Forward-compat.** Hand-craft a unified file with `version = UNIFIED_FORMAT_VERSION + 1` and confirm both loaders fail fast with an "unsupported version" error rather than silently misinterpreting.
+
+---
+
+## 7. Open Questions and Follow-ups
+
+- **Read amplification.** Dropping per-node sector padding means SSD reads slice from a 4 KiB-aligned window that may be up to 2 × `(node_record_size + SECTOR_LEN)` bytes. This is the regression the user has flagged for measurement. If unacceptable, a follow-up can add an opt-in `pad_nodes_to_sector` build flag whose payload format is a strict subset of v1 (same header, same offset table, just larger `offset_table[N+1] - offset_table[N]` deltas).
+- **`AbstractIndex` virtual dispatch.** Whether `save_unified`/`load_unified` need to be exposed through the type-erased base depends on caller demand; deferred to Phase 5.
+- **Conversion tool.** Not in v1. If needed later, a small `legacy_to_unified` utility can be added that calls `UnifiedIndexReader`/`UnifiedIndexWriter` and reads legacy bins via existing helpers; no format change required.
+
+---
+
+## 8. Glossary
+
+| Term | Definition |
+|------|------------|
+| `SECTOR_LEN` | 4096 bytes. Sector size required by `AlignedFileReader` on Windows (`FILE_FLAG_NO_BUFFERING`) and by libaio at 512-byte minimum on Linux. The unified format uses 4096 throughout for cross-platform compatibility. |
+| `T` | The vector element type, one of `float`, `uint8_t`, `int8_t`, encoded as `DataTypeTag`. |
+| `LabelT` | Label integer type, `uint16_t` or `uint32_t`, fixed at build time by the template instantiation. |
+| `medoid` | Graph entry node for search. Unfiltered indices have one; filtered indices have one per label. |
+| `universal_label` | A label value that matches every point unconditionally. Sentinel `0` means none. |
diff --git a/include/filter_match_proxy.h b/include/filter_match_proxy.h
index 51ec52e9e..1224dedcc 100644
--- a/include/filter_match_proxy.h
+++ b/include/filter_match_proxy.h
@@ -20,11 +20,18 @@ namespace diskann
             const std::vector<LabelT>& filter_labels,
             LabelT unv_label);
 
+        // Ctor variant that owns its per-query scratch buffer internally.
+        // Used by the unified-index path (see unified_label_data_bitmask::make_match_proxy).
+        bitmask_filter_match(simple_bitmask_buf& bitmask_filters,
+            const std::vector<LabelT>& filter_labels,
+            LabelT unv_label);
+
         virtual bool contain_filtered_label(uint32_t id) override;
 
     private:
         simple_bitmask_buf& _bitmask_filters;
-        std::vector<std::uint64_t>& _query_bitmask_buf;
+        std::vector<std::uint64_t> _owned_query_bitmask_buf;  // populated only by the 3-arg ctor
+        std::vector<std::uint64_t>& _query_bitmask_buf;       // refs either external or _owned
         simple_bitmask_full_val _bitmask_full_val;
     };
 
diff --git a/include/index.h b/include/index.h
index 79f0cef1c..97bd2420a 100644
--- a/include/index.h
+++ b/include/index.h
@@ -85,9 +85,17 @@ template <typename T, typename TagT = uint32_t, typename LabelT = uint32_t> clas
     DISKANN_DLLEXPORT void load(const IndexLoadParams& load_params);
 
     DISKANN_DLLEXPORT void load(const char *index_file, uint32_t num_threads, uint32_t search_l, LabelFormatType label_format_type = LabelFormatType::String);
-    
+
 #endif
 
+    // Unified single-file format. See docs/unified_index_format.md.
+    DISKANN_DLLEXPORT void save_unified(const char *filename);
+    // Variant of save_unified that also emits a PQ region. Pass empty
+    // buffers to skip PQ (equivalent to the no-arg overload). Used by
+    // unified_index_builder.
+    DISKANN_DLLEXPORT void save_unified(const char *filename, const std::vector<uint8_t> &pq_pivots_bytes,
+                                         const std::vector<uint8_t> &pq_codes_bytes);
+
     // get some private variables
     DISKANN_DLLEXPORT size_t get_num_points();
     DISKANN_DLLEXPORT size_t get_max_points();
diff --git a/include/integer_label_vector.h b/include/integer_label_vector.h
index 68688419f..4351c76c0 100644
--- a/include/integer_label_vector.h
+++ b/include/integer_label_vector.h
@@ -12,6 +12,17 @@ class integer_label_vector
 
     bool initialize_from_file(const std::string &label_file, size_t &numpoints);
 
+    bool initialize_from_buffers(const size_t *offsets, size_t num_points,
+                                 const uint32_t *labels, size_t total_labels);
+
+    // Zero-copy load path: caller pre-sizes both buffers, writes into the raw
+    // pointers, and the integer_label_vector is ready to use. The two-step
+    // form lets the caller skip the intermediate vector<uint8_t> + assign()
+    // copies that initialize_from_buffers incurs.
+    void resize_for_load(size_t num_points, size_t total_labels);
+    size_t *mutable_offset_data();   // size: num_points + 1 entries (size_t each)
+    uint32_t *mutable_label_data();  // size: total_labels entries (uint32_t each)
+
     bool write_to_file(const std::string &label_file) const;
 
     template <typename LabelT>
diff --git a/include/label_bitmask.h b/include/label_bitmask.h
index e0917bec0..ac0e669dd 100644
--- a/include/label_bitmask.h
+++ b/include/label_bitmask.h
@@ -2,6 +2,8 @@
 #include <cstdint>
 #include <vector>
 
+#include "windows_customizations.h"
+
 namespace diskann
 {
 
@@ -45,7 +47,15 @@ struct simple_bitmask_buf
 
 };
 
-class simple_bitmask
+// NOTE: simple_bitmask stays DISKANN_DLLEXPORT even though the unit tests now
+// link the static diskann_s lib (where DISKANN_DLLEXPORT is a no-op) and no
+// longer need it exported. It is kept because ColorInfoVector's inline
+// constructor (include/color_info.h, pulled in widely via neighbor.h) odr-uses
+// simple_bitmask's out-of-line methods (ctor, get_bitmask_size), so any DLL
+// consumer that instantiates it must import them. TODO: once that inline
+// dependency is removed or proven unused by every DLL consumer, drop this
+// export too -- simple_bitmask is otherwise an internal helper.
+class DISKANN_DLLEXPORT simple_bitmask
 {
 public:
     simple_bitmask(std::uint64_t* bitsets, std::uint64_t bitmask_size);
diff --git a/include/pq.h b/include/pq.h
index 3e6119f22..1055467ed 100644
--- a/include/pq.h
+++ b/include/pq.h
@@ -30,6 +30,14 @@ class FixedChunkPQTable
     void load_pq_centroid_bin(const char *pq_table_file, size_t num_chunks);
 #endif
 
+    // In-memory variant of load_pq_centroid_bin. Parses the same on-disk
+    // pq_pivots blob format (outer bin -> 4 or 5 sub-bins for offsets,
+    // pivot table, centroid, [old per-chunk dim], chunk offsets), but reads
+    // straight from a caller-supplied buffer -- no temp file, no disk IO.
+    // Does NOT support OPQ rotation matrix (unified-format PQ is always
+    // standard PQ).
+    void load_pq_centroid_bin_from_memory(const uint8_t *blob, size_t blob_len, size_t num_chunks);
+
     uint32_t get_num_chunks();
 
     void preprocess_query(float *query_vec);
diff --git a/include/pq_flash_index.h b/include/pq_flash_index.h
index ec024c4a2..45f90025e 100644
--- a/include/pq_flash_index.h
+++ b/include/pq_flash_index.h
@@ -52,6 +52,8 @@ template <typename T, typename LabelT = uint32_t> class PQFlashIndex
                                                     LabelFormatType label_format_type = LabelFormatType::String);
 #endif
 
+    // (load_unified removed; use diskann::make_unified_index_ssd(reader, ctx) — see include/unified_index.h.)
+
     DISKANN_DLLEXPORT void load_cache_list(std::vector<uint32_t> &node_list);
 
     DISKANN_DLLEXPORT void cache_bfs_levels(uint64_t num_nodes_to_cache, std::vector<uint32_t> &node_list,
diff --git a/include/unified_index.h b/include/unified_index.h
new file mode 100644
index 000000000..5e2d80334
--- /dev/null
+++ b/include/unified_index.h
@@ -0,0 +1,96 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#pragma once
+
+#include <cstdint>
+#include <functional>
+#include <memory>
+#include <optional>
+#include <string>
+#include <vector>
+
+#include "aligned_file_reader.h"
+#include "distance.h"
+#include "percentile_stats.h"
+#include "unified_index_format.h"
+#include "windows_customizations.h"
+
+namespace diskann
+{
+
+struct QueryStats;
+struct DebugTraversalInfo;
+
+// Knobs passed to unified_index::load. Path identifies the unified container
+// file. `num_threads` and `search_l` size per-thread scratch on the memory
+// implementation. `num_nodes_to_cache` triggers SSD static-cache priming
+// (no-op for the memory implementation).
+struct UnifiedLoadContext
+{
+    std::string path;
+    uint32_t num_threads = 1;
+    uint32_t search_l = 100;
+    uint64_t num_nodes_to_cache = 0;
+};
+
+// Single in/out container for a search call. The caller fills inputs and
+// allocates the output buffers; search() writes outputs (and optional
+// telemetry) directly. No allocation happens inside search().
+struct UnifiedSearchContext
+{
+    // ---- Inputs ----
+    const void *query = nullptr;          // typed by caller as const T*
+    size_t K = 10;
+    uint32_t L = 100;
+    // Filter labels as user-facing strings. Required non-empty if the loaded
+    // index has labels; required empty otherwise. The index converts strings
+    // to internal label ints per its encoding.
+    std::vector<std::string> filter_labels;
+    std::optional<uint32_t> beam_width;                            // SSD-only
+    std::optional<uint32_t> io_limit;                              // SSD-only
+    std::function<float(const std::uint8_t *, size_t)> rerank_fn;  // SSD-only
+
+    // ---- Outputs (caller-allocated, length >= K) ----
+    uint64_t *indices = nullptr;
+    float *distances = nullptr;
+
+    // ---- Optional telemetry sinks (nullptr = no telemetry) ----
+    QueryStats *stats = nullptr;
+    DebugTraversalInfo *debug_info = nullptr;
+};
+
+// Non-templated public interface returned by the factory. Users program
+// against this; the templated `unified_index_base<T>` implements it.
+class unified_index
+{
+  public:
+    virtual ~unified_index() = default;
+
+    virtual void load(const UnifiedLoadContext &ctx) = 0;
+    virtual void search(UnifiedSearchContext &ctx) = 0;
+
+    virtual const UnifiedIndexHeader &header() const = 0;
+    virtual uint64_t num_points() const = 0;
+    virtual uint64_t dim() const = 0;
+    virtual uint64_t aligned_dim() const = 0;
+    virtual diskann::Metric metric() const = 0;
+    virtual DataTypeTag data_type() const = 0;
+    virtual bool has_labels() const = 0;
+
+    // Resident memory / cardinality accounting for the loaded index, mirroring
+    // Index::get_table_stats() and PQFlashIndex::get_table_stats().
+    virtual TableStats get_table_stats() const = 0;
+};
+
+// Factory: open a unified file fully in memory. Peeks the 4 KiB header,
+// dispatches on `data_type`, instantiates the right templated implementation,
+// calls load(ctx), returns the owning pointer as the non-templated interface.
+std::unique_ptr<unified_index> make_unified_index_memory(const UnifiedLoadContext &ctx);
+
+// Factory: open a unified file in disk-resident (SSD) mode. The supplied
+// AlignedFileReader is handed to the constructed unified_index_ssd<T>.
+std::unique_ptr<unified_index> make_unified_index_ssd(
+    std::shared_ptr<AlignedFileReader> reader, const UnifiedLoadContext &ctx);
+
+} // namespace diskann
diff --git a/include/unified_index_base.h b/include/unified_index_base.h
new file mode 100644
index 000000000..708e15589
--- /dev/null
+++ b/include/unified_index_base.h
@@ -0,0 +1,109 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#pragma once
+
+#include <cstdint>
+#include <memory>
+#include <string>
+
+#include "distance.h"
+#include "unified_index.h"
+#include "unified_index_format.h"
+#include "unified_label_data.h"
+#include "unified_node_store.h"
+#include "windows_customizations.h"
+
+namespace diskann
+{
+
+class UnifiedIndexReader;
+
+// Templated implementation of the non-templated `unified_index` interface.
+// Holds the parsed header, the metric, the label data (built by
+// make_unified_label_data), and the node store (a unified_node_store_memory<T>
+// or unified_node_store_ssd<T>, plugged in by the derived class's
+// `load_storage`).
+template <typename T>
+class unified_index_base : public unified_index
+{
+  public:
+    explicit unified_index_base(diskann::Metric metric);
+    ~unified_index_base() override;
+
+    void load(const UnifiedLoadContext &ctx) override;
+    void search(UnifiedSearchContext &ctx) override;
+
+    const UnifiedIndexHeader &header() const override
+    {
+        return _header;
+    }
+    uint64_t num_points() const override
+    {
+        return _header.npts;
+    }
+    uint64_t dim() const override
+    {
+        return _header.dim;
+    }
+    uint64_t aligned_dim() const override
+    {
+        return _header.aligned_dim;
+    }
+    diskann::Metric metric() const override
+    {
+        return _metric;
+    }
+    DataTypeTag data_type() const override
+    {
+        return data_type_tag_of<T>();
+    }
+    bool has_labels() const override
+    {
+        return _labels && _labels->has_labels();
+    }
+    TableStats get_table_stats() const override
+    {
+        return _table_stats;
+    }
+
+    // Templated read-only accessors for in-process callers that *do* know T
+    // (unit tests, the index's own search loop). Not on the public interface.
+    const unified_label_data_base *labels() const
+    {
+        return _labels.get();
+    }
+    const unified_node_store_base<T> *nodes() const
+    {
+        return _store.get();
+    }
+    unified_node_store_base<T> *nodes()
+    {
+        return _store.get();
+    }
+
+  protected:
+    // Derived class is responsible for instantiating the right _store subclass
+    // and calling its load(). It may inspect ctx for SSD-only knobs like
+    // ctx.num_nodes_to_cache.
+    virtual void load_storage(UnifiedIndexReader &r, const UnifiedLoadContext &ctx) = 0;
+    virtual void search_impl(UnifiedSearchContext &ctx) = 0;
+
+    // Fill the storage-specific resident-memory fields (node_mem_usage,
+    // graph_mem_usage) of `stats`. Memory reports resident coords/graph; SSD
+    // reports the resident PQ codes (graph lives on disk). Called by load()
+    // after load_storage() so the store is populated.
+    virtual void fill_storage_stats(TableStats &stats) const = 0;
+
+    void validate_header(const UnifiedIndexHeader &h) const;
+    void validate_search_context(const UnifiedSearchContext &ctx) const;
+
+    UnifiedIndexHeader _header{};
+    diskann::Metric _metric;
+    std::unique_ptr<unified_label_data_base> _labels;       // nullptr when header has no labels
+    std::unique_ptr<unified_node_store_base<T>> _store;     // built by derived load_storage()
+    std::string _index_path;
+    TableStats _table_stats;
+};
+
+} // namespace diskann
diff --git a/include/unified_index_builder.h b/include/unified_index_builder.h
new file mode 100644
index 000000000..f7e3267b7
--- /dev/null
+++ b/include/unified_index_builder.h
@@ -0,0 +1,74 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#pragma once
+
+#include <cstdint>
+#include <memory>
+#include <string>
+
+#include "distance.h"
+#include "unified_index_format.h"
+#include "windows_customizations.h"
+
+namespace diskann
+{
+
+// All parameters required to build a unified-format index file.
+//
+// One struct, runtime-typed (no template). The data_type field selects which
+// concrete `Index<T>` is instantiated internally; coords are read from
+// `data_file_path` in `.bin` format (the legacy DiskANN file layout).
+struct UnifiedBuildContext
+{
+    // --- Input data ---
+    std::string data_file_path; // .bin file holding N points x dim coords of `data_type`
+    DataTypeTag data_type = DataTypeTag::Float;
+    diskann::Metric metric = diskann::Metric::L2;
+
+    // --- Graph build parameters (Vamana) ---
+    uint32_t R = 64;            // max degree
+    uint32_t L = 100;           // search list size during build
+    float alpha = 1.2f;         // pruning alpha
+    uint32_t num_threads = 0;   // 0 = use omp_get_num_procs()
+
+    // --- PQ parameters ---
+    // pq_dim == 0          => no PQ (memory-only unified file; SSD load will reject).
+    // 0 < pq_dim < dim     => train PQ with `pq_dim` chunks on a sampled subset and
+    //                         emit pivots + codes into the unified file.
+    // pq_dim >= dim        => train PQ with `dim` chunks (chunk size 1, full-precision
+    //                         per dimension). Clamped so the SSD load path -- which
+    //                         requires HAS_PQ -- can always load the produced file.
+    uint32_t pq_dim = 0;
+    double pq_sampling_rate = 0.1; // fraction of points to sample for pivot training (clamped server-side)
+
+    // --- Optional filtered-index inputs ---
+    std::string label_file;       // per-point labels (.txt), empty = unfiltered
+    std::string universal_label;  // string to treat as "any label"
+    bool use_integer_labels = false;
+
+    // --- Output ---
+    std::string output_path; // destination unified container file
+};
+
+// Builds a unified-format index file end-to-end: trains the Vamana graph from
+// the input data file, optionally trains PQ on a sampled subset, then writes
+// graph + medoids + (optional) PQ + (optional) labels into the unified
+// container at `ctx.output_path`.
+//
+// Class shape (instead of free function) leaves room for future stateful build
+// modes (incremental build, multi-pass, etc.). For now `build()` is the only
+// method.
+class unified_index_builder
+{
+  public:
+    unified_index_builder();
+    ~unified_index_builder();
+
+    // Throws ANNException on failure (file open, mismatched dims, build crash,
+    // PQ training error, etc.). Returns successfully when the unified file is
+    // fully written and closed.
+    void build(const UnifiedBuildContext &ctx);
+};
+
+} // namespace diskann
diff --git a/include/unified_index_format.h b/include/unified_index_format.h
new file mode 100644
index 000000000..31b802cbb
--- /dev/null
+++ b/include/unified_index_format.h
@@ -0,0 +1,100 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#pragma once
+
+#include <cstdint>
+#include <type_traits>
+
+namespace diskann
+{
+
+constexpr uint32_t UNIFIED_FORMAT_MAGIC = 0x444E4E55; // "UNND" little-endian
+constexpr uint32_t UNIFIED_FORMAT_VERSION = 2;
+constexpr uint64_t UNIFIED_FORMAT_ALIGN = 4096;
+
+enum class DataTypeTag : uint32_t
+{
+    Float = 1,
+    Uint8 = 2,
+    Int8 = 3,
+};
+
+enum class MetricTag : uint32_t
+{
+    L2 = 1,
+    InnerProduct = 2,
+    Cosine = 3,
+};
+
+enum class LabelEncoding : uint32_t
+{
+    None = 0,
+    Bitmask = 1,
+    Integer = 2,
+};
+
+enum UnifiedFormatFlags : uint32_t
+{
+    HAS_PQ = 1u << 0,
+    HAS_LABELS = 1u << 1,
+    HAS_MAX_BASE_NORM = 1u << 2,
+};
+
+#pragma pack(push, 1)
+struct UnifiedIndexHeader
+{
+    uint32_t magic;
+    uint32_t version;
+    DataTypeTag data_type;
+    MetricTag metric;
+    uint64_t npts;
+    uint64_t dim;
+    uint64_t aligned_dim;
+    uint32_t max_degree;
+    uint32_t flags;
+    uint64_t start_node;
+
+    uint64_t offset_table_off, offset_table_len;
+    uint64_t graph_region_off, graph_region_len;
+    uint64_t medoids_off, medoids_len;
+    uint64_t pq_pivots_off, pq_pivots_len;
+    uint64_t pq_codes_off, pq_codes_len;
+    uint64_t max_base_norm_off, max_base_norm_len;
+
+    LabelEncoding label_encoding;
+    uint64_t universal_label;
+    uint64_t total_labels;
+    uint64_t label_dictionary_off, label_dictionary_len;
+    uint64_t per_point_labels_off, per_point_labels_len;
+    uint64_t per_point_label_offsets_off, per_point_label_offsets_len;
+
+    // Total size of the file in bytes. Populated by finalize() and validated
+    // by readers on load (truncated / over-sized files are rejected).
+    // Also useful for disk-quota / capacity-planning logs.
+    uint64_t file_size_bytes;
+
+    uint8_t _reserved[4096 - (sizeof(uint32_t) * 7 + sizeof(uint64_t) * 25)];
+};
+#pragma pack(pop)
+
+static_assert(sizeof(UnifiedIndexHeader) == 4096, "header must occupy exactly one sector");
+
+inline uint64_t align_up_4k(uint64_t v)
+{
+    return (v + UNIFIED_FORMAT_ALIGN - 1) & ~(UNIFIED_FORMAT_ALIGN - 1);
+}
+
+template <typename T> constexpr DataTypeTag data_type_tag_of()
+{
+    if constexpr (std::is_same_v<T, float>)
+        return DataTypeTag::Float;
+    else if constexpr (std::is_same_v<T, uint8_t>)
+        return DataTypeTag::Uint8;
+    else if constexpr (std::is_same_v<T, int8_t>)
+        return DataTypeTag::Int8;
+    else
+        static_assert(!sizeof(T), "unsupported data type");
+}
+
+} // namespace diskann
diff --git a/include/unified_index_io.h b/include/unified_index_io.h
new file mode 100644
index 000000000..3eb88286e
--- /dev/null
+++ b/include/unified_index_io.h
@@ -0,0 +1,116 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#pragma once
+
+#include <cstdint>
+#include <fstream>
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "unified_index_format.h"
+#include "windows_customizations.h"
+
+namespace diskann
+{
+
+// Streaming writer for the unified index container.
+//
+// Caller drives the writer in this order:
+//   1) begin(npts, dim, aligned_dim, max_degree, data_type, metric, start_node)
+//   2) begin_graph_region()
+//      for each node N in [0, npts): write_node(coords_ptr, neighbors_ptr, degree)
+//      end_graph_region()
+//   3) write_medoids(medoid_ids, num_medoids)              // always called
+//   4) (optional) write_pq(pivots_bytes, ..., codes_bytes, ...)
+//   5) (optional) write_max_base_norm(value)
+//   6) (optional) write_labels(...)
+//   7) finalize() — seeks back, writes offset table and header
+//
+// Writer assumes nodes are appended in strict id order 0..npts-1.
+class UnifiedIndexWriter
+{
+  public:
+    explicit UnifiedIndexWriter(const std::string &path);
+    ~UnifiedIndexWriter();
+
+    void begin(uint64_t npts, uint64_t dim, uint64_t aligned_dim, uint32_t max_degree,
+                                 DataTypeTag data_type, MetricTag metric, uint64_t start_node);
+
+    void begin_graph_region();
+    void write_node(const void *coords, const uint32_t *neighbors, uint32_t degree);
+    void end_graph_region();
+
+    void write_medoids(const uint32_t *medoid_ids, uint64_t num_medoids);
+    void write_pq(const void *pivots_bytes, uint64_t pivots_len, const void *codes_bytes,
+                                    uint64_t codes_len);
+    void write_max_base_norm(float value);
+
+    // Bitmask encoding: bitmask_bytes = packed rows of `bitmask_size_words * 8` bytes each, npts rows.
+    void write_labels_bitmask(uint64_t total_labels, uint64_t universal_label,
+                                                const void *dictionary_bytes, uint64_t dictionary_len,
+                                                const void *bitmask_bytes, uint64_t bitmask_bytes_len);
+
+    // Integer encoding: per_point_offsets is uint64[npts+1] into per_point_data.
+    void write_labels_integer(uint64_t total_labels, uint64_t universal_label,
+                                                const void *dictionary_bytes, uint64_t dictionary_len,
+                                                const void *per_point_data, uint64_t per_point_data_len,
+                                                const uint64_t *per_point_offsets);
+
+    void finalize();
+
+  private:
+    void pad_to_4k();
+    void write_raw(const void *bytes, uint64_t len);
+    uint64_t cur_offset();
+
+    std::string _path;
+    std::ofstream _out;
+    UnifiedIndexHeader _header{};
+    std::vector<uint64_t> _node_offsets; // size npts+1, byte offsets within graph region
+    uint64_t _graph_region_start = 0;
+    uint64_t _written_nodes = 0;
+    bool _graph_open = false;
+    bool _finalized = false;
+};
+
+// Read-only view over a unified container file.
+//
+// Holds the parsed header and provides byte ranges for each region. Does not
+// own the file — callers re-open as needed (e.g. AlignedFileReader for SSD path).
+class UnifiedIndexReader
+{
+  public:
+    explicit UnifiedIndexReader(const std::string &path);
+
+    const UnifiedIndexHeader &header() const
+    {
+        return _header;
+    }
+    const std::string &path() const
+    {
+        return _path;
+    }
+
+    // Load and return the uint64[npts+1] offset table.
+    std::vector<uint64_t> load_offset_table();
+
+    // Load a region's bytes into a freshly-allocated buffer.
+    std::vector<uint8_t> load_region(uint64_t off, uint64_t len);
+
+    // Load a region's bytes directly into a caller-owned buffer. Caller is
+    // responsible for sizing the buffer to at least `len` bytes. Avoids the
+    // intermediate allocation+copy that the vector-returning overload incurs;
+    // intended for hot load paths that already own (or can size) the final
+    // destination storage.
+    void load_region(uint64_t off, uint64_t len, uint8_t *dst);
+
+  private:
+    void parse_header();
+
+    std::string _path;
+    UnifiedIndexHeader _header{};
+};
+
+} // namespace diskann
diff --git a/include/unified_index_memory.h b/include/unified_index_memory.h
new file mode 100644
index 000000000..25e8876a7
--- /dev/null
+++ b/include/unified_index_memory.h
@@ -0,0 +1,49 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#pragma once
+
+#include <memory>
+
+#include "concurrent_queue.h"
+#include "distance.h"
+#include "filter_match_proxy.h"
+#include "scratch.h"
+#include "unified_index_base.h"
+
+namespace diskann
+{
+
+// Fully in-memory implementation of the unified-format index.
+//
+// load_storage() constructs a unified_node_store_memory<T> and calls its
+// load(), then sizes the per-thread InMemQueryScratch pool. search_impl()
+// runs a Vamana-style greedy traversal, reading coords/neighbors via the
+// inherited _store (downcast to unified_node_store_memory<T>* in the hot path
+// for non-virtual access).
+template <typename T>
+class unified_index_memory final : public unified_index_base<T>
+{
+  public:
+    explicit unified_index_memory(diskann::Metric metric);
+    ~unified_index_memory() override;
+
+  protected:
+    void load_storage(UnifiedIndexReader &r, const UnifiedLoadContext &ctx) override;
+    void search_impl(UnifiedSearchContext &ctx) override;
+    void fill_storage_stats(TableStats &stats) const override;
+
+  private:
+    void init_scratch_pool(uint32_t num_threads, uint32_t search_l);
+    std::pair<uint32_t, uint32_t> iterate_to_fixed_point(InMemQueryScratch<T> *scratch, uint32_t L, const T *query,
+                                                         const std::vector<uint32_t> &init_ids,
+                                                         filter_match_proxy *match_proxy);
+
+    ConcurrentQueue<InMemQueryScratch<T> *> _query_scratch;
+    std::shared_ptr<Distance<T>> _dist_cmp;
+    uint32_t _start = 0;
+    uint32_t _max_observed_degree = 0;
+    std::vector<uint32_t> _medoids; // mirrors unified_index_ssd::_medoids
+};
+
+} // namespace diskann
diff --git a/include/unified_index_ssd.h b/include/unified_index_ssd.h
new file mode 100644
index 000000000..bf41639e1
--- /dev/null
+++ b/include/unified_index_ssd.h
@@ -0,0 +1,66 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#pragma once
+
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "aligned_file_reader.h"
+#include "concurrent_queue.h"
+#include "distance.h"
+#include "filter_match_proxy.h"
+#include "pq.h"
+#include "scratch.h"
+#include "unified_index_base.h"
+
+namespace diskann
+{
+
+// Disk-resident (SSD) implementation of the unified-format index.
+//
+// load_storage() constructs a unified_node_store_ssd<T> wrapping the supplied
+// AlignedFileReader, calls its load(), and -- when ctx.num_nodes_to_cache > 0
+// -- primes the static cache via _store->cache_bfs_levels(). Then loads PQ
+// pivots/codes (currently via temp-file extraction; direct-from-region read
+// is a follow-up). search_impl() runs the beam-search loop, pulling beam-wide
+// neighborhoods via _store->get_nodes() once per hop.
+template <typename T>
+class unified_index_ssd final : public unified_index_base<T>
+{
+  public:
+    unified_index_ssd(std::shared_ptr<AlignedFileReader> reader, diskann::Metric metric);
+    ~unified_index_ssd() override;
+
+  protected:
+    void load_storage(UnifiedIndexReader &r, const UnifiedLoadContext &ctx) override;
+    void search_impl(UnifiedSearchContext &ctx) override;
+    void fill_storage_stats(TableStats &stats) const override;
+
+  private:
+    void load_pq_from_unified(UnifiedIndexReader &r);
+    void load_medoids_from_unified(UnifiedIndexReader &r);
+    void setup_thread_data(uint64_t nthreads, uint64_t visited_reserve = 4096);
+    void use_medoids_data_as_centroids();
+
+    void cached_beam_search(const T *query, uint64_t K, uint64_t L, uint64_t *indices, float *distances,
+                            uint32_t beam_width, const std::vector<std::string> &filter_label_strings,
+                            uint32_t io_limit, QueryStats *stats, DebugTraversalInfo *debug_info);
+
+    std::shared_ptr<AlignedFileReader> _reader;
+    ConcurrentQueue<SSDThreadData<T> *> _thread_data;
+    uint64_t _max_nthreads = 0;
+    float _max_base_norm = 0.0f;
+
+    FixedChunkPQTable _pq_table;
+    std::vector<uint8_t> _pq_codes;
+    uint64_t _n_chunks = 0;
+
+    std::vector<uint32_t> _medoids;
+    float *_centroid_data = nullptr;
+    std::shared_ptr<Distance<T>> _dist_cmp;
+    std::shared_ptr<Distance<float>> _dist_cmp_float;
+};
+
+} // namespace diskann
diff --git a/include/unified_label_data.h b/include/unified_label_data.h
new file mode 100644
index 000000000..99fd141b2
--- /dev/null
+++ b/include/unified_label_data.h
@@ -0,0 +1,198 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#pragma once
+
+#include <cstdint>
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "filter_match_proxy.h"
+#include "integer_label_vector.h"
+#include "label_bitmask.h"
+#include "unified_index_format.h"
+#include "windows_customizations.h"
+
+namespace diskann
+{
+
+class UnifiedIndexReader;
+
+// ---------------------------------------------------------------------------
+// Abstract base for the label-data trio.
+// Owns the shared, encoding-independent state (label dictionary, universal
+// label, per-label medoids) and exposes the read-only query API.
+// Derived classes own encoding-specific storage and produce encoding-specific
+// match proxies via `make_match_proxy`.
+//
+// All label ints are uint32 on the API surface; the on-disk dictionary entry
+// stores them as uint32 unconditionally (see docs/unified_index_format.md).
+// ---------------------------------------------------------------------------
+class unified_label_data_base
+{
+  public:
+    virtual ~unified_label_data_base() = default;
+
+    // Template method: parse shared dictionary, then dispatch to derived
+    // load_encoding(). Caller has the reader open and validated.
+    void load(UnifiedIndexReader &r, const UnifiedIndexHeader &h, uint64_t npts);
+
+    // --- Shared query API ---
+    bool has_labels() const
+    {
+        return _has_labels;
+    }
+    bool has_universal_label() const
+    {
+        return _use_universal_label;
+    }
+    uint32_t universal_label() const
+    {
+        return _universal_label;
+    }
+    size_t num_labels() const
+    {
+        return _label_map.size();
+    }
+    virtual LabelEncoding encoding() const = 0;
+
+    // Resident bytes of the encoding-specific per-point label storage.
+    virtual uint64_t memory_usage() const
+    {
+        return 0;
+    }
+
+    bool is_valid_label(const std::string &s) const;
+    bool get_converted_label(const std::string &s, uint32_t &out) const;
+
+    // Resolve filter label strings to their internal label ints AND per-label
+    // medoids in a single dictionary probe per string. out_label_ints[i] and
+    // out_medoids[i] both correspond to filter_label_strings[i]; both vectors
+    // are caller-owned, cleared, then filled in lockstep. Throws ANNException
+    // on an unknown label string. The unified format stores exactly one medoid
+    // per label, packed in the same dictionary row as the label int, so the
+    // search path gets the proxy input (label int) and the init-id seed
+    // (medoid) from one map lookup instead of two.
+    void resolve_filters(const std::vector<std::string> &filter_label_strings,
+                                           std::vector<uint32_t> &out_label_ints,
+                                           std::vector<uint32_t> &out_medoids) const;
+
+    // Append every per-label entry-point medoid (the unified format stores
+    // exactly one per label) to `out`. Used to seed SSD cache priming so that
+    // filtered-search entry points -- and their BFS neighborhoods -- get
+    // cached, mirroring the legacy PQFlashIndex::cache_bfs_levels seeding from
+    // _filter_to_medoid_ids. `out` is appended to (not cleared); the caller
+    // typically pre-fills it with the global medoids first.
+    void collect_label_medoids(std::vector<uint32_t> &out) const;
+
+    // Build a search-loop-ready matcher from pre-resolved internal label ints
+    // (see resolve_filters -- the string -> int conversion happens once there
+    // and is shared with init-id seeding). The returned proxy borrows internal
+    // storage of `this` -- lifetime must not exceed `this`. No external scratch
+    // is needed; the concrete proxy owns any per-query scratch it requires.
+    virtual std::unique_ptr<filter_match_proxy> make_match_proxy(
+        const std::vector<uint32_t> &filter_label_ints) = 0;
+
+  protected:
+    // Derived classes load their encoding-specific region(s) after the base
+    // has parsed the shared dictionary.
+    virtual void load_encoding(UnifiedIndexReader &r, const UnifiedIndexHeader &h, uint64_t npts) = 0;
+
+    // Helper: convert strings -> uint32 label ints via dictionary; throws on unknown.
+    void parse_dictionary(const std::vector<uint8_t> &dict_bytes);
+
+    bool _has_labels = false;
+    bool _use_universal_label = false;
+    uint32_t _universal_label = 0;
+
+    // Dictionary row: label string -> {internal label int, per-label medoid}.
+    // Both fields come from the same on-disk dictionary entry (see
+    // parse_dictionary / docs/unified_index_format.md), so packing them lets a
+    // single lookup serve both the match proxy (label int) and init-id seeding
+    // (medoid) at search time -- avoiding a second probe of a separate map.
+    struct label_dict_entry
+    {
+        uint32_t label_int = 0;
+        uint32_t medoid = 0;
+    };
+    std::unordered_map<std::string, label_dict_entry> _label_map;
+};
+
+// Bitmask-encoded label storage. One bitmask row of
+// `_bitmask_buf._bitmask_size` uint64 words per point.
+class unified_label_data_bitmask final : public unified_label_data_base
+{
+  public:
+    LabelEncoding encoding() const override
+    {
+        return LabelEncoding::Bitmask;
+    }
+
+    uint64_t memory_usage() const override
+    {
+        return _bitmask_buf._buf.size() * sizeof(std::uint64_t);
+    }
+
+    std::unique_ptr<filter_match_proxy> make_match_proxy(
+        const std::vector<uint32_t> &filter_label_ints) override;
+
+    simple_bitmask_buf &bitmask_buf()
+    {
+        return _bitmask_buf;
+    }
+    const simple_bitmask_buf &bitmask_buf() const
+    {
+        return _bitmask_buf;
+    }
+
+  protected:
+    void load_encoding(UnifiedIndexReader &r, const UnifiedIndexHeader &h, uint64_t npts) override;
+
+  private:
+    simple_bitmask_buf _bitmask_buf;
+};
+
+// Integer-encoded label storage. Variable-length label list per point with an
+// offset table of size npts+1 into a flat uint32 label array.
+class unified_label_data_integer final : public unified_label_data_base
+{
+  public:
+    LabelEncoding encoding() const override
+    {
+        return LabelEncoding::Integer;
+    }
+
+    uint64_t memory_usage() const override
+    {
+        return _label_vector.get_memory_usage();
+    }
+
+    std::unique_ptr<filter_match_proxy> make_match_proxy(
+        const std::vector<uint32_t> &filter_label_ints) override;
+
+    integer_label_vector &label_vector()
+    {
+        return _label_vector;
+    }
+    const integer_label_vector &label_vector() const
+    {
+        return _label_vector;
+    }
+
+  protected:
+    void load_encoding(UnifiedIndexReader &r, const UnifiedIndexHeader &h, uint64_t npts) override;
+
+  private:
+    integer_label_vector _label_vector;
+};
+
+// Factory: peeks at `h.label_encoding`, constructs the correct derived class,
+// runs load(), and returns it. Returns nullptr when the header carries no
+// labels (HAS_LABELS flag unset or encoding == None).
+std::unique_ptr<unified_label_data_base> make_unified_label_data(UnifiedIndexReader &r,
+                                                                                    const UnifiedIndexHeader &h,
+                                                                                    uint64_t npts);
+
+} // namespace diskann
diff --git a/include/unified_node_store.h b/include/unified_node_store.h
new file mode 100644
index 000000000..1c08ae026
--- /dev/null
+++ b/include/unified_node_store.h
@@ -0,0 +1,259 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#pragma once
+
+#include <cstdint>
+#include <memory>
+#include <vector>
+
+#include "tsl/robin_map.h"
+
+#include "aligned_file_reader.h"
+#include "defaults.h"
+#include "unified_index_format.h"
+#include "windows_customizations.h"
+
+namespace diskann
+{
+
+class UnifiedIndexReader;
+
+// Per-thread scratch passed into get_nodes(). Holds:
+//  - `ctx`: the AlignedFileReader's per-thread IOContext (registered once at
+//    load time -- get_nodes does NOT touch the reader's thread-registration
+//    map, so no mutex on the search hot path).
+//  - A sector slab: either *owned* (allocated via reserve(), used by tests
+//    and load-time helpers) or *borrowed* (set via attach_borrowed(), used
+//    by the index's beam-search to reuse SSDQueryScratch::sector_scratch).
+//
+// Memory store ignores the entire scratch.
+struct NodeFetchScratch
+{
+    NodeFetchScratch();
+    NodeFetchScratch(const NodeFetchScratch &) = delete;
+    NodeFetchScratch &operator=(const NodeFetchScratch &) = delete;
+    NodeFetchScratch(NodeFetchScratch &&other) noexcept;
+    NodeFetchScratch &operator=(NodeFetchScratch &&other) noexcept;
+    ~NodeFetchScratch();
+
+    // Self-owned slab: (re)allocate to hold `max_batch * sectors_per_node`
+    // sectors. No-op if the existing slab is already large enough. Allocates
+    // -- use only at load time or in tests, not in the search hot path.
+    void reserve(uint64_t max_batch, uint32_t sectors_per_node);
+
+    // Borrowed slab: set pointers without allocating. The slab buffer must
+    // outlive the scratch and must be at least `slab_capacity_bytes` big.
+    // This is what the index uses on the search hot path: ctx and slab both
+    // come from SSDThreadData allocated at load time.
+    void attach_borrowed(IOContext &ctx, char *external_slab, uint64_t slab_capacity_bytes);
+
+    // Attach an IOContext to a scratch whose slab is already owned via
+    // reserve(). Lets the same scratch flip from "no ctx" to "ready" without
+    // disturbing the slab. Use in load-time helpers that allocate their own
+    // slab but borrow the ctx from a registered thread.
+    void set_ctx(IOContext &ctx);
+
+    char *slab() const
+    {
+        return _sector_slab;
+    }
+    uint64_t slab_capacity() const
+    {
+        return _capacity_bytes;
+    }
+    IOContext *io_ctx() const
+    {
+        return _ctx;
+    }
+
+    std::vector<AlignedRead> requests;
+
+  private:
+    char *_sector_slab = nullptr;
+    uint64_t _capacity_bytes = 0;
+    bool _owns_slab = false;       // true => destructor aligned_free's _sector_slab
+    IOContext *_ctx = nullptr;     // not owned; lifetime tied to the reader's per-thread map
+};
+
+// View into one node. Lifetime depends on the store:
+//  - memory store returns pointers into its resident `_packed` blob;
+//  - SSD store returns pointers into the supplied scratch's sector slab
+//    (or into the static cache buffers on a cache hit).
+template <typename T>
+struct NodeView
+{
+    const T *coords = nullptr;
+    const uint32_t *neighbors = nullptr;
+    uint32_t degree = 0;
+};
+
+// ---------------------------------------------------------------------------
+// unified_node_store_base<T>
+// Abstract base. Owns header copy, offset table, cached max_node_len.
+// Per-node wire layout is [coords (aligned_dim*sizeof(T) bytes),
+//                          neighbors (degree*sizeof(uint32_t) bytes)].
+// Degree is recovered from the offset delta -- there is no per-node degree
+// field in the wire format.
+// ---------------------------------------------------------------------------
+template <typename T>
+class unified_node_store_base
+{
+  public:
+    virtual ~unified_node_store_base() = default;
+
+    // --- Geometry ---
+    uint64_t num_points() const
+    {
+        return _header.npts;
+    }
+    uint64_t dim() const
+    {
+        return _header.dim;
+    }
+    uint64_t aligned_dim() const
+    {
+        return _header.aligned_dim;
+    }
+    uint32_t max_degree() const
+    {
+        return _header.max_degree;
+    }
+    uint64_t graph_region_base() const
+    {
+        return _header.graph_region_off;
+    }
+
+    // --- Offset math (valid after init_geometry) ---
+    uint64_t node_byte_offset(uint64_t id) const
+    {
+        return _offsets[id];
+    }
+    uint64_t node_byte_length(uint64_t id) const
+    {
+        return _offsets[id + 1] - _offsets[id];
+    }
+    // Absolute byte offset of node `id`'s payload in the unified file.
+    // Convenience: same as `graph_region_base() + node_byte_offset(id)`.
+    uint64_t node_disk_offset(uint64_t id) const
+    {
+        return graph_region_base() + _offsets[id];
+    }
+    uint32_t degree(uint64_t id) const;
+    uint32_t num_sectors_per_node() const;
+    uint64_t max_node_len() const
+    {
+        return _max_node_len;
+    }
+    // aligned_dim * sizeof(T) -- cached in init_geometry().
+    uint64_t coord_bytes() const
+    {
+        return _coord_bytes;
+    }
+
+    // --- Single virtual API for node access ---
+    // Resolve `ids` into `out` (one NodeView per id, same order).
+    virtual void get_nodes(const std::vector<uint64_t> &ids, NodeFetchScratch &scratch,
+                           std::vector<NodeView<T>> &out) = 0;
+
+  protected:
+    // Subclasses call this from their `load` after parsing header + offset table.
+    void init_geometry(const UnifiedIndexHeader &h, std::vector<uint64_t> offset_table);
+
+    UnifiedIndexHeader _header{}; // own copy
+    std::vector<uint64_t> _offsets;
+    uint64_t _max_node_len = 0;
+    uint64_t _coord_bytes = 0;
+};
+
+// ---------------------------------------------------------------------------
+// unified_node_store_memory<T>
+// Fully-resident. Loads the graph region into _packed during load().
+// ---------------------------------------------------------------------------
+template <typename T>
+class unified_node_store_memory final : public unified_node_store_base<T>
+{
+  public:
+    void load(UnifiedIndexReader &r, const UnifiedIndexHeader &h);
+
+    void get_nodes(const std::vector<uint64_t> &ids, NodeFetchScratch &scratch,
+                                      std::vector<NodeView<T>> &out) override;
+
+    // Non-virtual fast path for unified_index_memory<T>::iterate_to_fixed_point.
+    const T *get_coords(uint64_t id) const;
+    const uint32_t *get_neighbors(uint64_t id, uint32_t &out_degree) const;
+
+    // Total resident bytes of the graph region ([coords, neighbors] for all
+    // nodes), pulled fully into memory by load().
+    uint64_t resident_bytes() const
+    {
+        return _packed.size();
+    }
+
+  private:
+    std::vector<uint8_t> _packed;
+};
+
+// ---------------------------------------------------------------------------
+// unified_node_store_ssd<T>
+// AlignedFileReader-backed. Owns the static _nhood_cache / _coord_cache.
+// ---------------------------------------------------------------------------
+template <typename T>
+class unified_node_store_ssd final : public unified_node_store_base<T>
+{
+  public:
+    explicit unified_node_store_ssd(std::shared_ptr<AlignedFileReader> reader) : _reader(std::move(reader))
+    {
+    }
+    ~unified_node_store_ssd() override;
+
+    void load(UnifiedIndexReader &r, const UnifiedIndexHeader &h);
+
+    void get_nodes(const std::vector<uint64_t> &ids, NodeFetchScratch &scratch,
+                                      std::vector<NodeView<T>> &out) override;
+
+    // Internal helpers (used by unified_index_ssd::load_storage when the user
+    // requests cache priming via UnifiedLoadContext::num_nodes_to_cache).
+    // Pin `node_list` (read once, kept resident). Caller supplies a
+    // pre-attached NodeFetchScratch (slab + IOContext) -- typically borrowed
+    // from an SSDThreadData via attach_borrowed(), or from a self-owned
+    // build via make_fetch_scratch().
+    void load_cache_list(const std::vector<uint32_t> &node_list, NodeFetchScratch &scratch);
+
+    // BFS-based cache primer. Caller supplies the seed nodes (typically the
+    // unified file's medoids; the store doesn't own medoid data). Walks the
+    // graph from each seed in breadth-first order, collects up to
+    // num_nodes_to_cache unique ids into `out_node_list`, then calls
+    // load_cache_list(out_node_list, scratch).
+    void cache_bfs_levels(const std::vector<uint32_t> &seed_nodes, uint64_t num_nodes_to_cache,
+                                             std::vector<uint32_t> &out_node_list, NodeFetchScratch &scratch);
+
+    // Convenience: build a NodeFetchScratch sized for `max_batch` nodes,
+    // register the calling thread with the AlignedFileReader (idempotent;
+    // safe to call from already-registered threads), and attach the resulting
+    // IOContext. Used by tests and any standalone caller. Allocates an owned
+    // slab -- not for the hot path. The hot path attaches an existing
+    // SSDThreadData via NodeFetchScratch::attach_borrowed().
+    NodeFetchScratch make_fetch_scratch(uint64_t max_batch);
+
+    // Test/observability counter: number of AlignedRead requests this store
+    // has issued. Cheap (uint64 increment per get_nodes call), always compiled.
+    uint64_t io_count() const
+    {
+        return _io_count;
+    }
+
+  private:
+    std::shared_ptr<AlignedFileReader> _reader;
+
+    // Static caches.
+    tsl::robin_map<uint32_t, std::pair<uint32_t, uint32_t *>> _nhood_cache;
+    uint32_t *_nhood_cache_buf = nullptr;
+    tsl::robin_map<uint32_t, T *> _coord_cache;
+    T *_coord_cache_buf = nullptr;
+
+    // Always-compiled IO counter (cheap; one uint64 per get_nodes batch).
+    uint64_t _io_count = 0;
+};
+
+} // namespace diskann
diff --git a/include/windows_customizations.h b/include/windows_customizations.h
index e6c58466a..c2dacc497 100644
--- a/include/windows_customizations.h
+++ b/include/windows_customizations.h
@@ -5,7 +5,12 @@
 
 #ifdef _WINDOWS
 
-#ifdef _WINDLL
+#if defined(DISKANN_STATIC_LIB)
+// Static-library build/consumer (e.g. the unit tests): the internal symbols
+// are compiled straight into the linking target, so no dllimport/dllexport
+// decoration is needed. Checked first so it wins over _WINDLL.
+#define DISKANN_DLLEXPORT
+#elif defined(_WINDLL)
 #define DISKANN_DLLEXPORT __declspec(dllexport)
 #else
 #define DISKANN_DLLEXPORT __declspec(dllimport)
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 2f70194d4..3d09fbeaa 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -6,6 +6,37 @@ set(CMAKE_COMPILE_WARNING_AS_ERROR ON)
 
 if(MSVC)
     add_subdirectory(dll)
+
+    # Static-library variant of the DiskANN core, built for the unit tests so
+    # they can link the internal symbols directly instead of going through the
+    # DLL's export table. Compiled with DISKANN_STATIC_LIB (PUBLIC, so consumers
+    # see it too) which makes DISKANN_DLLEXPORT a no-op -- see
+    # include/windows_customizations.h. This lets internal-only types drop their
+    # DISKANN_DLLEXPORT annotations without breaking the tests.
+    set(DISKANN_STATIC_SOURCES
+        abstract_data_store.cpp abstract_index.cpp ann_exception.cpp
+        color_helper.cpp color_info.cpp disk_utils.cpp distance.cpp
+        filter_match_proxy.cpp filter_utils.cpp
+        in_mem_data_store.cpp in_mem_graph_reformat_store.cpp in_mem_graph_store.cpp
+        in_mem_reorder_data_store.cpp in_mem_static_graph_reformat.cpp in_mem_static_graph_store.cpp
+        index.cpp index_factory.cpp integer_label_vector.cpp
+        label_bitmask.cpp label_helper.cpp logger.cpp
+        math_utils.cpp memory_mapper.cpp natural_number_map.cpp natural_number_set.cpp
+        neighbor_list.cpp partition.cpp pq.cpp pq_data_store.cpp pq_flash_index.cpp
+        pq_l2_distance.cpp scratch.cpp unified_index.cpp unified_index_builder.cpp unified_index_io.cpp
+        unified_label_data.cpp unified_node_store.cpp
+        unified_index_base.cpp unified_index_memory.cpp unified_index_ssd.cpp
+        utils.cpp windows_aligned_file_reader.cpp)
+
+    add_library(${PROJECT_NAME}_s STATIC ${DISKANN_STATIC_SOURCES})
+    target_compile_definitions(${PROJECT_NAME}_s PUBLIC DISKANN_STATIC_LIB)
+    # index.cpp exceeds the COFF section limit without /GL (which the DLL uses);
+    # /bigobj lifts it for the static build.
+    target_compile_options(${PROJECT_NAME}_s PRIVATE /bigobj)
+    target_include_directories(${PROJECT_NAME}_s PRIVATE ${DISKANN_MKL_INCLUDE_DIRECTORIES})
+    # MKL + synchronization.lib are PRIVATE to the DLL; the static lib exposes
+    # them PUBLICly so anything linking diskann_s (the tests) inherits them.
+    target_link_libraries(${PROJECT_NAME}_s PUBLIC ${DISKANN_MKL_LINK_LIBRARIES} synchronization.lib)
 else()
     #file(GLOB CPP_SOURCES *.cpp)
     set(CPP_SOURCES abstract_data_store.cpp abstract_index.cpp ann_exception.cpp
@@ -17,7 +48,9 @@ else()
         label_bitmask.cpp label_helper.cpp linux_aligned_file_reader.cpp logger.cpp
         math_utils.cpp memory_mapper.cpp natural_number_map.cpp natural_number_set.cpp
         neighbor_list.cpp partition.cpp pq.cpp pq_data_store.cpp pq_flash_index.cpp
-        pq_l2_distance.cpp scratch.cpp utils.cpp)
+        pq_l2_distance.cpp scratch.cpp unified_index.cpp unified_index_builder.cpp unified_index_io.cpp
+        unified_label_data.cpp unified_node_store.cpp
+        unified_index_base.cpp unified_index_memory.cpp unified_index_ssd.cpp utils.cpp)
     if (RESTAPI)
         list(APPEND CPP_SOURCES restapi/search_wrapper.cpp restapi/server.cpp)
     endif()
diff --git a/src/disk_utils.cpp b/src/disk_utils.cpp
index fa7d90568..40b0a7c13 100644
--- a/src/disk_utils.cpp
+++ b/src/disk_utils.cpp
@@ -18,6 +18,10 @@
 #include "pq_flash_index.h"
 #include "timer.h"
 #include "tsl/robin_set.h"
+#include "unified_index_io.h"
+#include "label_helper.h"
+#include "label_bitmask.h"
+#include "integer_label_vector.h"
 
 namespace diskann
 {
@@ -1447,6 +1451,7 @@ int build_disk_index(const char *dataFilePath, const char *indexFilePath, const
     return 0;
 }
 
+
 template DISKANN_DLLEXPORT void create_disk_layout<int8_t>(const std::string base_file,
                                                            const std::string mem_index_file,
                                                            const std::string output_file,
@@ -1455,7 +1460,8 @@ template DISKANN_DLLEXPORT void create_disk_layout<uint8_t>(const std::string ba
                                                             const std::string mem_index_file,
                                                             const std::string output_file,
                                                             const std::string reorder_data_file);
-template DISKANN_DLLEXPORT void create_disk_layout<float>(const std::string base_file, const std::string mem_index_file,
+template DISKANN_DLLEXPORT void create_disk_layout<float>(const std::string base_file,
+                                                          const std::string mem_index_file,
                                                           const std::string output_file,
                                                           const std::string reorder_data_file);
 
diff --git a/src/dll/CMakeLists.txt b/src/dll/CMakeLists.txt
index 4b23b41d4..92d643fae 100644
--- a/src/dll/CMakeLists.txt
+++ b/src/dll/CMakeLists.txt
@@ -11,7 +11,10 @@ add_library(${PROJECT_NAME} SHARED dllmain.cpp
     ../label_bitmask.cpp ../label_helper.cpp ../logger.cpp
     ../math_utils.cpp ../memory_mapper.cpp ../natural_number_map.cpp ../natural_number_set.cpp
     ../neighbor_list.cpp ../partition.cpp ../pq.cpp ../pq_data_store.cpp ../pq_flash_index.cpp
-    ../pq_l2_distance.cpp ../scratch.cpp ../utils.cpp ../windows_aligned_file_reader.cpp)
+    ../pq_l2_distance.cpp ../scratch.cpp ../unified_index.cpp ../unified_index_builder.cpp ../unified_index_io.cpp
+    ../unified_label_data.cpp ../unified_node_store.cpp
+    ../unified_index_base.cpp ../unified_index_memory.cpp ../unified_index_ssd.cpp
+    ../utils.cpp ../windows_aligned_file_reader.cpp)
 
 set(TARGET_DIR "$<$<CONFIG:Debug>:${CMAKE_LIBRARY_OUTPUT_DIRECTORY_DEBUG}>$<$<CONFIG:Release>:${CMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE}>")
 
diff --git a/src/filter_match_proxy.cpp b/src/filter_match_proxy.cpp
index 4ba606a50..2b613e260 100644
--- a/src/filter_match_proxy.cpp
+++ b/src/filter_match_proxy.cpp
@@ -15,8 +15,8 @@ bitmask_filter_match<LabelT>::bitmask_filter_match(
     // _bitmask_size == 0 means no filter is set
     if (_bitmask_filters._bitmask_size > 0)
     {
-        query_bitmask_buf.resize(_bitmask_filters._bitmask_size, 0);
-        _bitmask_full_val._mask = query_bitmask_buf.data();
+        _query_bitmask_buf.resize(_bitmask_filters._bitmask_size, 0);
+        _bitmask_full_val._mask = _query_bitmask_buf.data();
 
         for (const auto& filter_label : filter_labels)
         {
@@ -30,6 +30,30 @@ bitmask_filter_match<LabelT>::bitmask_filter_match(
     }
 }
 
+template <typename LabelT>
+bitmask_filter_match<LabelT>::bitmask_filter_match(
+    simple_bitmask_buf& bitmask_filters,
+    const std::vector<LabelT>& filter_labels,
+    LabelT unv_label)
+    : _bitmask_filters(bitmask_filters),
+      _query_bitmask_buf(_owned_query_bitmask_buf)
+{
+    if (_bitmask_filters._bitmask_size > 0)
+    {
+        _query_bitmask_buf.resize(_bitmask_filters._bitmask_size, 0);
+        _bitmask_full_val._mask = _query_bitmask_buf.data();
+
+        for (const auto& filter_label : filter_labels)
+        {
+            auto bitmask_val = simple_bitmask::get_bitmask_val(filter_label);
+            _bitmask_full_val.merge_bitmask_val(bitmask_val);
+        }
+
+        auto bitmask_val = simple_bitmask::get_bitmask_val(unv_label);
+        _bitmask_full_val.merge_bitmask_val(bitmask_val);
+    }
+}
+
 template <typename LabelT>
 bool bitmask_filter_match<LabelT>::contain_filtered_label(uint32_t id)
 {
diff --git a/src/index.cpp b/src/index.cpp
index 5261adb5e..6e6811aa3 100644
--- a/src/index.cpp
+++ b/src/index.cpp
@@ -18,6 +18,7 @@
 #include "color_helper.h"
 #include "filter_match_proxy.h"
 #include "in_mem_reorder_data_store.h"
+#include "unified_index_io.h"
 
 #if defined(DISKANN_RELEASE_UNUSED_TCMALLOC_MEMORY_AT_CHECKPOINTS) && defined(DISKANN_BUILD)
 #include "gperftools/malloc_extension.h"
@@ -438,6 +439,134 @@ void Index<T, TagT, LabelT>::save(const char *filename, bool compact_before_save
     diskann::cout << "Time taken for save: " << timer.elapsed() / 1000000.0 << "s." << std::endl;
 }
 
+template <typename T, typename TagT, typename LabelT>
+void Index<T, TagT, LabelT>::save_unified(const char *filename)
+{
+    static const std::vector<uint8_t> kEmpty;
+    save_unified(filename, kEmpty, kEmpty);
+}
+
+template <typename T, typename TagT, typename LabelT>
+void Index<T, TagT, LabelT>::save_unified(const char *filename, const std::vector<uint8_t> &pq_pivots_bytes,
+                                          const std::vector<uint8_t> &pq_codes_bytes)
+{
+    diskann::Timer timer;
+
+    std::unique_lock<std::shared_timed_mutex> ul(_update_lock);
+    std::unique_lock<std::shared_timed_mutex> cl(_consolidate_lock);
+    std::unique_lock<std::shared_timed_mutex> tl(_tag_lock);
+    std::unique_lock<std::shared_timed_mutex> dl(_delete_lock);
+
+    if (!_data_compacted)
+    {
+        compact_data();
+    }
+
+    if (_dynamic_index || _delete_set->size() > 0 || _enable_tags)
+    {
+        throw ANNException("save_unified does not support dynamic/tagged/deletion indices in v1", -1,
+                           __FUNCSIG__, __FILE__, __LINE__);
+    }
+
+    const uint64_t npts = static_cast<uint64_t>(_nd);
+    const uint64_t dim = static_cast<uint64_t>(_dim);
+    const uint64_t aligned_dim = static_cast<uint64_t>(_data_store->get_aligned_dim());
+    const uint32_t max_degree = _graph_store->get_max_observed_degree();
+
+    UnifiedIndexWriter writer(filename);
+    MetricTag metric_tag = MetricTag::L2;
+    if (_dist_metric == diskann::Metric::INNER_PRODUCT)
+        metric_tag = MetricTag::InnerProduct;
+    else if (_dist_metric == diskann::Metric::COSINE)
+        metric_tag = MetricTag::Cosine;
+
+    writer.begin(npts, dim, aligned_dim, max_degree, data_type_tag_of<T>(), metric_tag,
+                 static_cast<uint64_t>(_start));
+
+    writer.begin_graph_region();
+    std::vector<T> vec(dim);
+    for (uint32_t i = 0; i < npts; ++i)
+    {
+        _data_store->get_vector(i, vec.data());
+        const NeighborList nbrs = _graph_store->get_neighbours(i);
+        writer.write_node(vec.data(), nbrs.data(), static_cast<uint32_t>(nbrs.size()));
+    }
+    writer.end_graph_region();
+
+    if (_filtered_index && !_label_to_start_id.empty())
+    {
+        std::vector<uint32_t> medoid_list;
+        medoid_list.reserve(_label_to_start_id.size());
+        for (const auto &kv : _label_to_start_id)
+            medoid_list.push_back(kv.second);
+        writer.write_medoids(medoid_list.data(), medoid_list.size());
+    }
+    else
+    {
+        const uint32_t single_medoid = _start;
+        writer.write_medoids(&single_medoid, 1);
+    }
+
+    // Optional PQ region. Caller supplies both pivots and codes; empty buffers
+    // skip the PQ write entirely (no HAS_PQ flag).
+    if (!pq_pivots_bytes.empty() && !pq_codes_bytes.empty())
+    {
+        writer.write_pq(pq_pivots_bytes.data(), pq_pivots_bytes.size(), pq_codes_bytes.data(),
+                        pq_codes_bytes.size());
+    }
+
+    if (_filtered_index)
+    {
+        std::vector<uint8_t> dict_bytes;
+        {
+            std::vector<std::pair<LabelT, uint32_t>> label_to_medoid_list(_label_to_start_id.begin(),
+                                                                          _label_to_start_id.end());
+            std::unordered_map<LabelT, std::string> int_to_str;
+            for (const auto &kv : _label_map)
+                int_to_str.emplace(kv.second, kv.first);
+            for (const auto &lm : label_to_medoid_list)
+            {
+                const auto it = int_to_str.find(lm.first);
+                const std::string &s = (it != int_to_str.end()) ? it->second : std::string();
+                const uint32_t label_int = static_cast<uint32_t>(lm.first);
+                const uint32_t slen = static_cast<uint32_t>(s.size());
+                const size_t old = dict_bytes.size();
+                dict_bytes.resize(old + sizeof(uint32_t) + slen + sizeof(uint32_t) + sizeof(uint32_t));
+                uint8_t *p = dict_bytes.data() + old;
+                std::memcpy(p, &slen, sizeof(uint32_t));
+                p += sizeof(uint32_t);
+                std::memcpy(p, s.data(), slen);
+                p += slen;
+                std::memcpy(p, &label_int, sizeof(uint32_t));
+                p += sizeof(uint32_t);
+                std::memcpy(p, &lm.second, sizeof(uint32_t));
+            }
+        }
+
+        const uint64_t universal = _use_universal_label ? static_cast<uint64_t>(_universal_label) : 0;
+        const uint64_t total_labels = static_cast<uint64_t>(_label_map.size());
+
+        if (_use_integer_labels)
+        {
+            const auto &offsets = _label_vector.get_offset_vector();
+            const auto &data = _label_vector.get_data_vector();
+            std::vector<uint64_t> off_u64(offsets.begin(), offsets.end());
+            writer.write_labels_integer(total_labels, universal, dict_bytes.data(), dict_bytes.size(),
+                                        data.data(), data.size() * sizeof(uint32_t), off_u64.data());
+        }
+        else if (_bitmask_buf._buf.size() > 0)
+        {
+            const uint64_t bitmap_bytes = _bitmask_buf._buf.size() * sizeof(uint64_t);
+            writer.write_labels_bitmask(total_labels, universal, dict_bytes.data(), dict_bytes.size(),
+                                        _bitmask_buf._buf.data(), bitmap_bytes);
+        }
+    }
+
+    writer.finalize();
+
+    diskann::cout << "Time taken for save_unified: " << timer.elapsed() / 1000000.0 << "s." << std::endl;
+}
+
 #ifdef EXEC_ENV_OLS
 template <typename T, typename TagT, typename LabelT>
 size_t Index<T, TagT, LabelT>::load_tags(AlignedFileReader &reader)
@@ -2133,6 +2262,12 @@ void Index<T, TagT, LabelT>::build(const std::string &data_file, const size_t nu
         std::string mem_labels_int_map_file = filter_params.save_path_prefix + "_labels_map.txt";
         convert_labels_string_to_int(filter_params.label_file, labels_file_to_use, mem_labels_int_map_file,
                                      filter_params.universal_label, unv_label_as_num);
+        // Populate the in-memory string->int label map. convert_labels_string_to_int
+        // only writes it to disk; without this _label_map stays empty until a
+        // load(), so a save_unified() called after build() (e.g. from
+        // unified_index_builder) would emit an empty label dictionary /
+        // total_labels == 0 and produce an unloadable filtered unified file.
+        _label_map = load_label_map(mem_labels_int_map_file);
         if (filter_params.universal_label != "")
         {
             if (unv_label_as_num != 0)
diff --git a/src/integer_label_vector.cpp b/src/integer_label_vector.cpp
index c467050f1..ad8b5f23a 100644
--- a/src/integer_label_vector.cpp
+++ b/src/integer_label_vector.cpp
@@ -45,6 +45,30 @@ bool integer_label_vector::initialize_from_file(const std::string& label_file, s
     return true;
 }
 
+bool integer_label_vector::initialize_from_buffers(const size_t *offsets, size_t num_points,
+                                                   const uint32_t *labels, size_t total_labels)
+{
+    _offset.assign(offsets, offsets + num_points + 1);
+    _data.assign(labels, labels + total_labels);
+    return true;
+}
+
+void integer_label_vector::resize_for_load(size_t num_points, size_t total_labels)
+{
+    _offset.resize(num_points + 1);
+    _data.resize(total_labels);
+}
+
+size_t *integer_label_vector::mutable_offset_data()
+{
+    return _offset.data();
+}
+
+uint32_t *integer_label_vector::mutable_label_data()
+{
+    return _data.data();
+}
+
 template <typename LabelT>
 bool integer_label_vector::add_labels(uint32_t point_id, std::vector<LabelT> &labels) {
     if (point_id >= _offset.size() - 1)
diff --git a/src/pq.cpp b/src/pq.cpp
index d2b545c79..ffe0c6d84 100644
--- a/src/pq.cpp
+++ b/src/pq.cpp
@@ -168,6 +168,115 @@ uint32_t FixedChunkPQTable::get_num_chunks()
     return static_cast<uint32_t>(n_chunks);
 }
 
+void FixedChunkPQTable::load_pq_centroid_bin_from_memory(const uint8_t *blob, size_t blob_len, size_t num_chunks)
+{
+    // The pq_pivots.bin format is a "bin-with-offsets" container:
+    //   Outer bin at offset 0:        [int32 nr][int32 nc][size_t offsets[nr]]   (nr = 4 or 5)
+    //   Sub-bin at offsets[0] (pivots):     [int32 256][int32 dim][float[256*dim]]
+    //   Sub-bin at offsets[1] (centroid):   [int32 dim][int32 1  ][float[dim]]
+    //   (nr==5 only) offsets[2] is an old-format per-chunk dims sub-bin (ignored).
+    //   Sub-bin at offsets[chunk_offsets_index]: [int32 n_chunks+1][int32 1][uint32_t[n_chunks+1]]
+    //   chunk_offsets_index = 2 (new) or 3 (old, when nr==5).
+    //
+    // This mirrors the disk loader's parsing (FixedChunkPQTable::load_pq_centroid_bin
+    // above) but reads straight from `blob` with no IO. OPQ rotation matrix is
+    // NOT supported -- unified-format PQ is always standard PQ.
+
+    auto read_sub_bin_header = [&](size_t off, size_t &out_nr, size_t &out_nc, size_t &out_payload_off) {
+        if (off + 2 * sizeof(int32_t) > blob_len)
+            throw diskann::ANNException("PQ blob: truncated sub-bin header", -1, __FUNCSIG__, __FILE__, __LINE__);
+        int32_t nr_i32 = 0, nc_i32 = 0;
+        std::memcpy(&nr_i32, blob + off, sizeof(int32_t));
+        std::memcpy(&nc_i32, blob + off + sizeof(int32_t), sizeof(int32_t));
+        out_nr = static_cast<size_t>(nr_i32);
+        out_nc = static_cast<size_t>(nc_i32);
+        out_payload_off = off + 2 * sizeof(int32_t);
+    };
+
+    // --- Outer bin: size_t offset table. ---
+    size_t nr = 0, nc = 0, payload_off = 0;
+    read_sub_bin_header(/*off=*/0, nr, nc, payload_off);
+
+    if (nr != 4 && nr != 5)
+    {
+        throw diskann::ANNException("PQ blob: outer offsets have unexpected count " + std::to_string(nr) +
+                                        " (expecting 4 or 5)",
+                                    -1, __FUNCSIG__, __FILE__, __LINE__);
+    }
+    const size_t outer_bytes = nr * nc * sizeof(size_t);
+    if (payload_off + outer_bytes > blob_len)
+        throw diskann::ANNException("PQ blob: truncated outer offsets", -1, __FUNCSIG__, __FILE__, __LINE__);
+
+    std::vector<size_t> file_offset_data(nr);
+    std::memcpy(file_offset_data.data(), blob + payload_off, outer_bytes);
+
+    const bool use_old_filetype = (nr == 5);
+
+    // --- Pivot table at offsets[0]. ---
+    read_sub_bin_header(file_offset_data[0], nr, nc, payload_off);
+    if (nr != NUM_PQ_CENTROIDS)
+    {
+        throw diskann::ANNException("PQ blob: pivots row count = " + std::to_string(nr) + " (expecting " +
+                                        std::to_string(NUM_PQ_CENTROIDS) + ")",
+                                    -1, __FUNCSIG__, __FILE__, __LINE__);
+    }
+    this->ndims = nc;
+    const size_t pivots_bytes = nr * nc * sizeof(float);
+    if (payload_off + pivots_bytes > blob_len)
+        throw diskann::ANNException("PQ blob: truncated pivot table", -1, __FUNCSIG__, __FILE__, __LINE__);
+    if (tables != nullptr)
+        delete[] tables;
+    tables = new float[nr * nc];
+    std::memcpy(tables, blob + payload_off, pivots_bytes);
+
+    // --- Centroid at offsets[1]. ---
+    read_sub_bin_header(file_offset_data[1], nr, nc, payload_off);
+    if (nr != this->ndims || nc != 1)
+    {
+        throw diskann::ANNException("PQ blob: centroid shape mismatch", -1, __FUNCSIG__, __FILE__, __LINE__);
+    }
+    const size_t centroid_bytes = nr * nc * sizeof(float);
+    if (payload_off + centroid_bytes > blob_len)
+        throw diskann::ANNException("PQ blob: truncated centroid", -1, __FUNCSIG__, __FILE__, __LINE__);
+    if (centroid != nullptr)
+        delete[] centroid;
+    centroid = new float[nr * nc];
+    std::memcpy(centroid, blob + payload_off, centroid_bytes);
+
+    // --- Chunk offsets at offsets[2] (new) or [3] (old-filetype). ---
+    const int chunk_offsets_index = use_old_filetype ? 3 : 2;
+    read_sub_bin_header(file_offset_data[chunk_offsets_index], nr, nc, payload_off);
+    if (nc != 1 || (nr != num_chunks + 1 && num_chunks != 0))
+    {
+        throw diskann::ANNException("PQ blob: chunk-offsets shape mismatch (nr=" + std::to_string(nr) +
+                                        ", nc=" + std::to_string(nc) + ")",
+                                    -1, __FUNCSIG__, __FILE__, __LINE__);
+    }
+    const size_t chunk_bytes = nr * nc * sizeof(uint32_t);
+    if (payload_off + chunk_bytes > blob_len)
+        throw diskann::ANNException("PQ blob: truncated chunk offsets", -1, __FUNCSIG__, __FILE__, __LINE__);
+    if (chunk_offsets != nullptr)
+        delete[] chunk_offsets;
+    chunk_offsets = new uint32_t[nr * nc];
+    std::memcpy(chunk_offsets, blob + payload_off, chunk_bytes);
+
+    this->n_chunks = nr - 1;
+    diskann::cout << "Loaded PQ Pivots from memory: #ctrs: " << NUM_PQ_CENTROIDS << ", #dims: " << this->ndims
+                  << ", #chunks: " << this->n_chunks << std::endl;
+
+    // Compute the transpose used by the distance hot path.
+    if (tables_tr != nullptr)
+        delete[] tables_tr;
+    tables_tr = new float[256 * this->ndims];
+    for (size_t i = 0; i < 256; i++)
+    {
+        for (size_t j = 0; j < this->ndims; j++)
+        {
+            tables_tr[j * 256 + i] = tables[i * this->ndims + j];
+        }
+    }
+}
+
 void FixedChunkPQTable::preprocess_query(float *query_vec)
 {
     for (uint32_t d = 0; d < ndims; d++)
diff --git a/src/pq_flash_index.cpp b/src/pq_flash_index.cpp
index 8d8870e57..848f47368 100644
--- a/src/pq_flash_index.cpp
+++ b/src/pq_flash_index.cpp
@@ -11,6 +11,7 @@
 #include "cosine_similarity.h"
 #include "color_helper.h"
 #include "filter_match_proxy.h"
+#include "unified_index_io.h"
 #include <limits>
 #include <filesystem>
 
@@ -142,7 +143,9 @@ std::vector<bool> PQFlashIndex<T, LabelT>::read_nodes(const std::vector<uint32_t
     std::vector<bool> retval(node_ids.size(), true);
 
     char *buf = nullptr;
-    auto num_sectors = _nnodes_per_sector > 0 ? 1 : DIV_ROUND_UP(_max_node_len, defaults::SECTOR_LEN);
+    auto num_sectors = _nnodes_per_sector > 0
+                           ? 1
+                           : DIV_ROUND_UP(_max_node_len, defaults::SECTOR_LEN);
     alloc_aligned((void **)&buf, node_ids.size() * num_sectors * defaults::SECTOR_LEN, defaults::SECTOR_LEN);
 
     // create read requests
@@ -186,9 +189,12 @@ std::vector<bool> PQFlashIndex<T, LabelT>::read_nodes(const std::vector<uint32_t
         if (nbr_buffers[i].second != nullptr)
         {
             uint32_t *node_nhood = offset_to_node_nhood(node_buf);
-            auto num_nbrs = *node_nhood;
+            uint32_t num_nbrs;
+            uint32_t *nbrs_src;
+            num_nbrs = *node_nhood;
+            nbrs_src = node_nhood + 1;
             nbr_buffers[i].first = num_nbrs;
-            memcpy(nbr_buffers[i].second, node_nhood + 1, num_nbrs * sizeof(uint32_t));
+            memcpy(nbr_buffers[i].second, nbrs_src, num_nbrs * sizeof(uint32_t));
         }
     }
 
@@ -1279,8 +1285,11 @@ void PQFlashIndex<T, LabelT>::cached_beam_search(const T *query1, const uint64_t
     // sector scratch
     char *sector_scratch = query_scratch->sector_scratch;
     uint64_t &sector_scratch_idx = query_scratch->sector_idx;
+    // In unified mode nodes are not sector-padded, so an unaligned node can
+    // straddle one extra sector beyond DIV_ROUND_UP(node_len, SECTOR_LEN).
     const uint64_t num_sectors_per_node =
-        _nnodes_per_sector > 0 ? 1 : DIV_ROUND_UP(_max_node_len, defaults::SECTOR_LEN);
+        _nnodes_per_sector > 0 ? 1
+                               : DIV_ROUND_UP(_max_node_len, defaults::SECTOR_LEN);
 
     // query <-> PQ chunk centers distances
     _pq_table.preprocess_query(query_rotated); // center the query and rotate if
@@ -1427,8 +1436,9 @@ void PQFlashIndex<T, LabelT>::cached_beam_search(const T *query1, const uint64_t
                 fnhood.second = sector_scratch + num_sectors_per_node * sector_scratch_idx * defaults::SECTOR_LEN;
                 sector_scratch_idx++;
                 frontier_nhoods.push_back(fnhood);
-                frontier_read_reqs.emplace_back(get_node_sector((size_t)id) * defaults::SECTOR_LEN,
-                                                num_sectors_per_node * defaults::SECTOR_LEN, fnhood.second);
+                uint64_t read_offset = get_node_sector((size_t)id) * defaults::SECTOR_LEN;
+                uint64_t read_length = num_sectors_per_node * defaults::SECTOR_LEN;
+                frontier_read_reqs.emplace_back(read_offset, read_length, fnhood.second);
                 if (stats != nullptr)
                 {
                     stats->n_4k++;
@@ -1526,7 +1536,10 @@ void PQFlashIndex<T, LabelT>::cached_beam_search(const T *query1, const uint64_t
 #endif
             char *node_disk_buf = offset_to_node(frontier_nhood.second, frontier_nhood.first);
             uint32_t *node_buf = offset_to_node_nhood(node_disk_buf);
-            uint64_t nnbrs = (uint64_t)(*node_buf);
+            uint64_t nnbrs;
+            uint32_t *node_nbrs;
+            nnbrs = (uint64_t)(*node_buf);
+            node_nbrs = (node_buf + 1);
             T *node_fp_coords = offset_to_node_coords(node_disk_buf);
             memcpy(data_buf, node_fp_coords, _disk_bytes_per_point);
             float cur_expanded_dist;
@@ -1542,7 +1555,6 @@ void PQFlashIndex<T, LabelT>::cached_beam_search(const T *query1, const uint64_t
                     cur_expanded_dist = _disk_pq_table.l2_distance(query_float, (uint8_t *)data_buf);
             }
             full_retset.push_back(Neighbor(frontier_nhood.first, cur_expanded_dist));
-            uint32_t *node_nbrs = (node_buf + 1);
             // compute node_nbrs <-> query dist in PQ space
             cpu_timer.reset();
             compute_dists(node_nbrs, nnbrs, dist_scratch);
diff --git a/src/unified_index.cpp b/src/unified_index.cpp
new file mode 100644
index 000000000..d39eca9ae
--- /dev/null
+++ b/src/unified_index.cpp
@@ -0,0 +1,79 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#include "unified_index.h"
+
+#include "ann_exception.h"
+#include "unified_index_io.h"
+#include "unified_index_memory.h"
+#include "unified_index_ssd.h"
+
+namespace diskann
+{
+
+namespace
+{
+
+// Map MetricTag from the header to the runtime Metric enum.
+diskann::Metric metric_from_tag(MetricTag tag)
+{
+    switch (tag)
+    {
+    case MetricTag::L2:
+        return diskann::Metric::L2;
+    case MetricTag::InnerProduct:
+        return diskann::Metric::INNER_PRODUCT;
+    case MetricTag::Cosine:
+        return diskann::Metric::COSINE;
+    default:
+        throw ANNException("unified_index factory: unknown metric tag in header", -1, __FUNCSIG__, __FILE__,
+                           __LINE__);
+    }
+}
+
+// Peek the 4 KiB header to decide which T to instantiate. Reader is closed
+// before the index's own load(ctx) reopens the file.
+UnifiedIndexHeader peek_header(const std::string &path)
+{
+    UnifiedIndexReader peek(path);
+    return peek.header();
+}
+
+template <template <typename> class IndexT, typename... CtorArgs>
+std::unique_ptr<unified_index> make_for_data_type(DataTypeTag dt, MetricTag metric_tag, CtorArgs &&...args)
+{
+    const diskann::Metric metric = metric_from_tag(metric_tag);
+    switch (dt)
+    {
+    case DataTypeTag::Float:
+        return std::make_unique<IndexT<float>>(std::forward<CtorArgs>(args)..., metric);
+    case DataTypeTag::Uint8:
+        return std::make_unique<IndexT<uint8_t>>(std::forward<CtorArgs>(args)..., metric);
+    case DataTypeTag::Int8:
+        return std::make_unique<IndexT<int8_t>>(std::forward<CtorArgs>(args)..., metric);
+    default:
+        throw ANNException("unified_index factory: unknown data_type in header", -1, __FUNCSIG__, __FILE__,
+                           __LINE__);
+    }
+}
+
+} // namespace
+
+std::unique_ptr<unified_index> make_unified_index_memory(const UnifiedLoadContext &ctx)
+{
+    const UnifiedIndexHeader h = peek_header(ctx.path);
+    auto idx = make_for_data_type<unified_index_memory>(h.data_type, h.metric);
+    idx->load(ctx);
+    return idx;
+}
+
+std::unique_ptr<unified_index> make_unified_index_ssd(std::shared_ptr<AlignedFileReader> reader,
+                                                      const UnifiedLoadContext &ctx)
+{
+    const UnifiedIndexHeader h = peek_header(ctx.path);
+    auto idx = make_for_data_type<unified_index_ssd>(h.data_type, h.metric, std::move(reader));
+    idx->load(ctx);
+    return idx;
+}
+
+} // namespace diskann
diff --git a/src/unified_index_base.cpp b/src/unified_index_base.cpp
new file mode 100644
index 000000000..980692942
--- /dev/null
+++ b/src/unified_index_base.cpp
@@ -0,0 +1,88 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#include "unified_index_base.h"
+
+#include "ann_exception.h"
+#include "unified_index_io.h"
+
+namespace diskann
+{
+
+template <typename T>
+unified_index_base<T>::unified_index_base(diskann::Metric metric) : _metric(metric)
+{
+}
+
+template <typename T> unified_index_base<T>::~unified_index_base() = default;
+
+template <typename T> void unified_index_base<T>::validate_header(const UnifiedIndexHeader &h) const
+{
+    if (h.magic != UNIFIED_FORMAT_MAGIC)
+        throw ANNException("unified_index_base: bad magic", -1, __FUNCSIG__, __FILE__, __LINE__);
+    if (h.version != UNIFIED_FORMAT_VERSION)
+        throw ANNException("unified_index_base: unsupported version", -1, __FUNCSIG__, __FILE__, __LINE__);
+    if (h.data_type != data_type_tag_of<T>())
+        throw ANNException("unified_index_base: data_type mismatch with T", -1, __FUNCSIG__, __FILE__, __LINE__);
+}
+
+template <typename T>
+void unified_index_base<T>::validate_search_context(const UnifiedSearchContext &ctx) const
+{
+    if (ctx.query == nullptr)
+        throw ANNException("UnifiedSearchContext: query == nullptr", -1, __FUNCSIG__, __FILE__, __LINE__);
+    if (ctx.indices == nullptr || ctx.distances == nullptr)
+        throw ANNException("UnifiedSearchContext: indices/distances buffers are required", -1, __FUNCSIG__, __FILE__,
+                           __LINE__);
+    if (ctx.K == 0)
+        throw ANNException("UnifiedSearchContext: K must be > 0", -1, __FUNCSIG__, __FILE__, __LINE__);
+    if (ctx.L < ctx.K)
+        throw ANNException("UnifiedSearchContext: L must be >= K", -1, __FUNCSIG__, __FILE__, __LINE__);
+
+    const bool filtered = has_labels();
+    const bool has_filters = !ctx.filter_labels.empty();
+    if (filtered && !has_filters)
+        throw ANNException("UnifiedSearchContext: filter_labels must be non-empty for a filtered index", -1,
+                           __FUNCSIG__, __FILE__, __LINE__);
+    if (!filtered && has_filters)
+        throw ANNException("UnifiedSearchContext: filter_labels must be empty for a non-filtered index", -1,
+                           __FUNCSIG__, __FILE__, __LINE__);
+}
+
+template <typename T> void unified_index_base<T>::load(const UnifiedLoadContext &ctx)
+{
+    _index_path = ctx.path;
+
+    UnifiedIndexReader reader(ctx.path);
+    _header = reader.header();
+    validate_header(_header);
+
+    _labels = make_unified_label_data(reader, _header, _header.npts);
+    load_storage(reader, ctx);
+
+    // Populate resident-memory / cardinality accounting (mirrors
+    // Index::load / PQFlashIndex::load). Common fields here; the
+    // storage-specific node/graph bytes come from the derived class.
+    _table_stats = TableStats{};
+    _table_stats.node_count = _header.npts;
+    if (_labels && _labels->has_labels())
+    {
+        _table_stats.label_count = _labels->num_labels();
+        _table_stats.label_mem_usage = _labels->memory_usage();
+    }
+    fill_storage_stats(_table_stats);
+    _table_stats.total_mem_usage = _table_stats.node_mem_usage + _table_stats.graph_mem_usage +
+                                   _table_stats.label_mem_usage + _table_stats.tag_memory_usage;
+}
+
+template <typename T> void unified_index_base<T>::search(UnifiedSearchContext &ctx)
+{
+    validate_search_context(ctx);
+    search_impl(ctx);
+}
+
+template class unified_index_base<float>;
+template class unified_index_base<uint8_t>;
+template class unified_index_base<int8_t>;
+
+} // namespace diskann
diff --git a/src/unified_index_builder.cpp b/src/unified_index_builder.cpp
new file mode 100644
index 000000000..ac2777466
--- /dev/null
+++ b/src/unified_index_builder.cpp
@@ -0,0 +1,176 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#include "boost/dynamic_bitset.hpp"
+
+#include "unified_index_builder.h"
+
+#include <algorithm>
+#include <cstdio>
+#include <fstream>
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "ann_exception.h"
+#include "index.h"
+#include "parameters.h"
+#include "pq.h"
+#include "utils.h"
+
+namespace diskann
+{
+
+namespace
+{
+
+// Read a temp file (produced by generate_quantized_data) fully into a byte
+// buffer. Used to pull PQ pivots / codes back in so they can be embedded into
+// the unified container via UnifiedIndexWriter::write_pq.
+std::vector<uint8_t> slurp_file(const std::string &path)
+{
+    std::ifstream in;
+    in.exceptions(std::ios::badbit | std::ios::failbit);
+    in.open(path, std::ios::binary | std::ios::ate);
+    const std::streamoff sz = in.tellg();
+    in.seekg(0, std::ios::beg);
+    std::vector<uint8_t> out(static_cast<size_t>(sz));
+    in.read(reinterpret_cast<char *>(out.data()), sz);
+    return out;
+}
+
+// Discover (npts, dim) from a DiskANN .bin file: a 4-byte int32 npts followed
+// by a 4-byte int32 dim, then npts*dim*sizeof(T) bytes of coords.
+void read_bin_metadata(const std::string &path, size_t &npts_out, size_t &dim_out)
+{
+    diskann::get_bin_metadata(path, npts_out, dim_out, 0);
+}
+
+// Build a Vamana Index<T> over the data file and write the unified container,
+// optionally embedding PQ pivots + codes. All disk artifacts other than the
+// final unified file are cleaned up before this function returns.
+template <typename T> void build_impl(const UnifiedBuildContext &ctx)
+{
+    size_t npts = 0, dim = 0;
+    read_bin_metadata(ctx.data_file_path, npts, dim);
+    if (npts == 0 || dim == 0)
+    {
+        throw ANNException("unified_index_builder: empty or unreadable data file", -1, __FUNCSIG__, __FILE__,
+                           __LINE__);
+    }
+
+    // -----------------------------------------------------------------------
+    // 1) Build Vamana graph in memory via Index<T>.
+    // -----------------------------------------------------------------------
+    auto write_params = std::make_shared<IndexWriteParameters>(
+        IndexWriteParametersBuilder(ctx.L, ctx.R)
+            .with_alpha(ctx.alpha)
+            .with_num_threads(ctx.num_threads)
+            .build());
+
+    Index<T, uint32_t, uint32_t> idx(ctx.metric, dim, npts, write_params, /*search_params=*/nullptr,
+                                      /*num_frozen_pts=*/0, /*dynamic=*/false, /*enable_tags=*/false,
+                                      /*concurrent_consolidate=*/false, /*pq_dist_build=*/false,
+                                      /*num_pq_chunks=*/0, /*use_opq=*/false,
+                                      /*filtered_index=*/!ctx.label_file.empty());
+
+    if (!ctx.label_file.empty())
+    {
+        IndexFilterParams filter_params = IndexFilterParamsBuilder()
+                                              .with_label_file(ctx.label_file)
+                                              .with_universal_label(ctx.universal_label)
+                                              .with_save_path_prefix(ctx.output_path + ".legacy_tmp")
+                                              .build();
+        idx.build(ctx.data_file_path, npts, filter_params);
+    }
+    else
+    {
+        idx.build(ctx.data_file_path.c_str(), npts, std::vector<uint32_t>());
+    }
+
+    // -----------------------------------------------------------------------
+    // 2) PQ training (sampled subset).
+    // -----------------------------------------------------------------------
+    // The unified SSD load path (unified_index_ssd::load_storage) requires the
+    // HAS_PQ regions to be present, so we emit PQ whenever PQ is requested
+    // (pq_dim > 0) -- including the pq_dim == dim case, which yields chunk
+    // size 1 (a full-precision-per-dimension PQ). Previously pq_dim >= dim
+    // skipped PQ generation, producing a file that could not be loaded as an
+    // SSD index. Clamp the chunk count to dim so an over-large pq_dim can't
+    // silently skip PQ either (generate_pq_pivots rejects num_pq_chunks > dim).
+    // TODO: revisit -- a cleaner design would make PQ truly optional on the
+    // SSD path (serve full-precision coords when HAS_PQ is unset).
+    std::vector<uint8_t> pq_pivots_bytes;
+    std::vector<uint8_t> pq_codes_bytes;
+    const bool train_pq = (ctx.pq_dim > 0);
+    if (train_pq)
+    {
+        const size_t pq_chunks = std::min(static_cast<size_t>(ctx.pq_dim), dim);
+        const std::string temp_prefix = ctx.output_path + ".pq_tmp";
+        const std::string temp_pivots = temp_prefix + ".pq_pivots.bin";
+        const std::string temp_codes = temp_prefix + ".pq_codes.bin";
+
+        double p_val = ctx.pq_sampling_rate;
+        if (p_val <= 0.0 || p_val > 1.0)
+            p_val = 0.1; // safety fallback
+        // For tiny datasets ensure at least a few hundred points train.
+        if (npts > 0)
+        {
+            const double min_p = std::min(1.0, 256.0 / static_cast<double>(npts));
+            if (p_val < min_p)
+                p_val = min_p;
+        }
+
+        diskann::generate_quantized_data<T>(ctx.data_file_path, temp_pivots, temp_codes, ctx.metric, p_val,
+                                            pq_chunks, /*use_opq=*/false, /*codebook_prefix=*/"");
+
+        try
+        {
+            pq_pivots_bytes = slurp_file(temp_pivots);
+            pq_codes_bytes = slurp_file(temp_codes);
+        }
+        catch (...)
+        {
+            std::remove(temp_pivots.c_str());
+            std::remove(temp_codes.c_str());
+            throw;
+        }
+        std::remove(temp_pivots.c_str());
+        std::remove(temp_codes.c_str());
+    }
+
+    // -----------------------------------------------------------------------
+    // 3) Emit the unified container.
+    // -----------------------------------------------------------------------
+    idx.save_unified(ctx.output_path.c_str(), pq_pivots_bytes, pq_codes_bytes);
+}
+
+} // namespace
+
+unified_index_builder::unified_index_builder() = default;
+unified_index_builder::~unified_index_builder() = default;
+
+void unified_index_builder::build(const UnifiedBuildContext &ctx)
+{
+    if (ctx.data_file_path.empty())
+        throw ANNException("UnifiedBuildContext: data_file_path is empty", -1, __FUNCSIG__, __FILE__, __LINE__);
+    if (ctx.output_path.empty())
+        throw ANNException("UnifiedBuildContext: output_path is empty", -1, __FUNCSIG__, __FILE__, __LINE__);
+
+    switch (ctx.data_type)
+    {
+    case DataTypeTag::Float:
+        build_impl<float>(ctx);
+        break;
+    case DataTypeTag::Uint8:
+        build_impl<uint8_t>(ctx);
+        break;
+    case DataTypeTag::Int8:
+        build_impl<int8_t>(ctx);
+        break;
+    default:
+        throw ANNException("unified_index_builder: unsupported data_type", -1, __FUNCSIG__, __FILE__, __LINE__);
+    }
+}
+
+} // namespace diskann
diff --git a/src/unified_index_io.cpp b/src/unified_index_io.cpp
new file mode 100644
index 000000000..4e5b5972b
--- /dev/null
+++ b/src/unified_index_io.cpp
@@ -0,0 +1,303 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#include "unified_index_io.h"
+
+#include <algorithm>
+#include <cstring>
+#include <stdexcept>
+
+#include "logger.h"
+
+namespace diskann
+{
+
+namespace
+{
+constexpr uint8_t ZERO_SECTOR[UNIFIED_FORMAT_ALIGN] = {};
+}
+
+UnifiedIndexWriter::UnifiedIndexWriter(const std::string &path) : _path(path)
+{
+    _out.exceptions(std::ios::badbit | std::ios::failbit);
+    _out.open(_path, std::ios::binary | std::ios::out | std::ios::trunc);
+}
+
+UnifiedIndexWriter::~UnifiedIndexWriter()
+{
+    if (_out.is_open() && !_finalized)
+    {
+        // Caller forgot to finalize. Close anyway to flush OS handle; file will
+        // be unusable but we don't throw from a destructor.
+        _out.close();
+    }
+}
+
+uint64_t UnifiedIndexWriter::cur_offset()
+{
+    return static_cast<uint64_t>(_out.tellp());
+}
+
+void UnifiedIndexWriter::write_raw(const void *bytes, uint64_t len)
+{
+    if (len == 0)
+        return;
+    _out.write(static_cast<const char *>(bytes), static_cast<std::streamsize>(len));
+}
+
+void UnifiedIndexWriter::pad_to_4k()
+{
+    uint64_t cur = cur_offset();
+    uint64_t aligned = align_up_4k(cur);
+    if (aligned == cur)
+        return;
+    write_raw(ZERO_SECTOR, aligned - cur);
+}
+
+void UnifiedIndexWriter::begin(uint64_t npts, uint64_t dim, uint64_t aligned_dim, uint32_t max_degree,
+                               DataTypeTag data_type, MetricTag metric, uint64_t start_node)
+{
+    if (npts == 0)
+        throw std::invalid_argument("UnifiedIndexWriter::begin: npts must be > 0");
+
+    _header.magic = UNIFIED_FORMAT_MAGIC;
+    _header.version = UNIFIED_FORMAT_VERSION;
+    _header.data_type = data_type;
+    _header.metric = metric;
+    _header.npts = npts;
+    _header.dim = dim;
+    _header.aligned_dim = aligned_dim;
+    _header.max_degree = max_degree;
+    _header.flags = 0;
+    _header.start_node = start_node;
+    _header.label_encoding = LabelEncoding::None;
+
+    _node_offsets.assign(npts + 1, 0);
+
+    // Reserve header sector — written for real in finalize().
+    write_raw(ZERO_SECTOR, UNIFIED_FORMAT_ALIGN);
+
+    // Reserve offset table region — filled in finalize().
+    _header.offset_table_off = cur_offset();
+    _header.offset_table_len = (npts + 1) * sizeof(uint64_t);
+    const uint64_t table_padded = align_up_4k(_header.offset_table_len);
+    for (uint64_t written = 0; written < table_padded; written += UNIFIED_FORMAT_ALIGN)
+    {
+        const uint64_t chunk = std::min<uint64_t>(UNIFIED_FORMAT_ALIGN, table_padded - written);
+        write_raw(ZERO_SECTOR, chunk);
+    }
+}
+
+void UnifiedIndexWriter::begin_graph_region()
+{
+    if (_graph_open)
+        throw std::logic_error("UnifiedIndexWriter: graph region already open");
+    _graph_open = true;
+    _header.graph_region_off = cur_offset();
+    _graph_region_start = _header.graph_region_off;
+}
+
+void UnifiedIndexWriter::write_node(const void *coords, const uint32_t *neighbors, uint32_t degree)
+{
+    if (!_graph_open)
+        throw std::logic_error("UnifiedIndexWriter: write_node before begin_graph_region");
+    if (_written_nodes >= _header.npts)
+        throw std::logic_error("UnifiedIndexWriter: too many nodes written");
+
+    const uint64_t coords_bytes = _header.dim * [&]() -> uint64_t {
+        switch (_header.data_type)
+        {
+        case DataTypeTag::Float:
+            return sizeof(float);
+        case DataTypeTag::Uint8:
+            return sizeof(uint8_t);
+        case DataTypeTag::Int8:
+            return sizeof(int8_t);
+        }
+        throw std::logic_error("UnifiedIndexWriter: unknown data type");
+    }();
+
+    const uint64_t id = _written_nodes;
+    _node_offsets[id] = cur_offset() - _graph_region_start;
+    write_raw(coords, coords_bytes);
+    write_raw(neighbors, static_cast<uint64_t>(degree) * sizeof(uint32_t));
+    _node_offsets[id + 1] = cur_offset() - _graph_region_start;
+    ++_written_nodes;
+}
+
+void UnifiedIndexWriter::end_graph_region()
+{
+    if (!_graph_open)
+        throw std::logic_error("UnifiedIndexWriter: end_graph_region without begin_graph_region");
+    _header.graph_region_len = cur_offset() - _header.graph_region_off;
+    pad_to_4k();
+    _graph_open = false;
+}
+
+void UnifiedIndexWriter::write_medoids(const uint32_t *medoid_ids, uint64_t num_medoids)
+{
+    _header.medoids_off = cur_offset();
+    _header.medoids_len = num_medoids * sizeof(uint32_t);
+    write_raw(medoid_ids, _header.medoids_len);
+    pad_to_4k();
+}
+
+void UnifiedIndexWriter::write_pq(const void *pivots_bytes, uint64_t pivots_len, const void *codes_bytes,
+                                  uint64_t codes_len)
+{
+    _header.flags |= HAS_PQ;
+    _header.pq_pivots_off = cur_offset();
+    _header.pq_pivots_len = pivots_len;
+    write_raw(pivots_bytes, pivots_len);
+    pad_to_4k();
+
+    _header.pq_codes_off = cur_offset();
+    _header.pq_codes_len = codes_len;
+    write_raw(codes_bytes, codes_len);
+    pad_to_4k();
+}
+
+void UnifiedIndexWriter::write_max_base_norm(float value)
+{
+    _header.flags |= HAS_MAX_BASE_NORM;
+    _header.max_base_norm_off = cur_offset();
+    _header.max_base_norm_len = sizeof(float);
+    write_raw(&value, sizeof(float));
+    pad_to_4k();
+}
+
+void UnifiedIndexWriter::write_labels_bitmask(uint64_t total_labels, uint64_t universal_label,
+                                              const void *dictionary_bytes, uint64_t dictionary_len,
+                                              const void *bitmask_bytes, uint64_t bitmask_bytes_len)
+{
+    _header.flags |= HAS_LABELS;
+    _header.label_encoding = LabelEncoding::Bitmask;
+    _header.total_labels = total_labels;
+    _header.universal_label = universal_label;
+
+    _header.label_dictionary_off = cur_offset();
+    _header.label_dictionary_len = dictionary_len;
+    write_raw(dictionary_bytes, dictionary_len);
+    pad_to_4k();
+
+    _header.per_point_labels_off = cur_offset();
+    _header.per_point_labels_len = bitmask_bytes_len;
+    write_raw(bitmask_bytes, bitmask_bytes_len);
+    pad_to_4k();
+}
+
+void UnifiedIndexWriter::write_labels_integer(uint64_t total_labels, uint64_t universal_label,
+                                              const void *dictionary_bytes, uint64_t dictionary_len,
+                                              const void *per_point_data, uint64_t per_point_data_len,
+                                              const uint64_t *per_point_offsets)
+{
+    _header.flags |= HAS_LABELS;
+    _header.label_encoding = LabelEncoding::Integer;
+    _header.total_labels = total_labels;
+    _header.universal_label = universal_label;
+
+    _header.label_dictionary_off = cur_offset();
+    _header.label_dictionary_len = dictionary_len;
+    write_raw(dictionary_bytes, dictionary_len);
+    pad_to_4k();
+
+    // Write order mirrors the graph: offset table first, then per-point payload.
+    _header.per_point_label_offsets_off = cur_offset();
+    _header.per_point_label_offsets_len = (_header.npts + 1) * sizeof(uint64_t);
+    write_raw(per_point_offsets, _header.per_point_label_offsets_len);
+    pad_to_4k();
+
+    _header.per_point_labels_off = cur_offset();
+    _header.per_point_labels_len = per_point_data_len;
+    write_raw(per_point_data, per_point_data_len);
+    pad_to_4k();
+}
+
+void UnifiedIndexWriter::finalize()
+{
+    if (_finalized)
+        throw std::logic_error("UnifiedIndexWriter: already finalized");
+
+    // Capture total bytes written so readers can verify on load.
+    _header.file_size_bytes = cur_offset();
+
+    // Write offset table back at its reserved spot.
+    _out.seekp(static_cast<std::streamoff>(_header.offset_table_off), std::ios::beg);
+    write_raw(_node_offsets.data(), _header.offset_table_len);
+
+    // Write final header at byte 0.
+    _out.seekp(0, std::ios::beg);
+    write_raw(&_header, sizeof(UnifiedIndexHeader));
+
+    _out.flush();
+    _out.close();
+    _finalized = true;
+}
+
+// ---------- Reader ----------
+
+UnifiedIndexReader::UnifiedIndexReader(const std::string &path) : _path(path)
+{
+    parse_header();
+}
+
+void UnifiedIndexReader::parse_header()
+{
+    std::ifstream in;
+    in.exceptions(std::ios::badbit | std::ios::failbit);
+    in.open(_path, std::ios::binary);
+    in.read(reinterpret_cast<char *>(&_header), sizeof(UnifiedIndexHeader));
+
+    if (_header.magic != UNIFIED_FORMAT_MAGIC)
+        throw std::runtime_error("UnifiedIndexReader: bad magic in " + _path);
+    if (_header.version > UNIFIED_FORMAT_VERSION)
+        throw std::runtime_error("UnifiedIndexReader: unsupported version " + std::to_string(_header.version) +
+                                 " in " + _path);
+
+    // Validate file size against the value recorded by the writer. A mismatch
+    // typically means truncation, partial-write, or external tampering; the
+    // exact size is also useful for disk-quota / resource-planning telemetry.
+    in.seekg(0, std::ios::end);
+    const uint64_t actual_size = static_cast<uint64_t>(in.tellg());
+    if (_header.file_size_bytes != 0 && _header.file_size_bytes != actual_size)
+    {
+        throw std::runtime_error("UnifiedIndexReader: file_size_bytes mismatch in " + _path + " (header=" +
+                                 std::to_string(_header.file_size_bytes) +
+                                 ", actual=" + std::to_string(actual_size) + ")");
+    }
+    diskann::cout << "UnifiedIndexReader: opened " << _path << " (" << actual_size << " bytes)" << std::endl;
+}
+
+std::vector<uint64_t> UnifiedIndexReader::load_offset_table()
+{
+    std::vector<uint64_t> table(_header.npts + 1);
+    std::ifstream in;
+    in.exceptions(std::ios::badbit | std::ios::failbit);
+    in.open(_path, std::ios::binary);
+    in.seekg(static_cast<std::streamoff>(_header.offset_table_off), std::ios::beg);
+    in.read(reinterpret_cast<char *>(table.data()),
+            static_cast<std::streamsize>((_header.npts + 1) * sizeof(uint64_t)));
+    return table;
+}
+
+std::vector<uint8_t> UnifiedIndexReader::load_region(uint64_t off, uint64_t len)
+{
+    std::vector<uint8_t> buf(len);
+    if (len == 0)
+        return buf;
+    load_region(off, len, buf.data());
+    return buf;
+}
+
+void UnifiedIndexReader::load_region(uint64_t off, uint64_t len, uint8_t *dst)
+{
+    if (len == 0)
+        return;
+    std::ifstream in;
+    in.exceptions(std::ios::badbit | std::ios::failbit);
+    in.open(_path, std::ios::binary);
+    in.seekg(static_cast<std::streamoff>(off), std::ios::beg);
+    in.read(reinterpret_cast<char *>(dst), static_cast<std::streamsize>(len));
+}
+
+} // namespace diskann
diff --git a/src/unified_index_memory.cpp b/src/unified_index_memory.cpp
new file mode 100644
index 000000000..ecef07ae4
--- /dev/null
+++ b/src/unified_index_memory.cpp
@@ -0,0 +1,325 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#include "boost/dynamic_bitset.hpp"
+
+#include "unified_index_memory.h"
+
+#include <algorithm>
+#include <cstring>
+#include <limits>
+#include <memory>
+
+#include "ann_exception.h"
+#include "distance.h"
+#include "filter_match_proxy.h"
+#include "neighbor.h"
+#include "percentile_stats.h"
+#include "unified_index_io.h"
+#include "unified_node_store.h"
+#include "utils.h"
+
+#ifndef MAX_POINTS_FOR_USING_BITSET
+#define MAX_POINTS_FOR_USING_BITSET 10000000
+#endif
+
+namespace diskann
+{
+
+template <typename T>
+unified_index_memory<T>::unified_index_memory(diskann::Metric metric)
+    : unified_index_base<T>(metric), _query_scratch(nullptr)
+{
+}
+
+template <typename T> unified_index_memory<T>::~unified_index_memory()
+{
+    if (!_query_scratch.empty())
+    {
+        ScratchStoreManager<InMemQueryScratch<T>> manager(_query_scratch);
+        manager.destroy();
+    }
+}
+
+template <typename T>
+void unified_index_memory<T>::load_storage(UnifiedIndexReader &r, const UnifiedLoadContext &ctx)
+{
+    // Build the resident node store.
+    auto store = std::make_unique<unified_node_store_memory<T>>();
+    store->load(r, this->_header);
+    this->_store = std::move(store);
+
+    _start = static_cast<uint32_t>(this->_header.start_node);
+    _max_observed_degree = this->_header.max_degree;
+
+    // Mirror unified_index_ssd: load the medoids region. For unfiltered
+    // builds the writer emits a single-entry list (== _start); for filtered
+    // builds it emits one per label. Search-time seeding uses these
+    // medoids the same way the SSD path does -- pick the closest to query.
+    if (this->_header.medoids_len > 0)
+    {
+        const size_t num = this->_header.medoids_len / sizeof(uint32_t);
+        _medoids.resize(num);
+        r.load_region(this->_header.medoids_off, this->_header.medoids_len,
+                      reinterpret_cast<uint8_t *>(_medoids.data()));
+    }
+    if (_medoids.empty())
+        _medoids.push_back(_start);
+
+    _dist_cmp.reset(get_distance_function<T>(this->_metric));
+
+    init_scratch_pool(ctx.num_threads, ctx.search_l);
+}
+
+template <typename T> void unified_index_memory<T>::init_scratch_pool(uint32_t num_threads, uint32_t search_l)
+{
+    if (num_threads == 0)
+        num_threads = 1;
+
+    const size_t dim = static_cast<size_t>(this->_header.dim);
+    const size_t aligned_dim = static_cast<size_t>(this->_header.aligned_dim);
+    const uint32_t R = _max_observed_degree;
+    const uint32_t maxc = 750; // legacy default
+    const size_t alignment_factor = _dist_cmp ? _dist_cmp->get_required_alignment() : 8;
+
+    // The unified path doesn't use InMemQueryScratch::_query_label_bitmask --
+    // the bitmask match proxy owns its own per-query scratch internally
+    // (bitmask_filter_match's 3-arg ctor). So we can skip the per-thread
+    // bitmask buffer allocation entirely.
+    const size_t bitmask_size = 0;
+
+    std::vector<uint32_t> empty_sellers;
+    for (uint32_t i = 0; i < num_threads; ++i)
+    {
+        auto *s = new InMemQueryScratch<T>(search_l, search_l, R, maxc, dim, aligned_dim, alignment_factor,
+                                           empty_sellers,
+                                           /*init_pq_scratch=*/false, bitmask_size);
+        _query_scratch.push(s);
+    }
+}
+
+template <typename T>
+std::pair<uint32_t, uint32_t> unified_index_memory<T>::iterate_to_fixed_point(
+    InMemQueryScratch<T> *scratch, uint32_t L, const T *query, const std::vector<uint32_t> &init_ids,
+    filter_match_proxy *match_proxy)
+{
+    auto *store = static_cast<unified_node_store_memory<T> *>(this->_store.get());
+    const uint64_t aligned_dim = this->_header.aligned_dim;
+
+    NeighborPriorityQueue &best_L_nodes = scratch->best_l_nodes();
+    best_L_nodes.reserve(L);
+    tsl::robin_set<uint32_t> &inserted_into_pool_rs = scratch->inserted_into_pool_rs();
+    boost::dynamic_bitset<> &inserted_into_pool_bs = scratch->inserted_into_pool_bs();
+    std::vector<uint32_t> &id_scratch = scratch->id_scratch();
+    std::vector<float> &dist_scratch = scratch->dist_scratch();
+    id_scratch.clear();
+    dist_scratch.clear();
+
+    const T *aligned_query = scratch->aligned_query();
+
+    const uint64_t total_num_points = this->_header.npts;
+    const bool fast_iterate = total_num_points <= MAX_POINTS_FOR_USING_BITSET;
+
+    if (fast_iterate)
+    {
+        if (inserted_into_pool_bs.size() < total_num_points)
+        {
+            auto resize_size = 2 * total_num_points > MAX_POINTS_FOR_USING_BITSET
+                                   ? MAX_POINTS_FOR_USING_BITSET
+                                   : 2 * total_num_points;
+            inserted_into_pool_bs.resize(resize_size);
+        }
+    }
+
+    auto is_not_visited = [fast_iterate, &inserted_into_pool_bs, &inserted_into_pool_rs](uint32_t id) {
+        return fast_iterate ? !inserted_into_pool_bs.test(id)
+                            : inserted_into_pool_rs.find(id) == inserted_into_pool_rs.end();
+    };
+    auto mark_visited = [fast_iterate, &inserted_into_pool_bs, &inserted_into_pool_rs](uint32_t id) {
+        if (fast_iterate)
+            inserted_into_pool_bs.set(id);
+        else
+            inserted_into_pool_rs.insert(id);
+    };
+
+    uint32_t hops = 0;
+    uint32_t cmps = 0;
+
+    for (uint32_t id : init_ids)
+    {
+        if (id >= total_num_points)
+            continue;
+        if (match_proxy != nullptr && !match_proxy->contain_filtered_label(id))
+            continue;
+        if (is_not_visited(id))
+        {
+            mark_visited(id);
+            const T *coords = store->get_coords(id);
+            float d = _dist_cmp->compare(aligned_query, coords, static_cast<uint32_t>(aligned_dim));
+            best_L_nodes.insert(Neighbor(id, d));
+            ++cmps;
+        }
+    }
+
+    while (best_L_nodes.has_unexpanded_node())
+    {
+        auto nbr = best_L_nodes.closest_unexpanded();
+        const uint32_t n = nbr.id;
+        ++hops;
+
+        id_scratch.clear();
+        dist_scratch.clear();
+
+        uint32_t deg = 0;
+        const uint32_t *nbrs = store->get_neighbors(n, deg);
+        for (uint32_t j = 0; j < deg; ++j)
+        {
+            const uint32_t id = nbrs[j];
+            if (id >= total_num_points)
+                continue;
+            if (!is_not_visited(id))
+                continue;
+            if (match_proxy != nullptr && !match_proxy->contain_filtered_label(id))
+                continue;
+            id_scratch.push_back(id);
+        }
+
+        for (uint32_t id : id_scratch)
+            mark_visited(id);
+
+        dist_scratch.resize(id_scratch.size());
+        for (size_t k = 0; k < id_scratch.size(); ++k)
+        {
+            const T *coords = store->get_coords(id_scratch[k]);
+            dist_scratch[k] = _dist_cmp->compare(aligned_query, coords, static_cast<uint32_t>(aligned_dim));
+        }
+        cmps += static_cast<uint32_t>(id_scratch.size());
+
+        for (size_t k = 0; k < id_scratch.size(); ++k)
+        {
+            best_L_nodes.insert(Neighbor(id_scratch[k], dist_scratch[k]));
+        }
+    }
+
+    return {hops, cmps};
+}
+
+template <typename T> void unified_index_memory<T>::search_impl(UnifiedSearchContext &ctx)
+{
+    if (_query_scratch.size() == 0)
+    {
+        throw ANNException("unified_index_memory::search_impl: scratch pool empty (was load() called?)", -1,
+                           __FUNCSIG__, __FILE__, __LINE__);
+    }
+
+    ScratchStoreManager<InMemQueryScratch<T>> manager(_query_scratch);
+    InMemQueryScratch<T> *scratch = manager.scratch_space();
+    scratch->resize_for_new_L(std::max<uint32_t>(ctx.L, static_cast<uint32_t>(ctx.K)));
+    scratch->clear();
+
+    const uint64_t dim = this->_header.dim;
+    const uint64_t aligned_dim = this->_header.aligned_dim;
+    T *aligned_query = scratch->aligned_query();
+    std::memset(aligned_query, 0, aligned_dim * sizeof(T));
+    if (_dist_cmp && _dist_cmp->preprocessing_required())
+    {
+        _dist_cmp->preprocess_query(static_cast<const T *>(ctx.query), dim, aligned_query);
+    }
+    else
+    {
+        std::memcpy(aligned_query, ctx.query, dim * sizeof(T));
+    }
+
+    // Build the label match proxy if the index is filtered, resolving the
+    // filter label strings once: resolve_filters yields both the internal
+    // label ints (for the proxy) and the per-label medoid seed ids (init_ids
+    // below) from a single dictionary probe per label.
+    //
+    // init_ids / filter_label_ints are thread_local: search_impl runs once per
+    // query on a pooled thread, so reusing these buffers across queries avoids a
+    // per-call heap allocation (mirrors unified_index_ssd::cached_beam_search).
+    // init_ids must be cleared up front because the unfiltered branch below
+    // relies on it being empty when no filter is applied.
+    std::unique_ptr<filter_match_proxy> proxy;
+    thread_local std::vector<uint32_t> init_ids;
+    init_ids.clear();
+    if (this->_labels && this->_labels->has_labels())
+    {
+        thread_local std::vector<uint32_t> filter_label_ints;
+        this->_labels->resolve_filters(ctx.filter_labels, filter_label_ints, init_ids);
+        proxy = this->_labels->make_match_proxy(filter_label_ints);
+    }
+
+    // Seed init_ids. Aligned with unified_index_ssd::cached_beam_search:
+    // - Unfiltered: pick the single closest medoid from _medoids by
+    //   full-vector L2 (memory has all coords resident, so we don't need the
+    //   pre-computed centroid array the SSD path uses).
+    // - Filtered: one medoid per filter label (the unified format stores
+    //   exactly one per label), already resolved above. Per-label medoid
+    //   seeding dramatically improves recall on filtered search because the
+    //   global start node may not lie within any filter-label cluster.
+    auto *store = static_cast<unified_node_store_memory<T> *>(this->_store.get());
+    if (init_ids.empty())
+    {
+        // Unfiltered path -- pick closest of the global medoid set.
+        uint32_t best_id = _medoids.empty() ? _start : _medoids[0];
+        float best_dist = std::numeric_limits<float>::max();
+        for (uint32_t mid : _medoids)
+        {
+            const T *coords = store->get_coords(mid);
+            const float d = _dist_cmp->compare(aligned_query, coords, static_cast<uint32_t>(aligned_dim));
+            if (d < best_dist)
+            {
+                best_dist = d;
+                best_id = mid;
+            }
+        }
+        init_ids.push_back(best_id);
+    }
+
+    auto [hops, cmps] = iterate_to_fixed_point(scratch, ctx.L, aligned_query, init_ids, proxy.get());
+
+    NeighborPriorityQueue &best_L_nodes = scratch->best_l_nodes();
+    size_t pos = 0;
+    for (size_t i = 0; i < best_L_nodes.size() && pos < ctx.K; ++i)
+    {
+        const Neighbor &n = best_L_nodes[i];
+        ctx.indices[pos] = static_cast<uint64_t>(n.id);
+        if (this->_metric == diskann::Metric::INNER_PRODUCT)
+            ctx.distances[pos] = -n.distance;
+        else
+            ctx.distances[pos] = n.distance;
+        ++pos;
+    }
+    for (; pos < ctx.K; ++pos)
+    {
+        ctx.indices[pos] = std::numeric_limits<uint64_t>::max();
+        ctx.distances[pos] = std::numeric_limits<float>::max();
+    }
+
+    if (ctx.stats != nullptr)
+    {
+        ctx.stats->n_hops = hops;
+        ctx.stats->n_cmps = cmps;
+    }
+}
+
+template <typename T> void unified_index_memory<T>::fill_storage_stats(TableStats &stats) const
+{
+    // Memory keeps the whole graph region resident: [coords, neighbors] per
+    // node. Split it into vector bytes (node_mem_usage) and adjacency bytes
+    // (graph_mem_usage), mirroring Index::get_data_size / get_graph_size.
+    const auto *store = static_cast<const unified_node_store_memory<T> *>(this->_store.get());
+    if (store == nullptr)
+        return;
+    const uint64_t resident = store->resident_bytes();
+    const uint64_t node_bytes = store->num_points() * store->coord_bytes();
+    stats.node_mem_usage = node_bytes;
+    stats.graph_mem_usage = resident > node_bytes ? resident - node_bytes : 0;
+}
+
+template class unified_index_memory<float>;
+template class unified_index_memory<uint8_t>;
+template class unified_index_memory<int8_t>;
+
+} // namespace diskann
diff --git a/src/unified_index_ssd.cpp b/src/unified_index_ssd.cpp
new file mode 100644
index 000000000..f5aff7ed2
--- /dev/null
+++ b/src/unified_index_ssd.cpp
@@ -0,0 +1,440 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#include "boost/dynamic_bitset.hpp"
+
+#include "unified_index_ssd.h"
+
+#include <algorithm>
+#include <cstdio>
+#include <cstring>
+#include <fstream>
+#include <limits>
+#include <memory>
+#include <vector>
+
+#include "ann_exception.h"
+#include "neighbor.h"
+#include "percentile_stats.h"
+#include "pq.h"
+#include "pq_scratch.h"
+#include "unified_index_io.h"
+#include "unified_node_store.h"
+#include "utils.h"
+
+namespace diskann
+{
+
+
+template <typename T>
+unified_index_ssd<T>::unified_index_ssd(std::shared_ptr<AlignedFileReader> reader, diskann::Metric metric)
+    : unified_index_base<T>(metric), _reader(std::move(reader)), _thread_data(nullptr)
+{
+}
+
+template <typename T> unified_index_ssd<T>::~unified_index_ssd()
+{
+    if (_centroid_data != nullptr)
+    {
+        aligned_free(_centroid_data);
+        _centroid_data = nullptr;
+    }
+    if (!_thread_data.empty())
+    {
+        ScratchStoreManager<SSDThreadData<T>> manager(_thread_data);
+        manager.destroy();
+    }
+}
+
+template <typename T>
+void unified_index_ssd<T>::load_storage(UnifiedIndexReader &r, const UnifiedLoadContext &ctx)
+{
+    const UnifiedIndexHeader &h = r.header();
+    if (!(h.flags & HAS_PQ))
+    {
+        throw ANNException("unified_index_ssd::load_storage: SSD load requires HAS_PQ; file lacks PQ regions", -1,
+                           __FUNCSIG__, __FILE__, __LINE__);
+    }
+
+    // 1) Construct + load the disk-resident node store.
+    auto store = std::make_unique<unified_node_store_ssd<T>>(_reader);
+    store->load(r, h);
+    this->_store = std::move(store);
+
+    _dist_cmp.reset(get_distance_function<T>(this->_metric));
+    _dist_cmp_float.reset(get_distance_function<float>(this->_metric));
+
+    // 2) Load PQ pivots + codes directly from the unified file -- no temp
+    //    files, no extra IO beyond the two zero-copy load_region calls.
+    std::vector<uint8_t> pq_pivots_blob(h.pq_pivots_len);
+    r.load_region(h.pq_pivots_off, h.pq_pivots_len, pq_pivots_blob.data());
+    _pq_table.load_pq_centroid_bin_from_memory(pq_pivots_blob.data(), pq_pivots_blob.size(), /*num_chunks=*/0);
+
+    // PQ codes blob format (matches diskann::load_bin<uint8_t>):
+    //   [int32 npts][int32 nchunks][uint8 payload[npts * nchunks]]
+    // Read the 8-byte header first so we can size _pq_codes exactly, then
+    // stream the payload directly into _pq_codes -- zero intermediate copy.
+    if (h.pq_codes_len < 2 * sizeof(int32_t))
+    {
+        throw ANNException("unified_index_ssd::load_storage: PQ codes region truncated (header)", -1, __FUNCSIG__,
+                           __FILE__, __LINE__);
+    }
+    int32_t codes_header[2] = {0, 0};
+    r.load_region(h.pq_codes_off, 2 * sizeof(int32_t), reinterpret_cast<uint8_t *>(codes_header));
+    const size_t codes_npts = static_cast<size_t>(codes_header[0]);
+    const size_t codes_nchunks = static_cast<size_t>(codes_header[1]);
+
+    if (codes_npts != h.npts)
+    {
+        throw ANNException("unified_index_ssd::load_storage: PQ codes npts mismatch", -1, __FUNCSIG__, __FILE__,
+                           __LINE__);
+    }
+    const size_t payload_bytes = codes_npts * codes_nchunks * sizeof(uint8_t);
+    if (2 * sizeof(int32_t) + payload_bytes > h.pq_codes_len)
+    {
+        throw ANNException("unified_index_ssd::load_storage: PQ codes region truncated (payload)", -1, __FUNCSIG__,
+                           __FILE__, __LINE__);
+    }
+    _pq_codes.resize(payload_bytes);
+    r.load_region(h.pq_codes_off + 2 * sizeof(int32_t), payload_bytes, _pq_codes.data());
+    _n_chunks = codes_nchunks;
+
+    // 3) Load medoids.
+    load_medoids_from_unified(r);
+
+    // 4) Optional: HAS_MAX_BASE_NORM (MIPS rescaling).
+    if (h.flags & HAS_MAX_BASE_NORM)
+    {
+        if (h.max_base_norm_len >= sizeof(float))
+            r.load_region(h.max_base_norm_off, sizeof(float), reinterpret_cast<uint8_t *>(&_max_base_norm));
+    }
+
+    // 5) Per-thread scratch.
+    setup_thread_data(ctx.num_threads);
+    _max_nthreads = ctx.num_threads == 0 ? 1 : ctx.num_threads;
+
+    // 6) Centroid data (copy medoid vectors into the centroid array).
+    use_medoids_data_as_centroids();
+
+    // 7) Optional cache priming. Reuses an SSDThreadData (slab + IOContext)
+    //    from the pool we just built in step 5, so no extra slab allocation.
+    //    Seed the BFS from the global medoids AND each label's entry-point
+    //    medoid, so filtered-search seeds (and their BFS neighborhoods) get
+    //    cached too -- mirroring PQFlashIndex::cache_bfs_levels, which seeds
+    //    from both _medoids and _filter_to_medoid_ids. Global medoids go first
+    //    so they keep priority under the num_nodes_to_cache cap;
+    //    cache_bfs_levels dedups via its visited set, so any overlap between
+    //    the two sources is harmless.
+    if (ctx.num_nodes_to_cache > 0)
+    {
+        std::vector<uint32_t> seed_ids = _medoids;
+        if (this->_labels && this->_labels->has_labels())
+            this->_labels->collect_label_medoids(seed_ids);
+
+        if (!seed_ids.empty())
+        {
+            auto *ssd_store = static_cast<unified_node_store_ssd<T> *>(this->_store.get());
+            ScratchStoreManager<SSDThreadData<T>> manager(_thread_data);
+            SSDThreadData<T> *tdata = manager.scratch_space();
+            NodeFetchScratch fetch_scratch;
+            fetch_scratch.attach_borrowed(tdata->ctx, tdata->scratch.sector_scratch,
+                                           defaults::MAX_N_SECTOR_READS * defaults::SECTOR_LEN);
+            std::vector<uint32_t> cached_list;
+            ssd_store->cache_bfs_levels(seed_ids, ctx.num_nodes_to_cache, cached_list, fetch_scratch);
+        }
+    }
+}
+
+template <typename T> void unified_index_ssd<T>::load_pq_from_unified(UnifiedIndexReader & /*r*/)
+{
+    // PQ is loaded inline in load_storage via direct in-memory parsing
+    // (FixedChunkPQTable::load_pq_centroid_bin_from_memory + manual codes
+    // parsing). This stub is retained as a placeholder in case a future
+    // refactor moves PQ loading out of load_storage; today it has no body.
+}
+
+template <typename T> void unified_index_ssd<T>::load_medoids_from_unified(UnifiedIndexReader &r)
+{
+    const UnifiedIndexHeader &h = r.header();
+    if (h.medoids_len == 0)
+        return;
+    const size_t num = h.medoids_len / sizeof(uint32_t);
+    _medoids.resize(num);
+    r.load_region(h.medoids_off, h.medoids_len, reinterpret_cast<uint8_t *>(_medoids.data()));
+}
+
+template <typename T> void unified_index_ssd<T>::setup_thread_data(uint64_t nthreads, uint64_t visited_reserve)
+{
+    if (nthreads == 0)
+        nthreads = 1;
+    const size_t aligned_dim = static_cast<size_t>(this->_header.aligned_dim);
+    std::vector<uint32_t> empty_sellers;
+
+    // OMP-parallel loop so each worker thread registers itself with the
+    // AlignedFileReader and we cache the resulting IOContext on the
+    // SSDThreadData. Search-time get_nodes() then uses the cached ctx
+    // directly -- no mutex, no per-call get_ctx lookup.
+#pragma omp parallel for num_threads(static_cast<int>(nthreads))
+    for (int64_t t = 0; t < static_cast<int64_t>(nthreads); ++t)
+    {
+#pragma omp critical
+        {
+            auto *td = new SSDThreadData<T>(aligned_dim, visited_reserve, empty_sellers);
+            _reader->register_thread();
+            td->ctx = _reader->get_ctx();
+            _thread_data.push(td);
+        }
+    }
+}
+
+template <typename T> void unified_index_ssd<T>::use_medoids_data_as_centroids()
+{
+    if (_medoids.empty())
+        return;
+    const size_t aligned_dim = static_cast<size_t>(this->_header.aligned_dim);
+    if (_centroid_data != nullptr)
+    {
+        aligned_free(_centroid_data);
+        _centroid_data = nullptr;
+    }
+    const size_t bytes = _medoids.size() * aligned_dim * sizeof(float);
+    alloc_aligned(reinterpret_cast<void **>(&_centroid_data), bytes, 32);
+    std::memset(_centroid_data, 0, bytes);
+
+    auto *ssd_store = static_cast<unified_node_store_ssd<T> *>(this->_store.get());
+
+    // Load-time path: borrow an SSDThreadData (slab + registered IOContext)
+    // from the pool that setup_thread_data just built. Avoids allocating a
+    // fresh slab just to drop it immediately.
+    ScratchStoreManager<SSDThreadData<T>> manager(_thread_data);
+    SSDThreadData<T> *tdata = manager.scratch_space();
+    NodeFetchScratch scratch;
+    scratch.attach_borrowed(tdata->ctx, tdata->scratch.sector_scratch,
+                             defaults::MAX_N_SECTOR_READS * defaults::SECTOR_LEN);
+
+    std::vector<uint64_t> ids(1);
+    std::vector<NodeView<T>> views;
+    for (size_t i = 0; i < _medoids.size(); ++i)
+    {
+        ids[0] = _medoids[i];
+        ssd_store->get_nodes(ids, scratch, views);
+        // Convert T -> float (lossless for float; promote for int8/uint8).
+        const T *src = views[0].coords;
+        float *dst = _centroid_data + i * aligned_dim;
+        for (size_t j = 0; j < aligned_dim; ++j)
+            dst[j] = static_cast<float>(src[j]);
+    }
+}
+
+template <typename T> void unified_index_ssd<T>::search_impl(UnifiedSearchContext &ctx)
+{
+    const uint32_t beam_width = ctx.beam_width.value_or(4);
+    const uint32_t io_limit = ctx.io_limit.value_or(std::numeric_limits<uint32_t>::max());
+    cached_beam_search(static_cast<const T *>(ctx.query), ctx.K, ctx.L, ctx.indices, ctx.distances, beam_width,
+                       /*filter_label_strings=*/ctx.filter_labels, io_limit, ctx.stats, ctx.debug_info);
+}
+
+template <typename T>
+void unified_index_ssd<T>::cached_beam_search(const T *query, uint64_t K, uint64_t L, uint64_t *indices,
+                                              float *distances, uint32_t beam_width,
+                                              const std::vector<std::string> &filter_label_strings,
+                                              uint32_t io_limit, QueryStats *stats,
+                                              DebugTraversalInfo * /*debug_info*/)
+{
+    const uint64_t aligned_dim = this->_header.aligned_dim;
+    const uint64_t dim = this->_header.dim;
+    auto *ssd_store = static_cast<unified_node_store_ssd<T> *>(this->_store.get());
+
+    // Borrow per-thread scratch.
+    ScratchStoreManager<SSDThreadData<T>> manager(_thread_data);
+    SSDThreadData<T> *tdata = manager.scratch_space();
+    tdata->scratch.reset();
+
+    // Prepare aligned query (typed T) and aligned float query (for PQ table).
+    T *aligned_query_T = tdata->scratch.aligned_query_T();
+    std::memset(aligned_query_T, 0, aligned_dim * sizeof(T));
+    if (_dist_cmp && _dist_cmp->preprocessing_required())
+        _dist_cmp->preprocess_query(query, dim, aligned_query_T);
+    else
+        std::memcpy(aligned_query_T, query, dim * sizeof(T));
+
+    PQScratch<T> *pq_scratch = tdata->scratch.pq_scratch();
+    float *query_float = pq_scratch->aligned_query_float;
+    float *query_rotated = pq_scratch->rotated_query;
+    for (size_t i = 0; i < dim; ++i)
+    {
+        query_float[i] = static_cast<float>(query[i]);
+        query_rotated[i] = static_cast<float>(query[i]);
+    }
+    _pq_table.preprocess_query(query_rotated);
+    float *pq_dists = pq_scratch->aligned_pqtable_dist_scratch;
+    _pq_table.populate_chunk_distances(query_rotated, pq_dists);
+    float *dist_scratch = pq_scratch->aligned_dist_scratch;
+    uint8_t *pq_coord_scratch = pq_scratch->aligned_pq_coord_scratch;
+
+    auto compute_pq_dists = [this, pq_coord_scratch, pq_dists](const uint32_t *ids, uint64_t n_ids, float *out) {
+        diskann::aggregate_coords(ids, n_ids, _pq_codes.data(), _n_chunks, pq_coord_scratch);
+        diskann::pq_dist_lookup(pq_coord_scratch, n_ids, _n_chunks, pq_dists, out);
+    };
+
+    NeighborPriorityQueue &retset = tdata->scratch.retset;
+    retset.reserve(static_cast<uint32_t>(L));
+    std::vector<Neighbor> &full_retset = tdata->scratch.full_retset;
+    tsl::robin_set<uint64_t> &visited = tdata->scratch.visited;
+
+    // Build filter proxy if applicable. Resolve the filter label strings a
+    // single time here: resolve_filters returns both the internal label ints
+    // (consumed by make_match_proxy) and the per-label medoid seed ids
+    // (consumed by the filtered-seeding branch below), so the label dictionary
+    // is probed once per label instead of once for the proxy and again for the
+    // init ids.
+    std::unique_ptr<filter_match_proxy> proxy;
+    thread_local std::vector<uint32_t> filter_label_ints;
+    thread_local std::vector<uint32_t> filter_init_ids;
+    if (this->_labels && this->_labels->has_labels())
+    {
+        this->_labels->resolve_filters(filter_label_strings, filter_label_ints, filter_init_ids);
+        proxy = this->_labels->make_match_proxy(filter_label_ints);
+    }
+
+    // Seed retset. Branches on filtered vs. unfiltered, mirroring the legacy
+    // PQFlashIndex::cached_beam_search (src/pq_flash_index.cpp:1329-1377).
+    //
+    // Unfiltered: pick the single closest medoid by float-centroid distance
+    // and seed with it.
+    //
+    // Filtered: per-label medoid seeding. For each filter label, walk all of
+    // its per-label medoids, pick the closest by PQ distance (no float
+    // centroid data for filtered medoids), and seed retset+visited with that
+    // pick. One seed per label. Per-label medoids come from resolve_filters
+    // above (one dictionary probe shared with the match proxy).
+    if (proxy == nullptr)
+    {
+        uint32_t best_medoid = _medoids.empty() ? 0 : _medoids[0];
+        float best_medoid_dist = std::numeric_limits<float>::max();
+        for (size_t i = 0; i < _medoids.size(); ++i)
+        {
+            float d = _dist_cmp_float->compare(query_float, _centroid_data + i * aligned_dim,
+                                                static_cast<uint32_t>(aligned_dim));
+            if (d < best_medoid_dist)
+            {
+                best_medoid_dist = d;
+                best_medoid = _medoids[i];
+            }
+        }
+        compute_pq_dists(&best_medoid, 1, dist_scratch);
+        retset.insert(Neighbor(best_medoid, dist_scratch[0]));
+        visited.insert(best_medoid);
+    }
+    else
+    {
+        // filter_init_ids was populated by resolve_filters above: one medoid
+        // per filter label.
+        for (uint32_t mid : filter_init_ids)
+        {
+            // visited dedup: a medoid id may repeat across filter labels.
+            if (visited.insert(mid).second)
+            {
+                compute_pq_dists(&mid, 1, dist_scratch);
+                retset.insert(Neighbor(mid, dist_scratch[0]));
+            }
+        }
+    }
+
+    uint32_t num_ios = 0;
+    uint32_t cmps = 0;
+    uint32_t hops = 0;
+    // Zero-allocation fetch scratch: borrow the per-thread sector buffer +
+    // pre-registered IOContext from SSDThreadData. The slab is 2 MB
+    // (MAX_N_SECTOR_READS * SECTOR_LEN), allocated once at load time.
+    NodeFetchScratch fetch_scratch;
+    fetch_scratch.attach_borrowed(tdata->ctx, tdata->scratch.sector_scratch,
+                                   defaults::MAX_N_SECTOR_READS * defaults::SECTOR_LEN);
+    // Per-thread scratch reused across hops and across calls. Capacity grows
+    // once to the worst-case beam_width that this thread ever sees.
+    thread_local std::vector<uint64_t> beam_ids;
+    thread_local std::vector<NodeView<T>> beam_views;
+    beam_ids.clear();
+    beam_views.clear();
+
+    while (retset.has_unexpanded_node() && num_ios < io_limit)
+    {
+        beam_ids.clear();
+        uint32_t num_seen = 0;
+        while (retset.has_unexpanded_node() && beam_ids.size() < beam_width && num_seen < beam_width)
+        {
+            auto nbr = retset.closest_unexpanded();
+            beam_ids.push_back(nbr.id);
+            ++num_seen;
+        }
+        if (beam_ids.empty())
+            break;
+
+        // One batched IO per beam (the store handles cache hits internally).
+        ssd_store->get_nodes(beam_ids, fetch_scratch, beam_views);
+        ++hops;
+        num_ios += static_cast<uint32_t>(beam_ids.size());
+
+        for (size_t bi = 0; bi < beam_ids.size(); ++bi)
+        {
+            const uint32_t id = static_cast<uint32_t>(beam_ids[bi]);
+            const NodeView<T> &view = beam_views[bi];
+            const float exact_d = _dist_cmp->compare(aligned_query_T, view.coords,
+                                                       static_cast<uint32_t>(aligned_dim));
+            full_retset.push_back(Neighbor(id, exact_d));
+
+            // PQ-rank neighbors and admit unvisited ones to the search frontier.
+            const uint32_t deg = view.degree;
+            if (deg == 0)
+                continue;
+            compute_pq_dists(view.neighbors, deg, dist_scratch);
+            for (uint32_t m = 0; m < deg; ++m)
+            {
+                const uint32_t nb = view.neighbors[m];
+                if (!visited.insert(nb).second)
+                    continue;
+                if (proxy && !proxy->contain_filtered_label(nb))
+                    continue;
+                ++cmps;
+                retset.insert(Neighbor(nb, dist_scratch[m]));
+            }
+        }
+    }
+
+    // Sort full_retset (exact distances) and write top-K.
+    std::sort(full_retset.begin(), full_retset.end());
+    const size_t out_count = std::min<size_t>(K, full_retset.size());
+    for (size_t i = 0; i < out_count; ++i)
+    {
+        indices[i] = static_cast<uint64_t>(full_retset[i].id);
+        distances[i] = (this->_metric == diskann::Metric::INNER_PRODUCT) ? -full_retset[i].distance
+                                                                          : full_retset[i].distance;
+    }
+    for (size_t i = out_count; i < K; ++i)
+    {
+        indices[i] = std::numeric_limits<uint64_t>::max();
+        distances[i] = std::numeric_limits<float>::max();
+    }
+
+    if (stats != nullptr)
+    {
+        stats->n_hops = hops;
+        stats->n_cmps = cmps;
+        stats->n_ios = num_ios;
+    }
+}
+
+template <typename T> void unified_index_ssd<T>::fill_storage_stats(TableStats &stats) const
+{
+    // SSD keeps the PQ codes resident (npts * n_chunks bytes); the graph lives
+    // on disk, so graph_mem_usage stays 0 -- mirrors PQFlashIndex, which sets
+    // node_mem_usage = npts * nchunks and never sets graph_mem_usage.
+    stats.node_mem_usage = _pq_codes.size();
+    stats.graph_mem_usage = 0;
+}
+
+template class unified_index_ssd<float>;
+template class unified_index_ssd<uint8_t>;
+template class unified_index_ssd<int8_t>;
+
+} // namespace diskann
diff --git a/src/unified_label_data.cpp b/src/unified_label_data.cpp
new file mode 100644
index 000000000..07f7ab776
--- /dev/null
+++ b/src/unified_label_data.cpp
@@ -0,0 +1,251 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#include "unified_label_data.h"
+
+#include <algorithm>
+#include <cstring>
+#include <utility>
+
+#include "ann_exception.h"
+#include "unified_index_io.h"
+
+namespace diskann
+{
+
+namespace
+{
+} // namespace
+
+// ---------------------------------------------------------------------------
+// unified_label_data_base
+// ---------------------------------------------------------------------------
+
+void unified_label_data_base::load(UnifiedIndexReader &r, const UnifiedIndexHeader &h, uint64_t npts)
+{
+    _has_labels = false;
+    _use_universal_label = false;
+    _universal_label = 0;
+    _label_map.clear();
+
+    if ((h.flags & HAS_LABELS) == 0 || h.label_encoding == LabelEncoding::None)
+    {
+        // Factory shouldn't construct a derived label-data object in this
+        // case; throw if it happens through some other path.
+        throw ANNException("unified_label_data_base::load called on a header with no labels", -1, __FUNCSIG__,
+                           __FILE__, __LINE__);
+    }
+
+    _has_labels = true;
+
+    if (h.label_dictionary_len > 0)
+    {
+        const auto dict_bytes = r.load_region(h.label_dictionary_off, h.label_dictionary_len);
+        parse_dictionary(dict_bytes);
+    }
+
+    load_encoding(r, h, npts);
+
+    if (h.universal_label != 0)
+    {
+        _use_universal_label = true;
+        _universal_label = static_cast<uint32_t>(h.universal_label);
+    }
+}
+
+void unified_label_data_base::parse_dictionary(const std::vector<uint8_t> &dict_bytes)
+{
+    size_t cursor = 0;
+    while (cursor < dict_bytes.size())
+    {
+        if (cursor + sizeof(uint32_t) > dict_bytes.size())
+            throw ANNException("unified_label_data: truncated dictionary entry (slen)", -1, __FUNCSIG__, __FILE__,
+                               __LINE__);
+        uint32_t slen = 0;
+        std::memcpy(&slen, dict_bytes.data() + cursor, sizeof(uint32_t));
+        cursor += sizeof(uint32_t);
+        if (cursor + slen + 2 * sizeof(uint32_t) > dict_bytes.size())
+            throw ANNException("unified_label_data: truncated dictionary entry (body)", -1, __FUNCSIG__, __FILE__,
+                               __LINE__);
+        std::string s(reinterpret_cast<const char *>(dict_bytes.data() + cursor), slen);
+        cursor += slen;
+        uint32_t label_int = 0;
+        std::memcpy(&label_int, dict_bytes.data() + cursor, sizeof(uint32_t));
+        cursor += sizeof(uint32_t);
+        uint32_t medoid = 0;
+        std::memcpy(&medoid, dict_bytes.data() + cursor, sizeof(uint32_t));
+        cursor += sizeof(uint32_t);
+        // Wire format stores the label int and its single medoid in the same
+        // dictionary row; pack them together so search resolves both in one
+        // probe. Last-write-wins on duplicate dict entries (shouldn't happen in
+        // valid files).
+        _label_map[std::move(s)] = label_dict_entry{label_int, medoid};
+    }
+}
+
+bool unified_label_data_base::is_valid_label(const std::string &s) const
+{
+    return _label_map.find(s) != _label_map.end();
+}
+
+bool unified_label_data_base::get_converted_label(const std::string &s, uint32_t &out) const
+{
+    auto it = _label_map.find(s);
+    if (it == _label_map.end())
+        return false;
+    out = it->second.label_int;
+    return true;
+}
+
+void unified_label_data_base::resolve_filters(const std::vector<std::string> &filter_label_strings,
+                                              std::vector<uint32_t> &out_label_ints,
+                                              std::vector<uint32_t> &out_medoids) const
+{
+    out_label_ints.clear();
+    out_medoids.clear();
+    out_label_ints.reserve(filter_label_strings.size());
+    out_medoids.reserve(filter_label_strings.size());
+    for (const auto &s : filter_label_strings)
+    {
+        auto it = _label_map.find(s);
+        if (it == _label_map.end())
+        {
+            throw ANNException(std::string("unified_label_data: unknown filter label string: ") + s, -1, __FUNCSIG__,
+                               __FILE__, __LINE__);
+        }
+        // Single probe yields both the label int (for the match proxy) and the
+        // per-label medoid (for init-id seeding).
+        out_label_ints.push_back(it->second.label_int);
+        out_medoids.push_back(it->second.medoid);
+    }
+}
+
+void unified_label_data_base::collect_label_medoids(std::vector<uint32_t> &out) const
+{
+    out.reserve(out.size() + _label_map.size());
+    for (const auto &kv : _label_map)
+        out.push_back(kv.second.medoid);
+}
+
+// ---------------------------------------------------------------------------
+// unified_label_data_bitmask
+// ---------------------------------------------------------------------------
+
+void unified_label_data_bitmask::load_encoding(UnifiedIndexReader &r, const UnifiedIndexHeader &h, uint64_t npts)
+{
+    const uint64_t bitmask_words = simple_bitmask::get_bitmask_size(h.total_labels);
+    _bitmask_buf = simple_bitmask_buf(npts * bitmask_words, bitmask_words);
+    if (h.per_point_labels_len > 0)
+    {
+        const uint64_t expected_bytes = _bitmask_buf._buf.size() * sizeof(std::uint64_t);
+        if (h.per_point_labels_len != expected_bytes)
+        {
+            throw ANNException("unified_label_data_bitmask: bitmask region size mismatch", -1, __FUNCSIG__, __FILE__,
+                               __LINE__);
+        }
+        // Zero-copy: read straight into the bitmask buf's storage.
+        r.load_region(h.per_point_labels_off, h.per_point_labels_len,
+                      reinterpret_cast<uint8_t *>(_bitmask_buf._buf.data()));
+    }
+}
+
+std::unique_ptr<filter_match_proxy> unified_label_data_bitmask::make_match_proxy(
+    const std::vector<uint32_t> &filter_label_ints)
+{
+    // Uses the 3-arg ctor that owns its own per-query scratch buffer. Label
+    // ints are already resolved by the caller (resolve_filters), so no
+    // dictionary lookup happens here.
+    return std::make_unique<bitmask_filter_match<uint32_t>>(_bitmask_buf, filter_label_ints, _universal_label);
+}
+
+// ---------------------------------------------------------------------------
+// unified_label_data_integer
+// ---------------------------------------------------------------------------
+
+void unified_label_data_integer::load_encoding(UnifiedIndexReader &r, const UnifiedIndexHeader &h, uint64_t npts)
+{
+    // Wire format: offsets region is uint64[npts+1]; labels region is raw uint32[total_labels].
+    // On every platform DiskANN currently targets, sizeof(size_t) == sizeof(uint64_t),
+    // so we can read the offsets directly into _label_vector._offset's storage with
+    // zero intermediate copies. Same for _data.
+    static_assert(sizeof(size_t) == sizeof(uint64_t),
+                  "unified_label_data_integer: zero-copy load assumes size_t == uint64_t");
+
+    const uint64_t expected_off_bytes = (npts + 1) * sizeof(uint64_t);
+    if (h.per_point_label_offsets_len != expected_off_bytes)
+    {
+        throw ANNException("unified_label_data_integer: offset region size mismatch", -1, __FUNCSIG__, __FILE__,
+                           __LINE__);
+    }
+    if (h.per_point_labels_len % sizeof(uint32_t) != 0)
+    {
+        throw ANNException("unified_label_data_integer: labels region size is not a uint32 multiple", -1,
+                           __FUNCSIG__, __FILE__, __LINE__);
+    }
+    const size_t total_labels = h.per_point_labels_len / sizeof(uint32_t);
+
+    _label_vector.resize_for_load(npts, total_labels);
+    r.load_region(h.per_point_label_offsets_off, h.per_point_label_offsets_len,
+                  reinterpret_cast<uint8_t *>(_label_vector.mutable_offset_data()));
+    r.load_region(h.per_point_labels_off, h.per_point_labels_len,
+                  reinterpret_cast<uint8_t *>(_label_vector.mutable_label_data()));
+}
+
+std::unique_ptr<filter_match_proxy> unified_label_data_integer::make_match_proxy(
+    const std::vector<uint32_t> &filter_label_ints)
+{
+    // integer_label_filter_match holds the filter_labels vector by const reference.
+    // We need it to outlive the proxy. Allocate on the heap and bind via a
+    // small wrapper that owns the vector. Label ints are already resolved by the
+    // caller (resolve_filters), so the wrapper just copies them.
+    struct integer_proxy_owner final : public filter_match_proxy
+    {
+        std::vector<uint32_t> labels;
+        integer_label_filter_match<uint32_t> inner;
+        integer_proxy_owner(integer_label_vector &lv, std::vector<uint32_t> ls, uint32_t unv)
+            : labels(std::move(ls)), inner(lv, labels, unv)
+        {
+            // integer_label_vector::check_label_exists advances its binary-search
+            // window (start = last_check) between successive query labels, which
+            // is only correct when the query labels are in ascending order. The
+            // resolved filter ints come back in the caller's arbitrary filter
+            // order, so sort them here -- matching the legacy filtered-search
+            // contract (src/pq_flash_index.cpp:1221, src/index.cpp:2923,3136).
+            std::sort(labels.begin(), labels.end());
+        }
+        bool contain_filtered_label(uint32_t id) override
+        {
+            return inner.contain_filtered_label(id);
+        }
+    };
+    return std::make_unique<integer_proxy_owner>(_label_vector, filter_label_ints, _universal_label);
+}
+
+// ---------------------------------------------------------------------------
+// Factory
+// ---------------------------------------------------------------------------
+
+std::unique_ptr<unified_label_data_base> make_unified_label_data(UnifiedIndexReader &r, const UnifiedIndexHeader &h,
+                                                                 uint64_t npts)
+{
+    if ((h.flags & HAS_LABELS) == 0 || h.label_encoding == LabelEncoding::None)
+        return nullptr;
+
+    std::unique_ptr<unified_label_data_base> out;
+    switch (h.label_encoding)
+    {
+    case LabelEncoding::Bitmask:
+        out = std::make_unique<unified_label_data_bitmask>();
+        break;
+    case LabelEncoding::Integer:
+        out = std::make_unique<unified_label_data_integer>();
+        break;
+    default:
+        throw ANNException("make_unified_label_data: unknown label_encoding value", -1, __FUNCSIG__, __FILE__,
+                           __LINE__);
+    }
+    out->load(r, h, npts);
+    return out;
+}
+
+} // namespace diskann
diff --git a/src/unified_node_store.cpp b/src/unified_node_store.cpp
new file mode 100644
index 000000000..d8c249ea6
--- /dev/null
+++ b/src/unified_node_store.cpp
@@ -0,0 +1,501 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#include "unified_node_store.h"
+
+#include <algorithm>
+#include <cstring>
+#include <unordered_set>
+#include <utility>
+
+#include "ann_exception.h"
+#include "unified_index_io.h"
+#include "utils.h"
+
+namespace diskann
+{
+
+// ---------------------------------------------------------------------------
+// NodeFetchScratch
+// ---------------------------------------------------------------------------
+
+NodeFetchScratch::NodeFetchScratch() = default;
+
+NodeFetchScratch::NodeFetchScratch(NodeFetchScratch &&other) noexcept
+    : requests(std::move(other.requests)), _sector_slab(other._sector_slab), _capacity_bytes(other._capacity_bytes),
+      _owns_slab(other._owns_slab), _ctx(other._ctx)
+{
+    other._sector_slab = nullptr;
+    other._capacity_bytes = 0;
+    other._owns_slab = false;
+    other._ctx = nullptr;
+}
+
+NodeFetchScratch &NodeFetchScratch::operator=(NodeFetchScratch &&other) noexcept
+{
+    if (this != &other)
+    {
+        if (_owns_slab && _sector_slab != nullptr)
+            aligned_free(_sector_slab);
+        _sector_slab = other._sector_slab;
+        _capacity_bytes = other._capacity_bytes;
+        _owns_slab = other._owns_slab;
+        _ctx = other._ctx;
+        requests = std::move(other.requests);
+        other._sector_slab = nullptr;
+        other._capacity_bytes = 0;
+        other._owns_slab = false;
+        other._ctx = nullptr;
+    }
+    return *this;
+}
+
+NodeFetchScratch::~NodeFetchScratch()
+{
+    if (_owns_slab && _sector_slab != nullptr)
+    {
+        aligned_free(_sector_slab);
+        _sector_slab = nullptr;
+    }
+}
+
+void NodeFetchScratch::reserve(uint64_t max_batch, uint32_t sectors_per_node)
+{
+    const uint64_t need = max_batch * static_cast<uint64_t>(sectors_per_node) * defaults::SECTOR_LEN;
+    if (_owns_slab && need <= _capacity_bytes)
+        return;
+    if (_owns_slab && _sector_slab != nullptr)
+    {
+        aligned_free(_sector_slab);
+        _sector_slab = nullptr;
+    }
+    void *p = nullptr;
+    alloc_aligned(&p, need, defaults::SECTOR_LEN);
+    _sector_slab = static_cast<char *>(p);
+    _capacity_bytes = need;
+    _owns_slab = true;
+}
+
+void NodeFetchScratch::attach_borrowed(IOContext &ctx, char *external_slab, uint64_t slab_capacity_bytes)
+{
+    // If we previously owned a slab, free it -- attach is meant to replace.
+    if (_owns_slab && _sector_slab != nullptr)
+    {
+        aligned_free(_sector_slab);
+        _sector_slab = nullptr;
+    }
+    _sector_slab = external_slab;
+    _capacity_bytes = slab_capacity_bytes;
+    _owns_slab = false;
+    _ctx = &ctx;
+}
+
+void NodeFetchScratch::set_ctx(IOContext &ctx)
+{
+    _ctx = &ctx;
+}
+
+// ---------------------------------------------------------------------------
+// unified_node_store_base<T>
+// ---------------------------------------------------------------------------
+
+template <typename T>
+void unified_node_store_base<T>::init_geometry(const UnifiedIndexHeader &h, std::vector<uint64_t> offset_table)
+{
+    _header = h;
+    _offsets = std::move(offset_table);
+    if (_offsets.size() != _header.npts + 1)
+    {
+        throw ANNException("unified_node_store_base: offset table size mismatch", -1, __FUNCSIG__, __FILE__,
+                           __LINE__);
+    }
+    _coord_bytes = _header.aligned_dim * sizeof(T);
+    // max_node_len upper bound: (max_degree + 1) * uint32 + aligned_dim * T
+    // (the +1 mirrors the legacy padding hack the unified format inherits;
+    //  see plan notes on the +1 sector for unaligned-straddle safety).
+    _max_node_len = (static_cast<uint64_t>(_header.max_degree) + 1u) * sizeof(uint32_t) + _coord_bytes;
+}
+
+template <typename T> uint32_t unified_node_store_base<T>::degree(uint64_t id) const
+{
+    const uint64_t node_bytes = node_byte_length(id);
+    if (node_bytes < _coord_bytes)
+    {
+        throw ANNException("unified_node_store_base: node payload shorter than coords", -1, __FUNCSIG__, __FILE__,
+                           __LINE__);
+    }
+    return static_cast<uint32_t>((node_bytes - _coord_bytes) / sizeof(uint32_t));
+}
+
+template <typename T> uint32_t unified_node_store_base<T>::num_sectors_per_node() const
+{
+    // +1 to absorb worst-case unaligned straddle across sector boundary.
+    return static_cast<uint32_t>(DIV_ROUND_UP(_max_node_len, defaults::SECTOR_LEN) + 1u);
+}
+
+// ---------------------------------------------------------------------------
+// unified_node_store_memory<T>
+// ---------------------------------------------------------------------------
+
+template <typename T>
+void unified_node_store_memory<T>::load(UnifiedIndexReader &r, const UnifiedIndexHeader &h)
+{
+    // Initialise base geometry from header + offset table.
+    auto offsets = r.load_offset_table();
+    this->init_geometry(h, std::move(offsets));
+
+    // Pull the entire graph region resident -- zero-copy via the direct
+    // load_region overload (size is known from the offset table).
+    const uint64_t expected = this->_offsets.back();
+    if (h.graph_region_len != expected)
+    {
+        throw ANNException("unified_node_store_memory::load: graph region size != offset table total", -1,
+                           __FUNCSIG__, __FILE__, __LINE__);
+    }
+    _packed.resize(static_cast<size_t>(expected));
+    r.load_region(h.graph_region_off, h.graph_region_len, _packed.data());
+}
+
+template <typename T>
+void unified_node_store_memory<T>::get_nodes(const std::vector<uint64_t> &ids, NodeFetchScratch & /*scratch*/,
+                                              std::vector<NodeView<T>> &out)
+{
+    out.clear();
+    out.resize(ids.size());
+    for (size_t i = 0; i < ids.size(); ++i)
+    {
+        const uint64_t id = ids[i];
+        uint32_t deg = 0;
+        out[i].coords = get_coords(id);
+        out[i].neighbors = get_neighbors(id, deg);
+        out[i].degree = deg;
+    }
+}
+
+template <typename T> const T *unified_node_store_memory<T>::get_coords(uint64_t id) const
+{
+    return reinterpret_cast<const T *>(_packed.data() + this->_offsets[id]);
+}
+
+template <typename T>
+const uint32_t *unified_node_store_memory<T>::get_neighbors(uint64_t id, uint32_t &out_degree) const
+{
+    const uint64_t coord_bytes = this->_coord_bytes;
+    const uint64_t node_bytes = this->_offsets[id + 1] - this->_offsets[id];
+    out_degree = static_cast<uint32_t>((node_bytes - coord_bytes) / sizeof(uint32_t));
+    return reinterpret_cast<const uint32_t *>(_packed.data() + this->_offsets[id] + coord_bytes);
+}
+
+// ---------------------------------------------------------------------------
+// unified_node_store_ssd<T> -- skeleton: load/get_nodes/cache* throw not_implemented.
+// (Ctor is defined inline in the header to avoid DLL-export gymnastics when
+//  the SSD index .cpp constructs the store via std::make_unique.)
+// ---------------------------------------------------------------------------
+
+template <typename T> unified_node_store_ssd<T>::~unified_node_store_ssd()
+{
+    if (_nhood_cache_buf != nullptr)
+    {
+        delete[] _nhood_cache_buf;
+        _nhood_cache_buf = nullptr;
+    }
+    if (_coord_cache_buf != nullptr)
+    {
+        delete[] _coord_cache_buf;
+        _coord_cache_buf = nullptr;
+    }
+}
+
+template <typename T>
+void unified_node_store_ssd<T>::load(UnifiedIndexReader &r, const UnifiedIndexHeader &h)
+{
+    // Initialise base geometry from header + offset table.
+    auto offsets = r.load_offset_table();
+    this->init_geometry(h, std::move(offsets));
+
+    // Open the aligned reader on the unified file. Subsequent get_nodes()
+    // calls issue async reads through this handle.
+    if (_reader == nullptr)
+    {
+        throw ANNException("unified_node_store_ssd::load: AlignedFileReader is null", -1, __FUNCSIG__, __FILE__,
+                           __LINE__);
+    }
+    _reader->open(r.path());
+}
+
+template <typename T>
+void unified_node_store_ssd<T>::get_nodes(const std::vector<uint64_t> &ids, NodeFetchScratch &scratch,
+                                          std::vector<NodeView<T>> &out)
+{
+    out.assign(ids.size(), NodeView<T>{});
+
+    // Per-thread scratch vectors (one capacity grow per thread, reused across
+    // calls). Clearing keeps the elements but holds the capacity, so the
+    // first call's reserve covers all subsequent calls of similar batch size.
+    thread_local std::vector<size_t> miss_indices;
+    thread_local std::vector<uint64_t> aligned_starts;
+    miss_indices.clear();
+    aligned_starts.clear();
+
+    // Pass 1: cache lookups.
+    miss_indices.reserve(ids.size());
+    for (size_t i = 0; i < ids.size(); ++i)
+    {
+        const uint32_t id32 = static_cast<uint32_t>(ids[i]);
+        auto cit = _coord_cache.find(id32);
+        auto nit = _nhood_cache.find(id32);
+        if (cit != _coord_cache.end() && nit != _nhood_cache.end())
+        {
+            out[i].coords = cit->second;
+            out[i].degree = nit->second.first;
+            out[i].neighbors = nit->second.second;
+        }
+        else
+        {
+            miss_indices.push_back(i);
+        }
+    }
+
+    if (miss_indices.empty())
+        return;
+
+    // Pass 2: plan + issue batched IO for misses.
+    // Scratch contract on the search hot path: caller must have called
+    // attach_borrowed() (with the slab + per-thread IOContext that the
+    // index registered at load time) or reserve()+attach the ctx. We do NOT
+    // touch the reader's thread-registration map here -- that path uses a
+    // mutex which would serialise concurrent searches.
+    const uint32_t sectors_per_node = this->num_sectors_per_node();
+    const uint64_t bytes_per_node = static_cast<uint64_t>(sectors_per_node) * defaults::SECTOR_LEN;
+    const uint64_t need_bytes = static_cast<uint64_t>(miss_indices.size()) * bytes_per_node;
+    if (scratch.slab() == nullptr || scratch.slab_capacity() < need_bytes)
+    {
+        throw ANNException(
+            "unified_node_store_ssd::get_nodes: scratch slab too small or unset "
+            "(need " +
+                std::to_string(need_bytes) + " bytes, have " + std::to_string(scratch.slab_capacity()) +
+                "). Call attach_borrowed() or reserve() before search.",
+            -1, __FUNCSIG__, __FILE__, __LINE__);
+    }
+    if (scratch.io_ctx() == nullptr)
+    {
+        throw ANNException(
+            "unified_node_store_ssd::get_nodes: scratch has no IOContext attached. "
+            "Call attach_borrowed() with a registered IOContext before search.",
+            -1, __FUNCSIG__, __FILE__, __LINE__);
+    }
+    scratch.requests.clear();
+    scratch.requests.reserve(miss_indices.size());
+
+    // Track per-miss sector-aligned start so the decode step can compute the
+    // pointer back into the slab.
+    aligned_starts.resize(miss_indices.size());
+
+    for (size_t k = 0; k < miss_indices.size(); ++k)
+    {
+        const size_t i = miss_indices[k];
+        const uint64_t id = ids[i];
+        const uint64_t raw_start = this->node_disk_offset(id);
+        const uint64_t aligned_start = (raw_start / defaults::SECTOR_LEN) * defaults::SECTOR_LEN;
+        aligned_starts[k] = aligned_start;
+
+        // Read only the sectors this specific node actually spans, not the
+        // worst-case `bytes_per_node`. The resident offset table gives each
+        // node's exact byte length, so the minimal sector-aligned window that
+        // covers [raw_start, raw_start + node_bytes) is all we need to fetch.
+        // For a typical low-degree node this is a single SECTOR_LEN read
+        // instead of num_sectors_per_node() sectors, cutting bytes transferred.
+        // This exact window is always <= bytes_per_node, so it still fits in
+        // this miss's fixed slab slot (slab stride stays bytes_per_node).
+        const uint64_t raw_end = raw_start + this->node_byte_length(id);
+        const uint64_t aligned_end = DIV_ROUND_UP(raw_end, defaults::SECTOR_LEN) * defaults::SECTOR_LEN;
+        const uint64_t read_len = aligned_end - aligned_start;
+
+        AlignedRead req;
+        req.offset = aligned_start;
+        req.len = read_len;
+        req.buf = scratch.slab() + k * bytes_per_node;
+        scratch.requests.push_back(req);
+    }
+
+    _reader->read(scratch.requests, *scratch.io_ctx());
+    ++_io_count;
+
+    // Pass 3: decode each miss from its sector slice.
+    const uint64_t coord_bytes = this->coord_bytes();
+    for (size_t k = 0; k < miss_indices.size(); ++k)
+    {
+        const size_t i = miss_indices[k];
+        const uint64_t id = ids[i];
+        const uint64_t raw_start = this->node_disk_offset(id);
+        const uint64_t node_bytes = this->node_byte_length(id);
+        const uint64_t aligned_start = aligned_starts[k];
+        const uint64_t intra = raw_start - aligned_start;
+        const char *node_start = scratch.slab() + k * bytes_per_node + intra;
+
+        out[i].coords = reinterpret_cast<const T *>(node_start);
+        out[i].degree = static_cast<uint32_t>((node_bytes - coord_bytes) / sizeof(uint32_t));
+        out[i].neighbors = reinterpret_cast<const uint32_t *>(node_start + coord_bytes);
+    }
+}
+
+// Per-thread registration tracking. AlignedFileReader::register_thread() is
+// not idempotent on Windows -- calling it twice throws. We track which threads
+// we've already registered.
+template <typename T> NodeFetchScratch unified_node_store_ssd<T>::make_fetch_scratch(uint64_t max_batch)
+{
+    // Idempotent thread registration: AlignedFileReader::register_thread()
+    // warns + no-ops on duplicate calls (Windows), so we guard with a
+    // thread_local flag to skip both the warning and the lock-acquire that
+    // the reader does internally. Cheap; thread_local access is one TLS load.
+    thread_local bool s_registered_with_reader = false;
+    if (!s_registered_with_reader)
+    {
+        _reader->register_thread();
+        s_registered_with_reader = true;
+    }
+    IOContext &ctx = _reader->get_ctx();
+    NodeFetchScratch s;
+    s.reserve(max_batch, this->num_sectors_per_node());
+    s.set_ctx(ctx); // keep the self-owned slab; just point at the ctx
+    return s;
+}
+
+template <typename T>
+void unified_node_store_ssd<T>::load_cache_list(const std::vector<uint32_t> &node_list, NodeFetchScratch &scratch)
+{
+    if (node_list.empty())
+        return;
+    if (_nhood_cache_buf != nullptr || _coord_cache_buf != nullptr)
+    {
+        throw ANNException("unified_node_store_ssd::load_cache_list: cache already populated", -1, __FUNCSIG__,
+                           __FILE__, __LINE__);
+    }
+
+    const uint64_t aligned_dim = this->aligned_dim();
+    const uint32_t max_degree = this->max_degree();
+
+    // Allocate cache backing buffers. _nhood_cache_buf stores `max_degree`
+    // uint32 slots per cached node; _coord_cache_buf stores aligned_dim T
+    // values per cached node.
+    const size_t n = node_list.size();
+    _nhood_cache_buf = new uint32_t[n * static_cast<size_t>(max_degree)];
+    std::memset(_nhood_cache_buf, 0, n * static_cast<size_t>(max_degree) * sizeof(uint32_t));
+    _coord_cache_buf = new T[n * aligned_dim];
+    std::memset(_coord_cache_buf, 0, n * aligned_dim * sizeof(T));
+
+    // Batch the reads instead of one get_nodes() call per node. get_nodes()
+    // resolves a whole batch of misses in a single batched IO, so the number
+    // of IOs drops from n to ceil(n / batch). The batch size is bounded by how
+    // many worst-case node slots fit in the scratch slab (the same bound
+    // get_nodes() enforces internally). node_list ids are unique (callers build
+    // it via a visited set / pass distinct ids), so every node in a batch is a
+    // cache miss and contributes to the batched read.
+    const uint64_t bytes_per_node = static_cast<uint64_t>(this->num_sectors_per_node()) * defaults::SECTOR_LEN;
+    size_t max_per_batch = 1;
+    if (bytes_per_node > 0 && scratch.slab_capacity() >= bytes_per_node)
+        max_per_batch = static_cast<size_t>(scratch.slab_capacity() / bytes_per_node);
+
+    std::vector<uint64_t> id_batch;
+    std::vector<NodeView<T>> view_batch;
+    for (size_t base = 0; base < n; base += max_per_batch)
+    {
+        const size_t batch_n = std::min(max_per_batch, n - base);
+        id_batch.assign(node_list.begin() + base, node_list.begin() + base + batch_n);
+
+        // All views in the batch point into distinct slab slots and stay valid
+        // until the next get_nodes() call, so we can copy them all out here.
+        get_nodes(id_batch, scratch, view_batch);
+
+        for (size_t b = 0; b < batch_n; ++b)
+        {
+            const size_t i = base + b;
+            const uint32_t id = node_list[i];
+
+            T *coord_slot = _coord_cache_buf + i * aligned_dim;
+            std::memcpy(coord_slot, view_batch[b].coords, aligned_dim * sizeof(T));
+            _coord_cache.emplace(id, coord_slot);
+
+            uint32_t *nhood_slot = _nhood_cache_buf + i * static_cast<size_t>(max_degree);
+            const uint32_t deg = view_batch[b].degree;
+            const uint32_t deg_to_copy = std::min<uint32_t>(deg, max_degree);
+            std::memcpy(nhood_slot, view_batch[b].neighbors, deg_to_copy * sizeof(uint32_t));
+            _nhood_cache.emplace(id, std::make_pair(deg_to_copy, nhood_slot));
+        }
+    }
+}
+
+template <typename T>
+void unified_node_store_ssd<T>::cache_bfs_levels(const std::vector<uint32_t> &seed_nodes, uint64_t num_nodes_to_cache,
+                                                  std::vector<uint32_t> &out_node_list, NodeFetchScratch &scratch)
+{
+    out_node_list.clear();
+    if (num_nodes_to_cache == 0 || seed_nodes.empty())
+        return;
+
+    std::unordered_set<uint32_t> visited;
+    std::vector<uint32_t> frontier = seed_nodes;
+    for (uint32_t s : seed_nodes)
+    {
+        if (s >= this->num_points())
+            continue;
+        if (visited.insert(s).second)
+            out_node_list.push_back(s);
+        if (out_node_list.size() >= num_nodes_to_cache)
+            break;
+    }
+
+    std::vector<uint64_t> id_batch;
+    std::vector<NodeView<T>> view_batch;
+
+    while (out_node_list.size() < num_nodes_to_cache && !frontier.empty())
+    {
+        // Fetch the entire frontier in one batch.
+        id_batch.assign(frontier.begin(), frontier.end());
+        get_nodes(id_batch, scratch, view_batch);
+
+        std::vector<uint32_t> next_frontier;
+        for (size_t i = 0; i < id_batch.size(); ++i)
+        {
+            const uint32_t deg = view_batch[i].degree;
+            const uint32_t *nbrs = view_batch[i].neighbors;
+            for (uint32_t j = 0; j < deg; ++j)
+            {
+                const uint32_t nb = nbrs[j];
+                if (nb >= this->num_points())
+                    continue;
+                if (visited.insert(nb).second)
+                {
+                    out_node_list.push_back(nb);
+                    next_frontier.push_back(nb);
+                    if (out_node_list.size() >= num_nodes_to_cache)
+                        break;
+                }
+            }
+            if (out_node_list.size() >= num_nodes_to_cache)
+                break;
+        }
+        frontier = std::move(next_frontier);
+    }
+    // Delegate to the scratch-aware variant of load_cache_list so the same
+    // borrowed scratch is reused across the cache-load passes.
+    load_cache_list(out_node_list, scratch);
+}
+
+// ---------------------------------------------------------------------------
+// Explicit template instantiations (3 per class, 9 total).
+// ---------------------------------------------------------------------------
+
+template class unified_node_store_base<float>;
+template class unified_node_store_base<uint8_t>;
+template class unified_node_store_base<int8_t>;
+
+template class unified_node_store_memory<float>;
+template class unified_node_store_memory<uint8_t>;
+template class unified_node_store_memory<int8_t>;
+
+template class unified_node_store_ssd<float>;
+template class unified_node_store_ssd<uint8_t>;
+template class unified_node_store_ssd<int8_t>;
+
+} // namespace diskann
diff --git a/tests/CMakeLists.txt b/tests/CMakeLists.txt
index 6af8405cc..7e6e11ffb 100644
--- a/tests/CMakeLists.txt
+++ b/tests/CMakeLists.txt
@@ -32,10 +32,19 @@ if (NOT Boost_FOUND)
 endif()
 
 
-set(DISKANN_UNIT_TEST_SOURCES main.cpp index_write_parameters_builder_tests.cpp)
-
-add_executable(${PROJECT_NAME}_unit_tests ${DISKANN_SOURCES} ${DISKANN_UNIT_TEST_SOURCES})
-target_link_libraries(${PROJECT_NAME}_unit_tests ${PROJECT_NAME} ${DISKANN_TOOLS_TCMALLOC_LINK_OPTIONS} Boost::unit_test_framework)
+set(DISKANN_UNIT_TEST_SOURCES main.cpp index_write_parameters_builder_tests.cpp unified_index_tests.cpp)
+
+# Link the tests against the static DiskANN core (diskann_s) rather than the
+# DLL, so internal-only symbols don't need to be exported from the DLL just to
+# satisfy the test linker. diskann_s carries DISKANN_STATIC_LIB (making
+# DISKANN_DLLEXPORT a no-op) plus MKL + synchronization.lib transitively.
+add_executable(${PROJECT_NAME}_unit_tests ${DISKANN_UNIT_TEST_SOURCES})
+target_link_libraries(${PROJECT_NAME}_unit_tests ${PROJECT_NAME}_s ${DISKANN_TOOLS_TCMALLOC_LINK_OPTIONS} Boost::unit_test_framework)
+
+# The static test exe still needs libiomp5md.dll / libtcmalloc_minimal.dll at
+# runtime; those are copied to the output dir by the DLL target's POST_BUILD,
+# so make sure the DLL is built first.
+add_dependencies(${PROJECT_NAME}_unit_tests ${PROJECT_NAME})
 
 add_test(NAME ${PROJECT_NAME}_unit_tests COMMAND ${PROJECT_NAME}_unit_tests)
 
diff --git a/tests/unified_index_tests.cpp b/tests/unified_index_tests.cpp
new file mode 100644
index 000000000..93c070baf
--- /dev/null
+++ b/tests/unified_index_tests.cpp
@@ -0,0 +1,1894 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT license.
+
+#include <boost/test/unit_test.hpp>
+
+#include <algorithm>
+#include <cstdint>
+#include <cstdio>
+#include <cstring>
+#include <fstream>
+#include <memory>
+#include <set>
+#include <string>
+#include <unordered_set>
+#include <vector>
+
+#include "ann_exception.h"
+#include "defaults.h"
+#include "disk_utils.h"
+#include "filter_match_proxy.h"
+#include "index.h"
+#include "label_bitmask.h"
+#include "parameters.h"
+#include "pq.h"
+#include "pq_flash_index.h"
+#include "unified_index.h"
+#include "unified_index_builder.h"
+#include "unified_index_format.h"
+#include "unified_index_io.h"
+#include "unified_label_data.h"
+#include "unified_node_store.h"
+#include "utils.h"
+#include "windows_aligned_file_reader.h"
+
+using namespace diskann;
+
+namespace
+{
+
+// Path helper: writes test fixtures under the current working dir.
+std::string tmp_path(const char *suffix)
+{
+    return std::string("unified_index_test_") + suffix + ".bin";
+}
+
+// Tiny RAII deleter for temp test files.
+struct ScopedFile
+{
+    std::string path;
+    explicit ScopedFile(std::string p) : path(std::move(p))
+    {
+    }
+    ~ScopedFile()
+    {
+        std::remove(path.c_str());
+    }
+};
+
+// Test-only subclass exposing init_geometry so the sector-math suite can
+// inject a synthetic header + offset table without going through the writer.
+template <typename T> class node_store_test_fixture final : public unified_node_store_base<T>
+{
+  public:
+    using unified_node_store_base<T>::init_geometry;
+    void get_nodes(const std::vector<uint64_t> & /*ids*/, NodeFetchScratch & /*scratch*/,
+                   std::vector<NodeView<T>> & /*out*/) override
+    {
+        // Not exercised in these tests.
+    }
+};
+
+// Build a synthetic offset table for `npts` nodes where node i has degree `degrees[i]`,
+// laid out as [coords (aligned_dim * sizeof(T)), neighbors (degree * 4)] back-to-back.
+std::vector<uint64_t> make_offset_table(uint64_t npts, uint64_t aligned_dim, size_t sizeof_T,
+                                        const std::vector<uint32_t> &degrees)
+{
+    std::vector<uint64_t> off(npts + 1, 0);
+    const uint64_t coord_bytes = aligned_dim * sizeof_T;
+    for (uint64_t i = 0; i < npts; ++i)
+    {
+        off[i + 1] = off[i] + coord_bytes + degrees[i] * sizeof(uint32_t);
+    }
+    return off;
+}
+
+// Build a synthetic header for the test fixture.
+UnifiedIndexHeader make_header(uint64_t npts, uint64_t dim, uint64_t aligned_dim, uint32_t max_degree,
+                               DataTypeTag dt = DataTypeTag::Float)
+{
+    UnifiedIndexHeader h{};
+    h.magic = UNIFIED_FORMAT_MAGIC;
+    h.version = UNIFIED_FORMAT_VERSION;
+    h.data_type = dt;
+    h.metric = MetricTag::L2;
+    h.npts = npts;
+    h.dim = dim;
+    h.aligned_dim = aligned_dim;
+    h.max_degree = max_degree;
+    h.flags = 0;
+    h.start_node = 0;
+    h.label_encoding = LabelEncoding::None;
+    h.universal_label = 0;
+    h.total_labels = 0;
+    return h;
+}
+
+// Write a minimal unified file with `npts` nodes (all-zero coords, no
+// neighbors) so the reader has a valid offset table + graph region. Labels
+// regions are filled in by the caller (or skipped) via the encoding-specific
+// writer methods.
+void write_minimal_unified_file(const std::string &path, uint64_t npts, uint64_t dim, uint64_t aligned_dim,
+                                uint32_t max_degree, DataTypeTag dt)
+{
+    UnifiedIndexWriter w(path);
+    MetricTag metric = MetricTag::L2;
+    w.begin(npts, dim, aligned_dim, max_degree, dt, metric, /*start_node=*/0);
+
+    w.begin_graph_region();
+    std::vector<uint8_t> coord_zero(aligned_dim * /*sizeof float upper bound*/ 4, 0);
+    for (uint64_t i = 0; i < npts; ++i)
+    {
+        // Single neighbor pointing to self to keep neighbor section non-empty
+        // (writer doesn't care, but easier to reason about offsets).
+        const uint32_t nb = static_cast<uint32_t>(i);
+        w.write_node(coord_zero.data(), &nb, /*degree=*/1);
+    }
+    w.end_graph_region();
+
+    const uint32_t single_medoid = 0;
+    w.write_medoids(&single_medoid, 1);
+}
+
+// Build a labels-dict byte blob in the canonical wire format:
+//   [u32 slen][bytes label_str][u32 label_int][u32 medoid]
+std::vector<uint8_t> build_dict_bytes(const std::vector<std::tuple<std::string, uint32_t, uint32_t>> &entries)
+{
+    std::vector<uint8_t> bytes;
+    for (auto &e : entries)
+    {
+        const std::string &s = std::get<0>(e);
+        const uint32_t lid = std::get<1>(e);
+        const uint32_t medoid = std::get<2>(e);
+        const uint32_t slen = static_cast<uint32_t>(s.size());
+        const size_t old = bytes.size();
+        bytes.resize(old + sizeof(uint32_t) + slen + 2 * sizeof(uint32_t));
+        uint8_t *p = bytes.data() + old;
+        std::memcpy(p, &slen, sizeof(uint32_t));
+        p += sizeof(uint32_t);
+        std::memcpy(p, s.data(), slen);
+        p += slen;
+        std::memcpy(p, &lid, sizeof(uint32_t));
+        p += sizeof(uint32_t);
+        std::memcpy(p, &medoid, sizeof(uint32_t));
+    }
+    return bytes;
+}
+
+void write_bitmask_labels_file(const std::string &path, uint64_t npts,
+                               const std::vector<std::tuple<std::string, uint32_t, uint32_t>> &dict_entries,
+                               const std::vector<std::vector<uint32_t>> &per_point_label_ints,
+                               uint64_t total_labels, uint32_t universal_label)
+{
+    UnifiedIndexWriter w(path);
+    const uint64_t dim = 4;
+    const uint64_t aligned_dim = 4;
+    const uint32_t max_degree = 4;
+    w.begin(npts, dim, aligned_dim, max_degree, DataTypeTag::Float, MetricTag::L2, /*start_node=*/0);
+
+    w.begin_graph_region();
+    std::vector<float> coord_zero(aligned_dim, 0.0f);
+    const uint32_t self_nb = 0;
+    for (uint64_t i = 0; i < npts; ++i)
+    {
+        coord_zero[0] = 0.0f;
+        w.write_node(coord_zero.data(), &self_nb, /*degree=*/1);
+    }
+    w.end_graph_region();
+
+    const uint32_t medoid = 0;
+    w.write_medoids(&medoid, 1);
+
+    auto dict_bytes = build_dict_bytes(dict_entries);
+
+    // Build bitmask: one row of `bitmask_size_words` uint64 per point.
+    const uint64_t bitmask_words = simple_bitmask::get_bitmask_size(total_labels);
+    std::vector<uint64_t> bitmask(npts * bitmask_words, 0);
+    for (uint64_t i = 0; i < npts; ++i)
+    {
+        simple_bitmask bm(bitmask.data() + i * bitmask_words, bitmask_words);
+        for (uint32_t lid : per_point_label_ints[i])
+        {
+            bm.set(lid);
+        }
+    }
+    const uint64_t bitmap_bytes_len = bitmask.size() * sizeof(uint64_t);
+    w.write_labels_bitmask(total_labels, universal_label, dict_bytes.data(), dict_bytes.size(), bitmask.data(),
+                            bitmap_bytes_len);
+
+    w.finalize();
+}
+
+void write_integer_labels_file(const std::string &path, uint64_t npts,
+                               const std::vector<std::tuple<std::string, uint32_t, uint32_t>> &dict_entries,
+                               const std::vector<std::vector<uint32_t>> &per_point_label_ints,
+                               uint64_t total_labels, uint32_t universal_label)
+{
+    UnifiedIndexWriter w(path);
+    const uint64_t dim = 4;
+    const uint64_t aligned_dim = 4;
+    const uint32_t max_degree = 4;
+    w.begin(npts, dim, aligned_dim, max_degree, DataTypeTag::Float, MetricTag::L2, /*start_node=*/0);
+
+    w.begin_graph_region();
+    std::vector<float> coord_zero(aligned_dim, 0.0f);
+    const uint32_t self_nb = 0;
+    for (uint64_t i = 0; i < npts; ++i)
+    {
+        w.write_node(coord_zero.data(), &self_nb, /*degree=*/1);
+    }
+    w.end_graph_region();
+
+    const uint32_t medoid = 0;
+    w.write_medoids(&medoid, 1);
+
+    auto dict_bytes = build_dict_bytes(dict_entries);
+
+    // Flatten per-point lists + offset table.
+    std::vector<uint32_t> flat;
+    std::vector<uint64_t> offsets(npts + 1, 0);
+    for (uint64_t i = 0; i < npts; ++i)
+    {
+        offsets[i + 1] = offsets[i] + per_point_label_ints[i].size();
+        flat.insert(flat.end(), per_point_label_ints[i].begin(), per_point_label_ints[i].end());
+    }
+    w.write_labels_integer(total_labels, universal_label, dict_bytes.data(), dict_bytes.size(), flat.data(),
+                            flat.size() * sizeof(uint32_t), offsets.data());
+
+    w.finalize();
+}
+
+// Deterministic float in [-1, 1] from (seed, i, j) -- no Date::now / random.
+inline float det_float(uint64_t seed, uint64_t i, uint64_t j)
+{
+    uint64_t x = seed * 1103515245ull + i * 12345ull + j * 7919ull + 12345ull;
+    x ^= x >> 21;
+    x *= 2685821657736338717ull;
+    x ^= x >> 31;
+    const uint32_t lo = static_cast<uint32_t>(x & 0xFFFFFFFFu);
+    return (static_cast<float>(lo) / 2147483648.0f) - 1.0f;
+}
+
+// Build a unified float file: npts random points, aligned_dim columns each.
+// Graph is a star centered on node 0: node 0 -> [1..npts-1], every other node -> [0].
+inline void write_star_graph_unified(const std::string &path, uint64_t npts, uint64_t aligned_dim,
+                                     std::vector<std::vector<float>> &points_out)
+{
+    points_out.assign(npts, std::vector<float>(aligned_dim, 0.0f));
+    for (uint64_t i = 0; i < npts; ++i)
+        for (uint64_t j = 0; j < aligned_dim; ++j)
+            points_out[i][j] = det_float(/*seed=*/42, i, j);
+
+    const uint32_t max_degree = static_cast<uint32_t>(npts);
+    UnifiedIndexWriter w(path);
+    w.begin(npts, aligned_dim, aligned_dim, max_degree, DataTypeTag::Float, MetricTag::L2, /*start_node=*/0);
+
+    w.begin_graph_region();
+    std::vector<uint32_t> hub_neighbors;
+    for (uint32_t i = 1; i < npts; ++i)
+        hub_neighbors.push_back(i);
+    w.write_node(points_out[0].data(), hub_neighbors.data(), static_cast<uint32_t>(hub_neighbors.size()));
+    const uint32_t back_to_hub = 0;
+    for (uint64_t i = 1; i < npts; ++i)
+    {
+        w.write_node(points_out[i].data(), &back_to_hub, /*degree=*/1);
+    }
+    w.end_graph_region();
+
+    const uint32_t medoid = 0;
+    w.write_medoids(&medoid, 1);
+    w.finalize();
+}
+
+inline float l2_sq(const std::vector<float> &a, const float *b)
+{
+    float s = 0.0f;
+    for (size_t i = 0; i < a.size(); ++i)
+    {
+        const float d = a[i] - b[i];
+        s += d * d;
+    }
+    return s;
+}
+
+inline std::vector<uint64_t> brute_force_topk(const std::vector<std::vector<float>> &points, const float *query,
+                                              size_t K)
+{
+    std::vector<std::pair<float, uint64_t>> pairs;
+    pairs.reserve(points.size());
+    for (uint64_t i = 0; i < points.size(); ++i)
+        pairs.emplace_back(l2_sq(points[i], query), i);
+    std::sort(pairs.begin(), pairs.end(),
+              [](const auto &a, const auto &b) { return a.first < b.first; });
+    if (K > pairs.size())
+        K = pairs.size();
+    std::vector<uint64_t> out(K);
+    for (size_t i = 0; i < K; ++i)
+        out[i] = pairs[i].second;
+    return out;
+}
+
+inline float recall_at_k(const std::vector<uint64_t> &gt, const uint64_t *result, size_t K)
+{
+    std::unordered_set<uint64_t> gt_set(gt.begin(), gt.begin() + std::min<size_t>(K, gt.size()));
+    size_t hits = 0;
+    for (size_t i = 0; i < K; ++i)
+        if (gt_set.count(result[i]))
+            ++hits;
+    return static_cast<float>(hits) / static_cast<float>(K);
+}
+
+inline std::shared_ptr<::AlignedFileReader> make_reader()
+{
+    return std::shared_ptr<::AlignedFileReader>(new ::WindowsAlignedFileReader());
+}
+
+} // namespace
+
+// ===========================================================================
+// Suite 1: unified_node_store_base sector math (pure CPU)
+// ===========================================================================
+
+BOOST_AUTO_TEST_SUITE(unified_node_store_tests)
+
+BOOST_AUTO_TEST_CASE(sector_math_basic)
+{
+    const uint64_t npts = 8;
+    const uint64_t dim = 13;
+    const uint64_t aligned_dim = 16;
+    const uint32_t max_degree = 64;
+    const std::vector<uint32_t> degrees = {3, 5, 7, 0, 64, 12, 1, 2};
+
+    auto h = make_header(npts, dim, aligned_dim, max_degree);
+    auto offsets = make_offset_table(npts, aligned_dim, sizeof(float), degrees);
+
+    node_store_test_fixture<float> store;
+    store.init_geometry(h, offsets);
+
+    BOOST_CHECK_EQUAL(store.num_points(), npts);
+    BOOST_CHECK_EQUAL(store.dim(), dim);
+    BOOST_CHECK_EQUAL(store.aligned_dim(), aligned_dim);
+    BOOST_CHECK_EQUAL(store.max_degree(), max_degree);
+
+    // Offsets should match what make_offset_table computed.
+    for (uint64_t i = 0; i < npts; ++i)
+    {
+        BOOST_CHECK_EQUAL(store.node_byte_offset(i), offsets[i]);
+        const uint64_t expected_len = aligned_dim * sizeof(float) + degrees[i] * sizeof(uint32_t);
+        BOOST_CHECK_EQUAL(store.node_byte_length(i), expected_len);
+        BOOST_CHECK_EQUAL(store.degree(i), degrees[i]);
+    }
+
+    // max_node_len = (max_degree+1)*4 + aligned_dim*sizeof(T)
+    const uint64_t expected_max = static_cast<uint64_t>(max_degree + 1) * 4 + aligned_dim * sizeof(float);
+    BOOST_CHECK_EQUAL(store.max_node_len(), expected_max);
+
+    // num_sectors_per_node = ceil(max_node_len / SECTOR_LEN) + 1
+    const uint32_t expected_sectors =
+        static_cast<uint32_t>((expected_max + defaults::SECTOR_LEN - 1) / defaults::SECTOR_LEN + 1u);
+    BOOST_CHECK_EQUAL(store.num_sectors_per_node(), expected_sectors);
+}
+
+BOOST_AUTO_TEST_CASE(sector_math_throws_on_short_node)
+{
+    // Construct an offset table where node 0's payload is shorter than its
+    // coords -- degree() must throw.
+    auto h = make_header(/*npts=*/2, /*dim=*/4, /*aligned_dim=*/4, /*max_degree=*/4);
+    std::vector<uint64_t> bad_offsets = {0, 4, 4 * sizeof(float) + 16}; // node 0 too short
+    node_store_test_fixture<float> store;
+    store.init_geometry(h, bad_offsets);
+    BOOST_CHECK_THROW(store.degree(0), ANNException);
+}
+
+BOOST_AUTO_TEST_SUITE_END()
+
+// ===========================================================================
+// Suite 2: unified_label_data (bitmask)
+// ===========================================================================
+
+BOOST_AUTO_TEST_SUITE(unified_label_data_bitmask_tests)
+
+BOOST_AUTO_TEST_CASE(bitmask_load_and_match_proxy)
+{
+    ScopedFile sf(tmp_path("bitmask"));
+    const uint64_t npts = 4;
+    // Use label ints 10/11/12 (and universal=99) so the bitmask_filter_match
+    // ctor's unconditional universal-bit merge doesn't collide with real labels.
+    // total_labels MUST be large enough that every label int -- including the
+    // universal label -- fits inside the bitmask. The bitmask has
+    // get_bitmask_size(total_labels) 64-bit words, so a universal label of 99
+    // requires total_labels > 99; otherwise the ctor's universal-bit merge
+    // (simple_bitmask_full_val::merge_bitmask_val) indexes word 99/64 == 1 of a
+    // single-word query buffer and corrupts the heap.
+    const std::vector<std::tuple<std::string, uint32_t, uint32_t>> dict = {
+        {"red", 10u, 0u}, {"green", 11u, 1u}, {"blue", 12u, 2u}};
+    const std::vector<std::vector<uint32_t>> per_point = {
+        {10u},       // point 0 -> red
+        {11u, 12u},  // point 1 -> green, blue
+        {12u},       // point 2 -> blue
+        {10u, 11u}}; // point 3 -> red, green
+    write_bitmask_labels_file(sf.path, npts, dict, per_point, /*total_labels=*/100, /*universal_label=*/99);
+
+    UnifiedIndexReader r(sf.path);
+    auto labels = make_unified_label_data(r, r.header(), r.header().npts);
+    BOOST_REQUIRE(labels != nullptr);
+    BOOST_CHECK_EQUAL(static_cast<int>(labels->encoding()), static_cast<int>(LabelEncoding::Bitmask));
+    BOOST_CHECK(labels->has_labels());
+    BOOST_CHECK_EQUAL(labels->num_labels(), 3u);
+    BOOST_CHECK(labels->is_valid_label("red"));
+    BOOST_CHECK(labels->is_valid_label("green"));
+    BOOST_CHECK(!labels->is_valid_label("yellow"));
+    uint32_t out = 99;
+    BOOST_CHECK(labels->get_converted_label("blue", out));
+    BOOST_CHECK_EQUAL(out, 12u);
+
+    // resolve_filters returns the label int and its medoid in a single lookup.
+    std::vector<uint32_t> blue_ints, blue_medoids;
+    labels->resolve_filters({"blue"}, blue_ints, blue_medoids);
+    BOOST_REQUIRE_EQUAL(blue_ints.size(), 1u);
+    BOOST_CHECK_EQUAL(blue_ints[0], 12u);
+    BOOST_REQUIRE_EQUAL(blue_medoids.size(), 1u);
+    BOOST_CHECK_EQUAL(blue_medoids[0], 2u);
+
+    // collect_label_medoids appends every label's entry-point medoid (one per
+    // label). Dict above maps red/green/blue -> medoids 0/1/2. It appends (does
+    // not clear), so a pre-existing element must be preserved. Order is
+    // unspecified (hash-map iteration), so sort before comparing.
+    std::vector<uint32_t> all_medoids = {42u};
+    labels->collect_label_medoids(all_medoids);
+    BOOST_REQUIRE_EQUAL(all_medoids.size(), 4u);
+    std::sort(all_medoids.begin(), all_medoids.end());
+    const std::vector<uint32_t> expected_medoids = {0u, 1u, 2u, 42u};
+    BOOST_CHECK_EQUAL_COLLECTIONS(all_medoids.begin(), all_medoids.end(), expected_medoids.begin(),
+                                  expected_medoids.end());
+
+    // Match-proxy: filter "blue" -> matches points 1, 2, but NOT 0, 3.
+    auto proxy = labels->make_match_proxy(blue_ints);
+    BOOST_REQUIRE(proxy != nullptr);
+    BOOST_CHECK(!proxy->contain_filtered_label(0));
+    BOOST_CHECK(proxy->contain_filtered_label(1));
+    BOOST_CHECK(proxy->contain_filtered_label(2));
+    BOOST_CHECK(!proxy->contain_filtered_label(3));
+}
+
+BOOST_AUTO_TEST_CASE(bitmask_unknown_label_string_throws)
+{
+    ScopedFile sf(tmp_path("bitmask_unknown"));
+    write_bitmask_labels_file(sf.path, /*npts=*/2,
+                              /*dict=*/{{"a", 0u, 0u}}, /*per_point=*/{{0u}, {}}, /*total_labels=*/1,
+                              /*universal_label=*/0);
+    UnifiedIndexReader r(sf.path);
+    auto labels = make_unified_label_data(r, r.header(), r.header().npts);
+    BOOST_REQUIRE(labels != nullptr);
+    std::vector<uint32_t> ints, medoids;
+    BOOST_CHECK_THROW(labels->resolve_filters({"nonexistent"}, ints, medoids), ANNException);
+}
+
+BOOST_AUTO_TEST_SUITE_END()
+
+// ===========================================================================
+// Suite 3: unified_label_data (integer)
+// ===========================================================================
+
+BOOST_AUTO_TEST_SUITE(unified_label_data_integer_tests)
+
+BOOST_AUTO_TEST_CASE(integer_load_and_match_proxy)
+{
+    ScopedFile sf(tmp_path("integer"));
+    const uint64_t npts = 4;
+    const std::vector<std::tuple<std::string, uint32_t, uint32_t>> dict = {
+        {"red", 10u, 0u}, {"green", 11u, 1u}, {"blue", 12u, 2u}};
+    const std::vector<std::vector<uint32_t>> per_point = {{10u}, {11u, 12u}, {12u}, {10u, 11u}};
+    write_integer_labels_file(sf.path, npts, dict, per_point, /*total_labels=*/13, /*universal_label=*/99);
+
+    UnifiedIndexReader r(sf.path);
+    auto labels = make_unified_label_data(r, r.header(), r.header().npts);
+    BOOST_REQUIRE(labels != nullptr);
+    BOOST_CHECK_EQUAL(static_cast<int>(labels->encoding()), static_cast<int>(LabelEncoding::Integer));
+    BOOST_CHECK(labels->has_labels());
+    BOOST_CHECK_EQUAL(labels->num_labels(), 3u);
+
+    std::vector<uint32_t> blue_ints, blue_medoids;
+    labels->resolve_filters({"blue"}, blue_ints, blue_medoids);
+    auto proxy = labels->make_match_proxy(blue_ints);
+    BOOST_REQUIRE(proxy != nullptr);
+    BOOST_CHECK(!proxy->contain_filtered_label(0));
+    BOOST_CHECK(proxy->contain_filtered_label(1));
+    BOOST_CHECK(proxy->contain_filtered_label(2));
+    BOOST_CHECK(!proxy->contain_filtered_label(3));
+}
+
+BOOST_AUTO_TEST_CASE(integer_match_proxy_sorts_filter_labels)
+{
+    // Regression: integer_label_vector::check_label_exists advances its binary-
+    // search window between query labels, so make_match_proxy must sort the
+    // resolved filter ints. Pass labels in DESCENDING string order ("green",
+    // "red" -> ints 11, 10) so an unsorted proxy would search 11 first, advance
+    // past index 0, then miss 10 -> false negative for point 0 (labelled red).
+    ScopedFile sf(tmp_path("integer_unsorted"));
+    const uint64_t npts = 4;
+    const std::vector<std::tuple<std::string, uint32_t, uint32_t>> dict = {
+        {"red", 10u, 0u}, {"green", 11u, 1u}, {"blue", 12u, 2u}};
+    // Per-point label lists are stored ascending (binary-search precondition).
+    const std::vector<std::vector<uint32_t>> per_point = {{10u}, {11u, 12u}, {12u}, {10u, 11u}};
+    write_integer_labels_file(sf.path, npts, dict, per_point, /*total_labels=*/13, /*universal_label=*/0);
+
+    UnifiedIndexReader r(sf.path);
+    auto labels = make_unified_label_data(r, r.header(), r.header().npts);
+    BOOST_REQUIRE(labels != nullptr);
+
+    // Filter "green" OR "red" -> matches points 0 (red), 1 (green), 3 (red+green),
+    // but NOT 2 (blue only). Strings are intentionally out of int order.
+    std::vector<uint32_t> ints, medoids;
+    labels->resolve_filters({"green", "red"}, ints, medoids);
+    auto proxy = labels->make_match_proxy(ints);
+    BOOST_REQUIRE(proxy != nullptr);
+    BOOST_CHECK(proxy->contain_filtered_label(0)); // red -- the case the sort fixes
+    BOOST_CHECK(proxy->contain_filtered_label(1)); // green
+    BOOST_CHECK(!proxy->contain_filtered_label(2)); // blue only
+    BOOST_CHECK(proxy->contain_filtered_label(3)); // red + green
+}
+
+BOOST_AUTO_TEST_CASE(integer_unknown_label_string_throws)
+{
+    ScopedFile sf(tmp_path("integer_unknown"));
+    write_integer_labels_file(sf.path, /*npts=*/2,
+                              /*dict=*/{{"a", 0u, 0u}}, /*per_point=*/{{0u}, {}}, /*total_labels=*/1,
+                              /*universal_label=*/0);
+    UnifiedIndexReader r(sf.path);
+    auto labels = make_unified_label_data(r, r.header(), r.header().npts);
+    BOOST_REQUIRE(labels != nullptr);
+    std::vector<uint32_t> ints, medoids;
+    BOOST_CHECK_THROW(labels->resolve_filters({"nonexistent"}, ints, medoids), ANNException);
+}
+
+BOOST_AUTO_TEST_SUITE_END()
+
+// ===========================================================================
+// Suite 4: make_unified_label_data factory dispatch
+// ===========================================================================
+
+BOOST_AUTO_TEST_SUITE(unified_label_data_factory_tests)
+
+BOOST_AUTO_TEST_CASE(factory_no_labels_returns_null)
+{
+    ScopedFile sf(tmp_path("no_labels"));
+    write_minimal_unified_file(sf.path, /*npts=*/2, /*dim=*/4, /*aligned_dim=*/4, /*max_degree=*/4,
+                                DataTypeTag::Float);
+    // write_minimal_unified_file doesn't call finalize. Do it here.
+    {
+        // Re-open as writer? No -- minimal helper writes nothing post-medoids.
+        // Use a thin wrapper that *does* finalize.
+    }
+    // Simpler: use the same helper but call finalize via the writer's API. The
+    // helper above doesn't finalize; rewrite the file inline:
+    {
+        UnifiedIndexWriter w(sf.path);
+        w.begin(/*npts=*/2, /*dim=*/4, /*aligned_dim=*/4, /*max_degree=*/4, DataTypeTag::Float, MetricTag::L2, 0);
+        w.begin_graph_region();
+        std::vector<float> coord(4, 0.0f);
+        const uint32_t nb = 0;
+        w.write_node(coord.data(), &nb, 1);
+        w.write_node(coord.data(), &nb, 1);
+        w.end_graph_region();
+        const uint32_t medoid = 0;
+        w.write_medoids(&medoid, 1);
+        w.finalize();
+    }
+
+    UnifiedIndexReader r(sf.path);
+    BOOST_CHECK_EQUAL(r.header().flags & HAS_LABELS, 0u);
+    auto labels = make_unified_label_data(r, r.header(), r.header().npts);
+    BOOST_CHECK(labels == nullptr);
+}
+
+BOOST_AUTO_TEST_CASE(factory_picks_bitmask_class)
+{
+    ScopedFile sf(tmp_path("factory_bm"));
+    write_bitmask_labels_file(sf.path, /*npts=*/2,
+                              /*dict=*/{{"a", 0u, 0u}}, /*per_point=*/{{0u}, {}}, /*total_labels=*/1,
+                              /*universal_label=*/0);
+    UnifiedIndexReader r(sf.path);
+    auto labels = make_unified_label_data(r, r.header(), r.header().npts);
+    BOOST_REQUIRE(labels != nullptr);
+    BOOST_CHECK_EQUAL(static_cast<int>(labels->encoding()), static_cast<int>(LabelEncoding::Bitmask));
+}
+
+BOOST_AUTO_TEST_CASE(factory_picks_integer_class)
+{
+    ScopedFile sf(tmp_path("factory_int"));
+    write_integer_labels_file(sf.path, /*npts=*/2,
+                              /*dict=*/{{"a", 0u, 0u}}, /*per_point=*/{{0u}, {}}, /*total_labels=*/1,
+                              /*universal_label=*/0);
+    UnifiedIndexReader r(sf.path);
+    auto labels = make_unified_label_data(r, r.header(), r.header().npts);
+    BOOST_REQUIRE(labels != nullptr);
+    BOOST_CHECK_EQUAL(static_cast<int>(labels->encoding()), static_cast<int>(LabelEncoding::Integer));
+}
+
+BOOST_AUTO_TEST_SUITE_END()
+
+// ===========================================================================
+// Suite 5: unified_index factory -- non-templated user-facing surface
+// ===========================================================================
+//
+// Phase A's unified_index_memory<T>::load_storage throws not_implemented after
+// constructing the node store, so end-to-end load isn't reachable yet. We
+// verify two things: (1) the factory dispatches on data_type correctly by
+// confirming the not_implemented throw originates from the right concrete
+// type, and (2) UnifiedIndexReader independently confirms the file's
+// data_type matches the requested template parameter (the factory's expected
+// dispatch).
+
+BOOST_AUTO_TEST_SUITE(unified_index_factory_tests)
+
+BOOST_AUTO_TEST_CASE(factory_dispatches_on_data_type)
+{
+    struct case_t
+    {
+        DataTypeTag tag;
+        const char *suffix;
+    };
+    case_t cases[] = {{DataTypeTag::Float, "float"}, {DataTypeTag::Uint8, "u8"}, {DataTypeTag::Int8, "i8"}};
+
+    const uint64_t aligned_dim = 32; // multiple of every type's alignment factor
+
+    for (auto &c : cases)
+    {
+        ScopedFile sf(tmp_path(c.suffix));
+        {
+            UnifiedIndexWriter w(sf.path);
+            w.begin(/*npts=*/2, /*dim=*/aligned_dim, aligned_dim, /*max_degree=*/4, c.tag, MetricTag::L2, 0);
+            w.begin_graph_region();
+            // Worst-case sized coords buffer (sizeof(float)) -- writer reads
+            // aligned_dim*sizeof(T) bytes depending on the data_type tag.
+            std::vector<uint8_t> coord(aligned_dim * sizeof(float), 0);
+            const uint32_t nb = 0;
+            w.write_node(coord.data(), &nb, 1);
+            w.write_node(coord.data(), &nb, 1);
+            w.end_graph_region();
+            const uint32_t medoid = 0;
+            w.write_medoids(&medoid, 1);
+            w.finalize();
+        }
+
+        // Verify the file says the right data type.
+        {
+            UnifiedIndexReader r(sf.path);
+            BOOST_CHECK_EQUAL(static_cast<int>(r.header().data_type), static_cast<int>(c.tag));
+        }
+
+        // Factory should construct the right templated index and load successfully.
+        UnifiedLoadContext ctx;
+        ctx.path = sf.path;
+        auto idx = make_unified_index_memory(ctx);
+        BOOST_REQUIRE(idx != nullptr);
+        BOOST_CHECK_EQUAL(static_cast<int>(idx->data_type()), static_cast<int>(c.tag));
+        BOOST_CHECK_EQUAL(idx->num_points(), 2u);
+        BOOST_CHECK_EQUAL(idx->dim(), aligned_dim);
+    }
+}
+
+BOOST_AUTO_TEST_CASE(factory_ssd_dispatches_on_data_type)
+{
+    // SSD path requires HAS_PQ. A no-PQ unified file should be rejected
+    // cleanly. We verify two things: (1) the factory dispatches to the
+    // right templated unified_index_ssd<T> based on data_type, and (2) the
+    // SSD store class rejects no-PQ files with a clear error message.
+    const uint64_t aligned_dim = 32;
+    ScopedFile sf(tmp_path("ssd_disp"));
+    {
+        UnifiedIndexWriter w(sf.path);
+        w.begin(/*npts=*/2, /*dim=*/aligned_dim, aligned_dim, /*max_degree=*/4, DataTypeTag::Float, MetricTag::L2,
+                 0);
+        w.begin_graph_region();
+        std::vector<float> coord(aligned_dim, 0.0f);
+        const uint32_t nb = 0;
+        w.write_node(coord.data(), &nb, 1);
+        w.write_node(coord.data(), &nb, 1);
+        w.end_graph_region();
+        const uint32_t medoid = 0;
+        w.write_medoids(&medoid, 1);
+        w.finalize();
+    }
+
+    UnifiedLoadContext ctx;
+    ctx.path = sf.path;
+    auto reader = make_reader();
+    bool threw = false;
+    try
+    {
+        auto idx = make_unified_index_ssd(reader, ctx);
+    }
+    catch (const ANNException &e)
+    {
+        threw = true;
+        std::string msg = e.what();
+        // Either the SSD index class flags the missing PQ ("requires HAS_PQ"),
+        // or the file fails to open for some other reason. Either way the
+        // throw should originate from our SSD code path.
+        BOOST_CHECK(msg.find("HAS_PQ") != std::string::npos ||
+                    msg.find("unified_index_ssd") != std::string::npos ||
+                    msg.find("unified_node_store_ssd") != std::string::npos);
+    }
+    BOOST_CHECK(threw);
+}
+
+BOOST_AUTO_TEST_SUITE_END()
+
+// ===========================================================================
+// Suite 6: Phase B -- unified_index_memory<T> end-to-end search
+// ===========================================================================
+//
+// Build a small synthetic unified file with random float points and a star
+// graph (every node connected to a hub), load it via make_unified_index_memory,
+// then validate that search returns sane top-K against a brute-force baseline.
+// Recall@10 should be ≥ 95% on a star graph because every node is reachable
+// from the hub in one hop.
+
+BOOST_AUTO_TEST_SUITE(unified_index_memory_tests)
+
+BOOST_AUTO_TEST_CASE(memory_end_to_end_search_recall)
+{
+    ScopedFile sf(tmp_path("memory_e2e"));
+    const uint64_t npts = 64;
+    const uint64_t aligned_dim = 16;
+
+    std::vector<std::vector<float>> points;
+    write_star_graph_unified(sf.path, npts, aligned_dim, points);
+
+    UnifiedLoadContext ctx;
+    ctx.path = sf.path;
+    ctx.num_threads = 1;
+    ctx.search_l = 64;
+
+    auto idx = make_unified_index_memory(ctx);
+    BOOST_REQUIRE(idx != nullptr);
+    BOOST_CHECK_EQUAL(idx->num_points(), npts);
+    BOOST_CHECK_EQUAL(idx->dim(), aligned_dim);
+    BOOST_CHECK_EQUAL(static_cast<int>(idx->data_type()), static_cast<int>(DataTypeTag::Float));
+
+    const size_t K = 10;
+    const size_t Q = 50;
+    float total_recall = 0.0f;
+
+    for (size_t q = 0; q < Q; ++q)
+    {
+        std::vector<float> query(aligned_dim, 0.0f);
+        for (size_t j = 0; j < aligned_dim; ++j)
+            query[j] = det_float(/*seed=*/777, q, j);
+
+        std::vector<uint64_t> out_ids(K, 0);
+        std::vector<float> out_dists(K, 0.0f);
+
+        UnifiedSearchContext sctx;
+        sctx.query = query.data();
+        sctx.K = K;
+        sctx.L = 64;
+        sctx.indices = out_ids.data();
+        sctx.distances = out_dists.data();
+        idx->search(sctx);
+
+        auto gt = brute_force_topk(points, query.data(), K);
+        total_recall += recall_at_k(gt, out_ids.data(), K);
+    }
+
+    const float avg_recall = total_recall / static_cast<float>(Q);
+    BOOST_TEST_MESSAGE("memory_end_to_end_search_recall: avg recall@10 = " << avg_recall);
+    BOOST_CHECK_GE(avg_recall, 0.95f);
+}
+
+BOOST_AUTO_TEST_CASE(memory_search_validates_K_and_L)
+{
+    ScopedFile sf(tmp_path("memory_validate"));
+    std::vector<std::vector<float>> points;
+    write_star_graph_unified(sf.path, /*npts=*/8, /*aligned_dim=*/8, points);
+
+    UnifiedLoadContext ctx;
+    ctx.path = sf.path;
+    auto idx = make_unified_index_memory(ctx);
+
+    std::vector<float> query(8, 0.1f);
+    std::vector<uint64_t> out_ids(4, 0);
+    std::vector<float> out_dists(4, 0.0f);
+
+    UnifiedSearchContext sctx;
+    sctx.query = query.data();
+    sctx.K = 4;
+    sctx.L = 2; // L < K -> must throw
+    sctx.indices = out_ids.data();
+    sctx.distances = out_dists.data();
+    BOOST_CHECK_THROW(idx->search(sctx), ANNException);
+
+    // filter_labels must be empty on a non-filtered index.
+    sctx.L = 8;
+    sctx.filter_labels = {"red"};
+    BOOST_CHECK_THROW(idx->search(sctx), ANNException);
+}
+
+BOOST_AUTO_TEST_SUITE_END()
+
+// ===========================================================================
+// Suite 7: Phase C -- unified_node_store_ssd<T> with AlignedFileReader
+// ===========================================================================
+//
+// Build the same star-graph unified file the memory suite uses, but load it
+// via the SSD store. Verify (1) get_nodes returns the same coords/neighbors
+// the writer wrote, (2) cache priming via cache_bfs_levels populates the
+// caches so subsequent get_nodes calls skip IO, (3) recall via a simple
+// brute-force-over-cached-nodes baseline matches.
+
+BOOST_AUTO_TEST_SUITE(unified_node_store_ssd_tests)
+
+BOOST_AUTO_TEST_CASE(ssd_get_nodes_uncached_round_trip)
+{
+    ScopedFile sf(tmp_path("ssd_round_trip"));
+    const uint64_t npts = 16;
+    const uint64_t aligned_dim = 16;
+    std::vector<std::vector<float>> points;
+    write_star_graph_unified(sf.path, npts, aligned_dim, points);
+
+    UnifiedIndexReader reader_meta(sf.path);
+    auto reader = make_reader();
+    unified_node_store_ssd<float> store(reader);
+    store.load(reader_meta, reader_meta.header());
+
+    BOOST_CHECK_EQUAL(store.num_points(), npts);
+    BOOST_CHECK_EQUAL(store.aligned_dim(), aligned_dim);
+
+    NodeFetchScratch scratch = store.make_fetch_scratch(/*max_batch=*/16);
+    std::vector<uint64_t> ids = {0, 1, 5, 10, 15};
+    std::vector<NodeView<float>> views;
+    store.get_nodes(ids, scratch, views);
+    BOOST_REQUIRE_EQUAL(views.size(), ids.size());
+
+    // One batched read should have been issued for all 5 ids.
+    BOOST_CHECK_EQUAL(store.io_count(), 1u);
+
+    // Verify coords match what the writer wrote.
+    for (size_t i = 0; i < ids.size(); ++i)
+    {
+        const uint64_t id = ids[i];
+        BOOST_REQUIRE(views[i].coords != nullptr);
+        for (uint64_t j = 0; j < aligned_dim; ++j)
+        {
+            BOOST_CHECK_CLOSE(views[i].coords[j], points[id][j], 1e-6f);
+        }
+    }
+
+    // Star graph: node 0 has neighbors [1..npts-1], every other node has neighbor [0].
+    BOOST_CHECK_EQUAL(views[0].degree, npts - 1);
+    for (size_t i = 1; i < ids.size(); ++i)
+    {
+        BOOST_CHECK_EQUAL(views[i].degree, 1u);
+        BOOST_CHECK_EQUAL(views[i].neighbors[0], 0u);
+    }
+}
+
+BOOST_AUTO_TEST_CASE(ssd_cache_hits_skip_io)
+{
+    ScopedFile sf(tmp_path("ssd_cache"));
+    const uint64_t npts = 16;
+    const uint64_t aligned_dim = 16;
+    std::vector<std::vector<float>> points;
+    write_star_graph_unified(sf.path, npts, aligned_dim, points);
+
+    UnifiedIndexReader reader_meta(sf.path);
+    auto reader = make_reader();
+    unified_node_store_ssd<float> store(reader);
+    store.load(reader_meta, reader_meta.header());
+
+    // Prime cache with ids {0, 5, 10}. The prime scratch is sized for a single
+    // node (max_batch=1), so load_cache_list batches one node per get_nodes
+    // call here -> 3 IOs. (With a larger scratch it would batch all three into
+    // a single IO; this test pins the per-node case via the GE assertion.)
+    {
+        NodeFetchScratch prime_scratch = store.make_fetch_scratch(/*max_batch=*/1);
+        store.load_cache_list({0u, 5u, 10u}, prime_scratch);
+    }
+    const uint64_t io_after_prime = store.io_count();
+    BOOST_CHECK_GE(io_after_prime, 3u);
+
+    NodeFetchScratch scratch = store.make_fetch_scratch(/*max_batch=*/16);
+    std::vector<NodeView<float>> views;
+
+    // All-cached batch: io_count must stay flat.
+    store.get_nodes({0u, 5u, 10u}, scratch, views);
+    BOOST_CHECK_EQUAL(store.io_count(), io_after_prime);
+    for (size_t i = 0; i < 3; ++i)
+    {
+        BOOST_REQUIRE(views[i].coords != nullptr);
+        BOOST_REQUIRE(views[i].neighbors != nullptr);
+    }
+
+    // Mixed batch with one miss (id 7): one IO incurred.
+    store.get_nodes({0u, 5u, 7u, 10u}, scratch, views);
+    BOOST_CHECK_EQUAL(store.io_count(), io_after_prime + 1u);
+    // Verify the miss (id 7) decoded correctly against the writer's data.
+    for (uint64_t j = 0; j < aligned_dim; ++j)
+    {
+        BOOST_CHECK_CLOSE(views[2].coords[j], points[7][j], 1e-6f);
+    }
+}
+
+BOOST_AUTO_TEST_CASE(ssd_load_cache_list_batches_reads)
+{
+    // load_cache_list batches its reads: with a scratch large enough to hold
+    // the whole node list, all requested nodes are fetched in a single batched
+    // IO instead of one IO per node.
+    ScopedFile sf(tmp_path("ssd_cache_batch"));
+    const uint64_t npts = 16;
+    const uint64_t aligned_dim = 16;
+    std::vector<std::vector<float>> points;
+    write_star_graph_unified(sf.path, npts, aligned_dim, points);
+
+    UnifiedIndexReader reader_meta(sf.path);
+    auto reader = make_reader();
+    unified_node_store_ssd<float> store(reader);
+    store.load(reader_meta, reader_meta.header());
+
+    const std::vector<uint32_t> ids = {0u, 3u, 6u, 9u, 12u, 15u};
+    const uint64_t io_before = store.io_count();
+    {
+        // Scratch capacity 8 >= 6 ids -> exactly one batched read.
+        NodeFetchScratch prime_scratch = store.make_fetch_scratch(/*max_batch=*/8);
+        store.load_cache_list(ids, prime_scratch);
+    }
+    BOOST_CHECK_EQUAL(store.io_count() - io_before, 1u);
+
+    // Every primed id now resolves with zero additional IO, and the cached
+    // coords match what the writer stored.
+    const uint64_t io_after = store.io_count();
+    NodeFetchScratch scratch = store.make_fetch_scratch(/*max_batch=*/8);
+    std::vector<NodeView<float>> views;
+    store.get_nodes({0u, 3u, 6u, 9u, 12u, 15u}, scratch, views);
+    BOOST_CHECK_EQUAL(store.io_count(), io_after);
+    for (size_t i = 0; i < ids.size(); ++i)
+    {
+        BOOST_REQUIRE(views[i].coords != nullptr);
+        for (uint64_t j = 0; j < aligned_dim; ++j)
+            BOOST_CHECK_CLOSE(views[i].coords[j], points[ids[i]][j], 1e-6f);
+    }
+}
+
+BOOST_AUTO_TEST_CASE(ssd_cache_bfs_levels_seeds_from_medoid)
+{
+    ScopedFile sf(tmp_path("ssd_bfs"));
+    const uint64_t npts = 16;
+    const uint64_t aligned_dim = 16;
+    std::vector<std::vector<float>> points;
+    write_star_graph_unified(sf.path, npts, aligned_dim, points);
+
+    UnifiedIndexReader reader_meta(sf.path);
+    auto reader = make_reader();
+    unified_node_store_ssd<float> store(reader);
+    store.load(reader_meta, reader_meta.header());
+
+    // Star graph from medoid 0 -> all npts nodes reachable in 1 hop.
+    std::vector<uint32_t> cached_list;
+    {
+        NodeFetchScratch prime_scratch = store.make_fetch_scratch(/*max_batch=*/npts);
+        store.cache_bfs_levels(/*seeds=*/{0u}, /*num_nodes_to_cache=*/npts, cached_list, prime_scratch);
+    }
+    BOOST_CHECK_EQUAL(cached_list.size(), npts);
+
+    // After priming, asking for ANY id must not issue IO.
+    const uint64_t io_after_prime = store.io_count();
+    NodeFetchScratch scratch = store.make_fetch_scratch(/*max_batch=*/16);
+    std::vector<NodeView<float>> views;
+    store.get_nodes({0u, 7u, 15u}, scratch, views);
+    BOOST_CHECK_EQUAL(store.io_count(), io_after_prime);
+}
+
+BOOST_AUTO_TEST_SUITE_END()
+
+// ===========================================================================
+// Suite 8: Phase E -- unified_index_builder end-to-end
+// ===========================================================================
+//
+// Build a tiny .bin data file, run unified_index_builder, then load the
+// produced unified container via the read-side factories and confirm metadata.
+
+BOOST_AUTO_TEST_SUITE(unified_index_builder_tests)
+
+namespace
+{
+// Write a DiskANN-format .bin file: [int32 npts][int32 dim][float data...].
+void write_float_bin(const std::string &path, uint32_t npts, uint32_t dim, uint64_t seed)
+{
+    std::ofstream out;
+    out.exceptions(std::ios::badbit | std::ios::failbit);
+    out.open(path, std::ios::binary | std::ios::trunc);
+    const int32_t n = static_cast<int32_t>(npts);
+    const int32_t d = static_cast<int32_t>(dim);
+    out.write(reinterpret_cast<const char *>(&n), sizeof(int32_t));
+    out.write(reinterpret_cast<const char *>(&d), sizeof(int32_t));
+    for (uint32_t i = 0; i < npts; ++i)
+    {
+        for (uint32_t j = 0; j < dim; ++j)
+        {
+            const float v = det_float(seed, i, j);
+            out.write(reinterpret_cast<const char *>(&v), sizeof(float));
+        }
+    }
+}
+} // namespace
+
+BOOST_AUTO_TEST_CASE(builder_no_pq_emits_memory_loadable_file)
+{
+    ScopedFile data_sf(tmp_path("builder_data_nopq.bin"));
+    ScopedFile out_sf(tmp_path("builder_out_nopq.bin"));
+
+    write_float_bin(data_sf.path, /*npts=*/64, /*dim=*/16, /*seed=*/42);
+
+    UnifiedBuildContext ctx;
+    ctx.data_file_path = data_sf.path;
+    ctx.output_path = out_sf.path;
+    ctx.data_type = DataTypeTag::Float;
+    ctx.metric = diskann::Metric::L2;
+    ctx.R = 16;
+    ctx.L = 32;
+    ctx.alpha = 1.2f;
+    ctx.num_threads = 1;
+    ctx.pq_dim = 0; // no PQ
+
+    unified_index_builder builder;
+    builder.build(ctx);
+
+    // Verify the produced file: no HAS_PQ flag, metadata matches.
+    {
+        UnifiedIndexReader r(out_sf.path);
+        const auto &h = r.header();
+        BOOST_CHECK_EQUAL(static_cast<int>(h.data_type), static_cast<int>(DataTypeTag::Float));
+        BOOST_CHECK_EQUAL(h.npts, 64u);
+        BOOST_CHECK_EQUAL(h.dim, 16u);
+        BOOST_CHECK_EQUAL(h.flags & HAS_PQ, 0u);
+    }
+
+    // Confirm it loads via the memory factory and metadata is intact.
+    UnifiedLoadContext load_ctx;
+    load_ctx.path = out_sf.path;
+    auto idx = make_unified_index_memory(load_ctx);
+    BOOST_REQUIRE(idx != nullptr);
+    BOOST_CHECK_EQUAL(idx->num_points(), 64u);
+    BOOST_CHECK_EQUAL(idx->dim(), 16u);
+}
+
+BOOST_AUTO_TEST_CASE(builder_with_pq_emits_ssd_loadable_file)
+{
+    ScopedFile data_sf(tmp_path("builder_data_pq.bin"));
+    ScopedFile out_sf(tmp_path("builder_out_pq.bin"));
+
+    // Bigger dataset so PQ training has enough sample points.
+    write_float_bin(data_sf.path, /*npts=*/512, /*dim=*/16, /*seed=*/7);
+
+    UnifiedBuildContext ctx;
+    ctx.data_file_path = data_sf.path;
+    ctx.output_path = out_sf.path;
+    ctx.data_type = DataTypeTag::Float;
+    ctx.metric = diskann::Metric::L2;
+    ctx.R = 16;
+    ctx.L = 32;
+    ctx.alpha = 1.2f;
+    ctx.num_threads = 1;
+    ctx.pq_dim = 4; // PQ-compress 16-dim into 4 chunks
+    ctx.pq_sampling_rate = 1.0; // train on full data (tiny set)
+
+    unified_index_builder builder;
+    builder.build(ctx);
+
+    // Verify HAS_PQ flag is set.
+    {
+        UnifiedIndexReader r(out_sf.path);
+        const auto &h = r.header();
+        BOOST_CHECK_EQUAL(static_cast<int>(h.data_type), static_cast<int>(DataTypeTag::Float));
+        BOOST_CHECK_EQUAL(h.npts, 512u);
+        BOOST_CHECK_NE(h.flags & HAS_PQ, 0u);
+        BOOST_CHECK_GT(h.pq_pivots_len, 0u);
+        BOOST_CHECK_GT(h.pq_codes_len, 0u);
+    }
+
+    // Confirm it loads via the SSD factory end-to-end.
+    UnifiedLoadContext load_ctx;
+    load_ctx.path = out_sf.path;
+    load_ctx.num_threads = 1;
+    auto reader = make_reader();
+    auto idx = make_unified_index_ssd(reader, load_ctx);
+    BOOST_REQUIRE(idx != nullptr);
+    BOOST_CHECK_EQUAL(idx->num_points(), 512u);
+    BOOST_CHECK_EQUAL(idx->dim(), 16u);
+}
+
+BOOST_AUTO_TEST_CASE(builder_pq_dim_equals_dim_still_emits_ssd_loadable_file)
+{
+    // Regression: pq_dim == dim must still emit PQ so the file is SSD-loadable.
+    // Previously the builder skipped PQ when pq_dim >= dim, producing a file the
+    // SSD load path (which requires HAS_PQ) would reject.
+    ScopedFile data_sf(tmp_path("builder_data_pqfull.bin"));
+    ScopedFile out_sf(tmp_path("builder_out_pqfull.bin"));
+
+    const uint32_t dim = 16;
+    write_float_bin(data_sf.path, /*npts=*/512, dim, /*seed=*/11);
+
+    UnifiedBuildContext ctx;
+    ctx.data_file_path = data_sf.path;
+    ctx.output_path = out_sf.path;
+    ctx.data_type = DataTypeTag::Float;
+    ctx.metric = diskann::Metric::L2;
+    ctx.R = 16;
+    ctx.L = 32;
+    ctx.alpha = 1.2f;
+    ctx.num_threads = 1;
+    ctx.pq_dim = dim;           // pq_dim == dim -> chunk size 1, full-precision-per-dim PQ
+    ctx.pq_sampling_rate = 1.0; // train on full data (tiny set)
+
+    unified_index_builder builder;
+    builder.build(ctx);
+
+    // HAS_PQ must be set even though pq_dim == dim.
+    {
+        UnifiedIndexReader r(out_sf.path);
+        const auto &h = r.header();
+        BOOST_CHECK_EQUAL(h.npts, 512u);
+        BOOST_CHECK_EQUAL(h.dim, static_cast<uint64_t>(dim));
+        BOOST_CHECK_NE(h.flags & HAS_PQ, 0u);
+        BOOST_CHECK_GT(h.pq_pivots_len, 0u);
+        BOOST_CHECK_GT(h.pq_codes_len, 0u);
+    }
+
+    // And it loads via the SSD factory end-to-end.
+    UnifiedLoadContext load_ctx;
+    load_ctx.path = out_sf.path;
+    load_ctx.num_threads = 1;
+    auto reader = make_reader();
+    auto idx = make_unified_index_ssd(reader, load_ctx);
+    BOOST_REQUIRE(idx != nullptr);
+    BOOST_CHECK_EQUAL(idx->num_points(), 512u);
+    BOOST_CHECK_EQUAL(idx->dim(), static_cast<uint64_t>(dim));
+}
+
+BOOST_AUTO_TEST_SUITE_END()
+
+// ===========================================================================
+// Suite 9: legacy <-> unified parity
+// ===========================================================================
+//
+// Verify the unified format produces the SAME search results as the legacy
+// DiskANN format when both are derived from the SAME Vamana graph:
+//   - Memory: one Index<float> -> save() (legacy) + save_unified(); load both
+//     from disk and search; top-K must match.
+//   - SSD: one Index<float> + one shared PQ codebook -> legacy disk index (via
+//     a UT-local helper mirroring build_disk_index's internals on an existing
+//     Index) + unified SSD file; load PQFlashIndex and unified_index_ssd; top-K
+//     must match.
+
+namespace
+{
+
+// Write a DiskANN .bin file: [int32 npts][int32 dim][float data...]. (Local to
+// the parity suite; the builder suite has its own copy in a nested namespace.)
+void write_float_bin_parity(const std::string &path, uint32_t npts, uint32_t dim, uint64_t seed)
+{
+    std::ofstream out;
+    out.exceptions(std::ios::badbit | std::ios::failbit);
+    out.open(path, std::ios::binary | std::ios::trunc);
+    const int32_t n = static_cast<int32_t>(npts);
+    const int32_t d = static_cast<int32_t>(dim);
+    out.write(reinterpret_cast<const char *>(&n), sizeof(int32_t));
+    out.write(reinterpret_cast<const char *>(&d), sizeof(int32_t));
+    for (uint32_t i = 0; i < npts; ++i)
+        for (uint32_t j = 0; j < dim; ++j)
+        {
+            const float v = det_float(seed, i, j);
+            out.write(reinterpret_cast<const char *>(&v), sizeof(float));
+        }
+}
+
+// Cleanup helper for the multi-file legacy memory index (save() writes the
+// graph file plus a ".data" sidecar, and ".tags" when tags are enabled).
+struct ScopedLegacyMemFiles
+{
+    std::string prefix;
+    explicit ScopedLegacyMemFiles(std::string p) : prefix(std::move(p))
+    {
+    }
+    ~ScopedLegacyMemFiles()
+    {
+        for (const char *suffix : {"", ".data", ".tags"})
+            std::remove((prefix + suffix).c_str());
+    }
+};
+
+// Fraction of `a`'s top-K ids that also appear in `b`'s top-K (set overlap).
+template <typename IdA, typename IdB>
+double topk_overlap(const std::vector<IdA> &a, const std::vector<IdB> &b)
+{
+    std::unordered_set<uint64_t> bs;
+    for (auto id : b)
+        bs.insert(static_cast<uint64_t>(id));
+    size_t inter = 0;
+    for (auto id : a)
+        if (bs.count(static_cast<uint64_t>(id)))
+            ++inter;
+    return a.empty() ? 1.0 : static_cast<double>(inter) / static_cast<double>(a.size());
+}
+
+// Read an entire file into a byte buffer.
+std::vector<uint8_t> slurp_all(const std::string &path)
+{
+    std::ifstream in;
+    in.exceptions(std::ios::badbit | std::ios::failbit);
+    in.open(path, std::ios::binary | std::ios::ate);
+    const std::streamoff sz = in.tellg();
+    in.seekg(0, std::ios::beg);
+    std::vector<uint8_t> out(static_cast<size_t>(sz));
+    if (sz > 0)
+        in.read(reinterpret_cast<char *>(out.data()), sz);
+    return out;
+}
+
+// Cleanup helper for the legacy SSD index artifacts (mem graph, PQ files +
+// sidecars, and the sector-packed disk index).
+struct ScopedLegacySsdFiles
+{
+    std::string prefix;
+    explicit ScopedLegacySsdFiles(std::string p) : prefix(std::move(p))
+    {
+    }
+    ~ScopedLegacySsdFiles()
+    {
+        for (const char *suffix :
+             {"_mem.index", "_mem.index.data", "_mem.index.tags", "_disk.index", "_disk.index_medoids.bin",
+              "_disk.index_centroids.bin", "_pq_pivots.bin", "_pq_pivots.bin_centroid.bin",
+              "_pq_pivots.bin_chunk_offsets.bin", "_pq_pivots.bin_rearrangement_perm.bin", "_pq_compressed.bin"})
+            std::remove((prefix + suffix).c_str());
+    }
+};
+
+// Build the legacy SSD index files from an EXISTING Index, mirroring the
+// relevant internals of diskann::build_disk_index (which we can't reuse
+// directly because it constructs its own Index instance). Emits
+// <prefix>_disk.index, <prefix>_pq_pivots.bin, <prefix>_pq_compressed.bin.
+// Returns the PQ pivot + code bytes so the caller can embed the SAME PQ
+// codebook into the unified file -- guaranteeing both indices share graph AND
+// PQ, so any search-result difference is purely a format/decoder difference.
+void make_legacy_ssd_from_index(Index<float, uint32_t, uint32_t> &idx, const std::string &data_file,
+                                const std::string &prefix, uint32_t num_pq_chunks, diskann::Metric metric,
+                                std::vector<uint8_t> &pq_pivots_bytes, std::vector<uint8_t> &pq_codes_bytes)
+{
+    const std::string mem_index = prefix + "_mem.index";
+    const std::string pq_pivots = prefix + "_pq_pivots.bin";
+    const std::string pq_codes = prefix + "_pq_compressed.bin";
+    const std::string disk_index = prefix + "_disk.index";
+
+    // 1) Save the Vamana graph (the same graph that backs the unified file).
+    idx.save(mem_index.c_str());
+
+    // 2) Train PQ once. These files feed BOTH the legacy PQFlashIndex and (via
+    //    slurp) the unified file, so both sides use byte-identical codes.
+    diskann::generate_quantized_data<float>(data_file, pq_pivots, pq_codes, metric, /*p_val=*/1.0,
+                                            num_pq_chunks, /*use_opq=*/false, /*codebook_prefix=*/"");
+
+    // 3) Pack coords + adjacency into the sector-aligned legacy disk index.
+    diskann::create_disk_layout<float>(data_file, mem_index, disk_index);
+
+    pq_pivots_bytes = slurp_all(pq_pivots);
+    pq_codes_bytes = slurp_all(pq_codes);
+}
+
+// --- Filtered-index helpers -------------------------------------------------
+
+inline uint64_t splitmix64(uint64_t x)
+{
+    x += 0x9E3779B97F4A7C15ull;
+    x = (x ^ (x >> 30)) * 0xBF58476D1CE4E5B9ull;
+    x = (x ^ (x >> 27)) * 0x94D049BB133111EBull;
+    return x ^ (x >> 31);
+}
+
+// Assign each of `npts` points a small random subset (1..3) of string labels
+// drawn from the vocabulary {"0".."vocab-1"}, deterministically from `seed`.
+// Every point gets >= 1 label (the filtered build rejects label-less points).
+std::vector<std::vector<std::string>> gen_label_sets(uint32_t npts, uint32_t vocab, uint64_t seed)
+{
+    std::vector<std::vector<std::string>> sets(npts);
+    for (uint32_t i = 0; i < npts; ++i)
+    {
+        uint64_t h = splitmix64(seed ^ (static_cast<uint64_t>(i) * 0x100000001B3ull));
+        const uint32_t count = 1u + static_cast<uint32_t>(h % 3u); // 1..3 labels
+        std::set<uint32_t> chosen;
+        for (uint32_t c = 0; c < count; ++c)
+        {
+            h = splitmix64(h);
+            chosen.insert(static_cast<uint32_t>(h % vocab));
+        }
+        for (uint32_t lb : chosen)
+            sets[i].push_back(std::to_string(lb));
+    }
+    return sets;
+}
+
+// Write a DiskANN string-label file: one comma-separated line of labels per point.
+void write_label_file(const std::string &path, const std::vector<std::vector<std::string>> &sets)
+{
+    std::ofstream out;
+    out.exceptions(std::ios::badbit | std::ios::failbit);
+    out.open(path, std::ios::trunc);
+    for (const auto &s : sets)
+    {
+        for (size_t j = 0; j < s.size(); ++j)
+        {
+            out << s[j];
+            if (j + 1 < s.size())
+                out << ",";
+        }
+        out << "\n";
+    }
+}
+
+bool point_has_label(const std::vector<std::vector<std::string>> &sets, uint32_t id, const std::string &lbl)
+{
+    if (id >= sets.size())
+        return false;
+    const auto &s = sets[id];
+    return std::find(s.begin(), s.end(), lbl) != s.end();
+}
+
+// Cleanup helper for the many sidecar files a filtered legacy index emits.
+struct ScopedFilteredLegacyFiles
+{
+    std::string prefix;
+    explicit ScopedFilteredLegacyFiles(std::string p) : prefix(std::move(p))
+    {
+    }
+    ~ScopedFilteredLegacyFiles()
+    {
+        for (const char *suffix :
+             {"", ".data", ".tags", ".del", "_labels.txt", "_labels_map.txt", "_labels_to_medoids.txt",
+              "_universal_label.txt", "_bitmask_labels.bin", "_integer_labels.bin", "_label_formatted.txt"})
+            std::remove((prefix + suffix).c_str());
+    }
+};
+
+// Build the legacy FILTERED SSD index from an existing filtered Index. The mem
+// graph is saved to <prefix> (not <prefix>_mem.index) so its label sidecars
+// land at <prefix>_labels.txt / _labels_to_medoids.txt / _bitmask_labels.bin,
+// which is exactly where PQFlashIndex::load(prefix) looks. The filtered build
+// must have used save_path_prefix == prefix so <prefix>_labels_map.txt exists.
+void make_legacy_ssd_filtered_from_index(Index<float, uint32_t, uint32_t> &idx, const std::string &data_file,
+                                         const std::string &prefix, uint32_t num_pq_chunks, diskann::Metric metric,
+                                         std::vector<uint8_t> &pq_pivots_bytes, std::vector<uint8_t> &pq_codes_bytes)
+{
+    const std::string mem_index = prefix; // graph file; label sidecars co-locate here
+    const std::string pq_pivots = prefix + "_pq_pivots.bin";
+    const std::string pq_codes = prefix + "_pq_compressed.bin";
+    const std::string disk_index = prefix + "_disk.index";
+
+    idx.save(mem_index.c_str());
+    diskann::generate_quantized_data<float>(data_file, pq_pivots, pq_codes, metric, /*p_val=*/1.0, num_pq_chunks,
+                                            /*use_opq=*/false, /*codebook_prefix=*/"");
+    diskann::create_disk_layout<float>(data_file, mem_index, disk_index);
+
+    pq_pivots_bytes = slurp_all(pq_pivots);
+    pq_codes_bytes = slurp_all(pq_codes);
+}
+
+// Cleanup for the legacy filtered SSD artifacts.
+struct ScopedFilteredLegacySsdFiles
+{
+    std::string prefix;
+    explicit ScopedFilteredLegacySsdFiles(std::string p) : prefix(std::move(p))
+    {
+    }
+    ~ScopedFilteredLegacySsdFiles()
+    {
+        for (const char *suffix :
+             {"", ".data", ".tags", ".del", "_labels.txt", "_labels_map.txt", "_labels_to_medoids.txt",
+              "_universal_label.txt", "_bitmask_labels.bin", "_integer_labels.bin", "_label_formatted.txt",
+              "_disk.index", "_disk.index_medoids.bin", "_disk.index_centroids.bin", "_pq_pivots.bin",
+              "_pq_pivots.bin_centroid.bin", "_pq_pivots.bin_chunk_offsets.bin",
+              "_pq_pivots.bin_rearrangement_perm.bin", "_pq_compressed.bin"})
+            std::remove((prefix + suffix).c_str());
+    }
+};
+
+} // namespace
+
+BOOST_AUTO_TEST_SUITE(unified_parity_tests)
+
+BOOST_AUTO_TEST_CASE(memory_parity_legacy_vs_unified)
+{
+    const uint32_t npts = 10000;
+    const uint32_t dim = 32;
+    const uint32_t nq = 100;
+    const uint32_t R = 32, L = 100, K = 10, search_L = 100;
+
+    ScopedFile data_sf(tmp_path("parity_mem_data"));
+    write_float_bin_parity(data_sf.path, npts, dim, /*seed=*/123);
+
+    // 1) Build one Vamana Index<float> (single-threaded for determinism).
+    auto write_params = std::make_shared<IndexWriteParameters>(
+        IndexWriteParametersBuilder(L, R).with_alpha(1.2f).with_num_threads(1).build());
+    Index<float, uint32_t, uint32_t> idx(diskann::Metric::L2, dim, npts, write_params, nullptr,
+                                         /*num_frozen_pts=*/0, /*dynamic=*/false, /*enable_tags=*/false,
+                                         /*concurrent_consolidate=*/false, /*pq_dist_build=*/false,
+                                         /*num_pq_chunks=*/0, /*use_opq=*/false, /*filtered_index=*/false);
+    idx.build(data_sf.path.c_str(), npts, std::vector<uint32_t>());
+
+    // 2) Emit BOTH formats from the same in-memory graph.
+    ScopedLegacyMemFiles legacy(tmp_path("parity_mem_legacy"));
+    ScopedFile unified_sf(tmp_path("parity_mem_unified"));
+    idx.save(legacy.prefix.c_str());
+    idx.save_unified(unified_sf.path.c_str());
+
+    // 3) Load the legacy memory index from disk.
+    Index<float, uint32_t, uint32_t> legacy_idx(diskann::Metric::L2, dim, npts, write_params, nullptr,
+                                                /*num_frozen_pts=*/0, /*dynamic=*/false, /*enable_tags=*/false,
+                                                /*concurrent_consolidate=*/false, /*pq_dist_build=*/false,
+                                                /*num_pq_chunks=*/0, /*use_opq=*/false, /*filtered_index=*/false);
+    legacy_idx.load(legacy.prefix.c_str(), /*num_threads=*/1, /*search_l=*/search_L);
+
+    // 4) Load the unified memory index from disk.
+    UnifiedLoadContext uctx;
+    uctx.path = unified_sf.path;
+    uctx.num_threads = 1;
+    uctx.search_l = search_L;
+    auto uidx = make_unified_index_memory(uctx);
+    BOOST_REQUIRE(uidx != nullptr);
+    BOOST_REQUIRE_EQUAL(uidx->num_points(), npts);
+
+    // 5) Search the same queries on both; top-K must be (near-)identical.
+    double total_overlap = 0.0;
+    size_t exact = 0;
+    for (uint32_t q = 0; q < nq; ++q)
+    {
+        std::vector<float> query(dim);
+        for (uint32_t j = 0; j < dim; ++j)
+            query[j] = det_float(/*seed=*/999, q, j);
+
+        std::vector<uint32_t> legacy_ids(K, 0);
+        std::vector<float> legacy_dists(K, 0.0f);
+        legacy_idx.search<uint32_t>(query.data(), K, search_L, legacy_ids.data(), legacy_dists.data());
+
+        std::vector<uint64_t> uni_ids(K, 0);
+        std::vector<float> uni_dists(K, 0.0f);
+        UnifiedSearchContext sctx;
+        sctx.query = query.data();
+        sctx.K = K;
+        sctx.L = search_L;
+        sctx.indices = uni_ids.data();
+        sctx.distances = uni_dists.data();
+        uidx->search(sctx);
+
+        const double ov = topk_overlap(legacy_ids, uni_ids);
+        total_overlap += ov;
+        if (ov >= 1.0)
+            ++exact;
+    }
+    const double avg_overlap = total_overlap / nq;
+    BOOST_TEST_MESSAGE("memory parity: avg top-" << K << " overlap = " << avg_overlap << ", exact = " << exact << "/"
+                                                 << nq);
+    // EXACT parity is expected: both indices are built from the SAME in-memory
+    // graph, the memory search has no RNG, runs single-threaded, seeds from the
+    // same single medoid, and NeighborPriorityQueue breaks distance ties
+    // deterministically by id (see Neighbor::operator< in include/neighbor.h).
+    // So every query's top-K must be identical.
+    BOOST_CHECK_EQUAL(exact, static_cast<size_t>(nq));
+}
+
+BOOST_AUTO_TEST_CASE(ssd_parity_legacy_vs_unified)
+{
+    const uint32_t npts = 10000;
+    const uint32_t dim = 32;
+    const uint32_t nq = 100;
+    const uint32_t R = 32, L = 100, K = 10, search_L = 100, beam = 4, pq_chunks = 16;
+
+    ScopedFile data_sf(tmp_path("parity_ssd_data"));
+    write_float_bin_parity(data_sf.path, npts, dim, /*seed=*/234);
+
+    // 1) Build one Vamana Index<float>.
+    auto write_params = std::make_shared<IndexWriteParameters>(
+        IndexWriteParametersBuilder(L, R).with_alpha(1.2f).with_num_threads(1).build());
+    Index<float, uint32_t, uint32_t> idx(diskann::Metric::L2, dim, npts, write_params, nullptr,
+                                         /*num_frozen_pts=*/0, /*dynamic=*/false, /*enable_tags=*/false,
+                                         /*concurrent_consolidate=*/false, /*pq_dist_build=*/false,
+                                         /*num_pq_chunks=*/0, /*use_opq=*/false, /*filtered_index=*/false);
+    idx.build(data_sf.path.c_str(), npts, std::vector<uint32_t>());
+
+    // 2) Legacy SSD index from the Index (also returns the shared PQ bytes).
+    ScopedLegacySsdFiles legacy(tmp_path("parity_ssd_legacy"));
+    std::vector<uint8_t> pq_pivots_bytes, pq_codes_bytes;
+    make_legacy_ssd_from_index(idx, data_sf.path, legacy.prefix, pq_chunks, diskann::Metric::L2, pq_pivots_bytes,
+                               pq_codes_bytes);
+
+    // 3) Unified SSD file from the SAME Index + SAME PQ codebook.
+    ScopedFile unified_sf(tmp_path("parity_ssd_unified"));
+    idx.save_unified(unified_sf.path.c_str(), pq_pivots_bytes, pq_codes_bytes);
+
+    // 4) Load the legacy PQFlashIndex.
+    auto legacy_reader = make_reader();
+    PQFlashIndex<float, uint32_t> pfi(legacy_reader, diskann::Metric::L2);
+    const int rc = pfi.load(/*num_threads=*/1, legacy.prefix.c_str());
+    BOOST_REQUIRE_EQUAL(rc, 0);
+
+    // 5) Load the unified SSD index.
+    UnifiedLoadContext uctx;
+    uctx.path = unified_sf.path;
+    uctx.num_threads = 1;
+    uctx.search_l = search_L;
+    auto ureader = make_reader();
+    auto uidx = make_unified_index_ssd(ureader, uctx);
+    BOOST_REQUIRE(uidx != nullptr);
+    BOOST_REQUIRE_EQUAL(uidx->num_points(), npts);
+
+    // 6) Search the same queries on both; top-K should match.
+    double total_overlap = 0.0;
+    size_t exact = 0;
+    for (uint32_t q = 0; q < nq; ++q)
+    {
+        std::vector<float> query(dim);
+        for (uint32_t j = 0; j < dim; ++j)
+            query[j] = det_float(/*seed=*/888, q, j);
+
+        std::vector<uint64_t> legacy_ids(K, 0);
+        std::vector<float> legacy_dists(K, 0.0f);
+        pfi.cached_beam_search(query.data(), K, search_L, legacy_ids.data(), legacy_dists.data(),
+                               static_cast<uint64_t>(beam));
+
+        std::vector<uint64_t> uni_ids(K, 0);
+        std::vector<float> uni_dists(K, 0.0f);
+        UnifiedSearchContext sctx;
+        sctx.query = query.data();
+        sctx.K = K;
+        sctx.L = search_L;
+        sctx.indices = uni_ids.data();
+        sctx.distances = uni_dists.data();
+        sctx.beam_width = beam;
+        uidx->search(sctx);
+
+        const double ov = topk_overlap(legacy_ids, uni_ids);
+        total_overlap += ov;
+        if (ov >= 1.0)
+            ++exact;
+    }
+    const double avg_overlap = total_overlap / nq;
+    BOOST_TEST_MESSAGE("ssd parity: avg top-" << K << " overlap = " << avg_overlap << ", exact = " << exact << "/"
+                                              << nq);
+    // EXACT parity is expected. Both indices share the SAME graph AND the SAME
+    // PQ codebook (the pivot/code bytes are generated once and fed to both), so
+    // the beam search is fully deterministic: no RNG on the search path (the
+    // cache_bfs_levels shuffle is not triggered -- no cache priming), single
+    // thread, one deterministic medoid seed, and NeighborPriorityQueue breaks
+    // distance ties by id. Every query's top-K must be identical.
+    BOOST_CHECK_EQUAL(exact, static_cast<size_t>(nq));
+
+    // WindowsAlignedFileReader keeps the unified file open (its destructor does
+    // not close, unlike PQFlashIndex which calls reader->close()). Release the
+    // index and close the reader so ScopedFile can delete the backing file.
+    uidx.reset();
+    ureader->close();
+}
+
+BOOST_AUTO_TEST_CASE(memory_filtered_parity_legacy_vs_unified)
+{
+    const uint32_t npts = 10000;
+    const uint32_t dim = 32;
+    const uint32_t nq = 100;
+    const uint32_t R = 32, L = 100, K = 10, search_L = 100, vocab = 8;
+
+    ScopedFile data_sf(tmp_path("parity_fmem_data"));
+    write_float_bin_parity(data_sf.path, npts, dim, /*seed=*/321);
+
+    // Simulate random per-point labels and write the label file.
+    const auto label_sets = gen_label_sets(npts, vocab, /*seed=*/55);
+    ScopedFile rawlabels_sf(tmp_path("parity_fmem_rawlabels"));
+    write_label_file(rawlabels_sf.path, label_sets);
+
+    // 1) Build ONE filtered Vamana Index<float>. filter_list_size (Lf) MUST be
+    // set for a filtered build -- it defaults to 0, which makes the filtered
+    // link phase run with an empty search list and crash.
+    auto write_params = std::make_shared<IndexWriteParameters>(
+        IndexWriteParametersBuilder(L, R).with_alpha(1.2f).with_num_threads(1).with_filter_list_size(L).build());
+    Index<float, uint32_t, uint32_t> idx(diskann::Metric::L2, dim, npts, write_params, nullptr,
+                                         /*num_frozen_pts=*/0, /*dynamic=*/false, /*enable_tags=*/false,
+                                         /*concurrent_consolidate=*/false, /*pq_dist_build=*/false,
+                                         /*num_pq_chunks=*/0, /*use_opq=*/false, /*filtered_index=*/true);
+    ScopedFilteredLegacyFiles legacy(tmp_path("parity_fmem_legacy"));
+    {
+        // save_path_prefix == legacy prefix so the labels_map + label files that
+        // build/save emit all co-locate for the subsequent legacy load().
+        IndexFilterParams fp = IndexFilterParamsBuilder()
+                                   .with_label_file(rawlabels_sf.path)
+                                   .with_save_path_prefix(legacy.prefix)
+                                   .build();
+        idx.build(data_sf.path, npts, fp);
+    }
+
+    // 2) Emit both formats from the same filtered graph.
+    ScopedFile unified_sf(tmp_path("parity_fmem_unified"));
+    idx.save(legacy.prefix.c_str());
+    idx.save_unified(unified_sf.path.c_str());
+
+    // 3) Load the legacy filtered index.
+    Index<float, uint32_t, uint32_t> legacy_idx(diskann::Metric::L2, dim, npts, write_params, nullptr,
+                                                /*num_frozen_pts=*/0, /*dynamic=*/false, /*enable_tags=*/false,
+                                                /*concurrent_consolidate=*/false, /*pq_dist_build=*/false,
+                                                /*num_pq_chunks=*/0, /*use_opq=*/false, /*filtered_index=*/true);
+    legacy_idx.load(legacy.prefix.c_str(), /*num_threads=*/1, /*search_l=*/search_L);
+
+    // 4) Load the unified index.
+    UnifiedLoadContext uctx;
+    uctx.path = unified_sf.path;
+    uctx.num_threads = 1;
+    uctx.search_l = search_L;
+    auto uidx = make_unified_index_memory(uctx);
+    BOOST_REQUIRE(uidx != nullptr);
+
+    // get_table_stats() on a real filtered index: node/label cardinality and a
+    // non-zero label memory footprint (bitmask storage).
+    const TableStats ust = uidx->get_table_stats();
+    BOOST_CHECK_EQUAL(ust.node_count, npts);
+    BOOST_CHECK_EQUAL(ust.label_count, static_cast<size_t>(vocab));
+    BOOST_CHECK_GT(ust.label_mem_usage, 0u);
+    BOOST_CHECK_GT(ust.node_mem_usage, 0u);
+    double total_overlap = 0.0;
+    size_t exact = 0, nonempty = 0, legacy_bad = 0, uni_bad = 0;
+    for (uint32_t q = 0; q < nq; ++q)
+    {
+        std::vector<float> query(dim);
+        for (uint32_t j = 0; j < dim; ++j)
+            query[j] = det_float(/*seed=*/444, q, j);
+        const std::string flabel = std::to_string(q % vocab);
+
+        std::vector<uint32_t> legacy_ids(K, std::numeric_limits<uint32_t>::max());
+        std::vector<float> legacy_dists(K, 0.0f);
+        std::vector<uint32_t> filter_ints = {legacy_idx.get_converted_label(flabel)};
+        legacy_idx.search_with_filters<uint32_t>(query.data(), filter_ints, K, search_L, /*maxLperSeller=*/0,
+                                                 legacy_ids.data(), legacy_dists.data());
+
+        std::vector<uint64_t> uni_ids(K, std::numeric_limits<uint64_t>::max());
+        std::vector<float> uni_dists(K, 0.0f);
+        UnifiedSearchContext sctx;
+        sctx.query = query.data();
+        sctx.K = K;
+        sctx.L = search_L;
+        sctx.indices = uni_ids.data();
+        sctx.distances = uni_dists.data();
+        sctx.filter_labels = {flabel};
+        uidx->search(sctx);
+
+        // Correctness: every returned point MUST carry the filter label.
+        std::vector<uint32_t> lvalid;
+        std::vector<uint64_t> uvalid;
+        for (uint32_t id : legacy_ids)
+            if (id != std::numeric_limits<uint32_t>::max())
+            {
+                lvalid.push_back(id);
+                if (!point_has_label(label_sets, id, flabel))
+                    ++legacy_bad;
+            }
+        for (uint64_t id : uni_ids)
+            if (id != std::numeric_limits<uint64_t>::max())
+            {
+                uvalid.push_back(id);
+                if (!point_has_label(label_sets, static_cast<uint32_t>(id), flabel))
+                    ++uni_bad;
+            }
+
+        if (!lvalid.empty())
+        {
+            const double ov = topk_overlap(lvalid, uvalid);
+            total_overlap += ov;
+            ++nonempty;
+            if (ov >= 1.0)
+                ++exact;
+        }
+    }
+    const double avg_overlap = nonempty ? total_overlap / nonempty : 1.0;
+    BOOST_TEST_MESSAGE("memory filtered parity: avg overlap = " << avg_overlap << ", exact = " << exact << "/"
+                                                                << nonempty << ", legacy_bad = " << legacy_bad
+                                                                << ", uni_bad = " << uni_bad);
+    // Correctness is exact: no result from either index may violate the filter.
+    BOOST_CHECK_EQUAL(legacy_bad, 0u);
+    BOOST_CHECK_EQUAL(uni_bad, 0u);
+    // Legacy seeds init_ids from the global _start medoid PLUS per-label medoids
+    // (Index::search_with_filters), while the unified seeds ONLY from per-label
+    // medoids (a deliberate recall-oriented choice). So results are highly
+    // similar but not required to be bit-identical.
+    BOOST_CHECK_GE(avg_overlap, 0.90);
+}
+
+BOOST_AUTO_TEST_CASE(ssd_filtered_parity_legacy_vs_unified)
+{
+    const uint32_t npts = 10000;
+    const uint32_t dim = 32;
+    const uint32_t nq = 100;
+    const uint32_t R = 32, L = 100, K = 10, search_L = 100, beam = 4, pq_chunks = 16, vocab = 8;
+
+    ScopedFile data_sf(tmp_path("parity_fssd_data"));
+    write_float_bin_parity(data_sf.path, npts, dim, /*seed=*/876);
+
+    const auto label_sets = gen_label_sets(npts, vocab, /*seed=*/77);
+    ScopedFile rawlabels_sf(tmp_path("parity_fssd_rawlabels"));
+    write_label_file(rawlabels_sf.path, label_sets);
+
+    // 1) Build ONE filtered Index. save_path_prefix == the SSD prefix so the
+    //    labels_map lands where PQFlashIndex::load expects it.
+    ScopedFilteredLegacySsdFiles legacy(tmp_path("parity_fssd_legacy"));
+    auto write_params = std::make_shared<IndexWriteParameters>(
+        IndexWriteParametersBuilder(L, R).with_alpha(1.2f).with_num_threads(1).with_filter_list_size(L).build());
+    Index<float, uint32_t, uint32_t> idx(diskann::Metric::L2, dim, npts, write_params, nullptr,
+                                         /*num_frozen_pts=*/0, /*dynamic=*/false, /*enable_tags=*/false,
+                                         /*concurrent_consolidate=*/false, /*pq_dist_build=*/false,
+                                         /*num_pq_chunks=*/0, /*use_opq=*/false, /*filtered_index=*/true);
+    {
+        IndexFilterParams fp = IndexFilterParamsBuilder()
+                                   .with_label_file(rawlabels_sf.path)
+                                   .with_save_path_prefix(legacy.prefix)
+                                   .build();
+        idx.build(data_sf.path, npts, fp);
+    }
+
+    // 2) Legacy filtered SSD (+ shared PQ) and unified filtered SSD.
+    std::vector<uint8_t> pq_pivots_bytes, pq_codes_bytes;
+    make_legacy_ssd_filtered_from_index(idx, data_sf.path, legacy.prefix, pq_chunks, diskann::Metric::L2,
+                                        pq_pivots_bytes, pq_codes_bytes);
+    ScopedFile unified_sf(tmp_path("parity_fssd_unified"));
+    idx.save_unified(unified_sf.path.c_str(), pq_pivots_bytes, pq_codes_bytes);
+
+    // 3) Load the legacy filtered PQFlashIndex.
+    auto legacy_reader = make_reader();
+    PQFlashIndex<float, uint32_t> pfi(legacy_reader, diskann::Metric::L2);
+    const int rc = pfi.load(/*num_threads=*/1, legacy.prefix.c_str());
+    BOOST_REQUIRE_EQUAL(rc, 0);
+
+    // 4) Load the unified filtered SSD index.
+    UnifiedLoadContext uctx;
+    uctx.path = unified_sf.path;
+    uctx.num_threads = 1;
+    uctx.search_l = search_L;
+    auto ureader = make_reader();
+    auto uidx = make_unified_index_ssd(ureader, uctx);
+    BOOST_REQUIRE(uidx != nullptr);
+
+    // 5) Search each query under a rotating filter label; correctness + parity.
+    double total_overlap = 0.0;
+    size_t exact = 0, nonempty = 0, legacy_bad = 0, uni_bad = 0;
+    for (uint32_t q = 0; q < nq; ++q)
+    {
+        std::vector<float> query(dim);
+        for (uint32_t j = 0; j < dim; ++j)
+            query[j] = det_float(/*seed=*/222, q, j);
+        const std::string flabel = std::to_string(q % vocab);
+
+        std::vector<uint64_t> legacy_ids(K, std::numeric_limits<uint64_t>::max());
+        std::vector<float> legacy_dists(K, 0.0f);
+        std::vector<uint32_t> filter_ints = {pfi.get_converted_label(flabel)};
+        pfi.cached_beam_search(query.data(), K, search_L, legacy_ids.data(), legacy_dists.data(),
+                               static_cast<uint64_t>(beam), /*use_filter=*/true, filter_ints);
+
+        std::vector<uint64_t> uni_ids(K, std::numeric_limits<uint64_t>::max());
+        std::vector<float> uni_dists(K, 0.0f);
+        UnifiedSearchContext sctx;
+        sctx.query = query.data();
+        sctx.K = K;
+        sctx.L = search_L;
+        sctx.indices = uni_ids.data();
+        sctx.distances = uni_dists.data();
+        sctx.beam_width = beam;
+        sctx.filter_labels = {flabel};
+        uidx->search(sctx);
+
+        std::vector<uint64_t> lvalid, uvalid;
+        for (uint64_t id : legacy_ids)
+            if (id != std::numeric_limits<uint64_t>::max())
+            {
+                lvalid.push_back(id);
+                if (!point_has_label(label_sets, static_cast<uint32_t>(id), flabel))
+                    ++legacy_bad;
+            }
+        for (uint64_t id : uni_ids)
+            if (id != std::numeric_limits<uint64_t>::max())
+            {
+                uvalid.push_back(id);
+                if (!point_has_label(label_sets, static_cast<uint32_t>(id), flabel))
+                    ++uni_bad;
+            }
+
+        if (!lvalid.empty())
+        {
+            const double ov = topk_overlap(lvalid, uvalid);
+            total_overlap += ov;
+            ++nonempty;
+            if (ov >= 1.0)
+                ++exact;
+        }
+    }
+    const double avg_overlap = nonempty ? total_overlap / nonempty : 1.0;
+    BOOST_TEST_MESSAGE("ssd filtered parity: avg overlap = " << avg_overlap << ", exact = " << exact << "/" << nonempty
+                                                             << ", legacy_bad = " << legacy_bad
+                                                             << ", uni_bad = " << uni_bad);
+    // Correctness: no result may violate the filter.
+    BOOST_CHECK_EQUAL(legacy_bad, 0u);
+    BOOST_CHECK_EQUAL(uni_bad, 0u);
+    // For a single filter label per query, the legacy filtered SSD search seeds
+    // ONLY from that label's medoid (no global seed) -- same as the unified --
+    // so with the shared graph + PQ the results should be highly similar.
+    BOOST_CHECK_GE(avg_overlap, 0.90);
+
+    uidx.reset();
+    ureader->close();
+}
+
+BOOST_AUTO_TEST_SUITE_END()
+
+// ===========================================================================
+// Suite 10: get_table_stats()
+// ===========================================================================
+
+BOOST_AUTO_TEST_SUITE(unified_table_stats_tests)
+
+BOOST_AUTO_TEST_CASE(memory_stats_unfiltered)
+{
+    ScopedFile sf(tmp_path("stats_mem"));
+    const uint64_t npts = 64, aligned_dim = 16;
+    std::vector<std::vector<float>> points;
+    write_star_graph_unified(sf.path, npts, aligned_dim, points);
+
+    UnifiedLoadContext ctx;
+    ctx.path = sf.path;
+    auto idx = make_unified_index_memory(ctx);
+    BOOST_REQUIRE(idx != nullptr);
+
+    const TableStats st = idx->get_table_stats();
+    BOOST_CHECK_EQUAL(st.node_count, npts);
+    BOOST_CHECK_EQUAL(st.label_count, 0u); // unfiltered
+    BOOST_CHECK_EQUAL(st.label_mem_usage, 0u);
+    BOOST_CHECK_GT(st.node_mem_usage, 0u);
+    // Memory keeps the full graph region resident, so total > 0 and is the sum
+    // of the parts.
+    BOOST_CHECK_GT(st.total_mem_usage, 0u);
+    BOOST_CHECK_EQUAL(st.total_mem_usage, st.node_mem_usage + st.graph_mem_usage + st.label_mem_usage +
+                                              st.tag_memory_usage);
+}
+
+BOOST_AUTO_TEST_CASE(ssd_stats_pq_codes)
+{
+    ScopedFile data_sf(tmp_path("stats_ssd_data"));
+    const uint32_t npts = 512, dim = 16, pq_dim = 4;
+    write_float_bin_parity(data_sf.path, npts, dim, /*seed=*/5);
+
+    ScopedFile out_sf(tmp_path("stats_ssd_out"));
+    UnifiedBuildContext bctx;
+    bctx.data_file_path = data_sf.path;
+    bctx.output_path = out_sf.path;
+    bctx.data_type = DataTypeTag::Float;
+    bctx.metric = diskann::Metric::L2;
+    bctx.R = 16;
+    bctx.L = 32;
+    bctx.alpha = 1.2f;
+    bctx.num_threads = 1;
+    bctx.pq_dim = pq_dim;
+    bctx.pq_sampling_rate = 1.0;
+    unified_index_builder().build(bctx);
+
+    UnifiedLoadContext ctx;
+    ctx.path = out_sf.path;
+    ctx.num_threads = 1;
+    auto reader = make_reader();
+    auto idx = make_unified_index_ssd(reader, ctx);
+    BOOST_REQUIRE(idx != nullptr);
+
+    const TableStats st = idx->get_table_stats();
+    BOOST_CHECK_EQUAL(st.node_count, npts);
+    // SSD node_mem_usage == resident PQ codes == npts * n_chunks; graph on disk.
+    BOOST_CHECK_EQUAL(st.node_mem_usage, static_cast<size_t>(npts) * pq_dim);
+    BOOST_CHECK_EQUAL(st.graph_mem_usage, 0u);
+    BOOST_CHECK_EQUAL(st.total_mem_usage, st.node_mem_usage + st.graph_mem_usage + st.label_mem_usage +
+                                              st.tag_memory_usage);
+
+    idx.reset();
+    reader->close();
+}
+
+BOOST_AUTO_TEST_SUITE_END()