Skip to content

fix(engine): scalar index coverage + filter literal coercion (query latency)#216

Merged
ragnorc merged 10 commits into
mainfrom
ragnorc/query-latency-investigation
Jun 14, 2026
Merged

fix(engine): scalar index coverage + filter literal coercion (query latency)#216
ragnorc merged 10 commits into
mainfrom
ragnorc/query-latency-investigation

Conversation

@ragnorc

@ragnorc ragnorc commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

What & why

A query-latency investigation found that reads on indexed columns were full-scanning. This closes the gaps end to end: scalar/enum/DateTime @index columns now get a usable BTREE, optimize keeps index coverage current as data is appended, and filter literals are coerced to the column's exact Arrow type so the persisted scalar index is actually used instead of being defeated by a column-side cast.

Backing issue / RFC

  • Fixes an accepted issue: Closes #
  • Implements / is an accepted RFC: <link to docs/rfcs/NNNN-*.md>
  • Trivial fast-lane — no issue/RFC required

Maintainer change from the internal query-latency (Pulse) investigation; no public issue/RFC. Reviewed under the maintainer internal process.

Checklist

  • Change is focused (one logical theme: make indexed reads actually use their indexes)
  • Tests added/updated for behavior changes
  • Public docs updated (docs/user/indexes.md, docs/user/maintenance.md, plus dev docs)
  • Reviewed against docs/dev/invariants.md — §1 (staged writes), §5 (recovery sidecar), §7 (indexes derived state) all preserved

Notes for reviewers

Three related changes: (1) build_indices_on_dataset_for_catalog dispatches index kind by type via a shared node_prop_index_kind (enum + orderable scalar → BTREE, free-text String → FTS), fixing enums being mis-routed to FTS and DateTime/numeric building nothing; (2) optimize runs Lance optimize_indices after compact_files to fold appended/rewritten fragments back into existing indexes — an inline-commit residual under the existing SidecarKind::Optimize recovery sidecar (Lance exposes no uncommitted variant); (3) pushdown filter literals coerce to the column type, which supersedes the earlier date-literal commit's rationale (a physical-plan probe showed dates were never index-defeated — temporal coercion casts the literal, not the column — so only narrow-numeric columns were affected). Deferred follow-ups: blob-table reindex (blocked on the upstream Lance blob-compaction bug) and bitmap (vs BTREE) for low-cardinality enums.


Note

Medium Risk
Changes query filter lowering and maintenance’s inline optimize_indices path under recovery sidecars; behavior is heavily tested but a reindex-only optimize with no HEAD advance can leave a sidecar until recovery, which reviewers should sanity-check.

Overview
Fixes indexed reads that were effectively full-scanning by aligning index creation, maintenance, and pushdown on the same rules.

Index dispatch: @index/@key node properties now share node_prop_index_kind for both ensure_indices and sidecar pinning—enums and orderable scalars (DateTime, numeric, Bool) get BTREE; free-text String stays FTS; vectors unchanged. needs_index_work_node and build_indices_on_dataset_for_catalog use the same mapping so coverage checks match what gets built.

optimize: After optional compact_files, each table runs Lance optimize_indices (inline-commit, still under SidecarKind::Optimize). Tables with stale fragment coverage but no compaction plan still enter the sidecar/publish path via TableStore::has_unindexed_fragments, so merge-appended rows get folded into existing indexes.

Query pushdown: Node scans pass the catalog arrow_schema into build_lance_filter_expr / ir_filter_to_expr, coercing literals to each column’s Arrow type (literal_to_typed_expr + shared literal_to_array) so narrow BTREEs are not bypassed by column-side casts; integer coercion uses a lossless round-trip guard for fractional/out-of-range values.

User/dev docs, capability matrix, and integration/surface guards (optimize_indices, scalar-index literal typing, new scalar_indexes / maintenance / literal filter tests) are updated accordingly.

Reviewed by Cursor Bugbot for commit a3148f6. Bugbot is set up for automated code reviews on this repo. Configure here.

Greptile Summary

This PR closes three gaps that caused indexed node reads to full-scan: @index/@key columns now get the right Lance index (BTREE for enums/orderable scalars, FTS for free-text Strings) via a shared node_prop_index_kind helper; optimize runs optimize_indices after compaction to fold appended fragments back into existing indexes; and filter literals are coerced to each column's exact Arrow type so narrow-numeric BTREEs are consulted rather than defeated by a DataFusion column-side cast.

  • Index dispatch (node_prop_index_kind): build_indices_on_dataset_for_catalog and needs_index_work_node share a single enum dispatch, fixing enums being mis-routed to FTS and DateTime/numeric columns building nothing.
  • optimize reindex pass: optimize_one_table checks has_unindexed_fragments and calls ds.optimize_indices after compaction (or alone when only coverage is stale); the existing SidecarKind::Optimize sidecar spans both operations for crash recovery.
  • Filter literal coercion (literal_to_typed_expr): node scans and hydrate-nodes pass the Arrow schema into build_lance_filter_expr; literals are cast to the column type with a lossless round-trip guard for integer targets, preventing silent truncation and preserving BTREE eligibility.

Confidence Score: 5/5

Safe to merge — no new correctness or safety issues found beyond what prior rounds already flagged; the logic, tests, and docs are well-aligned.

The three changed paths (index dispatch, optimize reindex, filter coercion) are each covered by new integration and unit tests and by Lance surface guards that will turn red on a bad upstream bump. The integer round-trip lossless guard is correct and exercised. No new defects were found in this round beyond the recovery-sidecar concerns already captured in prior review threads.

optimize.rs warrants the most scrutiny — the reindex-only entry point (needs_reindex=true, will_compact=false) relaxes the original guarantee that compaction always advances Lance HEAD before the sidecar is written; the prior thread on this file covers the implications in detail.

Important Files Changed

Filename Overview
crates/omnigraph/src/db/omnigraph/table_ops.rs Adds node_prop_index_kind helper to dispatch @index/@key columns to BTREE (enums, orderable scalars), FTS (free-text Strings), or vector index; updates needs_index_work_node and build_indices_on_dataset_for_catalog to use it, fixing enum to FTS misrouting and DateTime/numeric building nothing.
crates/omnigraph/src/db/omnigraph/optimize.rs Extends optimize_one_table to check has_unindexed_fragments and call optimize_indices after compaction; the reindex-only path (no compaction) now enters the sidecar window without the original guarantee that at least one HEAD advance will occur.
crates/omnigraph/src/exec/query.rs Adds schema-aware literal coercion to build_lance_filter_expr and ir_filter_to_expr via literal_to_typed_expr with a lossless round-trip guard for integer targets; node scans and hydrate_nodes now pass the Arrow schema so BTREEs are consulted instead of being defeated by column-side casts.
crates/omnigraph/src/table_store.rs Adds has_unindexed_fragments that inspects fragment_bitmap on non-system indices to detect rows appended after index build; correctly mirrors the None-bitmap treatment from key_column_index_coverage.
crates/omnigraph/tests/lance_surface_guards.rs Adds three new Lance surface guards: Guard 14 (compile-only signature pin for optimize_indices), Guard 15 (runtime: optimize_indices extends fragment coverage), Guard 16 (runtime: BTREE is consulted only when literal type matches column type).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[optimize_one_table] --> B{has blob columns?}
    B -- yes --> C[return Skipped::BlobColumnsUnsupportedByLance]
    B -- no --> D{HEAD > manifest uncovered drift?}
    D -- yes --> E[return Skipped::DriftNeedsRepair]
    D -- no --> F[plan_compaction]
    F --> G{will_compact OR needs_reindex?}
    G -- neither --> H[return no-op compacted]
    G -- at least one --> I[Phase A: write recovery sidecar post_commit_pin = expected_version+1]
    I --> J{will_compact?}
    J -- yes --> K[compact_files reserve+rewrite commits]
    J -- no --> L[CompactionMetrics::default]
    K --> M[ds.optimize_indices fold uncovered fragments into existing indexes]
    L --> M
    M --> N{committed = version_after != version_before?}
    N -- yes --> O[Phase C: publish new version to manifest]
    N -- no --> P[skip manifest publish]
    O --> Q[Phase D: delete sidecar]
    P --> Q
Loading

Comments Outside Diff (2)

  1. docs/dev/lance.md, line 1189 (link)

    P2 Guard 16 omitted from the surface-guard count update

    The PR adds three new guards (Guard 14 _compile_optimize_indices_signature, Guard 15 optimize_indices_extends_fragment_coverage, Guard 16 scalar_index_use_requires_matched_literal_type) but the updated summary sentence only names two. Guard 16 — which pins the core substrate assumption that the literal-coercion fix relies on (BTREE is consulted only when the literal type matches the column type) — is missing from the count and description. Future Lance bumpers relying on this doc to find the right test to re-run first will miss it.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Fix in Claude Code

  2. crates/omnigraph/src/db/omnigraph/table_ops.rs, line 148-163 (link)

    P2 ScalarType::String in two match arms — guard ordering is load-bearing

    String if !is_enum catches free-text strings; plain String in the BTREE OR-arm catches enum strings via fall-through. The code is correct, but adding an arm between them or removing the guard would silently route enums back to FTS. Consolidating into one ScalarType::String arm with an inner if is_enum branch would make the logic self-documenting and robust against future arm reordering.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Fix in Claude Code

Reviews (7): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

ragnorc added 5 commits June 13, 2026 18:42
`literal_to_expr` lowered `Date`/`DateTime` query literals as Utf8 strings,
relying on DataFusion implicit casts. Against a physical `Date32`/`Date64`
column that can coerce the column side (`CAST(col AS Utf8)`), which defeats a
scalar BTREE and degrades the scan to a full filtered read. Lower to typed
`Date32`/`Date64` scalars instead (reusing the loader's
`parse_date32_literal`/`parse_date64_literal`, already used by the in-memory
comparison arm), so the predicate stays a direct column comparison and the
index is used. Malformed literals fall back to the Utf8 string so pushdown
behavior never regresses.

Tests: unit goldens asserting the lowered literal is typed (red before, green
after) + inline-binding pushdown equality in literal_filters confirming the
epoch conversion selects the right rows.
…columns

`build_indices_on_dataset_for_catalog` only handled `String` (-> FTS) and
`Vector` (-> vector). Enums are physically `String`, so an enum `@index`
column (e.g. `status`) got an FTS inverted index, which Lance never consults
for `=`; and `DateTime`/`Date`/numeric/`Bool` `@index` columns fell through
and built nothing. Both meant equality/range filters degraded to full scans
with `indices_loaded=0`.

Dispatch index kind by property type via a shared `node_prop_index_kind`:
enum + orderable scalar -> BTREE, free-text String -> FTS, Vector -> vector,
list/Blob -> none. The helper is shared by the builder and
`needs_index_work_node` so they cannot drift — the latter decides recovery-
sidecar pinning, and under-reporting would leave a HEAD-advancing index build
uncovered (invariant 5).

Tests: scalar_indexes.rs asserts enum/DateTime/numeric @index columns report
`IndexCoverage::Indexed` while free-text String/un-annotated columns stay
`Degraded` (negative control). Docs: docs/user/indexes.md.
A scalar/FTS/vector index only covers the fragments it was built over. Rows
appended after the build (e.g. `ingest --mode merge`, whose commit does not
rebuild an existing index) are scanned unindexed, and `compact_files` rewrites
fragments out of coverage. Nothing folded them back in, so coverage decayed as
the graph grew — even the id/src/dst BTREEs that power traversal.

`optimize_one_table` now runs Lance `optimize_indices` after `compact_files`
(incremental merge, not retrain — the same compact->optimize_indices sequence
LanceDB's `optimize()` uses) and enters the publish path on compaction work OR
stale index coverage (new `TableStore::has_unindexed_fragments`, reusing the
fragment_bitmap logic). `optimize_indices` is a committing call with no
uncommitted variant in lance-6.0.1, so it is an inline-commit residual covered
by the existing `SidecarKind::Optimize` recovery sidecar spanning both ops.
Blob-bearing tables are still skipped (the Lance blob-compaction bug is
compaction-specific; reindex-for-blob deferred as a noted follow-up).

Tests: maintenance.rs asserts an appended fragment is uncovered before and
covered after optimize, and idempotency holds (second pass is a no-op).
lance_surface_guards pins the `optimize_indices` signature and its incremental-
coverage behavior. The existing optimize Phase-B recovery failpoint now also
exercises a crash after reindex. Docs: maintenance.md, writes.md, invariants.md,
lance.md, AGENTS.md.
Filter literals were pushed to Lance in their natural Arrow type (every integer
Int64, every float Float64). Against a narrower indexed column DataFusion widens
to the literal's type and casts the COLUMN (`CAST(n32 AS Int64)`), which defeats
the scalar BTREE and degrades to a full filtered read. A physical-plan probe
confirms it: an Int32 column filtered by an i32 literal uses `ScalarIndexQuery`;
by an i64 literal it does not.

Thread the scan's `arrow_schema` through `build_lance_filter_expr` ->
`ir_filter_to_expr` and coerce each literal operand to the opposite column's
exact Arrow type, reusing `projection::literal_to_array` + `arrow_cast` (the same
path the in-memory arm uses, so the two arms agree). Coercion never demotes a
filter to None: on failure it falls back to the natural literal, because a node
scan has no in-memory fallback for inline filters.

Supersedes the date-specific change in e4ef67b (PR1): the probe shows dates were
never index-defeated — temporal coercion casts the LITERAL, not the column — so
PR1's index-use rationale was wrong though harmless. The generic coercion
subsumes it; `literal_to_expr`'s date arms revert to the natural Utf8 fallback,
and its unit tests now assert the live coerced path.

Tests: surface guard `scalar_index_use_requires_matched_literal_type` pins the
substrate behavior (matched -> index, widened -> column-cast full scan); unit
tests cover Int32/UInt32/Float32 coercion, range op, reversed operand order, and
the natural fallback; `literal_filters` adds an I32 column with equality + range
and an F32 pushdown case.
@ragnorc ragnorc requested a review from aaltshuler as a code owner June 13, 2026 19:05

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ragnorc has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0641210a3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/omnigraph/src/exec/query.rs Outdated
Comment on lines +2123 to +2127
let casted = if arr.data_type() == target {
arr
} else {
arrow_cast::cast::cast(&arr, target).ok()?
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid truncating float filters for integer columns

When a valid numeric comparison uses a Float literal against an integer column, e.g. match { $m: Metric { count: 2.7 } } with count: I32, this column-typed coercion casts the Float64 literal to Int32 before Lance sees it. Arrow numeric casts truncate fractional floats, so the pushed predicate becomes count = 2 and can return rows that should not match; because NodeScan filters are only pushed down, there is no later in-memory predicate to restore the original comparison. Please only apply exact/widening coercions here or fall back to the natural literal when the cast would be lossy.

Useful? React with 👍 / 👎.

Comment on lines +427 to 431
ds.optimize_indices(&OptimizeOptions::default())
.await
.map_err(|e| OmniError::Lance(e.to_string()))?;
.map_err(|e| OmniError::Lance(format!("optimize_indices on {}: {}", table_key, e)))?;
let version_after = ds.version().version;
let committed = version_after != version_before;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 optimize_indices called unconditionally even when needs_reindex = false

When will_compact = true but needs_reindex = false, the function enters the sidecar path, compaction runs and rewrites fragments (dropping them from existing index coverage), and then optimize_indices folds them back in — this sequencing is correct. However, if the table has no non-system indexes at all, optimize_indices still makes a Lance API call (a no-op) on every compaction cycle. A guard checking !ds.load_indices().await?.iter().any(|i| !is_system_index(i)) before calling optimize_indices would avoid the spurious round-trip, though this is a latency concern only.

Fix in Claude Code

Comment on lines 380 to 431
@@ -398,11 +407,26 @@ async fn optimize_one_table(
let handle =
crate::db::manifest::write_sidecar(db.root_uri(), db.storage_adapter(), &sidecar).await?;

// Phase B: compaction (reserve-fragments + rewrite commits advance HEAD).
// Phase B: compaction (if any) then incremental index optimize — both
// advance Lance HEAD inside the sidecar window. `compact_files` rewrites
// fragments and drops them from existing index segments' coverage;
// `optimize_indices` folds the rewritten and any previously-unindexed
// fragments back in (Lance's incremental merge, not a full retrain). This
// is the same compact -> optimize_indices sequencing LanceDB's `optimize()`
// uses. `optimize_indices` is an inline-commit residual: lance-6.0.1
// exposes no uncommitted variant, so like `compact_files` it commits
// directly and relies on the sidecar for recovery.
let version_before = ds.version().version;
let metrics: CompactionMetrics = compact_files(&mut ds, options, None)
let metrics: CompactionMetrics = if will_compact {
compact_files(&mut ds, options, None)
.await
.map_err(|e| OmniError::Lance(e.to_string()))?
} else {
CompactionMetrics::default()
};
ds.optimize_indices(&OptimizeOptions::default())
.await
.map_err(|e| OmniError::Lance(e.to_string()))?;
.map_err(|e| OmniError::Lance(format!("optimize_indices on {}: {}", table_key, e)))?;
let version_after = ds.version().version;
let committed = version_after != version_before;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Sidecar written without guaranteed HEAD advance on the reindex-only path

The original code gated sidecar creation on plan.num_tasks() > 0, guaranteeing compact_files advances Lance HEAD at least once — so a stranded sidecar (Phase D cleanup failure) always recovers forward, not as NoMovement. The new needs_reindex entry point doesn't carry that guarantee: if a concurrent optimize already ran optimize_indices between the has_unindexed_fragments check and the call here, optimize_indices commits nothing, committed = false, Phase C is skipped, and the sidecar is left on disk if Phase D cleanup fails.

Recovery classifies this as "not yet committed" (dataset at V < post_commit_pin V+1) and takes the rollback path (Dataset::restore(V) + manifest publish V). Since V is the current version the restore is a data no-op, but the manifest CAS can conflict if concurrent writes advanced the table in the window — requiring manual omnigraph repair.

Fix in Claude Code

Comment thread crates/omnigraph/src/exec/query.rs
The literal coercion in f064121 narrowed unconditionally. typecheck permits
numeric cross-type comparisons (`types_compatible`), so an out-of-domain literal
reaches `literal_to_typed_expr` and casts lossily: a fractional float vs an
integer column truncates (`{ count: 2.7 }` -> `count = 2`, wrongly matching the
count=2 row) and an out-of-range integer overflows to null (`count < 3e9` on I32
-> `count < NULL` -> empty). Both silently change results, and a node scan has no
in-memory fallback for inline filters.

Add a lossless guard for integer targets: round-trip the cast back to the natural
type and, on mismatch, return None so the caller keeps the natural literal
(correct via DataFusion coercion; the index is just unused for that out-of-domain
predicate). Float targets stay coerced -- narrowing F64 -> F32 is the column's own
precision domain, not a value error.

Resolves the two valid review findings on PR #216 (Codex float truncation, Greptile
out-of-range). Tests: unit cases for fractional/out-of-range fallback vs
whole-float/in-range coerce vs F32 exemption; e2e `{ count: 2.7 }` returns no rows.
@ragnorc

ragnorc commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

Addressed the automated review on cf14bf8:

  • Codex P2 (float truncation) — fixed. A fractional float against an integer column was truncating in the pushdown coercion ({ count: 2.7 }count = 2, wrongly matching). literal_to_typed_expr now round-trips the cast for integer targets and falls back to the natural literal on mismatch, so count = 2.7 matches no integer (added an e2e test for exactly this, plus unit cases).
  • Greptile P2 (out-of-range integer) — fixed by the same guard. An out-of-range integer no longer overflows to null (which made count < 3e9 on I32 return empty); it falls back to the natural Int64 literal (correct via DataFusion, index unused for that out-of-domain predicate). Float targets stay coerced — narrowing F64F32 is the column's own precision domain, not a value error.
  • Greptile P1 (reindex-only sidecar NoMovement) — valid edge, not blocking. The per-(table, main) write queue serializes the check-and-act intra-process, so the race is cross-process only (already documented one-winner-CAS territory) and additionally needs a Phase-D delete failure plus a concurrent manifest advance during recovery; the outcome is a manual omnigraph repair, not corruption. Tracking as a known-gap follow-up (optional recovery hardening: skip the manifest publish when HEAD == manifest).
  • Greptile P2 (spurious optimize_indices call) — declined. A table only enters the work block when will_compact || needs_reindex; any such table has rows and thus an id BTREE, so the "no non-system indexes" no-op is unreachable. The suggested load_indices guard would cost about as much as the no-op it avoids.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3627b4e. Configure here.

if original != round_tripped {
return None;
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer literals lose float precision

Medium Severity

In literal_to_typed_expr, the lossless round-trip guard runs only when the comparison column is an integer type. Integer query literals are built as Int64 and, when the column is Float32 or Float64, are cast to that float without checking that the value survives round-trip. Integers with magnitude above float mantissa limits (e.g. beyond 2^24 on Float32) can change during coercion, so equality filters on indexed float columns can match the wrong rows instead of falling back to the natural literal.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3627b4e. Configure here.

ragnorc and others added 2 commits June 14, 2026 10:36
@ragnorc ragnorc merged commit 1bed998 into main Jun 14, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant