fix(engine): scalar index coverage + filter literal coercion (query latency)#216
Conversation
`literal_to_expr` lowered `Date`/`DateTime` query literals as Utf8 strings, relying on DataFusion implicit casts. Against a physical `Date32`/`Date64` column that can coerce the column side (`CAST(col AS Utf8)`), which defeats a scalar BTREE and degrades the scan to a full filtered read. Lower to typed `Date32`/`Date64` scalars instead (reusing the loader's `parse_date32_literal`/`parse_date64_literal`, already used by the in-memory comparison arm), so the predicate stays a direct column comparison and the index is used. Malformed literals fall back to the Utf8 string so pushdown behavior never regresses. Tests: unit goldens asserting the lowered literal is typed (red before, green after) + inline-binding pushdown equality in literal_filters confirming the epoch conversion selects the right rows.
…columns `build_indices_on_dataset_for_catalog` only handled `String` (-> FTS) and `Vector` (-> vector). Enums are physically `String`, so an enum `@index` column (e.g. `status`) got an FTS inverted index, which Lance never consults for `=`; and `DateTime`/`Date`/numeric/`Bool` `@index` columns fell through and built nothing. Both meant equality/range filters degraded to full scans with `indices_loaded=0`. Dispatch index kind by property type via a shared `node_prop_index_kind`: enum + orderable scalar -> BTREE, free-text String -> FTS, Vector -> vector, list/Blob -> none. The helper is shared by the builder and `needs_index_work_node` so they cannot drift — the latter decides recovery- sidecar pinning, and under-reporting would leave a HEAD-advancing index build uncovered (invariant 5). Tests: scalar_indexes.rs asserts enum/DateTime/numeric @index columns report `IndexCoverage::Indexed` while free-text String/un-annotated columns stay `Degraded` (negative control). Docs: docs/user/indexes.md.
A scalar/FTS/vector index only covers the fragments it was built over. Rows appended after the build (e.g. `ingest --mode merge`, whose commit does not rebuild an existing index) are scanned unindexed, and `compact_files` rewrites fragments out of coverage. Nothing folded them back in, so coverage decayed as the graph grew — even the id/src/dst BTREEs that power traversal. `optimize_one_table` now runs Lance `optimize_indices` after `compact_files` (incremental merge, not retrain — the same compact->optimize_indices sequence LanceDB's `optimize()` uses) and enters the publish path on compaction work OR stale index coverage (new `TableStore::has_unindexed_fragments`, reusing the fragment_bitmap logic). `optimize_indices` is a committing call with no uncommitted variant in lance-6.0.1, so it is an inline-commit residual covered by the existing `SidecarKind::Optimize` recovery sidecar spanning both ops. Blob-bearing tables are still skipped (the Lance blob-compaction bug is compaction-specific; reindex-for-blob deferred as a noted follow-up). Tests: maintenance.rs asserts an appended fragment is uncovered before and covered after optimize, and idempotency holds (second pass is a no-op). lance_surface_guards pins the `optimize_indices` signature and its incremental- coverage behavior. The existing optimize Phase-B recovery failpoint now also exercises a crash after reindex. Docs: maintenance.md, writes.md, invariants.md, lance.md, AGENTS.md.
Filter literals were pushed to Lance in their natural Arrow type (every integer Int64, every float Float64). Against a narrower indexed column DataFusion widens to the literal's type and casts the COLUMN (`CAST(n32 AS Int64)`), which defeats the scalar BTREE and degrades to a full filtered read. A physical-plan probe confirms it: an Int32 column filtered by an i32 literal uses `ScalarIndexQuery`; by an i64 literal it does not. Thread the scan's `arrow_schema` through `build_lance_filter_expr` -> `ir_filter_to_expr` and coerce each literal operand to the opposite column's exact Arrow type, reusing `projection::literal_to_array` + `arrow_cast` (the same path the in-memory arm uses, so the two arms agree). Coercion never demotes a filter to None: on failure it falls back to the natural literal, because a node scan has no in-memory fallback for inline filters. Supersedes the date-specific change in e4ef67b (PR1): the probe shows dates were never index-defeated — temporal coercion casts the LITERAL, not the column — so PR1's index-use rationale was wrong though harmless. The generic coercion subsumes it; `literal_to_expr`'s date arms revert to the natural Utf8 fallback, and its unit tests now assert the live coerced path. Tests: surface guard `scalar_index_use_requires_matched_literal_type` pins the substrate behavior (matched -> index, widened -> column-cast full scan); unit tests cover Int32/UInt32/Float32 coercion, range op, reversed operand order, and the natural fallback; `literal_filters` adds an I32 column with equality + range and an F32 pushdown case.
There was a problem hiding this comment.
ragnorc has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0641210a3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let casted = if arr.data_type() == target { | ||
| arr | ||
| } else { | ||
| arrow_cast::cast::cast(&arr, target).ok()? | ||
| }; |
There was a problem hiding this comment.
Avoid truncating float filters for integer columns
When a valid numeric comparison uses a Float literal against an integer column, e.g. match { $m: Metric { count: 2.7 } } with count: I32, this column-typed coercion casts the Float64 literal to Int32 before Lance sees it. Arrow numeric casts truncate fractional floats, so the pushed predicate becomes count = 2 and can return rows that should not match; because NodeScan filters are only pushed down, there is no later in-memory predicate to restore the original comparison. Please only apply exact/widening coercions here or fall back to the natural literal when the cast would be lossy.
Useful? React with 👍 / 👎.
| ds.optimize_indices(&OptimizeOptions::default()) | ||
| .await | ||
| .map_err(|e| OmniError::Lance(e.to_string()))?; | ||
| .map_err(|e| OmniError::Lance(format!("optimize_indices on {}: {}", table_key, e)))?; | ||
| let version_after = ds.version().version; | ||
| let committed = version_after != version_before; |
There was a problem hiding this comment.
optimize_indices called unconditionally even when needs_reindex = false
When will_compact = true but needs_reindex = false, the function enters the sidecar path, compaction runs and rewrites fragments (dropping them from existing index coverage), and then optimize_indices folds them back in — this sequencing is correct. However, if the table has no non-system indexes at all, optimize_indices still makes a Lance API call (a no-op) on every compaction cycle. A guard checking !ds.load_indices().await?.iter().any(|i| !is_system_index(i)) before calling optimize_indices would avoid the spurious round-trip, though this is a latency concern only.
| @@ -398,11 +407,26 @@ async fn optimize_one_table( | |||
| let handle = | |||
| crate::db::manifest::write_sidecar(db.root_uri(), db.storage_adapter(), &sidecar).await?; | |||
|
|
|||
| // Phase B: compaction (reserve-fragments + rewrite commits advance HEAD). | |||
| // Phase B: compaction (if any) then incremental index optimize — both | |||
| // advance Lance HEAD inside the sidecar window. `compact_files` rewrites | |||
| // fragments and drops them from existing index segments' coverage; | |||
| // `optimize_indices` folds the rewritten and any previously-unindexed | |||
| // fragments back in (Lance's incremental merge, not a full retrain). This | |||
| // is the same compact -> optimize_indices sequencing LanceDB's `optimize()` | |||
| // uses. `optimize_indices` is an inline-commit residual: lance-6.0.1 | |||
| // exposes no uncommitted variant, so like `compact_files` it commits | |||
| // directly and relies on the sidecar for recovery. | |||
| let version_before = ds.version().version; | |||
| let metrics: CompactionMetrics = compact_files(&mut ds, options, None) | |||
| let metrics: CompactionMetrics = if will_compact { | |||
| compact_files(&mut ds, options, None) | |||
| .await | |||
| .map_err(|e| OmniError::Lance(e.to_string()))? | |||
| } else { | |||
| CompactionMetrics::default() | |||
| }; | |||
| ds.optimize_indices(&OptimizeOptions::default()) | |||
| .await | |||
| .map_err(|e| OmniError::Lance(e.to_string()))?; | |||
| .map_err(|e| OmniError::Lance(format!("optimize_indices on {}: {}", table_key, e)))?; | |||
| let version_after = ds.version().version; | |||
| let committed = version_after != version_before; | |||
There was a problem hiding this comment.
Sidecar written without guaranteed HEAD advance on the reindex-only path
The original code gated sidecar creation on plan.num_tasks() > 0, guaranteeing compact_files advances Lance HEAD at least once — so a stranded sidecar (Phase D cleanup failure) always recovers forward, not as NoMovement. The new needs_reindex entry point doesn't carry that guarantee: if a concurrent optimize already ran optimize_indices between the has_unindexed_fragments check and the call here, optimize_indices commits nothing, committed = false, Phase C is skipped, and the sidecar is left on disk if Phase D cleanup fails.
Recovery classifies this as "not yet committed" (dataset at V < post_commit_pin V+1) and takes the rollback path (Dataset::restore(V) + manifest publish V). Since V is the current version the restore is a data no-op, but the manifest CAS can conflict if concurrent writes advanced the table in the window — requiring manual omnigraph repair.
The literal coercion in f064121 narrowed unconditionally. typecheck permits numeric cross-type comparisons (`types_compatible`), so an out-of-domain literal reaches `literal_to_typed_expr` and casts lossily: a fractional float vs an integer column truncates (`{ count: 2.7 }` -> `count = 2`, wrongly matching the count=2 row) and an out-of-range integer overflows to null (`count < 3e9` on I32 -> `count < NULL` -> empty). Both silently change results, and a node scan has no in-memory fallback for inline filters. Add a lossless guard for integer targets: round-trip the cast back to the natural type and, on mismatch, return None so the caller keeps the natural literal (correct via DataFusion coercion; the index is just unused for that out-of-domain predicate). Float targets stay coerced -- narrowing F64 -> F32 is the column's own precision domain, not a value error. Resolves the two valid review findings on PR #216 (Codex float truncation, Greptile out-of-range). Tests: unit cases for fractional/out-of-range fallback vs whole-float/in-range coerce vs F32 exemption; e2e `{ count: 2.7 }` returns no rows.
|
Addressed the automated review on
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3627b4e. Configure here.
| if original != round_tripped { | ||
| return None; | ||
| } | ||
| } |
There was a problem hiding this comment.
Integer literals lose float precision
Medium Severity
In literal_to_typed_expr, the lossless round-trip guard runs only when the comparison column is an integer type. Integer query literals are built as Int64 and, when the column is Float32 or Float64, are cast to that float without checking that the value survives round-trip. Integers with magnitude above float mantissa limits (e.g. beyond 2^24 on Float32) can change during coercion, so equality filters on indexed float columns can match the wrong rows instead of falling back to the natural literal.
Reviewed by Cursor Bugbot for commit 3627b4e. Configure here.
…-investigation # Conflicts: # docs/dev/invariants.md # docs/user/maintenance.md


What & why
A query-latency investigation found that reads on indexed columns were full-scanning. This closes the gaps end to end: scalar/enum/
DateTime@indexcolumns now get a usable BTREE,optimizekeeps index coverage current as data is appended, and filter literals are coerced to the column's exact Arrow type so the persisted scalar index is actually used instead of being defeated by a column-side cast.Backing issue / RFC
Maintainer change from the internal query-latency (Pulse) investigation; no public issue/RFC. Reviewed under the maintainer internal process.
Checklist
docs/user/indexes.md,docs/user/maintenance.md, plus dev docs)Notes for reviewers
Three related changes: (1)
build_indices_on_dataset_for_catalogdispatches index kind by type via a sharednode_prop_index_kind(enum + orderable scalar → BTREE, free-text String → FTS), fixing enums being mis-routed to FTS andDateTime/numeric building nothing; (2)optimizeruns Lanceoptimize_indicesaftercompact_filesto fold appended/rewritten fragments back into existing indexes — an inline-commit residual under the existingSidecarKind::Optimizerecovery sidecar (Lance exposes no uncommitted variant); (3) pushdown filter literals coerce to the column type, which supersedes the earlier date-literal commit's rationale (a physical-plan probe showed dates were never index-defeated — temporal coercion casts the literal, not the column — so only narrow-numeric columns were affected). Deferred follow-ups: blob-table reindex (blocked on the upstream Lance blob-compaction bug) and bitmap (vs BTREE) for low-cardinality enums.Note
Medium Risk
Changes query filter lowering and maintenance’s inline
optimize_indicespath under recovery sidecars; behavior is heavily tested but a reindex-only optimize with no HEAD advance can leave a sidecar until recovery, which reviewers should sanity-check.Overview
Fixes indexed reads that were effectively full-scanning by aligning index creation, maintenance, and pushdown on the same rules.
Index dispatch:
@index/@keynode properties now sharenode_prop_index_kindfor bothensure_indicesand sidecar pinning—enums and orderable scalars (DateTime, numeric,Bool) get BTREE; free-textStringstays FTS; vectors unchanged.needs_index_work_nodeandbuild_indices_on_dataset_for_cataloguse the same mapping so coverage checks match what gets built.optimize: After optionalcompact_files, each table runs Lanceoptimize_indices(inline-commit, still underSidecarKind::Optimize). Tables with stale fragment coverage but no compaction plan still enter the sidecar/publish path viaTableStore::has_unindexed_fragments, so merge-appended rows get folded into existing indexes.Query pushdown: Node scans pass the catalog
arrow_schemaintobuild_lance_filter_expr/ir_filter_to_expr, coercing literals to each column’s Arrow type (literal_to_typed_expr+ sharedliteral_to_array) so narrow BTREEs are not bypassed by column-side casts; integer coercion uses a lossless round-trip guard for fractional/out-of-range values.User/dev docs, capability matrix, and integration/surface guards (
optimize_indices, scalar-index literal typing, newscalar_indexes/ maintenance / literal filter tests) are updated accordingly.Reviewed by Cursor Bugbot for commit a3148f6. Bugbot is set up for automated code reviews on this repo. Configure here.
Greptile Summary
This PR closes three gaps that caused indexed node reads to full-scan:
@index/@keycolumns now get the right Lance index (BTREE for enums/orderable scalars, FTS for free-text Strings) via a sharednode_prop_index_kindhelper;optimizerunsoptimize_indicesafter compaction to fold appended fragments back into existing indexes; and filter literals are coerced to each column's exact Arrow type so narrow-numeric BTREEs are consulted rather than defeated by a DataFusion column-side cast.node_prop_index_kind):build_indices_on_dataset_for_catalogandneeds_index_work_nodeshare a single enum dispatch, fixing enums being mis-routed to FTS and DateTime/numeric columns building nothing.optimizereindex pass:optimize_one_tablecheckshas_unindexed_fragmentsand callsds.optimize_indicesafter compaction (or alone when only coverage is stale); the existingSidecarKind::Optimizesidecar spans both operations for crash recovery.literal_to_typed_expr): node scans and hydrate-nodes pass the Arrow schema intobuild_lance_filter_expr; literals are cast to the column type with a lossless round-trip guard for integer targets, preventing silent truncation and preserving BTREE eligibility.Confidence Score: 5/5
Safe to merge — no new correctness or safety issues found beyond what prior rounds already flagged; the logic, tests, and docs are well-aligned.
The three changed paths (index dispatch, optimize reindex, filter coercion) are each covered by new integration and unit tests and by Lance surface guards that will turn red on a bad upstream bump. The integer round-trip lossless guard is correct and exercised. No new defects were found in this round beyond the recovery-sidecar concerns already captured in prior review threads.
optimize.rs warrants the most scrutiny — the reindex-only entry point (needs_reindex=true, will_compact=false) relaxes the original guarantee that compaction always advances Lance HEAD before the sidecar is written; the prior thread on this file covers the implications in detail.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[optimize_one_table] --> B{has blob columns?} B -- yes --> C[return Skipped::BlobColumnsUnsupportedByLance] B -- no --> D{HEAD > manifest uncovered drift?} D -- yes --> E[return Skipped::DriftNeedsRepair] D -- no --> F[plan_compaction] F --> G{will_compact OR needs_reindex?} G -- neither --> H[return no-op compacted] G -- at least one --> I[Phase A: write recovery sidecar post_commit_pin = expected_version+1] I --> J{will_compact?} J -- yes --> K[compact_files reserve+rewrite commits] J -- no --> L[CompactionMetrics::default] K --> M[ds.optimize_indices fold uncovered fragments into existing indexes] L --> M M --> N{committed = version_after != version_before?} N -- yes --> O[Phase C: publish new version to manifest] N -- no --> P[skip manifest publish] O --> Q[Phase D: delete sidecar] P --> QComments Outside Diff (2)
docs/dev/lance.md, line 1189 (link)The PR adds three new guards (Guard 14
_compile_optimize_indices_signature, Guard 15optimize_indices_extends_fragment_coverage, Guard 16scalar_index_use_requires_matched_literal_type) but the updated summary sentence only names two. Guard 16 — which pins the core substrate assumption that the literal-coercion fix relies on (BTREE is consulted only when the literal type matches the column type) — is missing from the count and description. Future Lance bumpers relying on this doc to find the right test to re-run first will miss it.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
crates/omnigraph/src/db/omnigraph/table_ops.rs, line 148-163 (link)ScalarType::Stringin two match arms — guard ordering is load-bearingString if !is_enumcatches free-text strings; plainStringin the BTREE OR-arm catches enum strings via fall-through. The code is correct, but adding an arm between them or removing the guard would silently route enums back to FTS. Consolidating into oneScalarType::Stringarm with an innerif is_enumbranch would make the logic self-documenting and robust against future arm reordering.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (7): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile