Skip to content

Empty-string ("") handling for ordered encrypted text is undefined / inconsistent between eql_v2 and eql_v3 #262

@tobyhede

Description

@tobyhede

Summary

Encrypting the empty string "" as ordered encrypted text produces an empty ORE term (ob: []) (and empty bloom bf: []), and the comparison behaviour of an empty ORE term is undefined / inconsistent across EQL versions. This needs a deliberate decision: is "" a supported plaintext for ordered encrypted text, or is it out of scope?

Surfaced while adding eql_v3.text (#260), where "" was used as the SQLx matrix "zero" pivot and broke ordering, aggregates, and comparison counts.

Findings

EQL v2 (main) — no coverage, but defensive handling exists

  • v2 has zero test coverage of "" for encrypted text. The ORE-text fixture (tests/sqlx/migrations/006_install_ore_text_data.sql) is 100 real words; the smallest is 'aardvark'. No fixture ever has an empty ob.
  • v2 does defensively handle empty term arrays: eql_v2.compare_ore_block_u64_8_256_terms documents "empty arrays sort before non-empty arrays" and returns -1 for empty-vs-non-empty. This path is never exercised by any test.

EQL v3 (eql_v3.text, #260) — diverges from v2

  • The v3 SEM ORE fork does not reproduce v2's empty-array guard. With "" in the fixtures, empty ob orders as the maximum, not the minimum:
    • eql_v3.max(eql_v3.text_ord) returns the "" payload instead of the real max ("zzzz").
    • payload::eql_v3.text_ord > '' returns 0 rows (expected: all non-empty values).
    • 'zzzz' > payload counts are off by one (the "" row is silently dropped).
    • count_distinct over ord_term hits function … returned NULL on the empty term.

So v2 says "empty sorts first" (untested), v3 effectively sorts it last/inconsistently — neither is validated end-to-end.

Decision needed

  1. Is "" (and other degenerate/too-short-to-tokenize plaintext) a supported value for ordered encrypted text?

    • If yes: the v3 SEM ORE comparison must define and implement empty-term ordering (mirror v2's "empty sorts first"), with explicit fixtures/tests covering it across _eq / _ord / _ord_ore and min/max. The match (bf: []) empty-set semantics should also be pinned (everything contains the empty filter; the empty filter contains nothing).
    • If no: document the constraint (minimum/at-least-one-ngram plaintext), and decide where it's enforced (proxy / client / EQL).
  2. Reconcile the v2↔v3 ORE empty-array divergence regardless of (1), so the two schemas don't disagree on a payload either might receive.

Immediate mitigation (in #260)

PR #260 will drop "" from the eql_v3.text fixtures and use real non-empty values (mirroring v2's "real word" convention, smallest a short real token), plus replace the matrix's Default::default() zero-pivot with an overridable ScalarType::zero_pivot() so text supplies a real mid value. That unblocks the PR; this issue tracks the underlying behavioural decision and the v2/v3 divergence, which outlive #260.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions