Skip to content

Full Unicode case-mapping parity for case-insensitive field matching #808

Description

@goel-skd

Context

Issue #613 asks for case-insensitive field matching consistent with iceberg-java and iceberg-python (both Unicode-aware), with İ (U+0130) as the example. PR #760 (Part of #613) made StringUtils::ToLower Unicode-aware using utf8proc simple (1:1) case mapping and added allocation-free ASCII fast paths.

This issue captures the design and remaining plan to reach full parity and tracks the follow-up PRs.

Remaining gap

utf8proc's simple mapping still diverges from java for the few code points where simple ≠ full case mapping — chiefly İ:

input iceberg-cpp (simple) iceberg-java toLowerCase(Locale.ROOT) / Python str.lower()
İ (U+0130) i (U+0069) (U+0069 U+0307)

So EqualsIgnoreCase("İD", "id") is true in iceberg-cpp but false in java/python — the inconsistency #613 is about.

Design questions

  • Match iceberg-java toLowerCase(Locale.ROOT) exactly; confirm the operation PyIceberg uses for matching and that it agrees.
  • Full lowercase mapping vs. Unicode case folding: utf8proc offers full case folding (utf8proc_map + UTF8PROC_CASEFOLD); verify it reproduces the java/python result, or add a small explicit mapping.
  • Keep the ASCII fast path; stream the non-ASCII path rather than materialize.

Work Items

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions