Skip to content

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760

Open
goel-skd wants to merge 6 commits into
apache:mainfrom
goel-skd:feat-613-unicode-lowercase
Open

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760
goel-skd wants to merge 6 commits into
apache:mainfrom
goel-skd:feat-613-unicode-lowercase

Conversation

@goel-skd

@goel-skd goel-skd commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Replaces the ASCII-only StringUtils::ToLower with a Unicode-aware implementation backed by utf8proc, so case-insensitive name handling matches Iceberg Java's toLowerCase(Locale.ROOT).

  • ToLower lower-cases UTF-8 using utf8proc simple (1:1) case mapping (e.g. CAFÉcafé, GROẞEgroße). Pure-ASCII input takes a byte-wise fast path and never touches utf8proc. The function is total: a byte that does not begin a valid UTF-8 sequence is passed through unchanged and decoding resumes at the next byte, so the valid code points around it are still lower-cased.
  • EqualsIgnoreCase and StartsWithIgnoreCase fold through ToLower, so they are case-insensitive for non-ASCII letters too. Both take an allocation-free ASCII fast path and fall back to ToLower only when a non-ASCII byte is present.
  • StartsWithIgnoreCase no longer assumes case mapping preserves byte length — the old byte-slice guard mishandled length-changing maps (e.g. İ U+0130 → i). Now "İx" starts with "i", and "i" starts with "İ".
  • ToUpper is intentionally left ASCII-only — it only normalizes ASCII enum/codec strings, and simple case mapping would be wrong for some letters (e.g. ß would stay unchanged instead of becoming SS).
  • utf8proc is pinned to 2.10.0 and wired into both the CMake (vendored via FetchContent / system package) and Meson (subprojects/utf8proc.wrap) builds, with matching source hashes and correct static-vs-shared linkage on Windows.

Testing

  • string_util_test.cc: ToLowerUnicode (incl. invalid / truncated / stray-continuation bytes), ToUpperAsciiOnly, and Unicode + invalid-UTF-8 cases for EqualsIgnoreCase / StartsWithIgnoreCase.
  • IgnoreCaseAgreesWithToLowerOracle exhaustively checks both ASCII fast paths against the ToLower oracle over all short strings drawn from an alphabet with length-changing maps (İi, Kk) and invalid UTF-8.

Follow-up

  • The non-ASCII fallback of the comparisons still materializes lowercased strings; a streaming (allocation-free) version is deferred to a follow-up PR.

Part of #613.
Follow-up to #748.
Pre-cursor to #808.

@goel-skd goel-skd force-pushed the feat-613-unicode-lowercase branch from bec8884 to 69cc006 Compare June 19, 2026 01:40
Replace the ASCII-only ToLower with utf8proc simple case mapping so
case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used
for name matching. EqualsIgnoreCase now compares lowercased forms.

Wire utf8proc into both the CMake (vendored/system) and Meson builds.

See apache#613.
@goel-skd goel-skd force-pushed the feat-613-unicode-lowercase branch from 69cc006 to f42e2da Compare June 19, 2026 02:13
Comment thread cmake_modules/IcebergThirdpartyToolchain.cmake

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

Comment thread src/iceberg/util/string_util.h Outdated
Comment thread src/iceberg/util/string_util.h Outdated
Comment thread src/iceberg/util/string_util.h Outdated
@goel-skd

Copy link
Copy Markdown
Contributor Author

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

Thanks much @wgtmac. I responded to your comments.

goel-skd added a commit to goel-skd/iceberg-cpp that referenced this pull request Jun 26, 2026
ToLower: note it uses Unicode simple (1:1) case mapping and document where
it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper:
spell out the ASCII-only behavior and why no Unicode variant is provided.
Also document EqualsIgnoreCase inheriting ToLower's mapping.

Addresses API review comments on apache#760.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ToLower: note it uses Unicode simple (1:1) case mapping and document where
it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper:
spell out the ASCII-only behavior and why no Unicode variant is provided.
Also document EqualsIgnoreCase inheriting ToLower's mapping.

Addresses API review comments on apache#760.
@goel-skd goel-skd force-pushed the feat-613-unicode-lowercase branch from c692240 to 9ffebc3 Compare June 26, 2026 00:56
@goel-skd goel-skd requested a review from wgtmac June 26, 2026 01:11
Comment thread src/iceberg/util/string_util.h
goel-skd added 2 commits June 27, 2026 11:27
The byte-slice in StartsWithIgnoreCase (str.substr(0, prefix.size()) before
lowercasing) is wrong when ToLower changes byte length: "İ" (U+0130) is two
bytes but lower-cases to "i", so "İx" should match prefix "i" but does not.
This test pins that behavior; it fails against the current implementation and
is fixed by the following commit.

Relates to apache#760.
Compare the ToLower forms of both inputs instead of byte-slicing str to
prefix.size() before lowercasing. ToLower can change a string's byte length
(e.g. "İ" U+0130 lower-cases to "i"), so the old slice could split a code
point or wrongly reject a valid match. Makes the regression test from the
previous commit pass.

Relates to apache#760.
@goel-skd goel-skd requested a review from wgtmac June 27, 2026 15:33
Comment thread src/iceberg/util/string_util.h
Comment thread src/iceberg/util/string_util.h Outdated
Comment thread src/iceberg/util/string_util.h
Comment thread src/iceberg/util/string_util.h Outdated
Comment thread src/iceberg/util/string_util.h
/// Intended for case-insensitive name matching, similar to Iceberg Java's
/// toLowerCase(Locale.ROOT). The mapping is locale-independent, matching the intent
/// of Locale.ROOT. It uses simple (1:1) case mapping rather than Java's full case
/// mapping, so results differ for a few code points; e.g. U+0130 (capital I with dot

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they don't match, how can iceberg-cpp read identifier path containing such characters written by java?

Comment thread src/iceberg/util/string_util.h
@goel-skd

Copy link
Copy Markdown
Contributor Author

Thanks for all the review here. Since this is a shared util it's picked up a bunch of good ideas, but some of them expand the scope, so I want to draw a line on what lands here vs. what we track separately.

What I'll do in this PR:

  • Add ASCII fast-paths to ToLower, EqualsIgnoreCase, and StartsWithIgnoreCase. If the inputs are pure ASCII (no byte >= 0x80) we just do the cheap byte-wise thing with an early-out and no allocation, and only fall back to utf8proc when there's actually a non-ASCII byte. That keeps the common case (modes, UUIDs, header/property names, enum-ish strings) as fast as it was before, which is the regression @wgtmac and @zhjwpku registered concerns about.
  • Reword the ToLower doc comment so the behavior and the small divergence from Java are actually clear (@manuzhang's point).

On the simple vs. full case mapping question (@manuzhang): reading Java-written tables isn't affected, since field names are stored verbatim in metadata and not lower-cased. The simple/full difference only shows up in case-insensitive name resolution, and only for a few code points like U+0130. For ASCII and the vast majority of letters they're identical. Matching Java bit-for-bit on those edge cases is a bigger change and I'd rather not block this PR on it.

Two things I'd punt to follow-up issues:

  • Having ToLower return Result<std::string> so it can reject invalid UTF-8 instead of passing it through. That changes the contract for every current caller, so it deserves its own PR.
  • Bit-for-bit parity with Java's toLowerCase(Locale.ROOT) on the code points where Java does full case mapping (e.g. U+0130 İ -> ). We're using simple 1:1 lowercasing here, which was a deliberate non-goal in the design doc: it's what makes ß/GROẞE match Java, whereas casefold would diverge there. So the remaining full-mapping cases are worth tracking on their own.

Does that split sound reasonable? If so I'll push the fast-path + doc changes and open the two follow-ups.

@manuzhang

manuzhang commented Jun 30, 2026

Copy link
Copy Markdown
Member

@goel-skd Please update the PR description (supposing this no longer fully fixes #613 now) and open a new issue with your design and plan, and link all your follow-up PRs there.

@wgtmac

wgtmac commented Jun 30, 2026

Copy link
Copy Markdown
Member

Thanks @goel-skd for your effort! I think 1:1 Java mapping is difficult to achieve so it's fine to be out of this PR's scope. However, it's correctness issue for invalid UTF-8 so it would be better to address them altogether in this PR. WDYT?

@goel-skd

goel-skd commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @goel-skd for your effort! I think 1:1 Java mapping is difficult to achieve so it's fine to be out of this PR's scope. However, it's correctness issue for invalid UTF-8 so it would be better to address them altogether in this PR. WDYT?

@wgtmac - Thanks, I agree that is fair. Will push the updated code.

@goel-skd goel-skd requested a review from wgtmac July 3, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants