feat(string_util): make ToLower Unicode-aware via utf8proc (2/2) by goel-skd · Pull Request #760 · apache/iceberg-cpp

goel-skd · 2026-06-19T01:30:41Z

Replaces the ASCII-only StringUtils::ToLower with a Unicode-aware implementation backed by utf8proc, so case-insensitive name handling matches Iceberg Java's toLowerCase(Locale.ROOT).

ToLower lower-cases UTF-8 using utf8proc simple (1:1) case mapping (e.g. CAFÉ → café, GROẞE → große). Pure-ASCII input takes a byte-wise fast path and never touches utf8proc. The function is total: a byte that does not begin a valid UTF-8 sequence is passed through unchanged and decoding resumes at the next byte, so the valid code points around it are still lower-cased.
EqualsIgnoreCase and StartsWithIgnoreCase fold through ToLower, so they are case-insensitive for non-ASCII letters too. Both take an allocation-free ASCII fast path and fall back to ToLower only when a non-ASCII byte is present.
StartsWithIgnoreCase no longer assumes case mapping preserves byte length — the old byte-slice guard mishandled length-changing maps (e.g. İ U+0130 → i). Now "İx" starts with "i", and "i" starts with "İ".
ToUpper is intentionally left ASCII-only — it only normalizes ASCII enum/codec strings, and simple case mapping would be wrong for some letters (e.g. ß would stay unchanged instead of becoming SS).
utf8proc is pinned to 2.10.0 and wired into both the CMake (vendored via FetchContent / system package) and Meson (subprojects/utf8proc.wrap) builds, with matching source hashes and correct static-vs-shared linkage on Windows.

Testing

string_util_test.cc: ToLowerUnicode (incl. invalid / truncated / stray-continuation bytes), ToUpperAsciiOnly, and Unicode + invalid-UTF-8 cases for EqualsIgnoreCase / StartsWithIgnoreCase.
IgnoreCaseAgreesWithToLowerOracle exhaustively checks both ASCII fast paths against the ToLower oracle over all short strings drawn from an alphabet with length-changing maps (İ→i, K→k) and invalid UTF-8.

Follow-up

The non-ASCII fallback of the comparisons still materializes lowercased strings; a streaming (allocation-free) version is deferred to a follow-up PR.

Part of #613.
Follow-up to #748.
Pre-cursor to #808.

Replace the ASCII-only ToLower with utf8proc simple case mapping so case-insensitive name handling matches Iceberg Java's toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used for name matching. EqualsIgnoreCase now compares lowercased forms. Wire utf8proc into both the CMake (vendored/system) and Meson builds. See apache#613.

wgtmac

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

goel-skd · 2026-06-24T04:11:00Z

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

Thanks much @wgtmac. I responded to your comments.

ToLower: note it uses Unicode simple (1:1) case mapping and document where it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper: spell out the ASCII-only behavior and why no Unicode variant is provided. Also document EqualsIgnoreCase inheriting ToLower's mapping. Addresses API review comments on apache#760. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ToLower: note it uses Unicode simple (1:1) case mapping and document where it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper: spell out the ASCII-only behavior and why no Unicode variant is provided. Also document EqualsIgnoreCase inheriting ToLower's mapping. Addresses API review comments on apache#760.

The byte-slice in StartsWithIgnoreCase (str.substr(0, prefix.size()) before lowercasing) is wrong when ToLower changes byte length: "İ" (U+0130) is two bytes but lower-cases to "i", so "İx" should match prefix "i" but does not. This test pins that behavior; it fails against the current implementation and is fixed by the following commit. Relates to apache#760.

Compare the ToLower forms of both inputs instead of byte-slicing str to prefix.size() before lowercasing. ToLower can change a string's byte length (e.g. "İ" U+0130 lower-cases to "i"), so the old slice could split a code point or wrongly reject a valid match. Makes the regression test from the previous commit pass. Relates to apache#760.

manuzhang · 2026-06-29T11:30:05Z

+  /// Intended for case-insensitive name matching, similar to Iceberg Java's
+  /// toLowerCase(Locale.ROOT). The mapping is locale-independent, matching the intent
+  /// of Locale.ROOT. It uses simple (1:1) case mapping rather than Java's full case
+  /// mapping, so results differ for a few code points; e.g. U+0130 (capital I with dot


If they don't match, how can iceberg-cpp read identifier path containing such characters written by java?

goel-skd · 2026-06-29T23:16:49Z

Thanks for all the review here. Since this is a shared util it's picked up a bunch of good ideas, but some of them expand the scope, so I want to draw a line on what lands here vs. what we track separately.

What I'll do in this PR:

Add ASCII fast-paths to ToLower, EqualsIgnoreCase, and StartsWithIgnoreCase. If the inputs are pure ASCII (no byte >= 0x80) we just do the cheap byte-wise thing with an early-out and no allocation, and only fall back to utf8proc when there's actually a non-ASCII byte. That keeps the common case (modes, UUIDs, header/property names, enum-ish strings) as fast as it was before, which is the regression @wgtmac and @zhjwpku registered concerns about.
Reword the ToLower doc comment so the behavior and the small divergence from Java are actually clear (@manuzhang's point).

On the simple vs. full case mapping question (@manuzhang): reading Java-written tables isn't affected, since field names are stored verbatim in metadata and not lower-cased. The simple/full difference only shows up in case-insensitive name resolution, and only for a few code points like U+0130. For ASCII and the vast majority of letters they're identical. Matching Java bit-for-bit on those edge cases is a bigger change and I'd rather not block this PR on it.

Two things I'd punt to follow-up issues:

Having ToLower return Result<std::string> so it can reject invalid UTF-8 instead of passing it through. That changes the contract for every current caller, so it deserves its own PR.
Bit-for-bit parity with Java's toLowerCase(Locale.ROOT) on the code points where Java does full case mapping (e.g. U+0130 İ -> i̇). We're using simple 1:1 lowercasing here, which was a deliberate non-goal in the design doc: it's what makes ß/GROẞE match Java, whereas casefold would diverge there. So the remaining full-mapping cases are worth tracking on their own.

Does that split sound reasonable? If so I'll push the fast-path + doc changes and open the two follow-ups.

manuzhang · 2026-06-30T02:04:35Z

@goel-skd Please update the PR description (supposing this no longer fully fixes #613 now) and open a new issue with your design and plan, and link all your follow-up PRs there.

wgtmac · 2026-06-30T03:10:31Z

Thanks @goel-skd for your effort! I think 1:1 Java mapping is difficult to achieve so it's fine to be out of this PR's scope. However, it's correctness issue for invalid UTF-8 so it would be better to address them altogether in this PR. WDYT?

goel-skd · 2026-07-02T23:06:46Z

Thanks @goel-skd for your effort! I think 1:1 Java mapping is difficult to achieve so it's fine to be out of this PR's scope. However, it's correctness issue for invalid UTF-8 so it would be better to address them altogether in this PR. WDYT?

@wgtmac - Thanks, I agree that is fair. Will push the updated code.

goel-skd force-pushed the feat-613-unicode-lowercase branch from bec8884 to 69cc006 Compare June 19, 2026 01:40

goel-skd force-pushed the feat-613-unicode-lowercase branch from 69cc006 to f42e2da Compare June 19, 2026 02:13

wgtmac reviewed Jun 19, 2026

View reviewed changes

Comment thread cmake_modules/IcebergThirdpartyToolchain.cmake

Add license info to LICENSE

b8639d6

wgtmac requested changes Jun 21, 2026

View reviewed changes

Comment thread src/iceberg/util/string_util.h Outdated

Comment thread src/iceberg/util/string_util.h Outdated

Comment thread src/iceberg/util/string_util.h Outdated

goel-skd force-pushed the feat-613-unicode-lowercase branch from c692240 to 9ffebc3 Compare June 26, 2026 00:56

goel-skd requested a review from wgtmac June 26, 2026 01:11

wgtmac reviewed Jun 26, 2026

View reviewed changes

Comment thread src/iceberg/util/string_util.h

goel-skd added 2 commits June 27, 2026 11:27

goel-skd requested a review from wgtmac June 27, 2026 15:33

zhjwpku reviewed Jun 28, 2026

View reviewed changes

Comment thread src/iceberg/util/string_util.h

wgtmac reviewed Jun 29, 2026

View reviewed changes

Comment thread src/iceberg/util/string_util.h Outdated

Comment thread src/iceberg/util/string_util.h

Comment thread src/iceberg/util/string_util.h Outdated

Comment thread src/iceberg/util/string_util.h

manuzhang reviewed Jun 29, 2026

View reviewed changes

Comment thread src/iceberg/util/string_util.h

goel-skd requested a review from wgtmac July 3, 2026 15:48

refactor: ASCII fast path + build polish

3e063ae

goel-skd force-pushed the feat-613-unicode-lowercase branch from db1f5a8 to 3e063ae Compare July 3, 2026 15:57

goel-skd mentioned this pull request Jul 3, 2026

Full Unicode case-mapping parity for case-insensitive field matching #808

Open

2 tasks

Uh oh!

Conversation

goel-skd commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Follow-up

Uh oh!

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goel-skd commented Jun 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

manuzhang Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

goel-skd commented Jun 29, 2026

Uh oh!

manuzhang commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac commented Jun 30, 2026

Uh oh!

goel-skd commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

goel-skd commented Jun 19, 2026 •

edited

Loading

manuzhang commented Jun 30, 2026 •

edited

Loading