Conversation
First, `categroy` rather than `code` was used in constructing the
`control_boundary` property as related to the characters U+200C and
U+200D. This seemed incorrect and should be fixed. This could be an
observable bugfix for any C code which inspects the `control_boundary`
property.
Second, when reading composition exclusions, Ruby's String hex method
produces zero rather than nil if no number is found. For example
$ ruby -e 'puts "# blah".hex'
0
This led to the character `'\0'` being included in the `exclusions`
and `excl_versions` sets which is incorrect. However this seems
asymptomatic because `'\0'` is never part of a composition. (In terms of
the C code, the use of `comp_exclusion` is guarded by the `comb_index`
property which is `UINT16_MAX` for `'\0'`.)
This hack changed the ordering of sequences encoded in the sequences table and was added so we could easily prove equivalence to the Ruby data generator code. However, it's no longer needed and removing it shouldn't result in any functional change.
|
Also fixes #226? |
Is the sequence ordering still deterministic independent of the Julia version? I wouldn't want the output to change just because Julia changed its |
Yes I've kept iteration order separate from The only ordering-related thing which changes here is that we insert the decomposition mapping and case folding sequences into the cache interleaved with uppercase, lowercase and titlecase mappings, rather than doing them first. But that's independent of Julia version. The ordering in the Ruby code was less obvious because the mutable cache of sequences is a global variable. |
Follow on from #258
There's three minor cleanups here arranged into two patches so you can see the effect on utf8proc_data.c. (We might want to squash the patches together when merging because they churn utf8proc_data.c a bit due to two of them affecting the encoding tables in a non-meaningful but global way which is why I've put them all together here in this PR.)
Only one of these fixes is observable via the API as a minor bug fix for properties of U+200C and U+200D.
The patch notes:
[PATCH 1/2] Fix two minor bugs from the Ruby code
First,
categroyrather thancodewas used in constructing thecontrol_boundaryproperty as related to the characters U+200C andU+200D. This seemed incorrect and should be fixed. This could be an
observable bugfix for any C code which inspects the
control_boundaryproperty.
Second, when reading composition exclusions, Ruby's String hex method
produces zero rather than nil if no number is found. For example
This led to the character
'\0'being included in theexclusionsand
excl_versionssets which is incorrect. However this seemsasymptomatic because
'\0'is never part of a composition. (In terms ofthe C code, the use of
comp_exclusionis guarded by thecomb_indexproperty which is
UINT16_MAXfor'\0'.)[PATCH 2/2] Cleanup: Remove sequence ordering hack
This hack changed the ordering of sequences encoded in the sequences
table and was added so we could easily prove equivalence to the Ruby
data generator code.
However, it's no longer needed and removing it shouldn't result in any
functional change.