Skip to content

stdlib: Add binary:encode_hex/2#6297

Merged
bjorng merged 1 commit intoerlang:masterfrom
gilbertwong96:stdlib/binary-encode-case-sensitive
Dec 15, 2022
Merged

stdlib: Add binary:encode_hex/2#6297
bjorng merged 1 commit intoerlang:masterfrom
gilbertwong96:stdlib/binary-encode-case-sensitive

Conversation

@gilbertwong96
Copy link
Copy Markdown
Contributor

@gilbertwong96 gilbertwong96 commented Sep 14, 2022

The binary module only exposes encode_hex/1, which only supports uppercase output.

This change adds binary:encode_hex/2 to support uppercase and lowercase hex-encoding. After this change, encode_hex/1 supports uppercase by default.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Sep 14, 2022

CT Test Results

       2 files       86 suites   35m 10s ⏱️
1 698 tests 1 649 ✔️ 47 💤 2
1 971 runs  1 920 ✔️ 49 💤 2

For more details on these failures, see this check.

Results for commit b12294d.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@gilbertwong96 gilbertwong96 changed the title Stdlib/binary encode case sensitive sodlib: add binary:encode_hex/2 Sep 14, 2022
@gilbertwong96 gilbertwong96 changed the title sodlib: add binary:encode_hex/2 stdlib: Add binary:encode_hex/2 Sep 14, 2022
@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Sep 14, 2022
@Maria-12648430
Copy link
Copy Markdown
Contributor

Maria-12648430 commented Sep 15, 2022

Something similar has recently been done for the base64 module, see #6280.

The solution initially proposed there was much like this one here, and it slowed down the encoding and decoding process considerably. So, you might want to benchmark the performance before and after your change here 😉

After changes suggested by @bjorng were applied, the revised solution was about on par with the then current implementation. The same approach should also be applicable for this PR.

Comment thread lib/stdlib/src/binary.erl
Comment on lines +377 to +379
-spec encode_hex(Bin, Case) -> Bin2 when
Bin :: binary(),
Case :: lower | upper,
Bin2 :: <<_:_*16>>.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would lowercase / uppercase be better namings than upper / lower? Not sure what the precedents are.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, lowercase/uppercase is better.

@gilbertwong96
Copy link
Copy Markdown
Contributor Author

gilbertwong96 commented Sep 19, 2022

Something similar has recently been done for the base64 module, see #6280.

The solution initially proposed there was much like this one here, and it slowed down the encoding and decoding process considerably. So, you might want to benchmark the performance before and after your change here 😉

After changes suggested by @bjorng were applied, the revised solution was about on par with the then current implementation. The same approach should also be applicable for this PR.

The performance must be the same as the original version. The algorithm hasn't been changed. It just provides new APIs.

@bjorng
Copy link
Copy Markdown
Contributor

bjorng commented Sep 19, 2022

The performance must be the same as the original version.

Running a modified version of base64_bench on the maint branch, I get the following numbers on my computer:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 9622 ms: 103 it/sec
fun binary:decode_hex/1: 1000 iterations in 41326 ms: 24 it/sec

With this PR, there is a slight loss of performance:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 10541 ms: 94 it/sec
fun binary:decode_hex/1: 1000 iterations in 41008 ms: 24 it/sec

On the master branch, the results are:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 2607 ms: 383 it/sec
fun binary:decode_hex/1: 1000 iterations in 37013 ms: 27 it/sec

With this PR rebased on top of the master branch, the results are:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 3574 ms: 279 it/sec
fun binary:decode_hex/1: 1000 iterations in 36800 ms: 27 it/sec

@gilbertwong96 gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch 2 times, most recently from e2b966c to b12294d Compare September 19, 2022 14:04
@gilbertwong96
Copy link
Copy Markdown
Contributor Author

The performance must be the same as the original version.

Running a modified version of base64_bench on the maint branch, I get the following numbers on my computer:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 9622 ms: 103 it/sec
fun binary:decode_hex/1: 1000 iterations in 41326 ms: 24 it/sec

With this PR, there is a slight loss of performance:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 10541 ms: 94 it/sec
fun binary:decode_hex/1: 1000 iterations in 41008 ms: 24 it/sec

On the master branch, the results are:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 2607 ms: 383 it/sec
fun binary:decode_hex/1: 1000 iterations in 37013 ms: 27 it/sec

With this PR rebased on top of the master branch, the results are:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 3574 ms: 279 it/sec
fun binary:decode_hex/1: 1000 iterations in 36800 ms: 27 it/sec

Thank you. I will find a way to improve it.

@Maria-12648430
Copy link
Copy Markdown
Contributor

@bjorng out of interest, if you run the benchmarks with lowercase (with these changes applied), does that make it worse? I mean, is the difference in performance more pronounced? Going by gut feeling, it should, as there is now an extra check? (And sorry, can't try myself right now, I'm only on my flimsy mobile 😅)

@bjorng
Copy link
Copy Markdown
Contributor

bjorng commented Sep 20, 2022

@Maria-12648430 There were no noticeable differences in benchmark results between the two modes.

Going by gut feeling, it should, as there is now an extra check?

In general, matching is not done one clause at a time in order. In this case, the selection between the two clauses is done by the following BEAM instruction:

    {select_val,{x,2},
                {f,490},
                {list,[{atom,lower},{f,488},{atom,upper},{f,487}]}}.

If there are many values in a select_val a binary search will be used. When there are only a few, as in this case, the JIT will emit code that does a linear search:

# i_select_val_lins_sfI
    mov rsi, qword ptr [rbx+16]
    cmp rsi, 441611
    je label_487
    cmp rsi, 441931
    je label_488
    jmp label_490

I expect the most expensive part of this sequence to be the first instruction, which loads a BEAM register into CPU register rsi. Comparing rsi to a value and not branching is very cheap, relatively speaking. Therefore, it doesn't really matter which of the atoms that is compared first.

@Maria-12648430
Copy link
Copy Markdown
Contributor

@bjorng thanks for the detailed explanation, very much appreciated 🥰

Comment thread lib/stdlib/src/binary.erl Outdated
encode_hex(Data) when is_binary(Data) ->
<< <<?HEX(N)>> || <<N>> <= Data >>;
encode_hex(Bin) ->
encode_hex(Bin, upper).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be uppercase.

@gilbertwong96 gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch from b12294d to 61263c1 Compare November 20, 2022 12:07
@gilbertwong96
Copy link
Copy Markdown
Contributor Author

gilbertwong96 commented Nov 20, 2022

I improve the performance in some way ugly 😅. @bjorng

Here is the bench on my computer.

On the master branch, the results are:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 9091 ms: 109 it/sec
fun binary:decode_hex/1: 1000 iterations in 32389 ms: 30 it/sec

With this PR rebased on top of the master branch, the results are:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 9060 ms: 110 it/sec
fun binary:decode_hex/1: 1000 iterations in 31815 ms: 31 it/sec

@gilbertwong96 gilbertwong96 requested review from Maria-12648430 and juhlig and removed request for Maria-12648430 and juhlig November 21, 2022 02:36
@juhlig
Copy link
Copy Markdown
Contributor

juhlig commented Nov 21, 2022

I'll take a look at it tomorrow 🙂

@juhlig
Copy link
Copy Markdown
Contributor

juhlig commented Nov 22, 2022

Ok. I want to say that my review is inspired (or maybe biased) by similar work recently done on the base64 module. @bjorng may have a different opinion on this. He is waaay more of an expert than I am. Anyway, here goes :)


I would suggest to conflate the functions hex_uppercase/1 and hex_lowercase/1 into one function hex/2, with the second parameter being an offset. Put both the uppercase and lowercase integers into one big tuple, such that the uppercase integers are in positions 1-256, lowercase in 257-512 (don't mind the duplicates):

hex(X, Off) ->
    element(
        X+Off,
        {
            % integers for uppercase
            16#3030, 16#3031, 16#3032, 16#3033, 16#3034, ...
            ..., 16#4642, 16#4643, 16#4644, 16#4645, 16#4646,
            % integers for lowercase
            16#3030, 16#3031, 16#3032, 16#3033, 16#3034, ...
            ..., , 16#6662, 16#6663, 16#6664, 16#6665, 16#6666
        }
    ).

So if you want to query for an uppercase integer, you pass in the offset 1, for a lowercase integer you pass in 257:

1> hex(16#4a, 1).
13371
2> hex(16#4a, 257).
13409

Having only one function for the translation, you can go with only one macro instead of two:

-define(HEX(X, Off), (hex(X, Off)):16).

And having only one function and macro, you don't need two separate functions encode_hex_upper/1 and encode_hex_lower/1 but only one, encode_hex1/2 (I suffixed the name with a 1 since a function encode_hex/2 already exists) which takes the data to encode and an offset:

encode_hex1(Data, Off) when byte_size(Data) rem 8 =:= 0 ->
    << <<?HEX(A, Off), ?HEX(B, Off), ?HEX(C, Off), ?HEX(D, Off),
         ?HEX(E, Off), ?HEX(F, Off), ?HEX(G, Off), ?HEX(H, Off)>>
       || <<A, B, C, D, E, F, G, H>> <= Data >>;
encode_hex1(Data, Off) when byte_size(Data) rem 7 =:= 0 ->
    << <<?HEX(A, Off), ?HEX(B, Off), ?HEX(C, Off), ?HEX(D, Off),
         ?HEX(E, Off), ?HEX(F, Off), ?HEX(G, Off)>>
       || <<A, B, C, D, E, F, G>> <= Data >>;
...
encode_hex1(Data, Off) when is_binary(Data) ->
    << <<?HEX(N, Off)>> || <<N>> <= Data >>;
encode_hex1(Bin, Off) ->
    badarg_with_info([Bin, Off]).

(Actually, if you define your macro like -define(HEX(X), (hex(X, Off)):16)., you can even go without the Off in all the ?HEX(...) usages. However, I'm pretty sure this is bad style 😅)

The proper value for the offset can be determined and passed in from encode_hex/2:

encode_hex(Bin, uppercase) ->
    encode_hex1(Bin, 1);
encode_hex(Bin, lowercase) ->
    encode_hex1(Bin, 257).

I think by going like this, the code would be smaller and cleaner (but make sure to add comments explaining what that Offset is about ;)), without any performance penalties. Anyway, I would advise to wait for @bjorng's comment before doing anything 😉

Copy link
Copy Markdown
Contributor

@bjorng bjorng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @juhlig's suggestions.

We don't want to have only property tests; there should always be some tests for if proper is not installed. Please update binary_module_SUITE.

Make sure that the documentation have the correct name for the options.

@gilbertwong96 gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch from 2593435 to c04af5d Compare November 29, 2022 10:15
Comment on lines +12 to +14
UpperHex = binary:encode_hex(Data, uppercase),
LowerHex = binary:encode_hex(Data, lowercase),
binary:decode_hex(LowerHex) =:= Data andalso binary:decode_hex(UpperHex) =:= Data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice as far as it goes, and you should keep it, but this only tests that what was hex-encoded can be hex-decoded. In other words, it is not tested if encode_hex produces the correct output, only that decode_hex can decode whatever encode_hex produced back into the initial binary.

A way you could test correct output (without repeating the actual implementation) could be with a helper function like this:

check_hex_encoded(<<I1:4, I2:4, Ins/binary>>, <<U1:8, U2:8, UCs/binary>>, <<L1:8, L2:8, LCs/binary>>) ->
    check_hex_chars_match(I1, U1, L1) andalso
    check_hex_chars_match(I2, U2, L2) andalso
    check_hex_encoded(Ins, UCs, LCs);
check_hex_encoded(<<>>, <<>>, <<>>) ->
    true;
check_hex_encoded(_, _, _) ->
    false.

check_hex_chars_match(X, U, L) when X < 10 ->
    (U =:= $0 + X) andalso (L =:= $0 + X);
check_hex_chars_match(X, U, L) ->
    (U =:= $A + X -10) andalso (L =:= $a + X -10).

... and call it from the property like check_hex_encoded(Data, UpperHex, LowerHex).
(Note that I just typed this off the top of my head, didn't test it, so there may be bugs and/or typos 😅)

I admit this won't be hideously fast (for which reason I would just use the normally-growing binary() generator instead of resizing it), but with tests it is more important to be thorough than to be fast 😉

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@gilbertwong96 gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch from c04af5d to 86bc409 Compare December 4, 2022 05:15
Copy link
Copy Markdown
Contributor

@juhlig juhlig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this is fine now. Only some minor indendation issues ;)

Comment on lines +11 to +17
begin
UpperHex = binary:encode_hex(Data, uppercase),
LowerHex = binary:encode_hex(Data, lowercase),
binary:decode_hex(LowerHex) =:= Data andalso
binary:decode_hex(UpperHex) =:= Data andalso
check_hex_encoded(Data, UpperHex, LowerHex)
end).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
begin
UpperHex = binary:encode_hex(Data, uppercase),
LowerHex = binary:encode_hex(Data, lowercase),
binary:decode_hex(LowerHex) =:= Data andalso
binary:decode_hex(UpperHex) =:= Data andalso
check_hex_encoded(Data, UpperHex, LowerHex)
end).
begin
UpperHex = binary:encode_hex(Data, uppercase),
LowerHex = binary:encode_hex(Data, lowercase),
binary:decode_hex(LowerHex) =:= Data andalso
binary:decode_hex(UpperHex) =:= Data andalso
check_hex_encoded(Data, UpperHex, LowerHex)
end).

Indendation.

Comment on lines +27 to +28
check_hex_chars_match(I2, U2, L2) andalso
check_hex_encoded(Ins, UCs, LCs);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
check_hex_chars_match(I2, U2, L2) andalso
check_hex_encoded(Ins, UCs, LCs);
check_hex_chars_match(I2, U2, L2) andalso
check_hex_encoded(Ins, UCs, LCs);

Indendation.

Comment thread lib/stdlib/doc/src/binary.xml Outdated
Comment on lines 250 to 254
<p>Encodes a binary into a hex encoded binary using the specified case for the Hex digits "a" to "f".</p>
<p>The default case is <c>uppercase</c>.</p>
<p><em>Example:</em></p>

<code>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<p>Encodes a binary into a hex encoded binary using the specified case for the Hex digits "a" to "f".</p>
<p>The default case is <c>uppercase</c>.</p>
<p><em>Example:</em></p>
<code>
<p>Encodes a binary into a hex encoded binary using the specified case for the Hex digits "a" to "f".</p>
<p>The default case is <c>uppercase</c>.</p>
<p><em>Example:</em></p>
<code>

Indendation.

Comment thread lib/stdlib/doc/src/binary.xml Outdated
&lt;&lt;"2f"&gt;&gt;
4> binary:encode_hex(&lt;&lt;"/"&gt;&gt;, uppercase).
&lt;&lt;"2F"&gt;&gt;
</code>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
</code>
</code>

Indendation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Maria-12648430
Copy link
Copy Markdown
Contributor

Nice, I think this is almost good to go now, only the indendation 😉

@gilbertwong96 gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch from 86bc409 to e011fc8 Compare December 6, 2022 03:15
@gilbertwong96
Copy link
Copy Markdown
Contributor Author

Nice, I think this is almost good to go now, only the indendation 😉

I have already fixed the indentations you mentioned. What about introducing format tools like erlfmt in otp?

@gilbertwong96 gilbertwong96 requested review from Maria-12648430 and juhlig and removed request for Maria-12648430 and bjorng December 7, 2022 06:59
@bjorng
Copy link
Copy Markdown
Contributor

bjorng commented Dec 7, 2022

I've been busy the last week or so and I haven't had time to fully review the latest changes. Thanks @juhlig and @Maria-12648430 for the reviewing you've done.

However, when I took a quick look yesterday and ran the test suite and attempted to build the documentation, I noticed that the test case binary_module_SUITE:error_info/1 fails, and that the documentation fails to build.

@bjorng
Copy link
Copy Markdown
Contributor

bjorng commented Dec 7, 2022

By the way, I've created a ticket number to use in the since attribute: OTP-18354

(@juhlig No, I don't think it's mentioned in our contribution guidelines.)

@juhlig
Copy link
Copy Markdown
Contributor

juhlig commented Dec 8, 2022

@gilbertwong96

However, when I took a quick look yesterday and ran the test suite and attempted to build the documentation, I noticed that the test case binary_module_SUITE:error_info/1 fails, and that the documentation fails to build.

The documentation build failure is probably because you removed the since tag from your new function altogether. I believe it can be left empty, but its presence is required. Since you now have a ticket number, you should put it in the since tag like this since="OTP @OTP-18354@", and that should also solve the build issue.

About the error_info test failure, I have no idea though =^^=

@gilbertwong96 gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch 2 times, most recently from 6716d85 to 090e267 Compare December 9, 2022 07:15
<<"not a binary">>;
expand_error(bad_hex_case) ->
<<"not uppercase or lowercase">>;
expand_error(not_compiled_regexp) ->
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gilbertwong96

However, when I took a quick look yesterday and ran the test suite and attempted to build the documentation, I noticed that the test case binary_module_SUITE:error_info/1 fails, and that the documentation fails to build.

The documentation build failure is probably because you removed the since tag from your new function altogether. I believe it can be left empty, but its presence is required. Since you now have a ticket number, you should put it in the since tag like this since="OTP @OTP-18354@", and that should also solve the build issue.

About the error_info test failure, I have no idea though =^^=

I have fixed the error_info test failure now. @juhlig

@gilbertwong96
Copy link
Copy Markdown
Contributor Author

gilbertwong96 commented Dec 9, 2022

image

The stdlib tests ran successfully on my machine, but it failed in CI

image

I don't know why.

Do you have any ideas?

@bjorng
Copy link
Copy Markdown
Contributor

bjorng commented Dec 9, 2022

We know that there are some test cases that tend to fail in github actions. It is not your fault.

Copy link
Copy Markdown
Contributor

@bjorng bjorng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only one more nit pick in the documentation. Otherwise looks good. When you have fixed that, please squash your commits, rebase on the master branch, and retarget the pull request to master.

Comment thread lib/stdlib/doc/src/binary.xml Outdated
<fsummary>Encodes a binary into a hex encoded binary with specified case</fsummary>
<desc>
<p>Encodes a binary into a hex encoded binary.</p>
<p>Encodes a binary into a hex encoded binary using the specified case for the Hex digits "a" to "f".</p>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<p>Encodes a binary into a hex encoded binary using the specified case for the Hex digits "a" to "f".</p>
<p>Encodes a binary into a hex encoded binary using the specified case for the hexadecimal digits "a" to "f".</p>

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@gilbertwong96 gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch from 090e267 to 21e3016 Compare December 13, 2022 09:06
@gilbertwong96 gilbertwong96 changed the base branch from maint to master December 13, 2022 09:06
The `binary` module only exposes `encode_hex/1` which encodes binary to a
 hex-encoded binary. The output as the number in hexadecimal, with a
 through f is in the upper case by default. This change allows the user to
 decide to use upper case or lower case.
@gilbertwong96 gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch from 21e3016 to a30a54d Compare December 13, 2022 09:10
@bjorng bjorng added the testing currently being tested, tag is used by OTP internal CI label Dec 13, 2022
@bjorng
Copy link
Copy Markdown
Contributor

bjorng commented Dec 13, 2022

Thanks! Added to our daily builds.

@bjorng bjorng merged commit d69e7e9 into erlang:master Dec 15, 2022
@bjorng
Copy link
Copy Markdown
Contributor

bjorng commented Dec 15, 2022

Thanks for your pull request!

@gilbertwong96 gilbertwong96 deleted the stdlib/binary-encode-case-sensitive branch December 16, 2022 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants