stdlib: Add `binary:encode_hex/2` by gilbertwong96 · Pull Request #6297 · erlang/otp

gilbertwong96 · 2022-09-14T04:20:08Z

The binary module only exposes encode_hex/1, which only supports uppercase output.

This change adds binary:encode_hex/2 to support uppercase and lowercase hex-encoding. After this change, encode_hex/1 supports uppercase by default.

github-actions · 2022-09-14T04:21:06Z

CT Test Results

      2 files     86 suites 35m 10s ⏱️
1 698 tests 1 649 ✔️ 47 💤 2 ❌
1 971 runs 1 920 ✔️ 49 💤 2 ❌

For more details on these failures, see this check.

Results for commit b12294d.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

Maria-12648430 · 2022-09-15T13:06:31Z

Something similar has recently been done for the base64 module, see #6280.

The solution initially proposed there was much like this one here, and it slowed down the encoding and decoding process considerably. So, you might want to benchmark the performance before and after your change here 😉

After changes suggested by @bjorng were applied, the revised solution was about on par with the then current implementation. The same approach should also be applicable for this PR.

Maria-12648430 · 2022-09-15T13:23:09Z

+-spec encode_hex(Bin, Case) -> Bin2 when
+      Bin :: binary(),
+      Case :: lower | upper,
+      Bin2 :: <<_:_*16>>.


Would lowercase / uppercase be better namings than upper / lower? Not sure what the precedents are.

Yes, lowercase/uppercase is better.

gilbertwong96 · 2022-09-19T09:54:55Z

Something similar has recently been done for the base64 module, see #6280.

The solution initially proposed there was much like this one here, and it slowed down the encoding and decoding process considerably. So, you might want to benchmark the performance before and after your change here 😉

After changes suggested by @bjorng were applied, the revised solution was about on par with the then current implementation. The same approach should also be applicable for this PR.

The performance must be the same as the original version. The algorithm hasn't been changed. It just provides new APIs.

bjorng · 2022-09-19T12:20:16Z

The performance must be the same as the original version.

Running a modified version of base64_bench on the maint branch, I get the following numbers on my computer:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 9622 ms: 103 it/sec
fun binary:decode_hex/1: 1000 iterations in 41326 ms: 24 it/sec

With this PR, there is a slight loss of performance:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 10541 ms: 94 it/sec
fun binary:decode_hex/1: 1000 iterations in 41008 ms: 24 it/sec

On the master branch, the results are:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 2607 ms: 383 it/sec
fun binary:decode_hex/1: 1000 iterations in 37013 ms: 27 it/sec

With this PR rebased on top of the master branch, the results are:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 3574 ms: 279 it/sec
fun binary:decode_hex/1: 1000 iterations in 36800 ms: 27 it/sec

gilbertwong96 · 2022-09-19T14:16:10Z

The performance must be the same as the original version.

Running a modified version of base64_bench on the maint branch, I get the following numbers on my computer:
== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 9622 ms: 103 it/sec
fun binary:decode_hex/1: 1000 iterations in 41326 ms: 24 it/sec
With this PR, there is a slight loss of performance:
== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 10541 ms: 94 it/sec
fun binary:decode_hex/1: 1000 iterations in 41008 ms: 24 it/sec
On the master branch, the results are:
== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 2607 ms: 383 it/sec
fun binary:decode_hex/1: 1000 iterations in 37013 ms: 27 it/sec
With this PR rebased on top of the master branch, the results are:
== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 3574 ms: 279 it/sec
fun binary:decode_hex/1: 1000 iterations in 36800 ms: 27 it/sec

Thank you. I will find a way to improve it.

Maria-12648430 · 2022-09-19T18:11:15Z

@bjorng out of interest, if you run the benchmarks with lowercase (with these changes applied), does that make it worse? I mean, is the difference in performance more pronounced? Going by gut feeling, it should, as there is now an extra check? (And sorry, can't try myself right now, I'm only on my flimsy mobile 😅)

bjorng · 2022-09-20T06:09:17Z

@Maria-12648430 There were no noticeable differences in benchmark results between the two modes.

Going by gut feeling, it should, as there is now an extra check?

In general, matching is not done one clause at a time in order. In this case, the selection between the two clauses is done by the following BEAM instruction:

    {select_val,{x,2},
                {f,490},
                {list,[{atom,lower},{f,488},{atom,upper},{f,487}]}}.

If there are many values in a select_val a binary search will be used. When there are only a few, as in this case, the JIT will emit code that does a linear search:

# i_select_val_lins_sfI
    mov rsi, qword ptr [rbx+16]
    cmp rsi, 441611
    je label_487
    cmp rsi, 441931
    je label_488
    jmp label_490

I expect the most expensive part of this sequence to be the first instruction, which loads a BEAM register into CPU register rsi. Comparing rsi to a value and not branching is very cheap, relatively speaking. Therefore, it doesn't really matter which of the atoms that is compared first.

Maria-12648430 · 2022-09-20T14:22:10Z

@bjorng thanks for the detailed explanation, very much appreciated 🥰

juhlig · 2022-09-27T17:50:00Z

-encode_hex(Data) when is_binary(Data) ->
-    << <<?HEX(N)>> || <<N>> <= Data >>;
 encode_hex(Bin) ->
+    encode_hex(Bin, upper).


Should be uppercase.

gilbertwong96 · 2022-11-20T12:13:06Z

I improve the performance in some way ugly 😅. @bjorng

Here is the bench on my computer.

On the master branch, the results are:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 9091 ms: 109 it/sec
fun binary:decode_hex/1: 1000 iterations in 32389 ms: 30 it/sec

With this PR rebased on top of the master branch, the results are:

== Testing with 1 MB ==
fun binary:encode_hex/1: 1000 iterations in 9060 ms: 110 it/sec
fun binary:decode_hex/1: 1000 iterations in 31815 ms: 31 it/sec

juhlig · 2022-11-21T20:38:06Z

I'll take a look at it tomorrow 🙂

juhlig · 2022-11-22T14:36:06Z

Ok. I want to say that my review is inspired (or maybe biased) by similar work recently done on the base64 module. @bjorng may have a different opinion on this. He is waaay more of an expert than I am. Anyway, here goes :)

I would suggest to conflate the functions hex_uppercase/1 and hex_lowercase/1 into one function hex/2, with the second parameter being an offset. Put both the uppercase and lowercase integers into one big tuple, such that the uppercase integers are in positions 1-256, lowercase in 257-512 (don't mind the duplicates):

hex(X, Off) ->
    element(
        X+Off,
        {
            % integers for uppercase
            16#3030, 16#3031, 16#3032, 16#3033, 16#3034, ...
            ..., 16#4642, 16#4643, 16#4644, 16#4645, 16#4646,
            % integers for lowercase
            16#3030, 16#3031, 16#3032, 16#3033, 16#3034, ...
            ..., , 16#6662, 16#6663, 16#6664, 16#6665, 16#6666
        }
    ).

So if you want to query for an uppercase integer, you pass in the offset 1, for a lowercase integer you pass in 257:

1> hex(16#4a, 1).
13371
2> hex(16#4a, 257).
13409

Having only one function for the translation, you can go with only one macro instead of two:

-define(HEX(X, Off), (hex(X, Off)):16).

And having only one function and macro, you don't need two separate functions encode_hex_upper/1 and encode_hex_lower/1 but only one, encode_hex1/2 (I suffixed the name with a 1 since a function encode_hex/2 already exists) which takes the data to encode and an offset:

encode_hex1(Data, Off) when byte_size(Data) rem 8 =:= 0 ->
    << <<?HEX(A, Off), ?HEX(B, Off), ?HEX(C, Off), ?HEX(D, Off),
         ?HEX(E, Off), ?HEX(F, Off), ?HEX(G, Off), ?HEX(H, Off)>>
       || <<A, B, C, D, E, F, G, H>> <= Data >>;
encode_hex1(Data, Off) when byte_size(Data) rem 7 =:= 0 ->
    << <<?HEX(A, Off), ?HEX(B, Off), ?HEX(C, Off), ?HEX(D, Off),
         ?HEX(E, Off), ?HEX(F, Off), ?HEX(G, Off)>>
       || <<A, B, C, D, E, F, G>> <= Data >>;
...
encode_hex1(Data, Off) when is_binary(Data) ->
    << <<?HEX(N, Off)>> || <<N>> <= Data >>;
encode_hex1(Bin, Off) ->
    badarg_with_info([Bin, Off]).

(Actually, if you define your macro like -define(HEX(X), (hex(X, Off)):16)., you can even go without the Off in all the ?HEX(...) usages. However, I'm pretty sure this is bad style 😅)

The proper value for the offset can be determined and passed in from encode_hex/2:

encode_hex(Bin, uppercase) ->
    encode_hex1(Bin, 1);
encode_hex(Bin, lowercase) ->
    encode_hex1(Bin, 257).

I think by going like this, the code would be smaller and cleaner (but make sure to add comments explaining what that Offset is about ;)), without any performance penalties. Anyway, I would advise to wait for @bjorng's comment before doing anything 😉

bjorng

I agree with @juhlig's suggestions.

We don't want to have only property tests; there should always be some tests for if proper is not installed. Please update binary_module_SUITE.

Make sure that the documentation have the correct name for the options.

Maria-12648430 · 2022-11-29T13:44:49Z

+                UpperHex = binary:encode_hex(Data, uppercase),
+                LowerHex = binary:encode_hex(Data, lowercase),
+                binary:decode_hex(LowerHex) =:= Data andalso binary:decode_hex(UpperHex) =:= Data


This is nice as far as it goes, and you should keep it, but this only tests that what was hex-encoded can be hex-decoded. In other words, it is not tested if encode_hex produces the correct output, only that decode_hex can decode whatever encode_hex produced back into the initial binary.

A way you could test correct output (without repeating the actual implementation) could be with a helper function like this:

check_hex_encoded(<<I1:4, I2:4, Ins/binary>>, <<U1:8, U2:8, UCs/binary>>, <<L1:8, L2:8, LCs/binary>>) -> check_hex_chars_match(I1, U1, L1) andalso check_hex_chars_match(I2, U2, L2) andalso check_hex_encoded(Ins, UCs, LCs); check_hex_encoded(<<>>, <<>>, <<>>) -> true; check_hex_encoded(_, _, _) -> false. check_hex_chars_match(X, U, L) when X < 10 -> (U =:= $0 + X) andalso (L =:= $0 + X); check_hex_chars_match(X, U, L) -> (U =:= $A + X -10) andalso (L =:= $a + X -10).

... and call it from the property like check_hex_encoded(Data, UpperHex, LowerHex).
(Note that I just typed this off the top of my head, didn't test it, so there may be bugs and/or typos 😅)

I admit this won't be hideously fast (for which reason I would just use the normally-growing binary() generator instead of resizing it), but with tests it is more important to be thorough than to be fast 😉

juhlig

IMO, this is fine now. Only some minor indendation issues ;)

juhlig · 2022-12-05T08:26:39Z

+            begin
+                UpperHex = binary:encode_hex(Data, uppercase),
+                LowerHex = binary:encode_hex(Data, lowercase),
+                binary:decode_hex(LowerHex) =:= Data andalso
+                    binary:decode_hex(UpperHex) =:= Data andalso
+                        check_hex_encoded(Data, UpperHex, LowerHex)
+            end).


Suggested change

begin

UpperHex = binary:encode_hex(Data, uppercase),

LowerHex = binary:encode_hex(Data, lowercase),

binary:decode_hex(LowerHex) =:= Data andalso

binary:decode_hex(UpperHex) =:= Data andalso

check_hex_encoded(Data, UpperHex, LowerHex)

end).

begin

UpperHex = binary:encode_hex(Data, uppercase),

LowerHex = binary:encode_hex(Data, lowercase),

binary:decode_hex(LowerHex) =:= Data andalso

binary:decode_hex(UpperHex) =:= Data andalso

check_hex_encoded(Data, UpperHex, LowerHex)

end).

Indendation.

juhlig · 2022-12-05T08:27:44Z

+        check_hex_chars_match(I2, U2, L2) andalso
+            check_hex_encoded(Ins, UCs, LCs);


Suggested change

check_hex_chars_match(I2, U2, L2) andalso

check_hex_encoded(Ins, UCs, LCs);

check_hex_chars_match(I2, U2, L2) andalso

check_hex_encoded(Ins, UCs, LCs);

Indendation.

juhlig · 2022-12-05T08:45:46Z

+          <p>Encodes a binary into a hex encoded binary using the specified case for the Hex digits "a" to "f".</p>
+          <p>The default case is <c>uppercase</c>.</p>
      <p><em>Example:</em></p>

      <code>


Suggested change

Encodes a binary into a hex encoded binary using the specified case for the Hex digits "a" to "f".

The default case is <c>uppercase</c>.

Example:

<code>

Encodes a binary into a hex encoded binary using the specified case for the Hex digits "a" to "f".

The default case is <c>uppercase</c>.

Example:

<code>

Indendation.

juhlig · 2022-12-05T08:46:18Z

+&lt;&lt;"2f"&gt;&gt;
+4> binary:encode_hex(&lt;&lt;"/"&gt;&gt;, uppercase).
+&lt;&lt;"2F"&gt;&gt;
+      </code>


Suggested change

</code>

</code>

Indendation.

Maria-12648430 · 2022-12-05T12:28:05Z

Nice, I think this is almost good to go now, only the indendation 😉

gilbertwong96 · 2022-12-06T03:18:54Z

Nice, I think this is almost good to go now, only the indendation 😉

I have already fixed the indentations you mentioned. What about introducing format tools like erlfmt in otp?

bjorng · 2022-12-07T13:42:08Z

I've been busy the last week or so and I haven't had time to fully review the latest changes. Thanks @juhlig and @Maria-12648430 for the reviewing you've done.

However, when I took a quick look yesterday and ran the test suite and attempted to build the documentation, I noticed that the test case binary_module_SUITE:error_info/1 fails, and that the documentation fails to build.

bjorng · 2022-12-07T13:55:11Z

By the way, I've created a ticket number to use in the since attribute: OTP-18354

(@juhlig No, I don't think it's mentioned in our contribution guidelines.)

juhlig · 2022-12-08T13:02:03Z

@gilbertwong96

However, when I took a quick look yesterday and ran the test suite and attempted to build the documentation, I noticed that the test case binary_module_SUITE:error_info/1 fails, and that the documentation fails to build.

The documentation build failure is probably because you removed the since tag from your new function altogether. I believe it can be left empty, but its presence is required. Since you now have a ticket number, you should put it in the since tag like this since="OTP @OTP-18354@", and that should also solve the build issue.

About the error_info test failure, I have no idea though =^^=

gilbertwong96 · 2022-12-09T07:18:11Z

    <<"not a binary">>;
+expand_error(bad_hex_case) ->
+    <<"not uppercase or lowercase">>;
 expand_error(not_compiled_regexp) ->


@gilbertwong96

However, when I took a quick look yesterday and ran the test suite and attempted to build the documentation, I noticed that the test case binary_module_SUITE:error_info/1 fails, and that the documentation fails to build.

The documentation build failure is probably because you removed the since tag from your new function altogether. I believe it can be left empty, but its presence is required. Since you now have a ticket number, you should put it in the since tag like this since="OTP @OTP-18354@", and that should also solve the build issue.

About the error_info test failure, I have no idea though =^^=

I have fixed the error_info test failure now. @juhlig

gilbertwong96 · 2022-12-09T09:39:49Z

The stdlib tests ran successfully on my machine, but it failed in CI

I don't know why.

Do you have any ideas?

bjorng · 2022-12-09T13:31:53Z

We know that there are some test cases that tend to fail in github actions. It is not your fault.

bjorng

I have only one more nit pick in the documentation. Otherwise looks good. When you have fixed that, please squash your commits, rebase on the master branch, and retarget the pull request to master.

bjorng · 2022-12-13T08:12:59Z

+      <fsummary>Encodes a binary into a hex encoded binary with specified case</fsummary>
      <desc>
-      <p>Encodes a binary into a hex encoded binary.</p>
+        <p>Encodes a binary into a hex encoded binary using the specified case for the Hex digits "a" to "f".</p>


Suggested change

Encodes a binary into a hex encoded binary using the specified case for the Hex digits "a" to "f".

Encodes a binary into a hex encoded binary using the specified case for the hexadecimal digits "a" to "f".

The `binary` module only exposes `encode_hex/1` which encodes binary to a hex-encoded binary. The output as the number in hexadecimal, with a through f is in the upper case by default. This change allows the user to decide to use upper case or lower case.

bjorng · 2022-12-13T11:19:30Z

Thanks! Added to our daily builds.

bjorng · 2022-12-15T12:22:56Z

Thanks for your pull request!

gilbertwong96 changed the title ~~Stdlib/binary encode case sensitive~~ sodlib: add binary:encode_hex/2 Sep 14, 2022

gilbertwong96 changed the title ~~sodlib: add binary:encode_hex/2~~ stdlib: Add binary:encode_hex/2 Sep 14, 2022

IngelaAndin added the team:VM Assigned to OTP team VM label Sep 14, 2022

Maria-12648430 reviewed Sep 15, 2022

View reviewed changes

rickard-green assigned bjorng Sep 19, 2022

gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch 2 times, most recently from e2b966c to b12294d Compare September 19, 2022 14:04

juhlig suggested changes Sep 28, 2022

View reviewed changes

gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch from b12294d to 61263c1 Compare November 20, 2022 12:07

gilbertwong96 requested review from Maria-12648430 and juhlig and removed request for Maria-12648430 and juhlig November 21, 2022 02:36

bjorng requested changes Nov 23, 2022

View reviewed changes

gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch from 2593435 to c04af5d Compare November 29, 2022 10:15

Maria-12648430 reviewed Nov 29, 2022

View reviewed changes

gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch from c04af5d to 86bc409 Compare December 4, 2022 05:15

juhlig suggested changes Dec 5, 2022

View reviewed changes

gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch from 86bc409 to e011fc8 Compare December 6, 2022 03:15

gilbertwong96 requested review from Maria-12648430 and juhlig and removed request for Maria-12648430 and bjorng December 7, 2022 06:59

juhlig approved these changes Dec 7, 2022

View reviewed changes

gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch 2 times, most recently from 6716d85 to 090e267 Compare December 9, 2022 07:15

gilbertwong96 commented Dec 9, 2022

View reviewed changes

bjorng approved these changes Dec 13, 2022

View reviewed changes

gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch from 090e267 to 21e3016 Compare December 13, 2022 09:06

gilbertwong96 changed the base branch from maint to master December 13, 2022 09:06

gilbertwong96 force-pushed the stdlib/binary-encode-case-sensitive branch from 21e3016 to a30a54d Compare December 13, 2022 09:10

bjorng added the testing currently being tested, tag is used by OTP internal CI label Dec 13, 2022

bjorng merged commit d69e7e9 into erlang:master Dec 15, 2022

gilbertwong96 deleted the stdlib/binary-encode-case-sensitive branch December 16, 2022 02:50

		check_hex_chars_match(I2, U2, L2) andalso
		check_hex_encoded(Ins, UCs, LCs);

	<p>Encodes a binary into a hex encoded binary using the specified case for the Hex digits "a" to "f".</p>
	<p>Encodes a binary into a hex encoded binary using the specified case for the hexadecimal digits "a" to "f".</p>

Conversation

gilbertwong96 commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CT Test Results

Artifacts

Uh oh!

Maria-12648430 commented Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gilbertwong96 commented Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjorng commented Sep 19, 2022

Uh oh!

gilbertwong96 commented Sep 19, 2022

Uh oh!

Maria-12648430 commented Sep 19, 2022

Uh oh!

bjorng commented Sep 20, 2022

Uh oh!

Maria-12648430 commented Sep 20, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gilbertwong96 commented Nov 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juhlig commented Nov 21, 2022

Uh oh!

juhlig commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjorng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juhlig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Maria-12648430 commented Dec 5, 2022

Uh oh!

gilbertwong96 commented Dec 6, 2022

Uh oh!

bjorng commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjorng commented Dec 7, 2022

Uh oh!

juhlig commented Dec 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gilbertwong96 commented Dec 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjorng commented Dec 9, 2022

Uh oh!

bjorng left a comment

Choose a reason for hiding this comment

Uh oh!

gilbertwong96 commented Sep 14, 2022 •

edited

Loading

github-actions Bot commented Sep 14, 2022 •

edited

Loading

Maria-12648430 commented Sep 15, 2022 •

edited

Loading

gilbertwong96 commented Sep 19, 2022 •

edited

Loading

gilbertwong96 commented Nov 20, 2022 •

edited

Loading

juhlig commented Nov 22, 2022 •

edited

Loading

bjorng commented Dec 7, 2022 •

edited

Loading

gilbertwong96 commented Dec 9, 2022 •

edited

Loading