-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
Comparing slides #122 and #125 from the CppCon PDF, it seems the BH decoder is almost twice as fast converting UTF-8 to UTF-16 than it is converting UTF-8 to UTF-32 for the english_wiki.txt test case. I can reproduce that locally, the numbers I get are as follows:
****** UTF-8 to UTF-32 Conversion ******
for file: 'english_wiki.txt'
UTF-8 to UTF-32 took 720 msec (360275/358131 units/points) (745 reps) (iconv)
UTF-8 to UTF-32 took 1641 msec (360275/358131 units/points) (745 reps) (llvm)
UTF-8 to UTF-32 took 1808 msec (360275/358131 units/points) (745 reps) (av)
UTF-8 to UTF-32 took 1484 msec (360275/358131 units/points) (745 reps) (std::codecvt)
UTF-8 to UTF-32 took 279 msec (360275/358131 units/points) (745 reps) (Boost.Text)
UTF-8 to UTF-32 took 748 msec (360275/358131 units/points) (745 reps) (hoehrmann)
UTF-8 to UTF-32 took 343 msec (360275/358131 units/points) (745 reps) (kewb-basic)
UTF-8 to UTF-32 took 180 msec (360275/358131 units/points) (745 reps) (kewb-fast)
UTF-8 to UTF-32 took 112 msec (360275/358131 units/points) (745 reps) (kewb-sse)
...
****** UTF-8 to UTF-16 Conversion ******
for file: 'english_wiki.txt'
UTF-8 to UTF-16 took 850 msec (360275/358137 units/units) (745 reps) (iconv)
UTF-8 to UTF-16 took 1397 msec (360275/358137 units/units) (745 reps) (llvm)
UTF-8 to UTF-16 took 1592 msec (360275/358137 units/units) (745 reps) (std::codecvt)
UTF-8 to UTF-16 took 836 msec (360275/358137 units/units) (745 reps) (Boost.Text)
UTF-8 to UTF-16 took 443 msec (360275/358137 units/units) (745 reps) (hoehrmann)
UTF-8 to UTF-16 took 360 msec (360275/358137 units/units) (745 reps) (kewb-basic)
UTF-8 to UTF-16 took 178 msec (360275/358137 units/units) (745 reps) (kewb-fast)
UTF-8 to UTF-16 took 62 msec (360275/358137 units/units) (745 reps) (kewb-sse)
Since conversion to UTF-16 is a lot more work than conversion to UTF-32, that is a rather odd result and apparently not explained by memory throughput differences (UTF-32 probably touches more bytes than UTF-16) as other decoders seem largely unaffected.
(My numbers are with GCC 5.4.0 on a bare metal Linux on an old i5 in power saving mode. Looks like the code now uses a much smaller repetition count than when you generated the data for the slides?)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels