chacha20 - further size and perf improvements#583
Open
daverodgman wants to merge 25 commits intoMbed-TLS:developmentfrom
Open
chacha20 - further size and perf improvements#583daverodgman wants to merge 25 commits intoMbed-TLS:developmentfrom
daverodgman wants to merge 25 commits intoMbed-TLS:developmentfrom
Conversation
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
9b3a332 to
2b31722
Compare
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
This is consistent with what is done for other ciphers. Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
If autovec is enabled, the compiler may vectorise the scalar code, resulting in worse performance for mixed Neon/scalar multiblock. Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Contributor
|
Thanks for this contribution! It's a bit large for us to take in a normal community review, size-m. We might have to schedule this or ask for a split PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Update to chacha neon, with improvements to size and perf. Default size slightly
improves; default perf increases between 45% (clang Neon) to 4.2x (gcc scalar).
This PR does two main things:
enables multiblock scalar and combined scalar/Neon.
Compared to the previous Neon implementation, this enables either smaller size
or better performance. At the default setting, it has slightly improved size and
better performance (clang 38%, gcc 63%). Minimum size reduces by around
300 bytes. Maximum perf increases around 45% (clang) or 58% (gcc).
At settings supported by the previous version (1-6 Neon blocks), performance is
roughly the same, but size is significantly reduced. This enables increasing the
default number of blocks without regressing size, giving better default
performance.
Compared to the previous scalar implementation (which was unchanged by the
previous Neon implementation, i.e. same as v1.0.0), there is enough size
improvement so that 4-block scalar is smaller than the previous single-block
scalar implementation, with about 32% performance uplift on clang. gcc
underperformed for the old scalar implementation, so for gcc, the default scalar
implementation (4-block) is about 4.2x faster, and single-block scalar is about
3x faster.
Compared to v1.0.0, default uplift with Neon enabled is 1.95x (clang) or 5.6x
(gcc). Default scalar uplift is 32% (clang) or 4.4x (gcc). Default code size
saving is around 200b (clang scalar), 100b (gcc scalar), 232b (clang Neon) or
120b (gcc Neon).
In almost all cases, default code size on aarch64 is smaller compared to both v1.0.0
and the previous version. The only known size regressions are gcc Thumb 2, which
is 61 bytes worse compared to v1.0.0, and gcc default Neon, which is smaller than v1.0.0 but 33b larger than the old Neon implementation.
There are some compiler specific paths, but I've tried to keep this to a
minimum.
Notes: size is in bytes; perf is GB/s. n=0, s=0 is single-block scalar with
additional size optimisations enabled. 2406341 is the version currently merged
in development. Default settings shown in bold.
clang 17.0 -Os aarch64 results:
gcc 15.2 -Os aarch64 results:
PR checklist
Please remove the segment/s on either side of the | symbol as appropriate, and add any relevant link/s to the end of the line.
If the provided content is part of the present PR remove the # symbol.
Notes for the submitter
Please refer to the contributing guidelines, especially the
checklist for PR contributors.
Help make review efficient: