Skip to content

chacha20 - further size and perf improvements#583

Open
daverodgman wants to merge 25 commits intoMbed-TLS:developmentfrom
daverodgman:chacha10
Open

chacha20 - further size and perf improvements#583
daverodgman wants to merge 25 commits intoMbed-TLS:developmentfrom
daverodgman:chacha10

Conversation

@daverodgman
Copy link
Copy Markdown
Contributor

@daverodgman daverodgman commented Nov 29, 2025

Description

Update to chacha neon, with improvements to size and perf. Default size slightly
improves; default perf increases between 45% (clang Neon) to 4.2x (gcc scalar).

This PR does two main things:

  • unifies multiblock implementation with scalar, which
    enables multiblock scalar and combined scalar/Neon.
  • various size and perf improvements (for both Neon and scalar).

Compared to the previous Neon implementation, this enables either smaller size
or better performance. At the default setting, it has slightly improved size and
better performance (clang 38%, gcc 63%). Minimum size reduces by around
300 bytes. Maximum perf increases around 45% (clang) or 58% (gcc).

At settings supported by the previous version (1-6 Neon blocks), performance is
roughly the same, but size is significantly reduced. This enables increasing the
default number of blocks without regressing size, giving better default
performance.

Compared to the previous scalar implementation (which was unchanged by the
previous Neon implementation, i.e. same as v1.0.0), there is enough size
improvement so that 4-block scalar is smaller than the previous single-block
scalar implementation, with about 32% performance uplift on clang. gcc
underperformed for the old scalar implementation, so for gcc, the default scalar
implementation (4-block) is about 4.2x faster, and single-block scalar is about
3x faster.

Compared to v1.0.0, default uplift with Neon enabled is 1.95x (clang) or 5.6x
(gcc). Default scalar uplift is 32% (clang) or 4.4x (gcc). Default code size
saving is around 200b (clang scalar), 100b (gcc scalar), 232b (clang Neon) or
120b (gcc Neon).

In almost all cases, default code size on aarch64 is smaller compared to both v1.0.0
and the previous version. The only known size regressions are gcc Thumb 2, which
is 61 bytes worse compared to v1.0.0, and gcc default Neon, which is smaller than v1.0.0 but 33b larger than the old Neon implementation.

There are some compiler specific paths, but I've tried to keep this to a
minimum.

Notes: size is in bytes; perf is GB/s. n=0, s=0 is single-block scalar with
additional size optimisations enabled. 2406341 is the version currently merged
in development. Default settings shown in bold.

clang 17.0 -Os aarch64 results:

Config v1.0.0 size v1.0.0 perf 2406341 size 2406341 perf latest size latest perf
n=0, s=0 2995 0.42
n=0, s=1 3595 0.87 3520 0.85 3283 0.91
n=0, s=2 3387 0.98
n=0, s=3 3395 1.04
n=0, s=4 3395 1.15
n=1, s=0 3190 0.74 2871 0.76
n=2, s=0 3400 1.04 3147 1.04
n=3, s=0 3620 1.24 3247 1.24
n=4, s=0 3860 1.72 3363 1.70
n=4, s=2 3899 2.17
n=4, s=4 3891 2.15
n=5, s=0 4050 1.55 3487 1.61
n=5, s=3 3999 2.35
n=6, s=0 4300 1.82 3611 1.87
n=6, s=2 4147 2.65

gcc 15.2 -Os aarch64 results:

Config v1.0.0 size v1.0.0 perf 2406341 size 2406341 perf latest size latest perf
n=0, s=0 3291 0.42
n=0, s=1 3863 0.31 3790 0.30 3635 0.92
n=0, s=2 3743 1.04
n=0, s=3 3787 1.16
n=0, s=4 3763 1.35
n=1, s=0 3480 0.75 3139 0.77
n=2, s=0 3710 1.05 3387 1.07
n=3, s=0 4000 1.25 3559 1.26
n=4, s=0 4380 1.72 3743 1.72
n=5, s=0 4720 1.61 3931 1.65
n=5, s=1 4667 2.05
n=5, s=2 5883 2.16
n=5, s=3 6875 2.89
n=6, s=0 5060 1.82 4099 1.89
n=6, s=2 6175 2.89

PR checklist

Please remove the segment/s on either side of the | symbol as appropriate, and add any relevant link/s to the end of the line.
If the provided content is part of the present PR remove the # symbol.

  • changelog provided
  • framework PR not required
  • mbedtls development PR not required because: no functional change
  • mbedtls 3.6 PR not required because: no functional change
  • tests provided

Notes for the submitter

Please refer to the contributing guidelines, especially the
checklist for PR contributors.

Help make review efficient:

  • Multiple simple commits
    • please structure your PR into a series of small commits, each of which does one thing
  • Avoid force-push
    • please do not force-push to update your PR - just add new commit(s)
  • See our Guidelines for Contributors for more details about the review process.

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
This is consistent with what is done for other ciphers.

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
If autovec is enabled, the compiler may vectorise the scalar
code, resulting in worse performance for mixed Neon/scalar
multiblock.

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
@minosgalanakis minosgalanakis added enhancement New feature or request size-m Estimated task size: medium (~1w) needs-review Every commit must be reviewed by at least two team members needs-reviewer This PR needs someone to pick it up for review priority-low Low priority - this may not receive review soon labels Dec 23, 2025
@davidhorstmann-arm
Copy link
Copy Markdown
Contributor

Thanks for this contribution!

It's a bit large for us to take in a normal community review, size-m. We might have to schedule this or ask for a split PR.

@davidhorstmann-arm davidhorstmann-arm moved this from Triage in to Scoped in Community Jan 14, 2026
@davidhorstmann-arm davidhorstmann-arm added the priority-scheduled This PR is big - it will require time to be scheduled for review label Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request needs-review Every commit must be reviewed by at least two team members needs-reviewer This PR needs someone to pick it up for review priority-low Low priority - this may not receive review soon priority-scheduled This PR is big - it will require time to be scheduled for review size-m Estimated task size: medium (~1w)

Projects

Status: Scoped

Development

Successfully merging this pull request may close these issues.

3 participants