chacha20 - further size and perf improvements by daverodgman · Pull Request #583 · Mbed-TLS/TF-PSA-Crypto

daverodgman · 2025-11-29T13:15:40Z

Description

Update to chacha neon, with improvements to size and perf. Default size slightly
improves; default perf increases between 45% (clang Neon) to 4.2x (gcc scalar).

This PR does two main things:

unifies multiblock implementation with scalar, which
enables multiblock scalar and combined scalar/Neon.
various size and perf improvements (for both Neon and scalar).

Compared to the previous Neon implementation, this enables either smaller size
or better performance. At the default setting, it has slightly improved size and
better performance (clang 38%, gcc 63%). Minimum size reduces by around
300 bytes. Maximum perf increases around 45% (clang) or 58% (gcc).

At settings supported by the previous version (1-6 Neon blocks), performance is
roughly the same, but size is significantly reduced. This enables increasing the
default number of blocks without regressing size, giving better default
performance.

Compared to the previous scalar implementation (which was unchanged by the
previous Neon implementation, i.e. same as v1.0.0), there is enough size
improvement so that 4-block scalar is smaller than the previous single-block
scalar implementation, with about 32% performance uplift on clang. gcc
underperformed for the old scalar implementation, so for gcc, the default scalar
implementation (4-block) is about 4.2x faster, and single-block scalar is about
3x faster.

Compared to v1.0.0, default uplift with Neon enabled is 1.95x (clang) or 5.6x
(gcc). Default scalar uplift is 32% (clang) or 4.4x (gcc). Default code size
saving is around 200b (clang scalar), 100b (gcc scalar), 232b (clang Neon) or
120b (gcc Neon).

In almost all cases, default code size on aarch64 is smaller compared to both v1.0.0
and the previous version. The only known size regressions are gcc Thumb 2, which
is 61 bytes worse compared to v1.0.0, and gcc default Neon, which is smaller than v1.0.0 but 33b larger than the old Neon implementation.

There are some compiler specific paths, but I've tried to keep this to a
minimum.

Notes: size is in bytes; perf is GB/s. n=0, s=0 is single-block scalar with
additional size optimisations enabled. 2406341 is the version currently merged
in development. Default settings shown in bold.

clang 17.0 -Os aarch64 results:

Config	v1.0.0 size	v1.0.0 perf	`2406341` size	`2406341` perf	latest size	latest perf
n=0, s=0					2995	0.42
n=0, s=1	3595	0.87	3520	0.85	3283	0.91
n=0, s=2					3387	0.98
n=0, s=3					3395	1.04
n=0, s=4					3395	1.15
n=1, s=0			3190	0.74	2871	0.76
n=2, s=0			3400	1.04	3147	1.04
n=3, s=0			3620	1.24	3247	1.24
n=4, s=0			3860	1.72	3363	1.70
n=4, s=2					3899	2.17
n=4, s=4					3891	2.15
n=5, s=0			4050	1.55	3487	1.61
n=5, s=3					3999	2.35
n=6, s=0			4300	1.82	3611	1.87
n=6, s=2					4147	2.65

gcc 15.2 -Os aarch64 results:

Config	v1.0.0 size	v1.0.0 perf	`2406341` size	`2406341` perf	latest size	latest perf
n=0, s=0					3291	0.42
n=0, s=1	3863	0.31	3790	0.30	3635	0.92
n=0, s=2					3743	1.04
n=0, s=3					3787	1.16
n=0, s=4					3763	1.35
n=1, s=0			3480	0.75	3139	0.77
n=2, s=0			3710	1.05	3387	1.07
n=3, s=0			4000	1.25	3559	1.26
n=4, s=0			4380	1.72	3743	1.72
n=5, s=0			4720	1.61	3931	1.65
n=5, s=1					4667	2.05
n=5, s=2					5883	2.16
n=5, s=3					6875	2.89
n=6, s=0			5060	1.82	4099	1.89
n=6, s=2					6175	2.89

PR checklist

Please remove the segment/s on either side of the | symbol as appropriate, and add any relevant link/s to the end of the line.
If the provided content is part of the present PR remove the # symbol.

changelog provided
framework PR not required
mbedtls development PR not required because: no functional change
mbedtls 3.6 PR not required because: no functional change
tests provided

Notes for the submitter

Please refer to the contributing guidelines, especially the
checklist for PR contributors.

Help make review efficient:

Multiple simple commits
- please structure your PR into a series of small commits, each of which does one thing
Avoid force-push
- please do not force-push to update your PR - just add new commit(s)
See our Guidelines for Contributors for more details about the review process.

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

This is consistent with what is done for other ciphers. Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

If autovec is enabled, the compiler may vectorise the scalar code, resulting in worse performance for mixed Neon/scalar multiblock. Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

davidhorstmann-arm · 2026-01-14T11:10:06Z

Thanks for this contribution!

It's a bit large for us to take in a normal community review, size-m. We might have to schedule this or ask for a split PR.

daverodgman added 2 commits November 29, 2025 13:02

separate out scalar prep/finish block

bf9fc76

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

prepare for scalar multiblock

5f68c58

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

daverodgman force-pushed the chacha10 branch from 9b3a332 to 2b31722 Compare November 29, 2025 14:56

davidhorstmann-arm added this to Community Dec 1, 2025

davidhorstmann-arm moved this to Triage in in Community Dec 1, 2025

daverodgman force-pushed the chacha10 branch from 2b31722 to 98c9f20 Compare December 1, 2025 16:20

daverodgman added 16 commits December 1, 2025 16:33

enable simple scalar multi-block

90d3a9e

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

move definition to common header

6f099d5

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

align shape of neon impl with scalar

e59a2b0

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

join up scalar and Neon implementations

456fb00

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

move Neon impl as static inline in header to enable inlining

b8cd09e

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

guard maybe-unused functions with ifdefs

1d1d44d

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

better zeroisation

eb0fdc2

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

remove not-needed zeroize

0e3258b

This is consistent with what is done for other ciphers. Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

remove redundant initialisation code

124e651

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

disable autovec for chacha20.c

0677608

If autovec is enabled, the compiler may vectorise the scalar code, resulting in worse performance for mixed Neon/scalar multiblock. Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

introduce MBEDTLS_CHACHA20_FORCE_UNROLL

b54d1f9

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

size and perf improvements in scalar impl

a9dfe02

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

size and perf improvements in main chacha loop

cdac1db

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

size and perf improvements in neon implementation

5dd8ccb

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

docs and settings

07beac4

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

changelog

b33ee68

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

daverodgman force-pushed the chacha10 branch from 98c9f20 to bf177ae Compare December 1, 2025 16:38

test all variations

f5caa18

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

daverodgman force-pushed the chacha10 branch from bf177ae to f5caa18 Compare December 1, 2025 18:24

daverodgman added 3 commits December 2, 2025 11:43

fix type error

869d881

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

fix macro name

4ef1e02

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

satisfy check_names

16d574b

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

daverodgman force-pushed the chacha10 branch from 1ba1d17 to 16d574b Compare December 2, 2025 15:16

fix GCC version check

e638c84

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

daverodgman added 2 commits December 3, 2025 10:31

fix style

1f29ea6

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

Merge remote-tracking branch 'origin/development' into chacha10

dcead35

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

davidhorstmann-arm moved this from Triage in to Scoped in Community Jan 14, 2026

davidhorstmann-arm added the priority-scheduled This PR is big - it will require time to be scheduled for review label Jan 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chacha20 - further size and perf improvements#583

chacha20 - further size and perf improvements#583
daverodgman wants to merge 25 commits intoMbed-TLS:developmentfrom
daverodgman:chacha10

daverodgman commented Nov 29, 2025 •

edited

Loading

Uh oh!

davidhorstmann-arm commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

daverodgman commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR checklist

Notes for the submitter

Uh oh!

davidhorstmann-arm commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

daverodgman commented Nov 29, 2025 •

edited

Loading