Skip to content

wolfCrypt on TI C2000 C28x (LAUNCHXL-F28P55X)#10724

Draft
dgarske wants to merge 4 commits into
wolfSSL:masterfrom
dgarske:ti_c25
Draft

wolfCrypt on TI C2000 C28x (LAUNCHXL-F28P55X)#10724
dgarske wants to merge 4 commits into
wolfSSL:masterfrom
dgarske:ti_c25

Conversation

@dgarske

@dgarske dgarske commented Jun 18, 2026

Copy link
Copy Markdown
Member

wolfCrypt: support TI C2000 C28x (CHAR_BIT == 16) targets

What

Enables wolfCrypt on toolchains where a C byte/int is wider than 8 bits - specifically the TI C2000 C28x DSP, where CHAR_BIT == 16 (the smallest addressable unit is a 16-bit cell, int/short are 16-bit, long is 32-bit). Validated on a TI LAUNCHXL-F28P55X (TMS320F28P550SJ) at 150 MHz: SHA-256/384/512(+512-224/256), SHA-3, SHAKE128/256, ML-DSA-87 (verify, keygen, sign), and ECDSA + ECDH P-256 all pass on hardware. wolfcrypt_test passes on x86-64 with no regression.

Why it's non-trivial

On a 16-bit-char target, a word32 occupies two 16-bit cells (two octets packed per cell), sizeof(word32) == 2, and a byte[] holds one octet per cell. So the common idioms - aliasing a word as a byte stream ((byte*)&w, XMEMCPY+ByteReverseWords), sizeof as a byte count, (byte)x to truncate to an octet, and 8 * sizeof(x) for a bit width - are all wrong. There is also a cl2000 codegen quirk: (word32)octet << 24 is miscompiled as a 16-bit shift (the fix accumulates with <<= 8), and the 32x32->32 q^-1 multiply in the ML-DSA Montgomery reduction is miscompiled (split-testing on hardware pinned it to that one multiply; the 64-bit widening multiply compiles correctly, so the fix computes the q^-1 product through the 64-bit path, which is also ~4% faster than the shift-based form).

Changes (4 commits, all gated / no-op on 8-bit-byte targets)

  1. infra, hashes, DRBG - types.h auto-detects WOLFSSL_WIDE_BYTE (CHAR_BIT!=8 / TI C2000 toolchains), guarantees CHAR_BIT is defined, and adds the shared WC_OCTET() octet mask; wc_port.{h,c} widen the atomic init-state bitfield for 16-bit int; settings.h+sp_int.h allow SP math on a 16-bit-int CPU (WOLFSSL_SP_ALLOW_16BIT_CPU, 16-bit-char SP type detection); misc.c rotate bit-width via CHAR_BIT; coding.c base64 octet mask; sha256/sha512 octet-wise big-endian word I/O + CHAR_BIT*sizeof length carry; sha3.c octet-wise Keccak squeeze; random.c octet-portable Hash-DRBG length/counter serialization.
  2. ML-DSA - decode integer-promotion fixes (a byte/word16 field promotes to unsigned 16-bit int, so 2 - field was unsigned and a negative coefficient zero-extended into sword32; cast the field to sword32); encode octet masks. Adds WOLFSSL_MLDSA_VERIFY_SMALLEST_MEM, which streams the signature's z vector one polynomial at a time instead of pinning the whole l-vector - cutting the ML-DSA-87 verify key by ~6 KB (with WOLFSSL_MLDSA_ASSIGN_KEY, ~10.7 KB total verify RAM).
  3. test/bench/ci - brace-init SHA/SHAKE KAT vectors (a "\x.." string is sign-extended by a signed-16-bit-char compiler); WOLFSSL_NO_MALLOC benchmark buffers; and a hardware-free cl2000 compile-only CI guard (scripts/ti-c2000/ + .github/workflows/ti-c2000-compile.yml).
  4. ML-DSA Montgomery - compute the q^-1 step of mldsa_mont_red() through the 32x64->64 widening multiply (MLDSA_MUL_QINV_WIDE64, auto-enabled for WC_16BIT_CPU) instead of the 32x32->32 low multiply cl2000 miscompiles; correct on any conforming compiler and ~4% faster than the shift-based form on the C28x.

Algorithms validated on hardware (TI F28P55x @ 150 MHz)

SHA-256; SHA-384; SHA-512; SHA-512/224; SHA-512/256; SHA3-224/256/384/512; SHAKE128; SHAKE256; HMAC/Hash wrappers; SHA-256 Hash-DRBG; ML-DSA-87 verify, key generation and signing; ECDSA P-256 sign and verify; ECDH P-256 key agreement. (wolfcrypt_test MEMORY/mutex/full-ML-DSA report config-expected results on this bare-metal, verify-only, no-WOLFSSL_MEMORY build.)

Benchmarks (TI F28P55x @ 150 MHz, generic C)

Algorithm Throughput
SHA-256 277 KiB/s
SHA-384 / SHA-512 / SHA-512-224 / SHA-512-256 ~176 KiB/s
SHA3-224 / SHA3-256 / SHA3-384 / SHA3-512 158 / 149 / 115 / 81 KiB/s
SHAKE128 / SHAKE256 182 / 149 KiB/s
RNG (SHA-256 Hash-DRBG) 122 KiB/s (Init/Free ~97 ops/sec)
ML-DSA-87 verify ~305 ms/op (3.28 ops/sec)

SHAKE vs a reference C implementation (cycles for 1 KB): SHAKE128 ~824 k (ref 1,195,069); SHAKE256 ~1.01 M (ref 1,360,788) - roughly 26-31% fewer cycles. ML-DSA-87 verify RAM: ~10.7 KB total (struct ~8.7 KB + stack <2 KB, zero heap) with WOLFSSL_MLDSA_VERIFY_SMALLEST_MEM + WOLFSSL_MLDSA_ASSIGN_KEY, down from ~22 KB. The ~305 ms/op verify figure reflects two optimizations measured on hardware: the 64-bit-widened Montgomery q^-1 multiply above (this PR; 317 -> 305 ms/op) and the companion example running the Keccak permutation and the ML-DSA NTTs from RAM (example PR; 354 -> 317 ms/op).

Notes

Every change is behind WOLFSSL_WIDE_BYTE / WC_16BIT_CPU / WC_SHA3_BYTEWISE / WOLFSSL_SP_ALLOW_16BIT_CPU / WOLFSSL_MLDSA_*, or is an idempotent octet mask (WC_OCTET), so 8-bit-byte builds are functionally unchanged (CHAR_BIT == 8 makes the CHAR_BIT-based expressions byte-for-byte identical to the originals). The bare-metal board example (BSP, linker, KATs, harness) is the companion PR wolfSSL/wolfssl-examples#576 (wolfSSL/wolfssl-examples#576), under embedded/ti-c2000-f28p55x/ - not in this PR. There is no public C28x instruction-set simulator, so the CI is compile-only; on-target KATs run on a hardware-in-the-loop runner.

Test

  • x86-64: ./configure --enable-all and --enable-dilithium --enable-experimental; wolfcrypt_test (incl. ECC, ML-DSA) passes.
  • TI C28x: make the wolfssl-examples embedded/ti-c2000-f28p55x (default verify+test, SIGN=1, ECC=1); all KATs + round-trips pass on the F28P55x.
  • Compile guard: CGT_ROOT=... scripts/ti-c2000/compile.sh.

@dgarske dgarske self-assigned this Jun 18, 2026
Copilot AI review requested due to automatic review settings June 18, 2026 00:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds and CI-guards a bare-metal wolfCrypt port for TI C2000 C28x targets where CHAR_BIT == 16, introducing gated fixes so hashing, DRBG, ML-DSA verify, and SP-math ECC work correctly when a C “byte” is wider than 8 bits.

Changes:

  • Introduces WOLFSSL_NO_OCTET_BYTE detection and uses octet-wise load/store paths to avoid invalid byte/word aliasing on CHAR_BIT != 8 targets (SHA-256/512 family, SHA-3/SHAKE, Base64 CT decode, DRBG helpers, rotate helpers).
  • Adds “smallest memory” ML-DSA verify mode that streams z per polynomial to reduce pinned RAM in wc_MlDsaKey.
  • Adds TI C2000 compile-only guard scripts plus a GitHub Actions workflow that downloads the TI CGT and compiles a scoped subset.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
wolfssl/wolfcrypt/wc_port.h Makes atomic arg type selection robust for 16-bit int by also checking UINT_MAX.
wolfssl/wolfcrypt/wc_mldsa.h Adds WOLFSSL_MLDSA_VERIFY_SMALLEST_MEM struct layout variant for reduced verify RAM.
wolfssl/wolfcrypt/types.h Adds WOLFSSL_NO_OCTET_BYTE auto-detection; adjusts WC_16BIT_CPU 64-bit availability behavior.
wolfssl/wolfcrypt/sp_int.h Adds support for unsigned char being 16-bit (no native 8-bit type).
wolfssl/wolfcrypt/settings.h Requires explicit opt-in for SP math on 16-bit-int CPUs via WOLFSSL_SP_ALLOW_16BIT_CPU.
wolfssl/wolfcrypt/dilithium.h Adds smallest-mem verify gating and defaults slow Montgomery reduction macros on WC_16BIT_CPU.
wolfcrypt/test/test.c Switches large-digest constants from C strings to byte[] to avoid CHAR_BIT!=8 pitfalls.
wolfcrypt/src/wc_port.c Fixes init-state static assert to use CHAR_BIT instead of hardcoded 8.
wolfcrypt/src/wc_mldsa.c Adds octet-masking for packed bytes and fixes integer-promotion/sign issues on 16-bit int; adds streaming z verify path.
wolfcrypt/src/sha512.c Adds octet-wise word load/store and corrects length carry/length placement for CHAR_BIT!=8.
wolfcrypt/src/sha3.c Forces bytewise Keccak absorb/squeeze for WOLFSSL_NO_OCTET_BYTE and adds squeeze helper.
wolfcrypt/src/sha256.c Adds octet-wise word load/store and corrects length carry/length placement for CHAR_BIT!=8.
wolfcrypt/src/random.c Fixes DRBG serialization/addition helpers for non-8-bit “byte” targets.
wolfcrypt/src/misc.c Fixes rotate helpers to use CHAR_BIT-based bit width when needed.
wolfcrypt/src/coding.c Ensures Base64 CT decode returns 0xFF for invalid chars even when byte is wider than 8 bits.
wolfcrypt/benchmark/benchmark.c Adds static buffers for WOLFSSL_NO_MALLOC benchmarking and adjusts frees/allocations accordingly.
scripts/ti-c2000/user_settings.h Adds minimal CI-only config for cl2000 compile-guard.
scripts/ti-c2000/compile.sh Adds compile-only script to build a scoped source set with TI cl2000.
.github/workflows/ti-c2000-compile.yml Adds CI workflow to download/cache TI CGT and run the compile-only guard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread wolfssl/wolfcrypt/types.h Outdated
Comment thread wolfcrypt/benchmark/benchmark.c
dgarske added 4 commits June 18, 2026 15:26
… hashes, DRBG

Enables wolfCrypt on toolchains where a C byte/char is wider than 8 bits (e.g.
TI C2000 C28x, CHAR_BIT == 16), all gated on WOLFSSL_WIDE_BYTE and a no-op on
8-bit-byte targets (the default fast paths are left exactly as-is):
 - types.h: auto-set WOLFSSL_WIDE_BYTE for CHAR_BIT != 8 / known TI C2000
   toolchains (and define CHAR_BIT = 16 when <limits.h> is absent); wc_port.h/.c
   widen the atomic init-state bitfield + CHAR_BIT static assert for 16-bit int.
 - settings.h + sp_int.h: allow SP math on a 16-bit-int CPU via
   WOLFSSL_SP_ALLOW_16BIT_CPU, and detect a 16-bit char in the SP smallest-type
   selection.
 - misc.c/misc.h: shared big-endian octet<->word helpers
   (WordsFromBytesBE32/64, BytesFromWordsBE32/64) for WOLFSSL_WIDE_BYTE, where a
   word cannot be aliased as an octet stream.  They are CHAR_BIT-generic,
   cl2000-safe (loads accumulate with <<= 8, since (word)octet << 24 is
   miscompiled as a 16-bit shift), in-place safe for the SHA schedule, and store
   by octet count for partial digests.  misc.c rotate width uses CHAR_BIT.
 - coding.c: mask the constant-time base64 result to an octet.
 - sha256.c/sha512.c: use the shared helpers for the schedule load and digest
   store, plus a CHAR_BIT*sizeof length carry; sha3.c: octet-wise Keccak squeeze.
 - random.c: Hash-DRBG length + reseed-counter serialization via the shared
   helpers (and an octet-masked carry) under WOLFSSL_WIDE_BYTE; default builds
   keep the word-aliasing path unchanged.

WOLFSSL_WIDE_BYTE replaces the earlier WOLFSSL_NO_OCTET_BYTE working name.
…EST_MEM

ML-DSA-87 keygen/sign/verify on a 16-bit byte/int CPU (TI C28x), gated and a
no-op on normal targets:
 - Encode/decode integer-promotion fixes: a byte/word16 field promotes to
   *unsigned* int where int is 16-bit, so '2 - field' was unsigned and a
   negative coefficient zero-extended into sword32 (e.g. -1 -> 0x0000FFFF);
   cast the unpacked field to sword32 (eta-2/eta-4/t0 decode).  Bit-packers
   relied on (byte) truncating to 8 bits; mask with MLDSA_OCT() and cast the
   <<MLDSA_D shift to sword32 (eta-2/t0/t1/gamma1 encode).
 - dilithium.h: shift-based Montgomery reduction on WC_16BIT_CPU (cl2000
   miscompiles the multiply form).
 - New WOLFSSL_MLDSA_VERIFY_SMALLEST_MEM: stream the signature z vector one
   polynomial at a time instead of pinning the whole l-vector, cutting the
   ML-DSA-87 verify key by ~6 KB (with WOLFSSL_MLDSA_ASSIGN_KEY, ~10.7 KB total
   verify RAM on the C28x).
…mpile CI

 - test.c: store the SHA/SHAKE large_digest KAT vectors as brace-init byte
   arrays (clean octets) instead of "\x.." string literals, which a
   signed-16-bit-char toolchain (cl2000) would sign-extend.
 - benchmark.c: WOLFSSL_NO_MALLOC mode uses static plain/cipher buffers and
   skips the key/iv XMALLOC/XFREE (gated; default build unchanged).
 - scripts/ti-c2000/ + .github/workflows/ti-c2000-compile.yml: a hardware-free
   cl2000 compile-only CI guard for the CHAR_BIT!=8 wolfCrypt subset.
…it CPUs

The TI cl2000 (C2000 C28x) compiler miscompiles the 32x32->32 low multiply
used for the q^-1 step of mldsa_mont_red() - verified on a TMS320F28P550SJ,
the ML-DSA-87 verify KAT fails (res=0) - but compiles the 32x64->64 widening
multiply correctly. Compute the q^-1 product through the 64-bit path
(MLDSA_MUL_QINV_WIDE64): correct on any conforming compiler and, on the C28x,
~4% faster than the shift-based reduction (305 vs 317 ms/op for ML-DSA-87
verify). dilithium.h auto-selects it for WC_16BIT_CPU and leaves the q
multiply enabled (it compiles correctly); a user can still force the shift
form with MLDSA_MUL_QINV_SLOW / MLDSA_MUL_Q_SLOW. Validated on hardware for
keygen+sign+verify (round-trip res=1). No effect on 8-bit/>=32-bit-int builds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants