intermediates_to_table_indices works as follows:
- It calls
bits_to_table_indices, which takes three u128s each containing the value of one of three intermediates for 128 multiplications, and returns four u128s containing a table index in each nibble.
- It then reorders those nibbles into bytes as its output. (Originally, the table lookup was done here, but additional optimization moved the table lookup elsewhere.)
It appears that bits_to_table_indices compiles to <200 instructions (fully unrolled with no loops or branches), while the rearranging of nibbles compiles to >1000 instructions (again, fully unrolled with no loops or branches). Implementing a single transpose-like operation covering both steps would probably be more efficient.
intermediates_to_table_indicesworks as follows:bits_to_table_indices, which takes threeu128s each containing the value of one of three intermediates for 128 multiplications, and returns fouru128s containing a table index in each nibble.It appears that
bits_to_table_indicescompiles to <200 instructions (fully unrolled with no loops or branches), while the rearranging of nibbles compiles to >1000 instructions (again, fully unrolled with no loops or branches). Implementing a single transpose-like operation covering both steps would probably be more efficient.