Further optimize `intermediates_to_table_indices`

`intermediates_to_table_indices` works as follows:
* It calls `bits_to_table_indices`, which takes three `u128`s each containing the value of one of three intermediates for 128 multiplications, and returns four `u128`s containing a table index in each nibble.
* It then reorders those nibbles into bytes as its output. (Originally, the table lookup was done here, but additional optimization moved the table lookup elsewhere.)

It appears that `bits_to_table_indices` compiles to <200 instructions (fully unrolled with no loops or branches), while the rearranging of nibbles compiles to >1000 instructions (again, fully unrolled with no loops or branches). Implementing a single transpose-like operation covering both steps would probably be more efficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further optimize `intermediates_to_table_indices` #1457

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Further optimize intermediates_to_table_indices #1457

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Further optimize `intermediates_to_table_indices` #1457