Missing f8 dtypes

Hi, the unified memory of Apple silicon devices is compelling for AI training, and often enables these to have significantly more memory for parameters and gradients than best consumer or even data center grade discrete GPUs. 

However, inspecting the data types list, the smallest float dtype I saw in mlx today was 16 bits (f16 or bf16)

Adding 8 bit floats to mlx would effectively double the maximum possible model size. 

Would this be possible with software or does it need to be a hardware update? 

If it is possible with software, how could we make it happen?

Some options for sensible default f8 e/m split for an 8-bit dtype could be:

e5m2
e4m3
e3m4

Then the question becomes how we would rank these and decide which one is best for the most likely use cases?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing f8 dtypes #1670

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing f8 dtypes #1670

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions