Hi, the unified memory of Apple silicon devices is compelling for AI training, and often enables these to have significantly more memory for parameters and gradients than best consumer or even data center grade discrete GPUs.
However, inspecting the data types list, the smallest float dtype I saw in mlx today was 16 bits (f16 or bf16)
Adding 8 bit floats to mlx would effectively double the maximum possible model size.
Would this be possible with software or does it need to be a hardware update?
If it is possible with software, how could we make it happen?
Some options for sensible default f8 e/m split for an 8-bit dtype could be:
e5m2
e4m3
e3m4
Then the question becomes how we would rank these and decide which one is best for the most likely use cases?
Hi, the unified memory of Apple silicon devices is compelling for AI training, and often enables these to have significantly more memory for parameters and gradients than best consumer or even data center grade discrete GPUs.
However, inspecting the data types list, the smallest float dtype I saw in mlx today was 16 bits (f16 or bf16)
Adding 8 bit floats to mlx would effectively double the maximum possible model size.
Would this be possible with software or does it need to be a hardware update?
If it is possible with software, how could we make it happen?
Some options for sensible default f8 e/m split for an 8-bit dtype could be:
e5m2
e4m3
e3m4
Then the question becomes how we would rank these and decide which one is best for the most likely use cases?