One of the biggest advancements in neural networks over the past couple of years has been decreasing the precision of the parameters, gradients, and activations. We used to store everything in floating point 32, but as the need for faster bigger matrices and faster multiplies grew, it was soon halved to FP16.
FP16 satisfied us for a while. With gradient scaling and improved GPU SM's designed specifically for quantized computations, the progress was impressive. Even more advancements led to the development of the BF16 which has higher dynamic range which is useful for preserving information in gradient steps.
But this wasn't far enough: we needed less bits.
So FP8 and BF8 were the obvious next choices. With only a byte per number, the computations were speedy and suddenly huge models could fit into RAM much easier. Models could be trained even faster, and even models trained at higher precisions could be quantized later without hurting performance too much. General advice still maintained to keep your optimizer states with FP32, but who needs accurate second moments anyway.
But why stop here?
Next FP4 models started coming out. It's hard to even call this a floating point number anymore, there's not much floating that the point can do. We only have 16 options now for each number, but that's all we need: many hands make light work. FP4 is EVEN faster, and EVEN smaller. Your 1 trillion parameter model suddenly fits on my toaster's RAM; and the computations are so easy, you could make the kernels out of Redstone if you needed to. Quantization this heavy is especially great for performing vector searches on millions to billions of vectors with PQ and other algorithms.
But this is still way too many bits.
1.584 Bit models were next. Now each number can only be a 1, 0, or -1. Why would you ever need anything else? Why have a thousand floating point numbers when you can have 20 times as many 1s, -1s, and 0s? This feels a bit like approximating pi by randomly throwing darts at a circle inscribed in a square, but it works I guess.
But why do you even need -1?
1 Bit models were next. We've come full circle. It's like we're saying why do you need multicellular organisms when you can just have a bunch of eukaryotic cells working together? Why would you wear a knitted sweater when you can wear string in a bunch of knots around you. But this shit is fast AF on CPU's since that's what they are designed for, lots and lots of atomic binary operations.
But I propose we don't stop here.
In this essay I introduce 0 Bit Models. They will be EVEN faster than 1 bit models, can run on any device in infinitesimal time, and take up zero RAM. Furthermore, matrix multiplies can be computed in constant time and their results can easily be cached using zero space. Imagine for a moment, a world where every query you could possibly dream of, every thought you could ever muster, every want and desire could be answered instantly. This is the vision I have for a future with zero bit models–a future where AI reaches its final form as a deity. Perhaps true wisdom is knowing when to shut up.