Training LLMs Faster with 4 Bits
Training massive language models is an incredibly intensive process, demanding huge amounts of computational power and memory. A new numerical format called MXFP4, a 4-bit floating-point representation, is making this process much more efficient. It directly tackles the hardware bottlenecks that slow down model development.
What is MXFP4?
Computers store numbers in specific formats. For AI, a common format is the 32-bit floating-point number (FP32), which offers a solid balance of range and precision. A floating-point number is basically a computer’s version of scientific notation. It consists of three parts: a sign bit, an exponent, and a mantissa (also called the significand).
The general formula looks like this:
$$ Value = (-1)^{sign} \times 2^{exponent} \times (1.mantissa) $$
To see why 4-bit formats are tricky, let’s sketch a toy example of a standard 4-bit float (FP4) with the following layout:
- 1 bit for the sign
- 2 bits for the exponent (with bias)
- 1 bit for the mantissa
Suppose we want to represent –1.5:
- Sign: Negative, so the sign bit is
1
. - Exponent: Convert 1.5 to binary → $1.1_2$. The exponent is 0. With a bias of 1, the stored exponent is
01
. - Mantissa: The fraction part ($.1_2$) gives a mantissa bit of
1
.
So the 4-bit pattern is 1 01 1
.
This toy FP4 is extremely limited. The range is tiny, and precision drops quickly. That’s where MXFP4 comes in.
The Microscaling Trick
The “MX” in MXFP4 stands for microscaling. Instead of every number carrying its own exponent, a block of numbers (commonly 32 values) shares a single 8-bit scaling factor.
Inside each block, every value is stored in just 4 bits:
- 1 sign bit
- 3 mantissa bits
The shared 8-bit exponent rescales the entire block so that all values fit within the reduced mantissa range.
For example, consider weights [0.5, –0.2, 0.8, 0.35]. If the shared exponent is chosen as $2^{–1}$:
- 0.5 becomes $1.0 \times 2^{–1}$ → stored as “1.0” in 4 bits
- –0.2 becomes approximately $–0.4 \times 2^{–1}$ → stored as “–0.4” in 4 bits
- and so on, with rounding as needed
This approach gives enough resolution for values of similar magnitude, which is very common inside a neural network layer.
Why 4 Bits Are Powerful
The shift to 4-bit storage brings three big benefits:
1. Memory Efficiency
A model with 200 billion parameters stored in FP16 (16 bits) needs about 400 GB just for weights:
$$ 200 \times 10^9 \times 16 \text{ bits} = 3.2 \times 10^{12} \text{ bits} = 400 \text{ GB} $$
With MXFP4:
$$ 200 \times 10^9 \times 4 \text{ bits} = 8 \times 10^{11} \text{ bits} = 100 \text{ GB} $$
That’s a 75% reduction. It means models that once needed hundreds of GPUs can now fit on far fewer, lowering both cost and barrier to entry.
2. Faster Training
Moving data between memory and compute units is a major bottleneck. Because 4-bit numbers are one-quarter the size of FP16, GPUs can move up to 4× more parameters per memory cycle. In practice, full end-to-end training throughput often improves by 1.5–2×, depending on hardware and model design.
3. Lower Energy Use
Less data movement and shorter compute cycles mean less energy consumed. For massive training runs that last weeks, the savings in power bills and carbon footprint are significant.
MXFP4 shows how much efficiency can come from a smart numerical design. By combining shared scaling with compact 4-bit storage, it manages to keep models stable during training while slashing memory, bandwidth, and power needs. It’s not just about training bigger models—it’s about training them faster, cheaper, and in a way that uses fewer resources.