Understanding LLM.int8(): Making Large Models Fit

Sun Jun 21 2026

LLM.int8() from First Principles. Most people understand that large language models have billions of parameters, but don’t know exactly how we compress them to run on normal GPUs without destroying their intelligence.

To understand how modern models can fit on consumer hardware, we need to understand Quantization and specifically LLM.int8().

It essentially solved the 8-bit inference bottleneck.

But before we get into the solution, we must understand the core problem.


Step 1: The Memory Gets Too Big

When a model processes text, its intelligence is stored inside billions of weight values.

Historically, most neural networks use FP32 (32-bit floating point). That means every single number occupies 4 bytes of memory.

For a tiny 4-parameter matrix, this is 16 bytes. No problem.

But look at how the memory requirements scale for modern LLMs:

Model Size Precision Memory Required (Weights Only)
7 Billion FP32 ~28 GB
70 Billion FP32 ~280 GB

Notice: Just the weights of a 70B model require 280 GB of VRAM. Most consumer GPUs cannot even hold a fraction of this memory.

The Rule:

If your model grows, the memory required explodes.

Here is the craziest part: Do we really need 32 bits for every single parameter?

This is exactly the question LLM.int8() asks.


Step 2: The Quantization Question

Instead of storing every weight using 32 bits, we can store them using fewer bits, like INT8 (8 bits).

By moving from FP32 (4 bytes) to INT8 (1 byte), we immediately reduce the memory by 4x. A 280 GB model becomes 70 GB.

But an INT8 value cannot directly store floating-point numbers such as 0.12 or 0.91. INT8 only stores integers from -128 to 127.

So we store an integer representation and a scaling factor:
0.56 -> 56 with a scale of 0.01.

This is called Naive Quantization.

Naive Quantization Pipeline


Step 3: Why Naive Quantization Fails

Until now, quantization looked almost magical. We get 4x less memory with a tiny error. So what’s the catch?

Suppose our weights are [0.12, 0.56, 0.91, 0.33, 100].

Notice the 100. It is an outlier.

Because quantization maps the largest value to the maximum INT8 value (127), the scale becomes 100 / 127 ≈ 0.787.

This single giant value forces a massive scale factor. The smaller values (like 0.12) are completely crushed and lose precision.

The Outlier Problem

The Outlier Problem

In large language models, a tiny fraction of values are orders of magnitude larger than the rest. If we force everything into INT8, these outliers stretch the scale so much that the model loses its intelligence. Researchers tried removing the outliers, but the model quality immediately dropped.


Step 4: The Magic Trick - LLM.int8() Solution

Instead of sending every value through INT8 quantization, LLM.int8() separates the values into two distinct groups:

  1. Normal Values: [0.12, 0.56, 0.91, 0.33]
  2. Outlier Values: [100]

The normal values are processed in INT8. They quantize very well because they live in a similar numerical range.

The outlier values are kept in higher precision (FP16). This avoids introducing large quantization errors.

LLM.int8() Splitting Logic

By treating these two groups differently, we never force the normal values to compete with the giant outlier for the quantization scale. We get nearly the memory savings of INT8 with the numerical accuracy of FP16!


Step 5: Split Matrix Multiplication

In practice, Transformers spend their time performing massive matrix multiplications (Activation Vector × Weight Matrix).

LLM.int8() performs two separate, smaller matrix multiplications:

  1. INT8 Path: Computes the contribution from normal dimensions.
  2. FP16 Path: Computes the contribution from outlier dimensions.

Once both paths finish, their outputs are perfectly added together.

Split Matrix Multiplication

This final output is extremely close to the original FP32 computation while using significantly less memory. Use INT8 wherever possible and keep higher precision only where it is actually needed.


Step 6: The Benchmark Results

When we benchmark the FP32 vs INT8 models for a standard architecture, the results speak for themselves:

Memory Footprint Benchmark

By applying LLM.int8(), we save 75% memory while introducing only a small reconstruction error. When testing generation, the model still understands English, grammar, and context beautifully.


Applying it in Code

Here is how you can use the HuggingFace transformers and bitsandbytes libraries to instantly leverage LLM.int8() for any model:

!pip install transformers
!pip install accelerate
!pip install bitsandbytes
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model_name = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name)

print(model)
GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(...)
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

Checking Memory

total_params = sum(
    p.numel()
    for p in model.parameters()
)

memory_mb = (
    total_params * 4
) / (1024 * 1024)

print(f"Total Parameters: {total_params:,}")
print(f"Estimated FP32 Memory: {memory_mb:.2f} MB")
Total Parameters: 124,439,808
Estimated FP32 Memory: 474.70 MB

Running Inference

text = "My name is"

inputs = tokenizer(
    text,
    return_tensors="pt"
)

output = model.generate(
    **inputs,
    max_new_tokens=20
)

print(tokenizer.decode(output[0]))
My name is John. I'm a man of God. I'm a man of God. I'm a man

Loading Int8 Version

from transformers import BitsAndBytesConfig

# Creating Configuration 
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True
)

# Loading Model
int8_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config
)

This line: load_in_8bit=True is literally everything we learned.

Now Comparing Memory

total_params = sum(
    p.numel()
    for p in int8_model.parameters()
)

fp32_memory = (total_params * 4) / (1024 * 1024)
int8_memory = (total_params * 1) / (1024 * 1024)

print(f"FP32 Estimate: {fp32_memory:.2f} MB")
print(f"INT8 Estimate: {int8_memory:.2f} MB")
print(f"Reduction: {fp32_memory/int8_memory:.1f}x")
FP32 Estimate: 474.70 MB
INT8 Estimate: 118.68 MB
Reduction: 4.0x

Generating Text Again

output = int8_model.generate(
    **inputs,
    max_new_tokens=20
)

print(tokenizer.decode(output[0]))
My name is John. I'm a writer, and I'm a writer. I'm a writer. I'm

Conclusion at the End:

Before INT8

My name is John. I’m a man of God. I’m a man of God. I’m a man

After INT8

My name is John. I’m a writer, and I’m a writer. I’m a writer. I’m

  1. Model still understands English
  2. Model still understands grammar
  3. Model still understands context
  4. There is a tiny error at the end, which is what quantization is all about!
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Plotting the memory comparison
plt.figure(figsize=(8, 5))
models = ['FP32 Model', 'INT8 Model']
memories = [fp32_memory, int8_memory]

plt.bar(models, memories, color=['#64748b', '#3b82f6'], edgecolor='black', width=0.5)
plt.title('HuggingFace GPT-2 Memory Footprint', fontsize=16, fontweight='bold', pad=15)
plt.ylabel('Memory (MB)', fontsize=12)

for i, v in enumerate(memories):
    plt.text(i, v + (max(memories)*0.02), f'{v:.2f} MB', ha='center', fontweight='bold', fontsize=12)

plt.tight_layout()
plt.show()

Output Graph


Comment down your thoughts below, if you found it helpful !

References

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Best,
Ankit

Share Your Thoughts

Post anonymously below. (I review and approve all comments.)

machine view