Understanding LLM (Large Language Models): From Transformers to GPT

Understanding Large Language Models

Introduction

The emergence of Large Language Models (LLMs) represents one of the most significant breakthroughs in artificial intelligence. From GPT-3’s ability to write coherent essays to ChatGPT’s conversational capabilities, these models have fundamentally changed how we interact with AI systems.

But what exactly makes these models work? How do they understand language, generate coherent text, and seemingly “know” facts about the world?

This post explores the technical foundations of LLMs: the Transformer architecture, the mechanisms that enable them to process language, and the evolution from early models like BERT to the GPT family that powers today’s most impressive AI applications.

The Transformer Revolution

Before Transformers: The RNN Era

The RNN Era

Before 2017, natural language processing relied heavily on Recurrent Neural Networks (RNNs) and their variants (LSTMs, GRUs). These architectures processed text sequentially, maintaining hidden states that captured context:

# Simplified RNN concept
for word in sentence:
    hidden_state = f(hidden_state, word_embedding)
    output = g(hidden_state)

Limitations:

Sequential processing: No parallelization, slow training.
Vanishing gradients: Difficulty learning long-range dependencies.
Limited context: Struggle with sequences longer than ~100-200 tokens.
Fixed capacity: Hidden state bottleneck.

The “Attention Is All You Need” Breakthrough

June 2017 Breakthrough

In June 2017, Vaswani et al. published “Attention Is All You Need”, introducing the Transformer architecture that would revolutionize NLP.

Key innovation: Replace sequential processing with parallel attention mechanisms that allow the model to focus on any part of the input simultaneously.

Traditional RNN:  word₁ → word₂ → word₃ → ... → wordₙ  (sequential)

Transformer:      word₁ ↔ word₂ ↔ word₃ ↔ ... ↔ wordₙ  (parallel)

Transformer Architecture: Core Components

Transformer Architecture Anatomy

The Transformer consists of two main components: Encoder and Decoder.

High-Level Architecture

Input Text → Tokenization → Embedding → Position Encoding
                                              ↓
                          ┌──────────────────────────────┐
                          │   Encoder Stack (N layers)  │
                          │  - Multi-Head Attention      │
                          │  - Feed Forward Network      │
                          │  - Layer Normalization       │
                          └──────────┬───────────────────┘
                                     ↓
                          ┌──────────────────────────────┐
                          │   Decoder Stack (N layers)  │
                          │  - Masked Multi-Head Attn    │
                          │  - Cross Attention           │
                          │  - Feed Forward Network      │
                          └──────────┬───────────────────┘
                                     ↓
                          Linear → Softmax → Output Token

1. Input Representation

Token Embeddings: Convert discrete tokens to continuous vectors

# Token to embedding
vocab_size = 50000
embedding_dim = 768

embedding_layer = nn.Embedding(vocab_size, embedding_dim)
token_ids = [1234, 5678, 9012]  # "The cat sat"
embeddings = embedding_layer(token_ids)  # Shape: [3, 768]

Positional Encoding: Inject position information (since Transformers have no inherent sequence order)

def positional_encoding(position, d_model):
    """
    PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    angle_rates = 1 / np.power(10000, (2 * (np.arange(d_model)//2)) / d_model)
    angle_rads = position * angle_rates

    # Apply sin to even indices, cos to odd
    pe = np.zeros(d_model)
    pe[0::2] = np.sin(angle_rads[0::2])
    pe[1::2] = np.cos(angle_rads[1::2])
    return pe

Self-Attention: The Core Mechanism

Attention in Action

Self-attention allows each token to “attend” to all other tokens in the sequence, learning which words are most relevant to each other.

Attention Mathematics

For each token, compute three vectors:

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What do I actually represent?”

# Simplified attention computation
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Attention(Q, K, V) = softmax(QK^T / √d_k) V
    """
    d_k = Q.size(-1)

    # 1. Compute attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    # 2. Apply mask (for padding or causal masking)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    # 3. Softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)

    # 4. Apply weights to values
    output = torch.matmul(attention_weights, V)

    return output, attention_weights

Attention Example

For the sentence “The cat sat on the mat”:

When processing “sat”:

Query: sat
Attention to: The (0.05), cat (0.45), sat (0.10), on (0.15), the (0.05), mat (0.20)
                      ↑ high attention to subject

The model learns that verbs should pay attention to their subjects and objects.

Visualizing Attention

Input:  The    cat    sat    on     the    mat
        ↓      ↓      ↓      ↓      ↓      ↓
Attn:  [0.1]  [0.3]  [0.2]  [0.1]  [0.1]  [0.2]  ← Attention weights for "cat"
        │      │      │      │      │      │
        └──────┴──────┴──────┴──────┴──────┘
                  Weighted combination
                         ↓
                  New representation of "cat"

Multi-Head Attention

Single attention can only learn one type of relationship. Multi-head attention learns multiple relationships in parallel.

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=768, num_heads=12):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.d_k = d_model // num_heads

        # Linear projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        batch_size = x.size(0)

        # 1. Project and split into multiple heads
        Q = self.W_q(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # 2. Apply attention for each head
        attn_output, _ = scaled_dot_product_attention(Q, K, V, mask)

        # 3. Concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )

        # 4. Final linear projection
        output = self.W_o(attn_output)
        return output

Why multiple heads?

Head 1: Syntactic relationships (subject-verb).
Head 2: Semantic similarity (synonyms).
Head 3: Positional proximity (adjacent words).
Head 4-12: Other linguistic patterns.

Feed-Forward Networks

After attention, each position passes through a position-wise feed-forward network (FFN):

class FeedForward(nn.Module):
    def __init__(self, d_model=768, d_ff=3072):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        # FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
        return self.linear2(self.dropout(F.gelu(self.linear1(x))))

Purpose:

Non-linear transformation.
Increase model capacity.
Learn complex feature interactions.

Layer Normalization and Residual Connections

Residual connections (skip connections) + Layer Normalization stabilize training:

class TransformerBlock(nn.Module):
    def __init__(self, d_model=768, num_heads=12, d_ff=3072):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x, mask=None):
        # 1. Multi-head attention with residual
        attn_output = self.attention(self.norm1(x), mask)
        x = x + self.dropout(attn_output)

        # 2. Feed-forward with residual
        ff_output = self.feed_forward(self.norm2(x))
        x = x + self.dropout(ff_output)

        return x

This pattern is repeated N times (typically 12-96 layers in modern LLMs).

From BERT to GPT: Architectural Variants

BERT vs GPT Architecture

BERT: Bidirectional Encoder (2018)

Architecture: Encoder-only Transformer

[CLS] The cat sat on the mat [SEP]
  ↓    ↓   ↓   ↓   ↓   ↓   ↓    ↓
  ←─────────────────────────────→  Bidirectional attention
  ↓    ↓   ↓   ↓   ↓   ↓   ↓    ↓
 h₁   h₂  h₃  h₄  h₅  h₆  h₇   h₈

Key features:

Bidirectional context: Each token sees entire sequence.
Masked Language Modeling (MLM): Predict masked tokens.
Next Sentence Prediction (NSP): Understand sentence relationships.

Training objective:

# Masked Language Modeling
input_text = "The [MASK] sat on the mat"
target = "cat"

# Model predicts masked token using bidirectional context
prediction = bert(input_text)
loss = cross_entropy(prediction, target)

Use cases:

Text classification.
Named entity recognition.
Question answering.
Embedding generation.

GPT: Autoregressive Decoder (2018-present)

Architecture: Decoder-only Transformer with causal masking

The    cat    sat    on     the
 ↓      ↓      ↓      ↓      ↓
 →  →→  →→→  →→→→  →→→→→     Causal (left-to-right) attention
 ↓      ↓      ↓      ↓      ↓
 h₁     h₂     h₃     h₄     h₅

Key features:

Causal attention: Each token only sees previous tokens.
Autoregressive generation: Predict next token.
Unidirectional: Left-to-right processing.

Training objective:

# Next Token Prediction
input_text = "The cat sat on"
target = "the"

# Model predicts next token using only previous context
prediction = gpt(input_text)
loss = cross_entropy(prediction, target)

Causal mask implementation:

def create_causal_mask(seq_len):
    """
    Mask out future positions:
    [[1, 0, 0, 0],
     [1, 1, 0, 0],
     [1, 1, 1, 0],
     [1, 1, 1, 1]]
    """
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask.view(1, 1, seq_len, seq_len)

Evolution:

GPT-1 (2018): 117M parameters, proof of concept.
GPT-2 (2019): 1.5B parameters, coherent text generation.
GPT-3 (2020): 175B parameters, few-shot learning.
GPT-3.5 (2022): Optimized for chat, RLHF.
GPT-4 (2023): Multimodal, enhanced reasoning.

Pre-training vs Fine-tuning

Two-Stage Training Paradigm

Modern LLMs follow a two-stage training paradigm:

Stage 1: Pre-training

Objective: Learn general language understanding from massive unlabeled data

# Pre-training on raw text
corpus = [
    "The cat sat on the mat.",
    "Natural language processing is fascinating.",
    "Machine learning models require large datasets.",
    # ... billions of sentences
]

# For GPT: predict next token
for text in corpus:
    tokens = tokenize(text)
    for i in range(len(tokens) - 1):
        context = tokens[:i+1]
        target = tokens[i+1]

        prediction = model(context)
        loss = cross_entropy(prediction, target)
        loss.backward()

Data sources:

Common Crawl (web pages).
Books (BookCorpus, Books3).
Wikipedia.
Scientific papers (arXiv, PubMed).
Code repositories (GitHub).

Computational requirements:

GPT-3: ~3.14 × 10²³ FLOPS.
Training time: Weeks to months on thousands of GPUs.
Cost: Millions of dollars.

Stage 2: Fine-tuning

Objective: Adapt pre-trained model to specific tasks

# Fine-tuning for specific task
task_data = [
    ("Classify sentiment: I love this product!", "positive"),
    ("Classify sentiment: This is terrible.", "negative"),
    # ... thousands of labeled examples
]

# Fine-tune with task-specific objective
for text, label in task_data:
    prediction = pretrained_model(text)
    loss = task_loss(prediction, label)
    loss.backward()

Fine-tuning approaches:

Full fine-tuning: Update all parameters.
Adapter layers: Add small trainable layers, freeze base model.
LoRA (Low-Rank Adaptation): Efficient parameter updates.
Prompt tuning: Learn soft prompts, freeze model.

Instruction fine-tuning (modern approach):

instruction_data = [
    {
        "instruction": "Translate to French",
        "input": "Hello, how are you?",
        "output": "Bonjour, comment allez-vous?"
    },
    {
        "instruction": "Summarize this text",
        "input": "Long article text...",
        "output": "Brief summary..."
    }
]

Scaling Laws: Bigger Is Better?

Research shows predictable relationships between model performance and three factors:

The Scaling Laws Formula

$\text{Loss} \propto N^{-\alpha} \times D^{-\beta} \times C^{-\gamma}$

Where:

N: Model size (parameters).
D: Dataset size (tokens).
C: Compute budget (FLOPs).
α, β, γ: Empirically determined exponents.

Key Findings

Performance scales predictably with size.
Compute is the limiting factor, not parameters or data.
Optimal allocation:
- Double compute → 1.7× larger model + 1.2× more data.
Diminishing returns but no clear saturation point yet.

Model Size Evolution

Model	Year	Parameters	Training Tokens
BERT-Base	2018	110M	~3B
GPT-2	2019	1.5B	~40B
GPT-3	2020	175B	~300B
Gopher	2021	280B	~300B
PaLM	2022	540B	~780B
LLaMA	2023	65B	~1.4T
GPT-4	2023	~1.7T*	Unknown

*Rumored, not confirmed

Emergent Abilities

As models scale, they exhibit emergent abilities: capabilities not present in smaller models that suddenly appear at certain scale thresholds.

Examples of Emergence

Few-shot learning: Learn from examples without fine-tuning.
Chain-of-thought reasoning: Break down complex problems.
Code generation: Write functional programs.
Multi-step math: Solve complex calculations.
Translation: Even for low-resource language pairs.

Scaling Curve Example

Performance on Complex Task
    │
100%│                              ┌────
    │                            ╱
 75%│                          ╱
    │                        ╱
 50%│                    ╱╱
    │              ╱╱╱╱
 25%│      ╱╱╱╱╱╱
    │╱╱╱╱╱
  0%└──────────────────────────────────
    1M   10M  100M   1B   10B  100B  1T
              Model Parameters

Emergence threshold: Often around 10B-100B parameters for complex reasoning tasks.

Training Techniques

1. Mixed Precision Training

Use FP16/BF16 for speed, FP32 for stability:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    with autocast():
        output = model(batch)
        loss = criterion(output, target)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

2. Gradient Checkpointing

Trade compute for memory:

# Store only some activations, recompute others during backward
from torch.utils.checkpoint import checkpoint

def forward(self, x):
    for layer in self.layers:
        x = checkpoint(layer, x)  # Recompute instead of storing
    return x

3. Model Parallelism

Split model across multiple GPUs:

Pipeline parallelism: Different layers on different GPUs.
Tensor parallelism: Split individual layers.
Data parallelism: Replicate model, split data.

4. Optimizer Improvements

AdamW (Adam with decoupled weight decay):

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    weight_decay=0.01
)

Learning rate schedule:

# Warmup + Cosine decay
def get_lr(step, warmup_steps, total_steps, max_lr):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    else:
        progress = (step - warmup_steps) / (total_steps - warmup_steps)
        return max_lr * 0.5 * (1 + math.cos(math.pi * progress))

Tokenization

Tokenization Process

Before processing, text must be converted to tokens. Modern LLMs use subword tokenization.

Byte-Pair Encoding (BPE)

Most common approach (used by GPT):

# Example tokenization
text = "understanding"

# Character level: ["u", "n", "d", "e", "r", "s", "t", "a", "n", "d", "i", "n", "g"]
# Subword level: ["under", "stand", "ing"]
# Word level: ["understanding"]

# GPT tokenization
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

tokens = tokenizer.encode("understanding")
# Output: [4625, 278, 9228]  (approximate)

Advantages:

Balance between vocabulary size and sequence length.
Handle rare words and typos.
Efficient for multiple languages.

Vocabulary sizes:

GPT-2/GPT-3: ~50k tokens.
LLaMA: ~32k tokens.
GPT-4: ~100k tokens (estimated).

Architecture Comparison Summary

Aspect	BERT	GPT	T5
Type	Encoder-only	Decoder-only	Encoder-Decoder
Attention	Bidirectional	Causal (unidirectional)	Both
Training	MLM + NSP	Next token prediction	Span corruption
Best for	Understanding	Generation	Translation, summarization
Context	Full sequence	Left-to-right	Input→Output

Computational Considerations

Model Size vs Memory

For a model with N parameters:

Storage: ~4N bytes (FP32) or ~2N bytes (FP16).
Training memory: ~16-20N bytes (gradients, optimizer states, activations).
Inference memory: ~4-8N bytes.

Example: GPT-3 175B:

Storage: ~350 GB (FP16).
Training: ~3 TB.
Inference: ~700 GB.

Inference Optimization

Quantization: Reduce precision (INT8, INT4).
Pruning: Remove unnecessary weights.
Distillation: Train smaller model to mimic larger one.
KV caching: Store attention keys/values for faster generation.

# KV caching for autoregressive generation
cache = {}
for position in range(max_length):
    # Reuse previous computations
    output, cache = model(input, cache=cache)
    next_token = sample(output)
    input = torch.cat([input, next_token])

Limitations and Challenges

1. Training Costs

Compute: Millions of dollars for large models.
Energy: Environmental concerns.
Accessibility: Only large organizations can train from scratch.

2. Context Length

Limited window: Most models handle 2k-8k tokens.
Information loss: Can’t process very long documents.
Workarounds: Chunking, summarization, retrieval.

3. Hallucinations

Confident but wrong: Generate plausible-sounding false information.
No grounding: No connection to real-world facts.
Mitigation: Retrieval augmentation, fact-checking.

4. Bias and Safety

Training data bias: Reflect biases in training data.
Harmful content: Can generate toxic or dangerous text.
Alignment: Ensuring models behave as intended.

The Future: What’s Next?

Emerging Trends

Efficient architectures:
- Sparse attention mechanisms.
- State space models (S4, Mamba).
- Mixture of Experts (MoE).
Longer context:
- Efficient attention variants (Linear attention, Flash attention).
- Retrieval augmentation.
- Memory mechanisms.
Multimodal models:
- Vision + Language (CLIP, Flamingo).
- Audio + Language.
- Video understanding.
Smaller, efficient models:
- Distillation techniques.
- Efficient fine-tuning (LoRA, QLoRA).
- On-device models.
Better training methods:
- Improved data quality.
- Curriculum learning.
- Multi-task learning.

Conclusion

Large Language Models represent a paradigm shift in how we build AI systems. The Transformer architecture’s ability to process language in parallel, combined with self-attention mechanisms and massive scale, has unlocked capabilities that were unimaginable just a few years ago.

Key takeaways:

Transformers replaced RNNs through parallel processing and attention.
Self-attention allows models to learn complex relationships in text.
Scaling laws show predictable improvements with size and compute.
Pre-training + fine-tuning enables transfer learning at massive scale.
Emergent abilities appear at certain scale thresholds.
Architecture choices (encoder vs decoder) determine capabilities.

Understanding these fundamentals is crucial for anyone working with modern AI systems. As we continue to push the boundaries of scale and capability, these core principles remain the foundation upon which all LLM applications are built.

In future posts, we’ll explore how to leverage these models in production systems, from embeddings and semantic search to retrieval-augmented generation and beyond.

References

Vaswani et al. (2017): “Attention Is All You Need”
Devlin et al. (2018): “BERT: Pre-training of Deep Bidirectional Transformers”
Radford et al. (2018): “Improving Language Understanding by Generative Pre-Training”
Brown et al. (2020): “Language Models are Few-Shot Learners” (GPT-3)
Kaplan et al. (2020): “Scaling Laws for Neural Language Models”
Wei et al. (2022): “Emergent Abilities of Large Language Models”
Hoffmann et al. (2022): “Training Compute-Optimal Large Language Models” (Chinchilla)