Skip to content
Go back

Understanding LLM (Large Language Models): From Transformers to GPT

Understanding Large Language Models

Introduction

The emergence of Large Language Models (LLMs) represents one of the most significant breakthroughs in artificial intelligence. From GPT-3’s ability to write coherent essays to ChatGPT’s conversational capabilities, these models have fundamentally changed how we interact with AI systems.

But what exactly makes these models work? How do they understand language, generate coherent text, and seemingly “know” facts about the world?

This post explores the technical foundations of LLMs: the Transformer architecture, the mechanisms that enable them to process language, and the evolution from early models like BERT to the GPT family that powers today’s most impressive AI applications.


The Transformer Revolution

Before Transformers: The RNN Era

The RNN Era

Before 2017, natural language processing relied heavily on Recurrent Neural Networks (RNNs) and their variants (LSTMs, GRUs). These architectures processed text sequentially, maintaining hidden states that captured context:

# Simplified RNN concept
for word in sentence:
    hidden_state = f(hidden_state, word_embedding)
    output = g(hidden_state)

Limitations:

The “Attention Is All You Need” Breakthrough

June 2017 Breakthrough

In June 2017, Vaswani et al. published “Attention Is All You Need”, introducing the Transformer architecture that would revolutionize NLP.

Key innovation: Replace sequential processing with parallel attention mechanisms that allow the model to focus on any part of the input simultaneously.

Traditional RNN:  word₁ → word₂ → word₃ → ... → wordₙ  (sequential)

Transformer:      word₁ ↔ word₂ ↔ word₃ ↔ ... ↔ wordₙ  (parallel)

Transformer Architecture: Core Components

Transformer Architecture Anatomy

The Transformer consists of two main components: Encoder and Decoder.

High-Level Architecture

Input Text → Tokenization → Embedding → Position Encoding

                          ┌──────────────────────────────┐
                          │   Encoder Stack (N layers)  │
                          │  - Multi-Head Attention      │
                          │  - Feed Forward Network      │
                          │  - Layer Normalization       │
                          └──────────┬───────────────────┘

                          ┌──────────────────────────────┐
                          │   Decoder Stack (N layers)  │
                          │  - Masked Multi-Head Attn    │
                          │  - Cross Attention           │
                          │  - Feed Forward Network      │
                          └──────────┬───────────────────┘

                          Linear → Softmax → Output Token

1. Input Representation

Token Embeddings: Convert discrete tokens to continuous vectors

# Token to embedding
vocab_size = 50000
embedding_dim = 768

embedding_layer = nn.Embedding(vocab_size, embedding_dim)
token_ids = [1234, 5678, 9012]  # "The cat sat"
embeddings = embedding_layer(token_ids)  # Shape: [3, 768]

Positional Encoding: Inject position information (since Transformers have no inherent sequence order)

def positional_encoding(position, d_model):
    """
    PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    angle_rates = 1 / np.power(10000, (2 * (np.arange(d_model)//2)) / d_model)
    angle_rads = position * angle_rates
    
    # Apply sin to even indices, cos to odd
    pe = np.zeros(d_model)
    pe[0::2] = np.sin(angle_rads[0::2])
    pe[1::2] = np.cos(angle_rads[1::2])
    return pe

Self-Attention: The Core Mechanism

Attention in Action

Self-attention allows each token to “attend” to all other tokens in the sequence, learning which words are most relevant to each other.

Attention Mathematics

For each token, compute three vectors:

# Simplified attention computation
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Attention(Q, K, V) = softmax(QK^T / √d_k) V
    """
    d_k = Q.size(-1)
    
    # 1. Compute attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    
    # 2. Apply mask (for padding or causal masking)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # 3. Softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)
    
    # 4. Apply weights to values
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights

Attention Example

For the sentence “The cat sat on the mat”:

When processing “sat”:

Query: sat
Attention to: The (0.05), cat (0.45), sat (0.10), on (0.15), the (0.05), mat (0.20)
                      ↑ high attention to subject

The model learns that verbs should pay attention to their subjects and objects.

Visualizing Attention

Input:  The    cat    sat    on     the    mat
        ↓      ↓      ↓      ↓      ↓      ↓
Attn:  [0.1]  [0.3]  [0.2]  [0.1]  [0.1]  [0.2]  ← Attention weights for "cat"
        │      │      │      │      │      │
        └──────┴──────┴──────┴──────┴──────┘
                  Weighted combination

                  New representation of "cat"

Multi-Head Attention

Single attention can only learn one type of relationship. Multi-head attention learns multiple relationships in parallel.

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=768, num_heads=12):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.d_k = d_model // num_heads
        
        # Linear projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x, mask=None):
        batch_size = x.size(0)
        
        # 1. Project and split into multiple heads
        Q = self.W_q(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # 2. Apply attention for each head
        attn_output, _ = scaled_dot_product_attention(Q, K, V, mask)
        
        # 3. Concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )
        
        # 4. Final linear projection
        output = self.W_o(attn_output)
        return output

Why multiple heads?


Feed-Forward Networks

After attention, each position passes through a position-wise feed-forward network (FFN):

class FeedForward(nn.Module):
    def __init__(self, d_model=768, d_ff=3072):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, x):
        # FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
        return self.linear2(self.dropout(F.gelu(self.linear1(x))))

Purpose:


Layer Normalization and Residual Connections

Residual connections (skip connections) + Layer Normalization stabilize training:

class TransformerBlock(nn.Module):
    def __init__(self, d_model=768, num_heads=12, d_ff=3072):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, x, mask=None):
        # 1. Multi-head attention with residual
        attn_output = self.attention(self.norm1(x), mask)
        x = x + self.dropout(attn_output)
        
        # 2. Feed-forward with residual
        ff_output = self.feed_forward(self.norm2(x))
        x = x + self.dropout(ff_output)
        
        return x

This pattern is repeated N times (typically 12-96 layers in modern LLMs).


From BERT to GPT: Architectural Variants

BERT vs GPT Architecture

BERT: Bidirectional Encoder (2018)

Architecture: Encoder-only Transformer

[CLS] The cat sat on the mat [SEP]
  ↓    ↓   ↓   ↓   ↓   ↓   ↓    ↓
  ←─────────────────────────────→  Bidirectional attention
  ↓    ↓   ↓   ↓   ↓   ↓   ↓    ↓
 h₁   h₂  h₃  h₄  h₅  h₆  h₇   h₈

Key features:

Training objective:

# Masked Language Modeling
input_text = "The [MASK] sat on the mat"
target = "cat"

# Model predicts masked token using bidirectional context
prediction = bert(input_text)
loss = cross_entropy(prediction, target)

Use cases:

GPT: Autoregressive Decoder (2018-present)

Architecture: Decoder-only Transformer with causal masking

The    cat    sat    on     the
 ↓      ↓      ↓      ↓      ↓
 →  →→  →→→  →→→→  →→→→→     Causal (left-to-right) attention
 ↓      ↓      ↓      ↓      ↓
 h₁     h₂     h₃     h₄     h₅

Key features:

Training objective:

# Next Token Prediction
input_text = "The cat sat on"
target = "the"

# Model predicts next token using only previous context
prediction = gpt(input_text)
loss = cross_entropy(prediction, target)

Causal mask implementation:

def create_causal_mask(seq_len):
    """
    Mask out future positions:
    [[1, 0, 0, 0],
     [1, 1, 0, 0],
     [1, 1, 1, 0],
     [1, 1, 1, 1]]
    """
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask.view(1, 1, seq_len, seq_len)

Evolution:


Pre-training vs Fine-tuning

Two-Stage Training Paradigm

Modern LLMs follow a two-stage training paradigm:

Stage 1: Pre-training

Objective: Learn general language understanding from massive unlabeled data

# Pre-training on raw text
corpus = [
    "The cat sat on the mat.",
    "Natural language processing is fascinating.",
    "Machine learning models require large datasets.",
    # ... billions of sentences
]

# For GPT: predict next token
for text in corpus:
    tokens = tokenize(text)
    for i in range(len(tokens) - 1):
        context = tokens[:i+1]
        target = tokens[i+1]
        
        prediction = model(context)
        loss = cross_entropy(prediction, target)
        loss.backward()

Data sources:

Computational requirements:

Stage 2: Fine-tuning

Objective: Adapt pre-trained model to specific tasks

# Fine-tuning for specific task
task_data = [
    ("Classify sentiment: I love this product!", "positive"),
    ("Classify sentiment: This is terrible.", "negative"),
    # ... thousands of labeled examples
]

# Fine-tune with task-specific objective
for text, label in task_data:
    prediction = pretrained_model(text)
    loss = task_loss(prediction, label)
    loss.backward()

Fine-tuning approaches:

  1. Full fine-tuning: Update all parameters.
  2. Adapter layers: Add small trainable layers, freeze base model.
  3. LoRA (Low-Rank Adaptation): Efficient parameter updates.
  4. Prompt tuning: Learn soft prompts, freeze model.

Instruction fine-tuning (modern approach):

instruction_data = [
    {
        "instruction": "Translate to French",
        "input": "Hello, how are you?",
        "output": "Bonjour, comment allez-vous?"
    },
    {
        "instruction": "Summarize this text",
        "input": "Long article text...",
        "output": "Brief summary..."
    }
]

Scaling Laws: Bigger Is Better?

Research shows predictable relationships between model performance and three factors:

The Scaling Laws Formula

LossNα×Dβ×Cγ\text{Loss} \propto N^{-\alpha} \times D^{-\beta} \times C^{-\gamma}

Where:

Key Findings

  1. Performance scales predictably with size.
  2. Compute is the limiting factor, not parameters or data.
  3. Optimal allocation:
    • Double compute → 1.7× larger model + 1.2× more data.
  4. Diminishing returns but no clear saturation point yet.

Model Size Evolution

ModelYearParametersTraining Tokens
BERT-Base2018110M~3B
GPT-220191.5B~40B
GPT-32020175B~300B
Gopher2021280B~300B
PaLM2022540B~780B
LLaMA202365B~1.4T
GPT-42023~1.7T*Unknown

*Rumored, not confirmed


Emergent Abilities

As models scale, they exhibit emergent abilities: capabilities not present in smaller models that suddenly appear at certain scale thresholds.

Examples of Emergence

  1. Few-shot learning: Learn from examples without fine-tuning.
  2. Chain-of-thought reasoning: Break down complex problems.
  3. Code generation: Write functional programs.
  4. Multi-step math: Solve complex calculations.
  5. Translation: Even for low-resource language pairs.

Scaling Curve Example

Performance on Complex Task

100%│                              ┌────
    │                            ╱
 75%│                          ╱
    │                        ╱
 50%│                    ╱╱
    │              ╱╱╱╱
 25%│      ╱╱╱╱╱╱
    │╱╱╱╱╱
  0%└──────────────────────────────────
    1M   10M  100M   1B   10B  100B  1T
              Model Parameters

Emergence threshold: Often around 10B-100B parameters for complex reasoning tasks.


Training Techniques

1. Mixed Precision Training

Use FP16/BF16 for speed, FP32 for stability:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    with autocast():
        output = model(batch)
        loss = criterion(output, target)
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

2. Gradient Checkpointing

Trade compute for memory:

# Store only some activations, recompute others during backward
from torch.utils.checkpoint import checkpoint

def forward(self, x):
    for layer in self.layers:
        x = checkpoint(layer, x)  # Recompute instead of storing
    return x

3. Model Parallelism

Split model across multiple GPUs:

4. Optimizer Improvements

AdamW (Adam with decoupled weight decay):

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-4,
    betas=(0.9, 0.999),
    weight_decay=0.01
)

Learning rate schedule:

# Warmup + Cosine decay
def get_lr(step, warmup_steps, total_steps, max_lr):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    else:
        progress = (step - warmup_steps) / (total_steps - warmup_steps)
        return max_lr * 0.5 * (1 + math.cos(math.pi * progress))

Tokenization

Tokenization Process

Before processing, text must be converted to tokens. Modern LLMs use subword tokenization.

Byte-Pair Encoding (BPE)

Most common approach (used by GPT):

# Example tokenization
text = "understanding"

# Character level: ["u", "n", "d", "e", "r", "s", "t", "a", "n", "d", "i", "n", "g"]
# Subword level: ["under", "stand", "ing"]
# Word level: ["understanding"]

# GPT tokenization
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

tokens = tokenizer.encode("understanding")
# Output: [4625, 278, 9228]  (approximate)

Advantages:

Vocabulary sizes:


Architecture Comparison Summary

AspectBERTGPTT5
TypeEncoder-onlyDecoder-onlyEncoder-Decoder
AttentionBidirectionalCausal (unidirectional)Both
TrainingMLM + NSPNext token predictionSpan corruption
Best forUnderstandingGenerationTranslation, summarization
ContextFull sequenceLeft-to-rightInput→Output

Computational Considerations

Model Size vs Memory

For a model with N parameters:

Example: GPT-3 175B:

Inference Optimization

  1. Quantization: Reduce precision (INT8, INT4).
  2. Pruning: Remove unnecessary weights.
  3. Distillation: Train smaller model to mimic larger one.
  4. KV caching: Store attention keys/values for faster generation.
# KV caching for autoregressive generation
cache = {}
for position in range(max_length):
    # Reuse previous computations
    output, cache = model(input, cache=cache)
    next_token = sample(output)
    input = torch.cat([input, next_token])

Limitations and Challenges

Limitations and Challenges

1. Training Costs

2. Context Length

3. Hallucinations

4. Bias and Safety


The Future: What’s Next?

  1. Efficient architectures:

    • Sparse attention mechanisms.
    • State space models (S4, Mamba).
    • Mixture of Experts (MoE).
  2. Longer context:

    • Efficient attention variants (Linear attention, Flash attention).
    • Retrieval augmentation.
    • Memory mechanisms.
  3. Multimodal models:

    • Vision + Language (CLIP, Flamingo).
    • Audio + Language.
    • Video understanding.
  4. Smaller, efficient models:

    • Distillation techniques.
    • Efficient fine-tuning (LoRA, QLoRA).
    • On-device models.
  5. Better training methods:

    • Improved data quality.
    • Curriculum learning.
    • Multi-task learning.

Conclusion

Large Language Models represent a paradigm shift in how we build AI systems. The Transformer architecture’s ability to process language in parallel, combined with self-attention mechanisms and massive scale, has unlocked capabilities that were unimaginable just a few years ago.

Key takeaways:

  1. Transformers replaced RNNs through parallel processing and attention.
  2. Self-attention allows models to learn complex relationships in text.
  3. Scaling laws show predictable improvements with size and compute.
  4. Pre-training + fine-tuning enables transfer learning at massive scale.
  5. Emergent abilities appear at certain scale thresholds.
  6. Architecture choices (encoder vs decoder) determine capabilities.

Understanding these fundamentals is crucial for anyone working with modern AI systems. As we continue to push the boundaries of scale and capability, these core principles remain the foundation upon which all LLM applications are built.

In future posts, we’ll explore how to leverage these models in production systems, from embeddings and semantic search to retrieval-augmented generation and beyond.


References


Share this post on:

Previous Post
Vector Embeddings and Semantic Search: The Foundation of Modern AI
Next Post
Port Forwarding & Pivoting Cheatsheet