Skip to content
Go back

Adversarial Machine Learning: Attacks and Defenses

Adversarial Machine Learning

Table of Contents

Open Table of Contents

Introduction: When AI Becomes Vulnerable

Machine learning models have achieved remarkable performance across domains—computer vision, NLP, speech recognition. But these systems have a critical weakness: they are inherently vulnerable to adversarial manipulation.

An adversarial example is an input deliberately crafted to cause a model to make a mistake. A classic example: adding imperceptible noise to an image of a panda causes a state-of-the-art classifier to confidently predict “gibbon” with 99% probability. To human eyes, the images are identical. To the neural network, they’re completely different.

Panda to Gibbon Attack

This isn’t a theoretical curiosity. Adversarial ML has real-world implications:

In this post, we’ll explore the taxonomy of adversarial attacks, defensive strategies, and how to systematically red team ML systems using frameworks like MITRE ATLAS.


The Adversarial Threat Model

Before diving into attacks, we need to establish threat models that characterize the adversary’s capabilities:

Adversarial Threat Landscape

Adversary’s Knowledge

White-box attacks: The adversary has complete knowledge of the model architecture, parameters, training data, and defense mechanisms. This represents the worst-case scenario.

Black-box attacks: The adversary can only query the model and observe outputs. No access to internals. This is more realistic for production systems behind APIs.

Gray-box attacks: Partial knowledge—perhaps architecture but not exact weights, or access to similar training data.

Adversary’s Goals

Untargeted attacks: Cause any misclassification (panda → anything except panda)

Targeted attacks: Force a specific misclassification (panda → gibbon)

Evasion attacks: Manipulate test-time inputs to avoid detection

Poisoning attacks: Corrupt training data to backdoor the model

Model extraction: Steal the model’s functionality via queries

Privacy attacks: Extract sensitive information about training data


Evasion Attacks: Fooling Models at Inference

Evasion attacks manipulate inputs at test time to cause misclassification. These are the most studied adversarial attacks.

Fast Gradient Sign Method (FGSM)

The simplest and most foundational attack, introduced by Goodfellow et al. (2015).

Key insight: Neural networks are vulnerable to linear perturbations in high-dimensional spaces. Even though each pixel’s perturbation is tiny, when you have millions of pixels, the cumulative effect is significant.

The attack is remarkably simple:

xadv=x+ϵsign(xJ(θ,x,y))x_{adv} = x + \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y))

Where:

Implementation (PyTorch):

import torch
import torch.nn.functional as F

def fgsm_attack(model, x, y, epsilon=0.3):
    """
    FGSM attack implementation.
    
    Args:
        model: Neural network model
        x: Input tensor (batch_size, channels, height, width)
        y: True labels (batch_size,)
        epsilon: Perturbation magnitude
    
    Returns:
        x_adv: Adversarial examples
    """
    # Ensure gradients are computed for input
    x.requires_grad = True
    
    # Forward pass
    outputs = model(x)
    loss = F.cross_entropy(outputs, y)
    
    # Backward pass - compute gradient w.r.t. input
    model.zero_grad()
    loss.backward()
    
    # Get gradient sign
    grad_sign = x.grad.sign()
    
    # Create adversarial example
    x_adv = x + epsilon * grad_sign
    
    # Clamp to valid image range [0, 1]
    x_adv = torch.clamp(x_adv, 0, 1)
    
    return x_adv.detach()

# Example usage
# x_adv = fgsm_attack(model, images, labels, epsilon=0.3)
# predictions = model(x_adv)

Visual example: Imagine a panda image. FGSM adds imperceptible noise (+0.007 to some pixels, -0.007 to others). To humans: still a panda. To the model: 99% confident it’s a gibbon.

FGSM vs PGD Comparison

Why it works: The gradient tells us which direction in input space increases the loss most. By moving in that direction, we maximize the model’s error.

Limitations:

Projected Gradient Descent (PGD)

PGD is the multi-step iterative version of FGSM, considered the “gold standard” for evaluating robustness.

Algorithm:

  1. Start with original input: x0=xx^0 = x.
  2. For t=1t = 1 to TT iterations: xt+1=Projϵ(xt+αsign(xJ(θ,xt,y)))x^{t+1} = \text{Proj}_{\epsilon} \left( x^t + \alpha \cdot \text{sign}(\nabla_x J(\theta, x^t, y)) \right)
  3. Return xTx^T as adversarial example.

Where:

Implementation (PyTorch):

def pgd_attack(model, x, y, epsilon=0.3, alpha=0.01, num_iter=40):
    """
    PGD attack - iterative FGSM with projection.
    
    Args:
        model: Neural network model
        x: Input tensor
        y: True labels
        epsilon: Maximum perturbation (L-infinity norm)
        alpha: Step size per iteration
        num_iter: Number of iterations
    
    Returns:
        x_adv: Adversarial examples
    """
    # Start from original image
    x_adv = x.clone().detach()
    
    # Random initialization within epsilon ball (recommended)
    x_adv = x_adv + torch.empty_like(x_adv).uniform_(-epsilon, epsilon)
    x_adv = torch.clamp(x_adv, 0, 1).detach()
    
    for i in range(num_iter):
        x_adv.requires_grad = True
        
        # Forward pass
        outputs = model(x_adv)
        loss = F.cross_entropy(outputs, y)
        
        # Backward pass
        model.zero_grad()
        loss.backward()
        
        # Update adversarial example
        with torch.no_grad():
            # Take small step in gradient direction
            x_adv = x_adv + alpha * x_adv.grad.sign()
            
            # Project back to epsilon-ball around original x
            perturbation = torch.clamp(x_adv - x, -epsilon, epsilon)
            x_adv = x + perturbation
            
            # Clamp to valid image range
            x_adv = torch.clamp(x_adv, 0, 1)
    
    return x_adv

# Example: Strong attack with 40 iterations
# x_adv = pgd_attack(model, images, labels, epsilon=8/255, alpha=2/255, num_iter=40)

Key difference from FGSM: PGD uses smaller steps and iterates, allowing it to find stronger adversarial examples within the same perturbation budget. It’s essentially gradient ascent on the loss with constraints.

Why PGD is considered strongest:

Carlini & Wagner (C&W) Attack

A sophisticated optimization-based attack that produces minimal perturbations.

Objective: Find smallest perturbation δ\delta such that misclassification occurs:

minδp+cf(x+δ)\min ||\delta||_p + c \cdot f(x + \delta)

Where:

The clever part: C&W reformulates the constrained problem using a differentiable objective function:

f(x)=max(maxit{Z(x)i}Z(x)t,κ)f(x') = \max(\max_{i \neq t}\{Z(x')_i\} - Z(x')_t, -\kappa)

Where:

This function is negative when attack succeeds (target logit is largest by margin κ\kappa).

Optimization: Use Adam optimizer to find δ\delta that minimizes total loss. Requires careful tuning of cc through binary search.

Why it’s powerful:

Implementation example:

import numpy as np

def create_backdoor_trigger(image, trigger_size=5, position='bottom-right'):
    """
    Add a simple backdoor trigger (white square) to an image.
    
    Args:
        image: Input image (H, W, C) or (C, H, W)
        trigger_size: Size of trigger square in pixels
        position: Where to place trigger
    
    Returns:
        poisoned_image: Image with trigger
    """
    poisoned = image.copy()
    
    if position == 'bottom-right':
        # Place white square at bottom-right corner
        poisoned[-trigger_size:, -trigger_size:] = 1.0
    
    return poisoned

def poison_dataset(images, labels, target_class=0, poison_rate=0.05):
    """
    Poison a dataset with backdoor triggers.
    
    Args:
        images: Training images
        labels: Training labels
        target_class: Class to backdoor into
        poison_rate: Fraction of data to poison (0.01-0.05)
    
    Returns:
        poisoned_images, poisoned_labels
    """
    num_poison = int(len(images) * poison_rate)
    poison_indices = np.random.choice(len(images), num_poison, replace=False)
    
    poisoned_images = images.copy()
    poisoned_labels = labels.copy()
    
    for idx in poison_indices:
        # Add trigger and change label to target
        poisoned_images[idx] = create_backdoor_trigger(images[idx])
        poisoned_labels[idx] = target_class
    
    return poisoned_images, poisoned_labels

# Usage:
# poisoned_train_x, poisoned_train_y = poison_dataset(
#     train_images, train_labels, target_class=3, poison_rate=0.03
# )
# model.fit(poisoned_train_x, poisoned_train_y)  # Model is now backdoored!

Black-Box Attacks: Transferability

In real-world scenarios, attackers rarely have white-box access. But adversarial examples exhibit a surprising property: transferability.

Adversarial Example Transferability

An adversarial example crafted for Model A often fools Model B, even with different architectures or training data.

Transfer-based black-box attack:

  1. Train a substitute model (similar architecture/task).
  2. Generate adversarial examples for substitute model using white-box attacks.
  3. Transfer these examples to target model.
  4. Success rate: 60-90% depending on similarity.

Query-based black-box attacks:

Practical implications:


Poisoning Attacks: Backdooring Models

While evasion attacks manipulate inputs at test time, poisoning attacks corrupt the training process itself.

Data Poisoning

Attack scenario: Adversary injects malicious samples into training dataset.

Goals:

Example - Label Flipping:

Difficulty: Modern ML systems are surprisingly robust to random label noise (up to 30-40%). Sophisticated poisoning requires targeted manipulation.

Backdoor Attacks (Trojan Attacks)

A more insidious poisoning variant: embed a hidden trigger that activates malicious behavior.

Backdoor Poisoning Attacks

Attack process:

  1. Choose trigger pattern (small sticker, pixel pattern, invisible watermark).
  2. Inject poisoned samples: legitimate images + trigger → target class.
  3. Poisoned samples are typically 1-5% of training data.
  4. Train model normally.

Result:

Why it’s dangerous:

Real-world scenario:

Defenses:


Model Extraction & Stealing

Goal: Replicate a victim model’s functionality without access to training data or parameters.

Model Extraction Attack

Query-Based Extraction

  1. Send queries to victim model API.
  2. Collect input-output pairs (xi,f(xi))(x_i, f(x_i)).
  3. Train substitute model on collected data.
  4. Substitute model approximates victim model.

Effectiveness:

Economic impact:

Model Inversion Attacks

Extract information about training data from model parameters or predictions.

Model Inversion Attack

Example: Given a face recognition model and a name, reconstruct the person’s face from the training set.

Technique:

Privacy implications: Models trained on sensitive data (medical records, faces) may leak that data.


Membership Inference Attacks

Goal: Determine if a specific sample was in the model’s training set.

Why this matters: Privacy violation—reveals if someone’s data was used for training.

Attack method:

  1. Train shadow models on similar data (some with target sample, some without).
  2. Learn to distinguish overfitting patterns (higher confidence on training data).
  3. Use classifier to predict membership of target sample in victim model.

Success rate: Can achieve 80%+ accuracy on determining membership.

Implications:


Defenses: Hardening ML Models

Defense Strategies Overview

Adversarial Training

The most effective defense: augment training data with adversarial examples.

Algorithm:

  1. For each training batch:
    • Generate adversarial examples using PGD/FGSM.
    • Train model on both clean and adversarial examples.
  2. Model learns to be robust within ϵ\epsilon-ball around each sample.

Implementation (conceptual):

for epoch in range(num_epochs):
    for x, y in train_loader:
        # Generate adversarial examples
        x_adv = pgd_attack(model, x, y, epsilon=0.3, steps=10)
        
        # Train on adversarial examples
        loss_adv = loss_fn(model(x_adv), y)
        
        # Optionally train on clean examples too
        loss_clean = loss_fn(model(x), y)
        
        total_loss = loss_adv + 0.5 * loss_clean
        total_loss.backward()
        optimizer.step()

Trade-offs:

Standard robust training: PGD-10 with ϵ=8/255\epsilon=8/255 for images (CIFAR-10/ImageNet)

Certified Robustness: Randomized Smoothing

Adversarial training provides empirical robustness—we can’t prove guarantees. Certified defenses provide provable robustness.

Randomized Smoothing

Randomized Smoothing (Cohen et al., 2019):

Key idea: Create a smoothed classifier by averaging predictions over Gaussian noise.

g(x)=argmaxcP(f(x+ϵ)=c),ϵN(0,σ2I)g(x) = \arg\max_c P(f(x + \epsilon) = c), \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)

Certification:

Advantages:

Disadvantages:

Input Transformations & Detection

![Input Sanitization Techniques](@/assets/images/adversarial-ml/input sanitation.webp)

Defensive transformations:

Problem: Most transformations can be incorporated into attack (adaptive attacks)

Detection approaches:

Limitation: Detection is an arms race—attackers adapt to bypass detectors

Ensemble Defenses

Trade-off: Computational cost scales linearly with ensemble size


Red Teaming ML Systems: A Systematic Approach

Red teaming AI systems requires structured methodology—not just running FGSM and calling it done.

Red Teaming Phases

Phase 1: Reconnaissance

Understand the system:

Threat modeling:

Phase 2: Attack Execution

Start simple, escalate complexity:

  1. Baseline attacks: FGSM, random noise.
  2. Optimized attacks: PGD, C&W.
  3. Adaptive attacks: Account for defenses.
  4. Physical attacks: Test real-world robustness.

Query budget considerations:

Phase 3: Defense Evaluation

Metrics:

Reporting:


MITRE ATLAS: Framework for AI Threat Intelligence

The MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) framework provides structured taxonomy for AI/ML threats.

MITRE ATLAS Framework

ATLAS Tactics (High-Level Goals)

  1. Reconnaissance: Gather information about ML system.
  2. Resource Development: Acquire tools/infrastructure for attacks.
  3. Initial Access: Gain entry to ML pipeline or model.
  4. ML Attack Staging: Prepare attack capabilities.
  5. Exfiltration: Extract model, data, or predictions.
  6. Impact: Compromise integrity, availability, or privacy.

Key ATLAS Techniques

AML.T0043 - Craft Adversarial Data:

AML.T0020 - Poison Training Data:

AML.T0024 - Exfiltrate ML Model:

AML.T0048 - Membership Inference:

Using ATLAS for Security Assessments

  1. Map attack surface to ATLAS tactics.
  2. Identify applicable techniques for your system.
  3. Assess likelihood and impact of each technique.
  4. Prioritize mitigations based on risk.
  5. Implement defenses and monitor for attacks.

Example - Image Classifier API:

Mitigations:


Practical Implications & Real-World Attacks

Case Study 1: Adversarial Patches on Stop Signs

Research (Eykholt et al., 2018) showed physical adversarial patches can fool traffic sign classifiers:

Lesson: ML systems in safety-critical applications need physical robustness.

Case Study 2: Poisoning Federated Learning

In federated learning, participants train on local data and share model updates. Attackers can:

Lesson: Decentralized training increases attack surface.

Case Study 3: Prompt Injection in LLMs

Prompt injection is adversarial ML for language models—the most prevalent attack against production LLM systems today.

Prompt Injection Attack

Attack mechanics:

Examples:

Why it’s adversarial ML:

Scale of impact: Every LLM application with user input is potentially vulnerable. Unlike image classifiers, LLMs process natural language instructions, making the attack surface enormous.


Building Robust ML Systems: Recommendations

For ML Engineers

  1. Assume adversaries exist: Security through obscurity fails.
  2. Adversarial training is essential: Especially for high-stakes applications.
  3. Test robustness systematically: Use PGD, C&W, not just FGSM.
  4. Monitor for attacks: Log anomalies, unusual query patterns.
  5. Defense in depth: Combine multiple mitigations.

For Security Professionals

  1. Learn ML fundamentals: You can’t secure what you don’t understand.
  2. Use MITRE ATLAS: Structure assessments with established framework.
  3. Collaborate with ML teams: Bridge security and ML expertise.
  4. Focus on realistic threats: Prioritize practical attacks over theoretical ones.

For Organizations

  1. ML security is not optional: It’s not a future problem—attacks exist now.
  2. Invest in red teaming: Proactively test systems before deployment.
  3. Secure the supply chain: Vet training data, pre-trained models, dependencies.
  4. Incident response planning: Prepare for when (not if) attacks occur.

Conclusion: Security is Not an Afterthought

Adversarial machine learning reveals a fundamental tension: the same properties that make neural networks powerful (high-dimensional optimization, gradient-based learning) also make them vulnerable.

Key takeaways:

  1. ML models are inherently vulnerable to adversarial manipulation.
  2. Attacks are diverse: Evasion, poisoning, extraction, privacy breaches.
  3. Defenses exist but have trade-offs: Adversarial training reduces clean accuracy; certified defenses are expensive.
  4. Security requires systematic thinking: Use frameworks like MITRE ATLAS.
  5. This is an active arms race: New attacks and defenses emerge constantly.

References & Further Reading

Foundational Papers:

Certified Defenses:

Poisoning & Backdoors:

Privacy Attacks:

Frameworks:


Share this post on:

Previous Post
Active Directory PrivEsc Cheatsheet
Next Post
Active Directory Pentesting Cheatsheet