Adversarial examples are inputs to machine learning models that cause them to make incorrect predictions despite being nearly indistinguishable from valid data to humans. A small, carefully crafted perturbation can cause a model to confidently misclassify an image.
The Problem: Even state-of-the-art models like ResNet and Vision Transformers can be “fooled” with tiny changes invisible to the human eye.
Real-World Impact
- Autonomous vehicles misreading stop signs
- Face recognition systems failing authentication
- Malware detection bypassing security filters
- Fraud detection systems missing malicious transactions
2. The FGSM Attack
Fast Gradient Sign Method (FGSM) is one of the simplest yet most effective white-box adversarial attacks. It requires only one forward and backward pass through the model.
Key Characteristics
- White-box: Requires model access & gradients
- Single-step: Very fast computation
- High success rate: ~90% on ImageNet
- L₂ norm bounded: Controls perturbation size
Attack Flow
- Compute loss gradient w.r.t. input
- Take sign of gradient (direction of steepest ascent)
- Scale by small
epsilonand add to input - Clip to valid range [0,1]
3. The Mathematics Behind FGSM
Given model f(x) and target class y, FGSM finds perturbation δ that maximizes loss:
x_adv = x + ε · sign(∇_x J(θ, x, y))
Where:
x= original inputx_adv= adversarial exampleε= perturbation magnitude (typically 0.01-0.3)sign(·)= element-wise sign function∇_x J(θ, x, y)= gradient of loss w.r.t. input
| ε Value | Attack Strength | Visual Change | Success Rate |
|---|---|---|---|
| 0.01 | Weak | Invisible | ~20% |
| 0.1 | Medium | Subtle | ~85% |
| 0.3 | Strong | Visible noise | ~98% |
4. Environment Setup
pip install torch torchvision matplotlib numpy pillow
This script uses PyTorch and torchvision for:
- Pretrained ResNet-50 model
- CIFAR-10 dataset
- Gradient computation
- Image visualization
5. Complete Python Implementation
Here’s a self-contained script that loads CIFAR-10, attacks a pretrained ResNet-50, and visualizes the results:
#!/usr/bin/env python3
"""
FGSM Adversarial Attack Demo
Breaks pretrained ResNet-50 on CIFAR-10 images
"""
import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
# Configuration
EPSILON = 0.1 # Attack strength
NUM_IMAGES = 8 # Images to attack
# CIFAR-10 class names
CIFAR_CLASSES = [
'airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck'
]
def fgsm_attack(model, images, labels, epsilon=0.1):
"""
Perform FGSM attack on batch of images
Args:
model: PyTorch model
images: batch of images [B, C, H, W]
labels: true labels [B]
epsilon: perturbation magnitude
Returns:
adversarial images
"""
images.requires_grad = True
outputs = model(images)
loss = nn.CrossEntropyLoss()(outputs, labels)
# Compute gradient w.r.t. input
model.zero_grad()
loss.backward()
# Create perturbation from gradient sign
perturbation = epsilon * images.grad.sign()
# Apply perturbation
adv_images = images + perturbation
# Clip to valid range [0,1]
adv_images = torch.clamp(adv_images, 0, 1)
return adv_images.detach()
def main():
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
# Load pretrained ResNet-50 (CIFAR-10 version)
model = models.resnet18(num_classes=10)
model.load_state_dict(torch.load('cifar10_resnet18_pretrained.pth',
map_location=device))
model.to(device)
model.eval()
# CIFAR-10 transforms
transform = transforms.Compose([
transforms.Resize((32, 32)),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2023, 0.1994, 0.2010))
])
# Load test data
test_dataset = CIFAR10(root='./data', train=False,
download=True, transform=transform)
test_loader = DataLoader(test_dataset, batch_size=NUM_IMAGES, shuffle=True)
# Get one batch
images, labels = next(iter(test_loader))
images, labels = images.to(device), labels.to(device)
print(f"Original predictions:")
with torch.no_grad():
orig_outputs = model(images)
_, orig_preds = orig_outputs.max(1)
for i in range(NUM_IMAGES):
print(f" Image {i}: {CIFAR_CLASSES[labels[i]]} → {CIFAR_CLASSES[orig_preds[i]]}")
# Generate adversarial examples
adv_images = fgsm_attack(model, images, labels, epsilon=EPSILON)
print(f"\nAdversarial predictions (ε={EPSILON}):")
with torch.no_grad():
adv_outputs = model(adv_images)
_, adv_preds = adv_outputs.max(1)
for i in range(NUM_IMAGES):
success = "✅" if adv_preds[i] != orig_preds[i] else "❌"
print(f" Image {i}: {CIFAR_CLASSES[labels[i]]} → {CIFAR_CLASSES[adv_preds[i]]} {success}")
# Visualization
fig, axes = plt.subplots(3, NUM_IMAGES, figsize=(4*NUM_IMAGES, 12))
for i in range(NUM_IMAGES):
# Original
img_orig = images[i].cpu().permute(1, 2, 0).numpy()
img_orig = (img_orig * np.array([0.2023, 0.1994, 0.2010])) + np.array([0.4914, 0.4822, 0.4465])
img_orig = np.clip(img_orig, 0, 1)
# Adversarial
img_adv = adv_images[i].cpu().permute(1, 2, 0).numpy()
img_adv = (img_adv * np.array([0.2023, 0.1994, 0.2010])) + np.array([0.4914, 0.4822, 0.4465])
img_adv = np.clip(img_adv, 0, 1)
# Difference
diff = np.abs(img_adv - img_orig)
axes[0, i].imshow(img_orig)
axes[0, i].set_title(f"Original\n{labels[i].item()}: {CIFAR_CLASSES[labels[i]]}", fontsize=10)
axes[0, i].axis('off')
axes[1, i].imshow(img_adv)
axes[1, i].set_title(f"Adversarial (ε={EPSILON})\n{adv_preds[i].item()}: {CIFAR_CLASSES[adv_preds[i]]}", fontsize=10)
axes[1, i].axis('off')
axes[2, i].imshow(diff)
axes[2, i].set_title("Perturbation\nMagnitude", fontsize=10)
axes[2, i].axis('off')
plt.tight_layout()
plt.savefig('fgsm_attack_demo.png', dpi=150, bbox_inches='tight')
plt.show()
print("\n✓ Demo complete! Check 'fgsm_attack_demo.png'")
if __name__ == "__main__":
main()
Expected Output:
- ~85-95% attack success rate on CIFAR-10
- Original predictions vs adversarial predictions shown
- Side-by-side visualization of original/adversarial/difference
- Saved plot as
fgsm_attack_demo.png
6. Running the Attack
Save the code above as fgsm_attack.py and run:
python fgsm_attack.py
You should see output like:
Original predictions:
Image 0: cat → cat
Image 1: dog → dog
Image 2: truck → truck
Adversarial predictions (ε=0.1):
Image 0: cat → airplane ✅
Image 1: dog → frog ✅
Image 2: truck → ship ✅
7. Limitations & Defenses
FGSM Limitations
- Single-step → suboptimal perturbations
- Sensitive to
εchoice - Doesn’t optimize for minimal perturbation
Stronger Attacks
- PGD: Projected Gradient Descent (iterative FGSM)
- CW: Carlini-Wagner (optimization-based)
- DeepFool: Minimal perturbation norm
Defenses
- Adversarial Training: Train on adversarial examples
- Input Preprocessing: Randomization, quantization
- Detection: Gradient masking, statistical tests
- Certified Defenses: Randomized smoothing
8. Ethical Considerations
Responsible Use
- Research Only: Use for security research and model robustness testing
- Authorization Required: Never attack production systems without permission
- Report Vulnerabilities: Disclose findings responsibly
- Defensive Research: Focus on building defenses as much as attacks
This knowledge helps us build more robust, secure AI systems. Understanding attacks is the first step toward effective defenses.
