Breaking a Model: A Python Script to Generate Adversarial Noise (FGSM)

Adversarial examples are inputs to machine learning models that cause them to make incorrect predictions despite being nearly indistinguishable from valid data to humans. A small, carefully crafted perturbation can cause a model to confidently misclassify an image.

The Problem: Even state-of-the-art models like ResNet and Vision Transformers can be “fooled” with tiny changes invisible to the human eye.

Table of Contents

Real-World Impact

Autonomous vehicles misreading stop signs
Face recognition systems failing authentication
Malware detection bypassing security filters
Fraud detection systems missing malicious transactions

2. The FGSM Attack

Fast Gradient Sign Method (FGSM) is one of the simplest yet most effective white-box adversarial attacks. It requires only one forward and backward pass through the model.

Key Characteristics

White-box: Requires model access & gradients
Single-step: Very fast computation
High success rate: ~90% on ImageNet
L₂ norm bounded: Controls perturbation size

Attack Flow

Compute loss gradient w.r.t. input
Take sign of gradient (direction of steepest ascent)
Scale by small epsilon and add to input
Clip to valid range [0,1]

⚠️ Important: This is for educational and security research purposes only. Do not use to attack production systems without authorization.

3. The Mathematics Behind FGSM

Given model f(x) and target class y, FGSM finds perturbation δ that maximizes loss:

x_adv = x + ε · sign(∇_x J(θ, x, y))

Where:

x = original input
x_adv = adversarial example
ε = perturbation magnitude (typically 0.01-0.3)
sign(·) = element-wise sign function
∇_x J(θ, x, y) = gradient of loss w.r.t. input

ε Value	Attack Strength	Visual Change	Success Rate
0.01	Weak	Invisible	~20%
0.1	Medium	Subtle	~85%
0.3	Strong	Visible noise	~98%

4. Environment Setup

pip install torch torchvision matplotlib numpy pillow

This script uses PyTorch and torchvision for:

Pretrained ResNet-50 model
CIFAR-10 dataset
Gradient computation
Image visualization

5. Complete Python Implementation

Here’s a self-contained script that loads CIFAR-10, attacks a pretrained ResNet-50, and visualizes the results:

#!/usr/bin/env python3
"""
FGSM Adversarial Attack Demo
Breaks pretrained ResNet-50 on CIFAR-10 images
"""

import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image

# Configuration
EPSILON = 0.1  # Attack strength
NUM_IMAGES = 8  # Images to attack

# CIFAR-10 class names
CIFAR_CLASSES = [
    'airplane', 'automobile', 'bird', 'cat', 'deer',
    'dog', 'frog', 'horse', 'ship', 'truck'
]

def fgsm_attack(model, images, labels, epsilon=0.1):
    """
    Perform FGSM attack on batch of images
    
    Args:
        model: PyTorch model
        images: batch of images [B, C, H, W]
        labels: true labels [B]
        epsilon: perturbation magnitude
    
    Returns:
        adversarial images
    """
    images.requires_grad = True
    outputs = model(images)
    loss = nn.CrossEntropyLoss()(outputs, labels)
    
    # Compute gradient w.r.t. input
    model.zero_grad()
    loss.backward()
    
    # Create perturbation from gradient sign
    perturbation = epsilon * images.grad.sign()
    
    # Apply perturbation
    adv_images = images + perturbation
    
    # Clip to valid range [0,1]
    adv_images = torch.clamp(adv_images, 0, 1)
    
    return adv_images.detach()

def main():
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using device: {device}")
    
    # Load pretrained ResNet-50 (CIFAR-10 version)
    model = models.resnet18(num_classes=10)
    model.load_state_dict(torch.load('cifar10_resnet18_pretrained.pth', 
                                   map_location=device))
    model.to(device)
    model.eval()
    
    # CIFAR-10 transforms
    transform = transforms.Compose([
        transforms.Resize((32, 32)),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), 
                           (0.2023, 0.1994, 0.2010))
    ])
    
    # Load test data
    test_dataset = CIFAR10(root='./data', train=False, 
                          download=True, transform=transform)
    test_loader = DataLoader(test_dataset, batch_size=NUM_IMAGES, shuffle=True)
    
    # Get one batch
    images, labels = next(iter(test_loader))
    images, labels = images.to(device), labels.to(device)
    
    print(f"Original predictions:")
    with torch.no_grad():
        orig_outputs = model(images)
        _, orig_preds = orig_outputs.max(1)
        for i in range(NUM_IMAGES):
            print(f"  Image {i}: {CIFAR_CLASSES[labels[i]]} → {CIFAR_CLASSES[orig_preds[i]]}")
    
    # Generate adversarial examples
    adv_images = fgsm_attack(model, images, labels, epsilon=EPSILON)
    
    print(f"\nAdversarial predictions (ε={EPSILON}):")
    with torch.no_grad():
        adv_outputs = model(adv_images)
        _, adv_preds = adv_outputs.max(1)
        for i in range(NUM_IMAGES):
            success = "✅" if adv_preds[i] != orig_preds[i] else "❌"
            print(f"  Image {i}: {CIFAR_CLASSES[labels[i]]} → {CIFAR_CLASSES[adv_preds[i]]} {success}")
    
    # Visualization
    fig, axes = plt.subplots(3, NUM_IMAGES, figsize=(4*NUM_IMAGES, 12))
    
    for i in range(NUM_IMAGES):
        # Original
        img_orig = images[i].cpu().permute(1, 2, 0).numpy()
        img_orig = (img_orig * np.array([0.2023, 0.1994, 0.2010])) + np.array([0.4914, 0.4822, 0.4465])
        img_orig = np.clip(img_orig, 0, 1)
        
        # Adversarial
        img_adv = adv_images[i].cpu().permute(1, 2, 0).numpy()
        img_adv = (img_adv * np.array([0.2023, 0.1994, 0.2010])) + np.array([0.4914, 0.4822, 0.4465])
        img_adv = np.clip(img_adv, 0, 1)
        
        # Difference
        diff = np.abs(img_adv - img_orig)
        
        axes[0, i].imshow(img_orig)
        axes[0, i].set_title(f"Original\n{labels[i].item()}: {CIFAR_CLASSES[labels[i]]}", fontsize=10)
        axes[0, i].axis('off')
        
        axes[1, i].imshow(img_adv)
        axes[1, i].set_title(f"Adversarial (ε={EPSILON})\n{adv_preds[i].item()}: {CIFAR_CLASSES[adv_preds[i]]}", fontsize=10)
        axes[1, i].axis('off')
        
        axes[2, i].imshow(diff)
        axes[2, i].set_title("Perturbation\nMagnitude", fontsize=10)
        axes[2, i].axis('off')
    
    plt.tight_layout()
    plt.savefig('fgsm_attack_demo.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("\n✓ Demo complete! Check 'fgsm_attack_demo.png'")

if __name__ == "__main__":
    main()

Expected Output:

~85-95% attack success rate on CIFAR-10
Original predictions vs adversarial predictions shown
Side-by-side visualization of original/adversarial/difference
Saved plot as fgsm_attack_demo.png

6. Running the Attack

Save the code above as fgsm_attack.py and run:

python fgsm_attack.py

You should see output like:

Original predictions:
  Image 0: cat → cat
  Image 1: dog → dog
  Image 2: truck → truck

Adversarial predictions (ε=0.1):
  Image 0: cat → airplane ✅
  Image 1: dog → frog ✅
  Image 2: truck → ship ✅

7. Limitations & Defenses

FGSM Limitations

Single-step → suboptimal perturbations
Sensitive to ε choice
Doesn’t optimize for minimal perturbation

Stronger Attacks

PGD: Projected Gradient Descent (iterative FGSM)
CW: Carlini-Wagner (optimization-based)
DeepFool: Minimal perturbation norm

Defenses

Adversarial Training: Train on adversarial examples
Input Preprocessing: Randomization, quantization
Detection: Gradient masking, statistical tests
Certified Defenses: Randomized smoothing

8. Ethical Considerations

Responsible Use

Research Only: Use for security research and model robustness testing
Authorization Required: Never attack production systems without permission
Report Vulnerabilities: Disclose findings responsibly
Defensive Research: Focus on building defenses as much as attacks

This knowledge helps us build more robust, secure AI systems. Understanding attacks is the first step toward effective defenses.