Keyword [Ensemble Adversarial Training]

Tramèr F, Kurakin A, Papernot N, et al. Ensemble adversarial training: Attacks and defenses[J]. arXiv preprint arXiv:1705.07204, 2017.

1. Overview

In this paper

show that adversarial training converges to a defenerate global minimum
find that adversarial training remains vulnerable to black-box attack, where we transfer perturbations computed on undefended model, as well as to proposed single-step attack
introduce Ensemble Adversarial Training that augments training data with perturbations transferred from other model (decouples adversarial example generation from the parameters of the trained model)
Ensemble Adversarial Training yields models with strong robustness to black-box attack

1.1. Adversarial Training

adversarial training on MNIST yields models that are robust to white-box attacks
MNIST dataset is peculiar in that there exists a simple ‘closed-form’ denoising procedure (namely feature binarization) which leads to similarly robust models without adversarial training. This may explain why robustness to white-box attack is hard to scale to tasks such as ImageNet
for an average MNIST image, over 80% of the pixels are in {0, 1} and only 6% are in the range [0.2, 0.8]. Thus, for a perturbation with epsilon ≤ 0.3, binarized version of x and x_adv can differ in at most 6% of the input dimension
Some prior works have hinted that adversarially trained models may remain vulnerable to black-box attacks
an adversarial maxout network on MNIST has slightly higher error on transferred examples than on white-box examples

FGSM
Single-Step Least-Likely Class Method (Step-LL). most effective for adversarial training on ImageNet
Iter FGSM and Iter Step-LL
proposed randomized single-step attack

decouple the generation of adversarial example from the model being trained
augments training data with adversarial examples crafted on other static pre-trained models

adversarial training greatly increases robustness to white-box single-step attacks, but incurs a higher error rate in a black-box setting

Ensemble Adversarial Training is not robust to white-box Iter-LL and R+Step-LL sample