(ICLR 2018) Ensemble Adversarial Training:Attacks and Defenses

Keyword [Ensemble Adversarial Training]

Tramèr F, Kurakin A, Papernot N, et al. Ensemble adversarial training: Attacks and defenses[J]. arXiv preprint arXiv:1705.07204, 2017.

1. Overview

In this paper

  • show that adversarial training converges to a defenerate global minimum
  • find that adversarial training remains vulnerable to black-box attack, where we transfer perturbations computed on undefended model, as well as to proposed single-step attack
  • introduce Ensemble Adversarial Training that augments training data with perturbations transferred from other model (decouples adversarial example generation from the parameters of the trained model)
  • Ensemble Adversarial Training yields models with strong robustness to black-box attack

1.1. Adversarial Training

  • adversarial training on MNIST yields models that are robust to white-box attacks
  • MNIST dataset is peculiar in that there exists a simple ‘closed-form’ denoising procedure (namely feature binarization) which leads to similarly robust models without adversarial training. This may explain why robustness to white-box attack is hard to scale to tasks such as ImageNet
  • for an average MNIST image, over 80% of the pixels are in {0, 1} and only 6% are in the range [0.2, 0.8]. Thus, for a perturbation with epsilon ≤ 0.3, binarized version of x and x_adv can differ in at most 6% of the input dimension
  • Some prior works have hinted that adversarially trained models may remain vulnerable to black-box attacks
  • an adversarial maxout network on MNIST has slightly higher error on transferred examples than on white-box examples

1.2. Attack Methods

  • FGSM

  • Single-Step Least-Likely Class Method (Step-LL). most effective for adversarial training on ImageNet

  • Iter FGSM and Iter Step-LL

  • proposed randomized single-step attack

1.3. Ensemble Adversarial Training

  • decouple the generation of adversarial example from the model being trained
  • augments training data with adversarial examples crafted on other static pre-trained models

1.4. Experiments

  • adversarial training greatly increases robustness to white-box single-step attacks, but incurs a higher error rate in a black-box setting

1.4.1. Ensemble Training

  • Ensemble Adversarial Training is not robust to white-box Iter-LL and R+Step-LL sample