Huang X, Belongie S. Arbitrary style transfer in real-time with adaptive instance normalization[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1501-1510.

1. Overview

In this paper, it proposes a Adaptive Instance Normalization (AdaIN) motivated by Instance Normalization (IN).

2. Normalization

2.1. Barch Normalization (BN)

For $x \in R^{N \times C \times H \times W}$.
1) $BN(x) = \gamma (\frac{x - \mu (x)}{\sigma (x)}) + \beta$.
2) $\mu_c (x) = \frac{1}{NHW} \sum_{n=1}^{N} \sum_{h=1}^H \sum_{w=1}^W x_{nchw}$
3) $\sigma_c(x) = \sqrt{\frac{1}{NHW} \sum_{n=1}^{N} \sum_{h=1}^H \sum_{w=1}^W (x_{nchw} - \mu_c(x))^2 + \epsilon}$
4) $\gamma, \beta$ are learnable parameters.
BN uses mini-batch statistics during traning and replace them with popular statistics during inference, introducing discrepancy between training and inference.

2.2. Instance Normalization (IN)

1) $IN(x) = \gamma (\frac{x - \mu(x)}{\sigma(x)}) + \beta$.
2) $\mu_{nc}(x) = \frac{1}{HW} \sum_{h=1}^H \sum_{w=1}{W} x_{nchw}$.
3) $\sigma_{nc}(x) = \sqrt{\frac{1}{HW} \sum_{h=1}^H \sum_{w=1}^W (x_{nchw} - \mu_{nc}(x))^2 + \epsilon }$.

No discrepancy during inference.

2.3. Conditional Instance Normalization (CIN)

Instead of learning a single set of affine parameters $\gamma, \beta$, CIN learns a different set of parameters $\gamma^s, \beta^s$ for each styles $s$.
1) $CIN(x;s) = \gamma^s (\frac{x - \mu(x)}{\sigma(x)}) + \beta^s$

2.4. Interpreting Instance Normalization (IN)

1) Argue that IN performs a form of style normalization by normalizing feature statistics.
2) Believe that feature statistics of a generator network can also control the style of the generated image.

3. Adaptive Instance Normalization (AIN)

3.1. Details

$AdaIN(x,y) = \sigma(y) (\frac{x - \mu(x)}{\sigma(x)} ) + \mu(y)$.
1) Content input $x$ and style input $y$.
2) Simple scale the normalized content input with $\sigma(y)$ and shift it with $\mu(y)$.
3) No learnable affine parameters.

$T(c,s) = g(t)$.
$t = AdaIN(f(c), f(s))$.

1) IN normalizes each sample to a single style while BN normalizes a batch of samples to be centered around a single style. Both are undesirable when want the decoder to generate images in vastly different styles.
2) Thus, do not use normalization layers in the decoder.

3.2. Loss Function

1) $L = L_c + \lambda L_s$.
2) $L_c = || f(g(t)) - t||_2$.
3) $L_s = \sum_{i=1}^L || \mu(\phi _{i} (g(t))) - \mu (\phi_i(s))||_{2}+\sum _{i=1}^L|| \sigma(\phi _{i} (g(t))) - \sigma( \phi_i (s)) || _2$.

4. Experiments

4.1. Speed

4.2. Ablation Study

4.3. Content-style Trade-off

$T(c, s, \alpha) = g( (1 - \alpha) f(c) + \alpha AdaIN(f(c), f(s)) )$

4.4. Style Interpolation

$T(c, s_{1,2,…,K}, w_{1, 2, …, K}) = g(\sum_{k=1}^K w_k AdaIN(f(c), f(s_k)))$

4.5 Color Control

1) Match the color distribution of the style image to that of the content image
2) Perform a normal style transfer using the color-aligned style image as the style input.

4.6. Spatial Control

Perform AdaIN to different regions.