Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. 2017: 6000-6010.

1. Overview

1.1. Motivation

dominant sequence transduction models based on RNN or CNN
RNN can not parallel, computational complexity, hard to learn long-range dependencies

In this paper, it proposed Transformer based solely on attention mechanism

without regard to their distance
attention to achieve global dependencies

reduce sequential computation
self-attention (intra-attention)
end-to-end memory network. based on recurrent attention mechanism

2. Methods

encoder. [x_1, …, x_n]→ [z_1, …, z_n]
decoder. [z_1, …, z_n]→ [y_1, …, y_m]

2.1. Encoder

N = 6
d_model = 512
two sub-layers + layer normalizaion + residual
- multi-head self-attention
- position-wise FC feed-forward network

2.2. Decoder

N = 6
insert a third sub-layer. perform multi-head attention over the output of the encoder stack

2.3. Attention

2.3.1. Scaled Dot-Product Attention

2.3.2. Multi-Head Attention

h = 8
d_k = d_v = d_{model} / h = 64

Found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to d_k, d_k and d_v dimensions.

2.4. Multi-Head Attention

encoder-decoder attention. Q from previous decoder, K-V from final encoder
self-attention in encoder
self-attention in decoder

Mask out all value which correspond to illegal connection to prevent leftward information flow in the decoder to preserve the auto-regressive property

2.5. Position-Wise FC

or 2 1x1Conv
inner-layer d_{ff} = 2048
input d_{model} = 512

2.6. Position Encoding

since no RNN or CNN, need to inject information about the token’s position
use sine and cosine function

3. Experiments

3.1. Details

sentence encoded by byte-pair encoding
sentence pairs were batched together by approximate sequence length
Adam; LR increase linearly by first warmup_steps then decrease
Dropout 0.1.

3.2. Comparison

3.3. Ablation Study

B. reducing d_k hurts model quality
C,D. bigger model better, dropout helpful in avoiding over-fitting
E. position embedding nearly identical to sinusoid