
No Recurrent Inductive Bias: The Transformer trades the recurrent inductive bias of RNN's for parallelizability.convolutional architectures which typically have a limited receptive field).Īlthough Transformers continue to achieve great improvements in many tasks, they have some shortcomings: Global receptive field : Each symbol’s representation is directly informed by all other symbols’ representations (in contrast to e.g.Straightforward to parallelize : There is no connections in time as with RNNs, allowing one to fully parallelize per-symbol computations.This leads to two main properties for Transformers: In fact, Transformers rely entirely on a self-attention mechanism to compute a series of context-informed vector-space representations of the symbols in its input (see this blog post to know more about the details of the Transformer). their inherently sequential computation which prevents parallelization across elements of the input sequence, whilst still addressing the vanishing gradients problem through its self-attention mechanism. They address a significant shortcoming of RNNs, i.e.


Transformers have recently become a competitive alternative to RNNs for a range of sequence modeling tasks. Thanks to Stephan Gouws for his help on writing and improving this blog post.
