

A TensorFlow implementation of it is available as a part of the Tensor2Tensor package.

The Transformer was proposed in the paper Attention is All You Need. So let’s try to break the model apart and look at how it functions. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. The Transformer outperforms the Google Neural Machine Translation model in specific tasks.

In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. Attention is a concept that helped improve the performance of neural machine translation applications. In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Watch: MIT’s Deep Learning State of the Art lecture referencing this post Translations: Arabic, Chinese (Simplified) 1, Chinese (Simplified) 2, French 1, French 2, Italian, Japanese, Korean, Persian, Russian, Spanish 1, Spanish 2, Vietnamese Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments)
