How do Transformers Work in NLP? A Guide to the Latest State-of-the-Art Models
The bare bones introduction all you need to know.
What is a Transformer? How has it changed the future of NLP and A.I?
New deep learning models are introduced at an increasing rate and sometimes it’s hard to keep track of all the novelties. That said, one particular neural network model has proven to be especially effective for common natural language processing tasks. The model is called a Transformer and it makes use of several methods and mechanisms.
A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer vision (CV).
A simple algorithm that revolutionized how neural networks approach language is now taking on vision as well. It may not stop there.
Overview
The Transformer model in NLP has truly changed the way we work with text data
Transformer is behind the recent NLP developments, including Google’s BERT
Learn how the Transformer idea works, how it’s related to language modeling, sequence-to-sequence modeling, and how it enables Google’s BERT model.
A bit of Transformer history
Here are some reference points in the (short) history of Transformer models:
Hopefully you can see the image below:
The performance of these seq2seq models was further enhanced with the addition of the Attention Mechanism in 2015. How quickly advancements in NLP have been happening in the last 5 years – incredible!
These sequence-to-sequence models are pretty versatile and they are used in a variety of NLP tasks, such as:
Machine Translation
Text Summarization
Speech Recognition
Question-Answering System, and so on.
The Transformer movement has also moved relatively quickly in the years just before and during the pandemic.
The Transformer architecture was introduced in June 2017. The focus of the original research was on translation tasks. This was followed by the introduction of several influential models, including:
June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results
October 2018: BERT, another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!)
February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns
October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance
October 2019: BART and T5, two large pretrained models using the same architecture as the original Transformer model (the first to do so)
May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning).
We know that GPT-4 is coming soon as well. There’s a lot to be excited about for the future of machine learning and A.I. Yann LeCun for instance is constantly sharing new things from Meta AI on LinkedIn. Twitter is full of a lot of useful chatter from Researchers in AI, many of which I follow on my own Twitter.
The Transformer
The paper ‘Attention Is All You Need’ introduces a novel architecture called Transformer. As the title indicates, it uses the attention-mechanism we saw earlier. Like LSTM, Transformer is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), but it differs from the previously described/existing sequence-to-sequence models because it does not imply any Recurrent Networks (GRU, LSTM, etc.).
Recurrent Networks were, until now, one of the best ways to capture the timely dependencies in sequences. However, the team presenting the paper proved that an architecture with only attention-mechanisms without any RNN (Recurrent Neural Networks) can improve on the results in translation task and other tasks! One improvement on Natural Language Tasks is presented by a team introducing BERT: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
So, what exactly is a Transformer?
An image is worth thousand words, so we will start with that!
Broadly speaking Transformers can be grouped into a few types:
For instance, they can be grouped into three categories:
GPT-like (also called auto-regressive Transformer models)
BERT-like (also called auto-encoding Transformer models)
BART/T5-like (also called sequence-to-sequence Transformer models)
In the early 2020s, this is now moving quite fast.
Training and inferring on Seq2Seq models is a bit different from the usual classification problem. The same is true for Transformers.
We know that to train a model for translation tasks we need two sentences in different languages that are translations of each other. Once we have a lot of sentence pairs, we can start training our model. Let’s say we want to translate French to German. Our encoded input will be a French sentence and the input for the decoder will be a German sentence.
Transformers are language models
All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models. This means they have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!
This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.
Self-Attention
The transformer first appeared in 2017 in a paper that cryptically declared that “Attention Is All You Need.” In other approaches to AI, the system would first focus on local patches of input data and then build up to the whole. In a language model, for example, nearby words would first get grouped together. The transformer, by contrast, runs processes so that every element in the input data connects, or pays attention, to every other element. Researchers refer to this as “self-attention.” This means that as soon as it starts training, the transformer can see traces of the entire data set.
Five years later Transformer based architectures are becoming more powerful. I cover some of these research papers on AiSupremacy, my Newsletter focused on A.I.
The success of transformers prompted the AI crowd to ask what else they could do. The answer is unfolding now, as researchers report that transformers are proving surprisingly versatile. In some vision tasks, like image classification, neural nets that use transformers have become faster and more accurate than those that don’t. Emerging work in other AI areas — like processing multiple kinds of input at once, or planning tasks — suggests transformers can handle even more.
What do you think, are Transformers an important part of the future of artificial intelligence?
I’m currently working on an article about the top people to follow in Datascience on LinkedIn, so stay tuned this week.