[Masked] Language Modeling with Recurrent Neural Networks

One step forward and the other step backward

6 min readJul 18, 2021

Down the memory lane!

Not so long ago, Recurrent Neural Networks were the go-to architectures for just about anything that had a sequential nature, most notably for text data. Variants of RNN’s like GRU, LSTM were used for text classification, paraphrasing, language modeling, token classification, and other non-standard problems.

However, LSTM models were starting to reach their limits when they were used for very large datasets, deeper architectures, and longer sequences. Google was using a seq2seq architecture for their language translation service. It became noticeable that the LSTM architecture could not retain the necessary context to translate properly for longer sequences. Then Bahdanau et al., 2015 utilized the additive attention mechanism coupled with the LSTM seq2seq architecture which gave impressive returns. Thus was born the idea of attention and it started the NLP revolution as we know it.

Later, the seminal paper “Attention is All you Need” (Vaswani, et al., 2017) popularized the idea of self-attention and started the Transformers* era. There have been many variations of the transformers architecture and researchers have flooded the landscape with ideas for improving them.

Two major themes of ideas are:-

- Tweaking the learning objective
- Optimization/approximation of the self-attention

Lilian Weng, 2018 has written a great blog post covering this with extensive detail.

Masked Token Prediction

Masked token prediction is a learning objective first used by the BERT language model (Devlin et al., 2019).

In summary, the input sentence is corrupted with a pseudo token [MASK] and the model bidirectionally attends to the whole text to predict the tokens that were masked. When a large model is trained on a large corpus, it gives a very powerful language model which is context-aware and can be easily fine-tuned for other tasks.

RNN language model with Masked Token Prediction

Now, let's try the Masked language modeling ideas to a Recurrent Neural Network.

Collecting the dataset

Instead of training on the generic Wikipedia corpus and the common crawl corpus which are quite big and generic, I have utilized a Job Description dataset from Kaggle. The full descriptions from job postings have been used for language modeling.

Training the tokenizer

One of the important ideas in recent language models is subword tokenization. The vocabulary of the English language is pretty huge and it's wasteful to train an embedding for each unique word. Rather, we can break words into constituents and trust the models to leverage the subword tokens to their advantage thus drastically reducing the number of unique tokens.

In this work, I use the awesome Tokenizers library developed by Huggingface to train a tokenizer for the corpus.

The trained tokenizer has 36000 unique tokens and has provision for 6 special tokens as mentioned above. The first five special tokens have the same purpose as the BERT model and the [NL] is kept separately to incorporate the new line character.

Example

This is a sentence with subword tokenization
this is a sentence with sub ##word tok ##eni ##zation

LSTM model architecture

The model architecture as such is very straightforward.

The input tokens (masked) are passed through the embedding layer.
The embeddings are passed to bidirectional LSTM layers.
All the LSTM layers have a residual connection.
The output layer reuses the embedding layer weight matrix to give the logits for output tokens. *

Training the model

The training loop is quite straightforward.

Init the model and the optimizer
Init variables to monitor the loss
Run a loop over the training data generator
Perform a forward pass
Calculate the cross-entropy loss (Done only over the corrupted/masked tokens)
Backpropagate the weighted loss as per the number of accumulation steps
Run the optimizer after running through the accumulation count.

The model adapts to the challenge consistently.

The number of tokens vs the loss. (Authors Image)

Results

Although the model was trained for a short amount of time, the results look promising

Here are a couple of examples:-

Input
C Developer Belfast Salary up to ****k pa Our client, a leading edge [MASK] Development Centre in Belfast requires C [MASK] to deliver key software products directly for their clients and for their [MASK] teams using the latest Microsoft [MASK] (.NET C, ASP.NET and SQL Server). Key [MASK] • Design and develop cuttingedge [MASK] solutions, developed in C .Net with SQL as the back [MASK] data store.
Output
. c developer belfast salary up to * * * * k pa our client , a leading edge software development centre in belfast requires c ) to deliver key software products directly for their clients and for their design teams using the latest microsoft technologies ( . net c , asp . net and sql server ) . key responsibilities • design and develop on software solutions , developed in c . net with sql as the back end data store . ?
Input
Excelsior Professional Search is a [MASK] executive search and recruitment firm [MASK] in the financial markets technology re a [MASK] sales professional in this market it would certainly be [MASK] for you to either apply to this posting or [MASK] us for a confidential [MASK] discussion with a view to [MASK] if any of these roles would better your [MASK] position and earning potential.
Output
or the of professional search is a leading executive search and recruitment firm specialising in the financial markets technology re a successful sales professional in this market it would certainly be useful for you to either apply to this role or call us for a confidential confidential discussion with a view to us if any of these roles would feel your career position and earning potential . ?

Conclusion

Based on this experiment and some other articles in the wild, I came to the following conclusions:-

The learning objective can have a huge difference in the capability of the model.
LSTM models work better in the lower to medium data settings (when training from scratch)
Pytorch is very easy to work with and it forces you to really understand what's going under the hood (Kinda off topic but sharing my personal opinion)

Further Work

I plan to keep exploring other variations of language models and will be maintaining the model variations and code on Github.

skilp4d/Language-Models

Contribute to skilp4d/Language-Models development by creating an account on GitHub.

github.com

References

Additive Attention Mechanism — Bahdanau et al., 2015
“Attention is All you Need” — (Vaswani, et al., 2017)
BERT — Devlin et al., 2019
Attention Attention! — Lilian Weng, 2018
Understanding LSTM Networks — Chris Olah