BERT large language models pushed to 2 million tokens

This post explores a recent breakthrough in natural language processing (NLP) that could significantly increase the effective context length of BERT, a popular pre-trained transformer-based neural network model. According to a new paper titled “Scaling Transformer to 1M tokens and beyond with RMT,” the context length of large language models could potentially increase to an impressive 2,048,000 tokens.

This breakthrough could have important implications for tasks that require processing lengthy texts, such as language modeling, and could lead to new applications in various fields, including the energy and power industries.

What’s BERT?

BERT is a pre-trained transformer-based neural network model developed by Google that is widely used for NLP tasks, such as text classification, question answering, and language translation. BERT models use subword tokenization, a technique to break down words into smaller subword units to handle out-of-vocabulary words that the model has not seen during training.

Tokenization in LLM (Large Language Models)

As of today, GPT-4 handles 32,000 tokens, while CoLT5 handles 64,000 tokens, which equates to approximately 50-100 pages of a standard document or 2-4 hours of conversation. The entire Lord of the Rings series, for example, contains 576,459 words, which translates to 768,612 tokens.

The entire Lord of the Rings series (including The Hobbit)
576,459 words = 768,612 tokens

2304.11062 research paper

However, a recent paper published on April 19, 2023, titled “Scaling Transformer to 1M tokens and beyond with RMT” suggests that the effective context length of a BERT large language model can be increased to an impressive 2,048,000 tokens using the RMT (Recurrent Memory Transformer). This study shows that Transformers can handle lengthy texts without requiring excessive amounts of memory by utilizing a recurrent approach and memory.

The research indicates that models trained on large inputs can extend their abilities to texts several times longer.

These findings could help Transformers perform well on tasks with unseen properties, such as language modeling. The authors plan to use this recurrent memory approach to increase the effective context size of widely used Transformers in future studies.

Conclusion

While it is unclear whether increasing the context length will have any implications, the potential benefits are considerable if this method does not result in functional model degradation.We could write entire novels, analyze large amounts of information in complex scientific research, and store and recall a person’s entire life experiences.

Seetalabs is currently researching new ways to apply this “boost” to the energy and power industry. Stay tuned!

For more information, visit the Github link: https://github.com/booydar/t5-experiments/tree/scaling-report