Google DeepMind recently introduced a new language model named RecurrentGemma that has the potential to outperform transformer-based models in terms of performance, while also being more memory efficient. This could mean that even resource limited environments can benefit from the capabilities of large language models.
The research paper provides a summary of the key points of RecurrentGemma.
Connection To Gemma
RecurrentGemma is a new open language model that utilizes Google's innovative Griffin architecture. This architecture merges linear recurrences with local attention to deliver outstanding results in language processing. One key feature of Griffin is its fixed-sized state, which not only reduces memory usage but also enables efficient processing of long sequences. We offer a pre-trained model with 2 billion non-embedding parameters, as well as a variant that has been fine-tuned for specific instructions. Despite being trained on fewer tokens, both models deliver performance comparable to Gemma-2B.
Griffin Architecture
Gemma and RecurrentGemma both utilize Google’s advanced Gemini technology, offering lightweight models that can operate on laptops and mobile devices. RecurrentGemma, like Gemma, is designed to function effectively in environments with limited resources. These models share similarities in their pre-training data, instruction tuning, and utilization of RLHF (Reinforcement Learning From Human Feedback). RLHF allows models to learn independently through human feedback, making them suitable for generative AI applications.
The new model is based on a hybrid model called Griffin that was announced a few months ago. Griffin is known as a “hybrid” model because it combines two types of technologies. One technology allows it to efficiently handle long sequences of information, while the other technology enables it to focus on the most recent parts of the input. This unique combination gives Griffin the ability to process “significantly” more data in the same amount of time as transformer-based models, while also reducing the wait time (latency).
The Griffin research paper introduced two models, one named Hawk and the other named Griffin. The research paper highlights why this new model is considered a breakthrough in the field.
We have conducted experiments to confirm that Hawk and Griffin offer faster performance and higher processing capacity compared to our Transformer models. Additionally, Hawk and Griffin have shown the capability to work with longer sequences and efficiently handle tasks like data copying and retrieval over extended periods. These results indicate that our new models present a strong and efficient option in place of Transformers with global attention.
The key distinction between Griffin and RecurrentGemma lies in a specific adjustment in how the model handles input data, particularly in the input embeddings.
Breakthroughs
According to the research paper, RecurrentGemma has been found to match or even outperform the traditional Gemma-2b transformer model. The interesting part is that RecurrentGemma was trained on 2 trillion tokens, while Gemma-2b was trained on 3 trillion tokens. This is one of the key reasons why the research paper is titled “Moving Past Transformer Models” - it demonstrates a way to achieve better performance without the heavy resource requirements of the transformer architecture.
Another advantage of RecurrentGemma compared to transformer models is the decrease in memory usage and faster processing times. According to the research paper, RecurrentGemma has a smaller state size than transformers when processing long sequences. While the KV cache of Gemma grows as the sequence length increases, RecurrentGemma's state remains constant and does not expand beyond the local attention window size of 2k tokens. This means that RecurrentGemma can generate sequences of any length, unlike Gemma which is limited by the memory available on the host for autoregressive generation.
RecurrentGemma outperforms the Gemma transformer model in terms of throughput, which refers to the amount of data that can be processed (higher is better). While the transformer model's throughput decreases with higher sequence lengths (more tokens or words), RecurrentGemma is able to maintain a high throughput.
The research paper demonstrates that RecurrentGemma maintains a high throughput even with longer sequence lengths, setting it apart from the Gemma transformer model.
In Figure 1a, we show the throughput results for sampling 2k tokens with different generation lengths. Throughput measures the maximum number of tokens that can be sampled per second using a single TPUv5e device.
RecurrentGemma consistently achieves a higher throughput compared to Gemma across all sequence lengths tested. Moreover, the throughput of RecurrentGemma remains stable even as the sequence length increases, while the throughput of Gemma decreases as the cache size grows.
Limitations Of RecurrentGemma
The research paper does show that this approach comes with its own limitation where performance lags in comparison with traditional transformer models.
The researchers highlight a limitation in handling very long sequences which is something that transformer models are able to handle.
According to the paper:
Implications in Real World
In practical terms, while RecurrentGemma models are great for shorter sequences, they may not perform as well as traditional transformer models like Gemma-2B when working with very long sequences that go beyond the local attention window.
This approach to language models is important because it offers alternative ways to enhance their performance without requiring as much computational resources as a transformer model. It demonstrates that a non-transformer model can address the issue of limited cache sizes in transformer models, which often result in higher memory usage.
As a result, this development could pave the way for the use of language models in scenarios with limited resources in the near future.
Read the Google DeepMind research paper:
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models (PDF)
Featured Image by Shutterstock/Photo For Everything
Editor's P/S:
The advent of RecurrentGemma, a novel language model from Google DeepMind, marks a significant stride in the evolution of AI language processing. It offers a compelling alternative to transformer-based models, boasting superior performance and memory efficiency. This breakthrough opens up exciting possibilities for harnessing the power of language models in resource-constrained environments, such as mobile devices and low-powered servers.
RecurrentGemma's ability to outperform transformer models on shorter sequences while maintaining a small state size and fast processing times is particularly noteworthy. This makes it an ideal choice for applications that require real-time language processing, such as chatbots, voice assistants, and text summarization tools. Moreover, its potential to process sequences of any length without memory limitations paves the way for advancements in fields such as language generation, machine translation, and question answering.