male-1: Welcome back to Byte-Sized Breakthroughs, the podcast where we dissect the latest advancements in tech and AI. I'm your host, Alex Askwell, and today we're diving deep into a fascinating paper titled 'Titans: Learning to Memorize at Test Time.' With me are Dr. Paige Turner, the lead researcher behind this work, and Professor Wyd Spectrum, an expert in the field of machine learning. Welcome to you both.

female-1: Thanks, Alex. Glad to be here.

female-2: It's a pleasure to join this conversation, Alex.

male-1: Dr. Turner, let’s start with the big picture. What problem were you aiming to solve with this research?

female-1: Certainly, Alex. Our focus was on the challenges faced by current sequence models when handling long-range dependencies. We've seen two main approaches: recurrent models and attention-based models like Transformers. Recurrent models, such as RNNs, compress data into a fixed-size hidden state, which can lose crucial information over longer sequences. Transformers, while powerful due to their attention mechanism, have a quadratic computational cost, making them inefficient for very long contexts. Linear Transformers try to reduce this cost, but they often underperform Transformers in accuracy. We aimed to bridge this gap by introducing a memory system that works effectively for very long contexts.

male-1: Professor Spectrum, what's your take on this historical context? Why are these limitations in existing models so significant?

female-2: Well, Alex, the core problem is that real-world data is often sequential and possesses long-term dependencies. Think of language – sentences, paragraphs, entire documents – each part relies on information from potentially very distant parts. Similarly in genomics, or time-series analysis, very long term dependencies are critical. For instance, in genomics, gene regulatory elements, such as enhancers, may be located distantly from the genes they regulate. Traditional RNNs simply struggle to retain information for such long spans, and that leads to limited performance. Transformers, while adept at capturing local dependencies through attention, become computationally infeasible when faced with excessively long sequences. So, this area has been the focus of very intense research.

male-1: So it's a balancing act between computational efficiency and modeling long-term dependencies. Dr. Turner, where did your research make key contributions to address this?

female-1: Our main contribution is the development of a neural long-term memory module that, unlike other models, learns to memorize and forget at *test time*. This is crucial. Most models have fixed parameters after training; our module continuously updates its parameters even during inference. We achieve this through a surprise-based mechanism. When an input is surprising - defined by a high gradient of the loss with respect to the input - we dedicate more memory to it. We also incorporated a forgetting mechanism, which is controlled by a gate, and is based on a form of weight decay, which has been shown to be important in controlling memory overflow. Additionally, rather than fixed-size matrix- or vector-based memory, we used a deep neural network as our memory module to capture complex non-linear relationships.

male-1: Let’s break this down, Dr. Turner.  You mentioned a 'surprise-based mechanism' and 'forgetting mechanism'. Can you detail how those work?

female-1: Certainly, Alex. The surprise metric is based on the gradient of the associative memory loss function with respect to the input. Specifically, our loss is the mean squared error between the output of the memory module for a given key, and the corresponding value, where keys and values are linear projections of the input token. A larger gradient signifies a greater error, indicating a more surprising or unexpected input. This means the model has more difficulty memorizing this token, and therefore, that token should be more memorable. We also use a momentum term to keep track of past surprises. We measure surprise not just from the incoming token but also based on a weighted average of recent past surprises. The forget mechanism, or gate, decides how much of the past memory should be decayed with each update. This is important to handle long sequences as it allows us to clear the memory when information is no longer relevant.

female-2: So, it's not just about storing everything; it’s also about selectively forgetting. Professor Spectrum, how does this differ from traditional approaches to memory in sequence models?

female-2: Exactly, Alex. Traditional methods tend to either compress all past information into a fixed-size vector in the case of RNNs, or keep all information within a limited context window as is the case with Transformers. Recent linear recurrent models have memory mechanisms, but many lack an explicit way to forget or use a simple matrix based memory which is limited in expressive power. Dr. Turner's approach, with its dynamic memory updates based on surprise and forgetting, is a significant departure. Many existing models only use momentary surprise, missing the information flow, or they use matrix or vector valued memories, which are more limited in performance in long context scenarios. The crucial aspect here is that they train the system to learn what and when to remember and forget during test time, instead of using a trained and fixed memory or a linear model for memory updates.

male-1: Dr. Turner, you also mentioned that you used a deep neural network for your memory module, rather than fixed-size matrix or vector memories. Can you explain that choice?

female-1: Absolutely. We used a multilayer perceptron, or MLP, with potentially multiple layers, as our memory module. The number of layers is an adjustable hyperparameter. This decision was based on the idea that complex relationships often aren't linear. Linear models for memory updates, such as a linear layer or a matrix, inherently compress data in a linear fashion.  MLPs with at least two layers have been shown to be more expressive than linear models, theoretically and practically. In our experimental results, we showed that deeper memory modules do in fact improve the model performance, although with a tradeoff for higher training time.

male-1: Interesting. So you’re not just adding memory; you’re also adding *depth* to memory. Now, let's discuss how you incorporated this memory module into your architecture, which you call ‘Titans.’ You presented three variations. Could you describe the design and how they differ?

female-1: Sure. We introduce three distinct Titans architectures: Memory as a Context (MAC), Memory as a Gate (MAG), and Memory as a Layer (MAL). In the MAC architecture, we treat the memory as a context to the incoming data. We first segment the sequence into chunks, and use the input context as a query to the memory to retrieve corresponding information from our neural long term memory module. This retrieved information, along with a persistent memory component, is prepended to the current chunk and passed to the attention module. Then, the output is used to update the memory module. In the MAG, we use a sliding window attention on the input, and then combine this with the memory output using a gating mechanism. Finally, in the MAL architecture, we apply the neural memory module as a layer to compress the sequence before it's processed by a sliding window attention module. All three architectures also have the persistent memory component.

male-1: And what about this persistent memory component? What's its purpose, and how does it operate?

female-1: Good question. The persistent memory is a set of learnable, but data-independent, parameters. Think of it as storing knowledge about the task, rather than specific data. We append these parameters to the beginning of the input sequence before processing it. This does three things: First, it acts as a task memory and stores information about the task which is useful throughout the entire training and inference procedure. Second, from an architectural perspective, these parameters are similar to feed forward networks which are shown to be acting similar to attention weights, but with data-independent parameters. Finally, from a technical viewpoint, it helps to mitigate the attention bias toward initial tokens of the sequence. This helps the model distribute attention weights more effectively, which increases the performance of the overall model.

female-2: So, you’ve built a complex system: a deep neural memory that learns at test time, a surprise-based updating rule with momentum and forgetting, and a family of architectures to integrate this. Professor Spectrum, what's your assessment of this methodological approach?

female-2: Alex, I find Dr. Turner's approach to be very insightful, well-motivated and comprehensive. The idea of a memory that learns at test time is a significant step away from typical architectures. The key to a good memory is to selectively remember the important things and forget the useless parts. The usage of gradient-based surprise and forgetting is both novel and inspired from cognitive neuroscience. The design also addresses some key issues of other recent models and provides solutions for these limitations. For example, modern linear recurrent models do not use deep memories, they are mostly using matrix or vector-valued memories. This design also tackles the problems of methods that do not incorporate the flow of information while updating the memory, or that lack the forgetting mechanism. The three architectural variants (MAC, MAG, and MAL) provide options with varying computational trade-offs. Also, the persistent memory is a very smart approach for mitigating the attention bias and making the model focus on the actual data rather than the position of the tokens. From a technical and theoretical perspective, the fast and parallelizable training is also very useful for applying it to practical settings.

male-1: Now let's move to the experiments. Dr. Turner, what tasks did you evaluate your models on, and what were the main results?

female-1: We evaluated the models on a range of tasks including language modeling, common-sense reasoning, needle-in-a-haystack, DNA modeling and time-series forecasting. For language modeling and common-sense reasoning, our neural memory module (LMM), when used alone, achieved the best performance among non-hybrid models. When looking at the hybrid models, we found that all three variants of Titans outperformed hybrid baselines, with MAC and MAG achieving similar higher performances, and showing the importance of design when incorporating such memory modules into a larger architecture. In the needle-in-a-haystack tasks, where we measure the ability to retrieve a piece of information from very long distractor texts, the LMM module showed better performance than the baselines, including TTT and Mamba2, in all tasks. Also, Titans variants, particularly MAC, performed on par or better than LMM. We also tested our model on the BABILong benchmark which requires reasoning across very long documents. Our Titans (MAC) model outperformed state-of-the-art models, including extremely large models such as GPT-4 and Llama3 with RAG. In time-series forecasting tasks, our neural memory module outperforms all baselines, including Mamba-based and Transformer-based architectures. Lastly, our LMM model also showed competitive performance in a diverse set of genomic tasks. We had four scales for Titans: 170M, 340M, 400M and 760M parameters, trained on 15B and 30B tokens respectively.

male-1: Dr. Turner, can you provide specific examples of quantitative results? What kind of improvements did you see in perplexity, accuracy, or other metrics?

female-1: Certainly, Alex. In language modeling, our Titans model with 760 million parameters achieves a perplexity of 18.61 on the WikiText dataset, while some baselines are performing as high as 28.12, which signifies a significant improvement. On the PIQA task, our Titans with 760M parameters achieves an accuracy of 70.25%, which is also an improvement over the baselines in that category. In the needle-in-a-haystack single task for the 16K context window we saw Titans (LMM) achieve 96.2% accuracy for retrieving a piece of information, while TTT and Mamba2 achieves 88.4% and 5.4% respectively. In the BABILong benchmark, Titans (MAC) achieved 68.8% accuracy in the few shot setting, while the best baseline was Mamba2.8B which achieved about 30%. Also, in the fine tuning version, Titan’s accuracy of 66.5% significantly outperforms even extremely large models such as GPT-4 with 61.2% accuracy. In time-series forecasting, we evaluated performance on the ETTm1 dataset, achieving a MSE of 0.358, compared with other baselines with the best performance at 0.387 MSE. Finally, in our DNA modeling tasks, our model achieved a top-1 accuracy of 75.2% for Enhancer classification, outperforming models like DNABERT which achieved 74.0%.

female-2: Those are some impressive results. Professor Spectrum, what is your take on these findings?

female-2: Well, Alex, the results are quite compelling. The model shows significant improvements in language modeling and common-sense reasoning tasks, but its strengths are really evident in long-context benchmarks. The fact that Titans can retrieve a specific piece of information from a 16K long text with such high accuracy, while also being capable of reasoning across extremely long documents is noteworthy. These results go beyond simple improvements and really show the capacity of this new approach. This work challenges the status quo, where many state-of-the-art models often fail in long-context scenarios. The improvements in time-series and DNA modelling further highlight the generalizability of this model and architecture.

male-1: It seems like you've made a significant stride forward. However, every research has its limitations. Dr. Turner, what are some limitations of your current approach, and what are the future research directions?

female-1: You’re absolutely right, Alex. While the results are promising, the study has some limitations. Firstly, while we do evaluate performance in time series and DNA modeling, our focus is mainly on language modeling and long context reasoning tasks. More extensive evaluation on diverse datasets and tasks will give us better insight into the generalizability of Titans. Also the tasks used for the evaluation are specifically designed to validate the context length and reasoning. Further experiments are needed to evaluate the performance in other tasks not requiring these specific features. Secondly, we've used simple MLPs as the base for our memory modules. Exploring more complex architectures might improve performance. We only evaluated a limited set of hyperparameters and further study could lead to better results. Also, the size of the models were limited to 760M parameters. Evaluating the scalability of Titans with much larger models is very important to make the method more useful. Also the training is specifically done to tackle the challenges of long context and may have limitations for other tasks, so further research on training is needed as well.  In terms of future directions, we aim to explore other memory module architectures, beyond MLPs, investigate different approaches for incorporating persistent memory and other training methods, extend the evaluation of Titans to a wider range of tasks, and scale Titans to larger models. Also, we want to study the theoretical properties of our method to better understand its expressiveness and generalization capabilities.

male-1: Professor Spectrum, what do you see as the broader implications of this work, and where do you think this research might lead us in the future?

female-2: Alex, this work has the potential to significantly reshape our approach to sequence modeling. Dr. Turner's idea of test time learning with a forgetful memory opens up new avenues for dynamic and adaptive memory systems. In the long run, we may be moving from static, trained memories to dynamic memory models that adapt to new tasks. This would have huge implications for various fields, such as natural language processing, where a dynamic memory would really enhance contextual learning.  The efficient handling of long sequences has obvious applications in analyzing complex documents, video content, and lengthy genomic sequences. Furthermore, this also touches upon the connections between neuroscience and AI, since the LMM design is partly inspired by human memory systems. We might see models in the future that are more human-like in their capacity to learn, memorize, and forget. Finally, the efficient and fast parallel training procedure has very important applications for efficient development of future models that are more scalable.

male-1: Dr. Turner, you've mentioned that the code for this research will be made available soon. That's definitely great for transparency and reproducibility.

female-1: Yes, Alex. We strongly believe in open research, and so we intend to make the code we used to train and evaluate our models available to everyone.

male-1: This has been an incredibly insightful discussion. To wrap things up, what are the main takeaways from your research, Dr. Turner?

female-1: Well, Alex, I’d say that the key insight is that an effective memory needs to be dynamic, surprise-driven, and have a way to forget the past. Our research showed that having a neural long term memory module that uses both short-term and long-term components with the ability to dynamically learn at test time is crucial for efficiently processing long contexts. We demonstrated that it can successfully be incorporated to state of the art architectures, which led to higher performance when compared to current methods. And by introducing Titans, we provide a framework for how one can incorporate such memory into the overall architecture.

male-1: That’s a fantastic summary.  Thank you both for this incredibly detailed explanation of your groundbreaking research. This has been another episode of Byte-Sized Breakthroughs, join us next time for more!