Humans are generally good at remembering events in both the long and short term, depending on the material nature of the event. The human brain serves two functions: processing and memory. This dual function allows humans to synthesize events from a historical perspective. If a material event occurs during the week, such as missing an international flight, we tend to remember it long into the future. But for mundane events such as random faces that we see on the way to work, the brain does a great job quickly cycling through them and not bringing them to the foreground of our historical perspective.
This is precisely what the Google Titans architecture proposes (Google Research, 2025). Solving the statelessness nature of transformer-based architectures, which means models are effectively closed from learning post-training. The implication is that Large Language Models (LLMs) deployed in production will eventually become stale and require retraining to continue providing relevant and up-to-date information. But what if these models can continue to learn in production?
There are two parts to the deployment of LLM infrastructure as it pertains to memory and processing:
1/ The core LLM - Uses dense linear algorithms but has no memory outside of the context window. Put simply, an LLM can only work on the information provided in a single prompt.
2/ The retrieval architecture - This serves as memory for the LLM, but it is not part of the inner LLM. It is an ancillary component built on top of the LLM to simulate memory as it appears to the user.
Within LLMs, transformer architectures primarily compute weighted combinations of contextual information via an attention mechanism that routes contextual information. You enter a query that is turned into tokens, which are used by the transformer for computation through dense linear algebra for function approximation in parallel across the model’s parameters to estimate next-token probability distribution: (Vaswani et al., 2017; Alammar, 2018)
Illustration of: Attention(Q, K, V) = softmax(QKᵀ / √d) · V
For long-term memory, retrieval-augmented generation (RAG) uses vectorization to retrieve chunks relevant to the tokens. To understand more about query, key, and value in transformer architecture, read here. (Lewis et al., 2020; Towards Data Science, n.d.)
Surprise, Surprise!
As described in the paper “Titans: Learning to Memorize at Test Time” (Behrouz et al., 2025a), Titans extend transformer-based models by incorporating a learned long-term memory module, thereby addressing a core limitation of traditional attention-based architectures.
Traditional transformer-based LLMs rely on a short-term memory mechanism that captures context within a prompt.
What is short-term context?
Short-term context refers to the information available within the model’s attention window during a single inference step. In other words, it is the context contained within a single stream of input tokens processed by attention. You can read more on attention from Vaswani 2017 paper - "Attention Is All You Need" - https://arxiv.org/abs/1706.03762 (Vaswani et al., 2017)
This stream typically includes:
1/ The user prompt
2/ System messages
3/ Assistant replies
What appears to users as persistent memory in LLM-powered applications—such as chatbots—is, in reality, the result of explicit prompt replay and context injection, often supported by embeddings (numerical vector representations) or retrieval mechanisms (Lewis et al., 2020).

This is where Titans begin to deviate toward a model of memory that more closely resembles human reasoning.
Enter long-term memory...
Long-term memory in Titans is a persistent, learned neural memory that exists across inference steps, without reliance on retrieval-augmented generation (RAG). This memory enables information from past inputs to influence future predictions without prompt re-injection or external retrieval, fundamentally decoupling memory from the attention window. (Behrouz et al., 2025a; Behrouz et al., 2025b)
Implications of Titans Architecture
The shift from traditional transformer-based models to Titans introduces significant implications for how Large Language Models are deployed and operated. From a user’s perspective, LLMs have long appeared to possess memory. In practice, however, this memory has been explicitly orchestrated by external systems—most commonly retrieval-augmented generation (RAG)—which retrieve and inject context into the model at inference time. This retrieval layer introduces additional infrastructure, latency, and energy overhead before inference even begins. (Lewis et al., 2020)
By natively integrating long-term memory into the model itself, Titans fundamentally alters this deployment paradigm. (Behrouz et al., 2025a; Google Research, 2025)
1. Persistent adaptation without retraining Titans enable models to memorize and reuse information at test time without updating model weights. This allows LLMs to remain up to date with recent interactions or data streams without retraining or fine-tuning cycles, while preserving a fixed attention window. (Behrouz et al., 2025a)
2. Reduced reliance on parameter-efficient fine-tuning for adaptation While techniques such as Low-Ranking Adaptation (LoRA) modify model parameters to achieve task adaptation, Titans offer an alternative mechanism: persistent neural memory closer to how humans use historical context. For specific adaptation scenarios, such as user preferences, episodic knowledge, or task context, this internal memory can reduce the need for frequent fine-tuning to keep the LLM up to date, thereby lowering operational cost and deployment complexity. (Hu et al., 2022; Behrouz et al., 2025a)
3. Selective memory through surprise-driven updates This is a notable challenge with Titans: which memory is retained? Titans employ a surprise-based mechanism to determine what information is worth storing in long-term memory. By prioritizing unexpected or informative inputs, the model avoids indiscriminate memorization and maintains an efficient, focused memory state over extended inference sequences. Therefore, the memory selection mechanisms in Titans play a critical role in preventing memory saturation, which can lead to drift. (Behrouz et al., 2025a; Behrouz et al., 2025b)

Reducing the Resources Scale Ratio
In transformer architectures, self-attention computation scales quadratically with sequence length (O(n²)), making long context windows increasingly expensive in terms of latency, memory, and energy consumption. As context length grows, attention signals also become more diffuse, leading to attention dilution and reduced effectiveness in modeling long-range dependencies. (Vaswani et al., 2017)
Titans architectures address this limitation by introducing a learned long-term memory module whose update and access costs scale linearly with sequence length (O(n)). Rather than relying solely on attention over ever-longer prompts, Titans persist information across inference steps through internal neural memory. Empirical evaluations such as needle-in-the-haystack tasks analyzed using the Miras sequencing framework to demonstrate that Titans retain relevant information more reliably than attention-only architectures as sequence length increases.
Conclusion
This appears to be a begining of an evolution of transformers architectures that currently dominate the LLM landscape. By decoupling memory from the attention window, Titans reduce reliance on large prompts, mitigate attention dilution, and enable inference to operate persistently across multiple prompts, representing a fundamentally different scaling axis from traditional long-context transformers that will drastically change how LLM infrastructure is built as we advance the technology. The wider implication at the resource level means that we could begin to see a reduction in the number of GPUs required to maintain models in operation, and managing to scale while meaningfully addressing the energy burden of AI data centers.
References
1/ Behrouz, A., Zhong, P., Mirrokni, V., et al. (2025). Titans: Learning to memorize at test time. arXiv. https://arxiv.org/abs/2501.00663
2/ Behrouz, A., Zhong, P., Mirrokni, V., et al. (2025). Miras: A unified framework for associative memory in sequence models. arXiv. https://arxiv.org/abs/2504.13173
3/ Google Research. (2025). Titans & Miras: Helping AI have long-term memory. https://research.google/blog/titans-miras-helping-ai-have-long-term-memory/
4/ Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33. https://arxiv.org/abs/2005.11401
5/ Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762
6/ Alammar, J. (2018). The illustrated transformer. https://jalammar.github.io/illustrated-transformer/
7/ Towards Data Science. (n.d.). What are query, key, and value in the transformer architecture and why are they used? https://towardsdatascience.com/what-are-query-key-and-value-in-the-transformer-architecture-and-why-are-they-used-acbe73f731f2/