👾 The Context Window Dilemma | Part ll

The technical limitations, resource demands, and design challenges that make an infinite context window extremely challenging—for now.

Kim Isenberg
December 29, 2024

Why isn't there an infinite context window?

An infinite context window that allows a model to process an unlimited number of tokens at the same time sounds tempting, but for a number of technical and practical reasons it is initially impossible. The limitation of the context window results from the fundamental properties of current model architectures, the demands on computing resources and the efficient use of memory. These factors prevent a language model from working with an infinite context.

As mentioned earlier, the computational cost of the attention mechanism in transformers scales quadratically with the number of tokens. With an infinite context window, the model would have to analyze an infinite number of token pairings, resulting in an infinite number of calculations. This is not mathematically solvable and would exceed any available computing capacity.

Even with very large but finite context windows, the computational requirements would be so high that even the most powerful supercomputers would not be able to handle them. Each addition of further tokens not only increases the computing time, but also the energy and hardware costs exponentially, which sets a limit to scalability.

In addition to the calculations, each token requires memory to store its vector representation, intermediate calculations and gradients. The memory requirement scales linearly with the number of tokens. With an infinite context window, the memory requirement would be infinitely large. Even in an optimized system that only stores relevant information, processing long texts would not be possible due to the memory limits of modern hardware.

An infinite context window would lead to a loss of efficiency, since not all information in a long text is equally relevant. Language models are based on prioritizing the most important information for the current context. With an infinite context, the model would inevitably process irrelevant or redundant information. This could not only reduce processing speed, but also affect the precision of the results, since irrelevant information could distort the model predictions.

The architecture of the current transformer models is designed for finite input lengths. This is because transformer networks work with fixed input lengths during training. An infinite context window would require completely new architectures that can efficiently store and retrieve information over long periods of time. Approaches such as recurrent models, memory-augmented networks or linear attention mechanisms could offer potential solutions here, but they too have practical limitations and are not yet mature enough for widespread use.

In principle, a narrow context window is a hindrance in many areas, especially in science. In scientific and legal contexts, it is often necessary to analyze long documents that are related to each other. But a narrow context window is also a hindrance when processing dialogues, for example in chatbots or customer service applications, because it could forget earlier parts of the conversation. In medicine, on the other hand, large amounts of data are often processed, including patient files, research reports and diagnostic information. And in technical fields, such as software development or engineering, it is crucial to fully understand longer instructions, codes or manuals.

As we can see, a long context window is essential for numerous areas of work and its importance should not be underestimated.

How could the problem be solved and what future can we expect?

The holy grail of the context question is, of course, a context window that is unlimited, so to speak, that has no limits and that can remember everything and include infinite lengths in the context.

Ex Google CEO Eric Schmidt recently said that we will soon reach this holy grail and that we can solve the problem. Specifically, he said in an interview:

❝

“To me the question is what happens next. And there are three things that are happening this year. The first is infinite context-window.“

Eric Schmidt, Ex-Google CEO

However, he did not provide an answer as to how this could be realized. Therefore, I will give 5 examples below that are close to reality and directly address the solution. In other words, examples of how we can solve this dilemma in the long term.

1. Improved model architectures and new attention mechanisms

One important approach is to adapt the underlying model architectures themselves. Let us recall that the classic transformer model scales the computational load quadratically with the number of tokens, which reaches practical limits after a certain length (see above). To break through this barrier, research is being conducted into more efficient variants of the attention mechanism, such as “linear” or “sparse” attention methods, which require significantly less computing power per additional token.

New architectures such as recurrent memory transformers (RMT) or models based on hierarchical structures attempt to divide longer sequences into logically segmented sections in order to make them trackable over longer periods of time. Perceiver models and hyena architectures are further examples of research directions that promise to significantly expand the context by changing the way the model stores and processes information.

2. Externalization of the context: Storage and retrieval approaches

Instead of packing all the knowledge into a single model, the context can also be outsourced to external storage. Databases or knowledge graphs are used for this, which the model accesses as needed. This is called “retrieval-augmented generation” (RAG): the language model accesses external documents by retrieving relevant text passages before formulating the actual response. A semantic search system (similar to an intelligent search engine) can help to find those passages from large collections of texts that are relevant to the current question.

This strategy makes it possible to operate a more compact model with a relatively small core window, while an almost unlimited amount of information is available in the background. Companies such as OpenAI, Google (DeepMind), Meta or Anthropic are researching such hybrid approaches and developing systems in which the model only calculates extensively when it is really necessary. This could make complex document analyses or longer conversations possible in the future without loss of coherence.

3. Segmentation, chunking and pre-processing

Another pragmatic method is to break up longer texts into digestible sections. These “chunks” are processed individually, and the model is given a summary of the previous findings before moving on to the next section. Such techniques may be less elegant than a truly infinite context window, but they already make it possible to process long documents relatively efficiently. With clever segmentation strategies, such as semantic chapter markers, the results can be optimized to such an extent that the loss of contextual information is minimized. Although this is basically only a workaround, it is currently widely used because it is comparatively easy to implement.

4. future visions: dynamic contexts and new models

In the long term, researchers envision that models will “understand” which information is relevant in the long term and which they can discard. A kind of dynamic memory could be created in which the model keeps a kind of continuous internal log and only delves deeper into older passages when necessary. This would create a context window that is no longer rigidly limited, but “grows” with the content.

Companies like Anthropic and Google DeepMind are working on models that can adapt better and interact with external memory, while startups and research labs worldwide are working on architectural innovations in which context length is no longer a rigid bottleneck. Eric Schmidt's optimistic view of the future, which spoke of an “infinite context”, reflects the fact that we will probably soon see models that can dynamically tap into new sources of information, manage attention sparingly and use contextually relevant content over the long term.

5. Advances in hardware and infrastructure

In addition to purely algorithmic improvements, technological advances in hardware are also crucial. More powerful chips, specialized AI accelerators, and more efficient storage technologies could expand the bottleneck. Better infrastructure, such as faster networks between the model and external storage, enables the almost seamless integration of huge amounts of data.

In combination with improved algorithms, these hardware advances will cause the practical limit of the context length to steadily increase until it is effectively perceived as “infinite” for many applications.

Conclusion

“Previously, Gemini could process up to 32,000 tokens at once, but 1.5 Pro — the first 1.5 model we’re releasing for early testing — has a context window of up to 1 million tokens — the longest context window of any large-scale foundation model to date. In fact, we’ve even successfully tested up to 10 million tokens in our research. And the longer the context window, the more text, images, audio, code or video a model can take in and process.”

The developments around the context window illustrate both the enormous progress in the area of large language models and the technical limits that still exist. Although the limitation to a finite number of tokens leads to practical challenges, especially in scientific and analytical fields of application, the rapid change in technologies shows that we are only at the beginning of a new era. Increasingly efficient model architectures, external storage and retrieval systems, and segmentation-based approaches make it clear that the usable context can be expanded step by step. This opens up new horizons for the use of AI systems in research, medicine, law, and literary analysis, without losing sight of comprehensible, application-oriented solutions.

Looking ahead, it is likely that in the coming years we will see models that work in context and can process even the largest volumes of text without sacrificing coherence or quality. The combination of improved hardware, innovative attention mechanisms and intelligent storage structures will make it possible to realize the “Holy Grail” goal of an almost unlimited context. This development is encouraging because it indicates that ever deeper, more comprehensive and user-friendly analysis of complex data will be possible in the future.

—

Subscribe to FF Daily to get the next article in this series, The Context Window Dilemma | Part II.

Kim Isenberg

Kim studied sociology and law at a university in Germany and has been impressed by technology in general for many years. Since the breakthrough of OpenAI's ChatGPT, Kim has been trying to scientifically examine the influence of artificial intelligence on our society.

Follow Kim on X

Reply

or to participate.