Scale Is All You Need? Part 2-1

Note: If you haven’t seen Part 1, you can read it here.

Introduction

“In three words: deep learning works” 

San Altman

Welcome to the second part of this series. In the first part, we defined the basics of AGI (Artificial General Intelligence) and explained why a common understanding is crucial for the development process - even if there is still no uniform agreement in the scientific community, as many researchers emphasize. ("This paper endeavors to summarize the minimal consensus of the community, consequently providing a justifiable definition of AGI. It is made clear what is known and what is controversial and remains for research, so as to minimize the ambiguous usages as much as possible in future discussions and debates.”).

In this section, we take a look back at the path to AGI by tracing developments to date and at the same time providing an outlook on future progress. The review helps to better understand key milestones, while the outlook shows what challenges still lie ahead.

Let's remember the definitional criteria for AGI that I have developed:

General expert level (1), self-learning (2) and multimodality (3) in conjunction with autonomous action (agentic) seem to have emerged as the essential criteria for AGI in broad reception. That is, the ability of a model to solve problems independently at the human expert level without necessarily having been trained on them beforehand and to draw conclusions for its further action.”

AGI differs from specialized artificial intelligence in that it is capable of solving a variety of tasks that are not explicitly pre-programmed.  

In this section, we take a historical look at the beginnings of modern AI and highlight the key breakthroughs that have brought us closer to this goal. We then analyze the current state of the technology and discuss what conditions must be met in order to realize AGI. I will draw on my personal assessment as well as empirical data and findings to illustrate current progress.

We start with a look at the origins of machine learning and the famous Turing test, which has long been considered one of the most important benchmarks for artificial intelligence. We then look at current data to evaluate technological progress and derive the need for further development.

This part of the series therefore traces the development path to AGI to date, analyzes the current status and derives future developments from the insights gained.

In the third part, I will highlight the key challenges that act as bottlenecks for the development of an AGI and the areas in which intensive research is required. Finally, the fourth part will provide an outlook on how an AGI could be made accessible to the majority of humanity.

1.1 When and where it all began

In order to understand what is still needed to achieve AGI, it is essential to know our current state of development. To do this, it is worth taking a brief look back at the history of the development of modern AI to understand the key advances and milestones. This review helps us to recognize the decisive criteria for its significance and relevance. The focus here is primarily on the models of OpenAI, which can be used as examples to illustrate the developments.

The history of modern artificial intelligence and its current form as large language models is the result of decades of research and innovation. This development began with simple models and led to complex systems such as GPT-4, which are revolutionizing the way we interact with technology. Looking back, important milestones can be recognized.

The first language models, known as N-gram models, were based on the frequency of word sequences in a text. They were able to recognize simple patterns in language, but were severely limited for longer texts as they could not store any context beyond the immediate word sequence. The introduction of Recurrent Neural Networks (RNNs) in the 1980s made it possible to make better use of information from earlier parts of a text. However, RNNs also had their limitations when processing longer sequences, as they tended to “forget” important information.

A decisive turning point in the development of modern AI was the “Transformer” model presented in 2017. This model, described in the famous paper “Attention is All You Need”, revolutionized language processing by relying on a new type of attention mechanism. Transformer models can process all the words in a text at the same time and thus better capture the context of an entire sentence or paragraph. This made them extremely efficient and powerful for a wide range of speech processing applications

Based on the Transformer architecture, OpenAI developed the GPT series (Generative Pre-trained Transformers). GPT-3, released in 2020, consisted of 175 billion parameters and represented a huge step forward. These models are pre-trained on huge amounts of text data, allowing them to generate amazingly coherent and contextual texts. Such models are now known to be able to not only hold simple conversations, but also perform complex tasks such as writing articles, programming or answering questions. A huge step forward.

The possible applications of LLMs are very diverse and have developed considerably in recent years: from chatbots and virtual assistants to automatic text generation and machine translations (e.g. deepL.com, developed in Germany, which is also excellent at reformulating texts). A well-known example is Google Translate, which, thanks to LLMs, is able to deliver translations of near-human quality for more than 100 languages (and the number of languages is regularly being expanded). 

For the sake of completeness, it should also be mentioned that completely different approaches to AGI are also being pursued, such as Human Brain Emulation, AIXI and Integrated Cognitive Architecture. However, their explanation would go beyond the scope of this article.

 

1.2 The Turing Test

The Turing test, proposed by the British mathematician and computer scientist Alan Turing in his 1950 essay “Computing Machinery and Intelligence”, was originally intended as a criterion for assessing whether a machine exhibits human-like intelligence. In this test, a human questioner attempts to determine whether they are communicating with a human or a machine by asking a series of questions. If the machine can deceive the questioner in a significant percentage of cases, it is considered intelligent according to the test.

“The idea, roughly, is that if an interrogator were unable to tell after a long, free-flowing and unrestricted conversation with a machine whether she was dealing with a person or a machine, then we should be prepared to say that the machine was thinking.”

The Turing test has often been used as a benchmark for the concept of artificial general intelligence. Remember: AGI refers to a machine/entity that can perform any intellectual task that a human can perform (and depending on the definition, autonomous/agentic). If a machine passes the Turing test, it could be argued that it has reached a level of intelligence equivalent to that of a human and is therefore AGI. Of course, we have since moved away from the idea that the Turing test could be an essential touchstone for general intelligence. After all, many LLMs today can write and speak in such a human way that the majority of models could probably pass the touring test in a blind test. Nevertheless, for many years the Touring test was regarded as the benchmark par excellence.

Accordingly, the Turing test has been fundamentally criticized over time and is no longer considered a meaningful touchstone. Instead, tests were developed based on and resulting from the Turing test, which now serve as more precise and more diverse tests for artificial general intelligence. For example, the “Lovelace 2.0 Test” (assessment of creative ability and intelligence in language and images. The Lovelace 2.0 test differs from the Turing test in that it not only assesses a machine's ability to act in a human-like manner, but also explicitly focuses on creativity and the ability to create new and surprising things, which is considered a sign of deeper intelligence) or the “Winograd Schema Challenge” as a criterion for intellectual AGI competence (a test developed to test the ability of machines to reason in common sense (everyday knowledge and logical reasoning).

The Turing test was a milestone in the history of AI and triggered important discussions. Today, however, it is no longer recognized as a sufficient or suitable criterion for AGI. The complexity of AGI requires more comprehensive evaluation methods that go beyond the simple imitation of human communication. However, this historical view shows that, on the one hand, the idea of what constitutes authentic and convincing artificial intelligence has shifted over the decades and, on the other hand, the measurement methods have become increasingly complex. As the complexity of the tests and the demands on artificial intelligence increased, so did the need for computing power. 

Last but not least, the historical view and the constant surpassing by new models has regularly shifted the requirements and tests upwards. Historically, we can therefore conclude that, on the one hand, the requirements for the technical models themselves were regularly revised upwards and, on the other hand, the surpassing of one's own expectations made new tests possible in the long term.

2. State of play and outlook for the future: Where are we right now?

As already outlined in the first part, today's LLMs are referred to as “narrow AI”. Even though some prominent figures such as Elon Musk claim to have already seen at least the beginnings of veritable AGI in GPT-4, there is still widespread agreement that AGI has not yet been achieved and that today's models are at least the last prerequisite for AGI. Hence the statement, which has now degenerated into a phrase, that all that is needed to achieve general intelligence is scale; the biggest challenges are out of the way. This is why Sam Altman is now convinced:

“This may turn out to be the most consequential fact about all of history so far. It is possible that we will have superintelligence in a few thousand days (!); it may take longer, but I’m confident we’ll get there.”

Currently, there are numerous large language models with hundreds of billions to trillions of parameters. They are regularly multimodal (multimodal refers to their ability to process different types of information or “modalities”, such as text, image, audio, video, etc.) and correspond to narrow intelligence on the Google Deepmind scale. The models are getting better and better at reasoning and, as has now been emphasized several times by various parties, there is no end in sight to the improvement of the models through scaling alone. Mark Zuckerberg recently expressed this view in an interview with YouTuber Cleo Abram:

"With past AI architectures, you could feed an AI system a certain amount of data and use a certain amount but eventually it hit a plateau. And one of the interesting things about these new transformer-based architectures over the last 5-10 years is that we haven't found the end yet. So that leads to this dynamic where Llama 3, we could train on 10.000-20.000 GPUs; Llama 4 we could train on more than 100.000 GPUs; Llama 5 we can plan to scale even further and there's just an interesting question of how far that goes. It's totally possible that at some time we hit a limit. And just like previous systems, there's an asymptote and it doesn't keep on growing. But it's also possible that that limit is not going to happen anytime soon. And that we're going to be able to keep on building more clusters and generating more synthetic data to train the systems, and that they just are going to keep on getting more and more useful for people for quite a while to come and it's a really big and high stakes question, I think, for the companies because we're basically making these bets on how much infrastrucutre to build out for the future. (...) So I'm clearly betting that this is going to keep scaling for a while."

Leopold Aschenbrenner outlined the development of models using OOMs (OOM = order of magnitude, 10x = 1 order of magnitude) as follows in his remarkable blog “Situational Awareness”

“We are racing through the OOMs extremely rapidly, and the numbers indicate we should expect another ~100,000x effective compute scaleup—resulting in another GPT-2-to-GPT-4-sized qualitative jump—over four years. Moreover, and critically, that doesn’t just mean a better chatbot; picking the many obvious low-hanging fruit on “unhobbling” gains should take us from chatbots to agents, from a tool to something that looks more like drop-in remote worker replacements.”  

At the same time, however, further development presents us with a challenge because the scale demands exponentially more compute and energy.

“Analyzing the data from various AI systems, we can observe an exponential growth in the computation required for both training and inference. This growth has profound implications for energy usage and hardware costs, which are critical factors in the sustainability and accessibility of AI technologies.”

So we find ourselves in a difficult situation. On the one hand, it is said with surprising unanimity that we can create better models by simply scaling the models; on the other hand, the need to use the models is increasing exponentially, so to speak. In the following, I will examine the question of scaling in more detail and substantiate the theses with empirical material.

2.1 What do we need?

We can say a lot of things about what may happen next, but the main one is that AI is going to get better with scale, and that will lead to meaningful improvements to the lives of people around the world.

Sam Altman 

In his recently published essay “The Intelligence Age”, Sam Altman takes a look into the future and estimates when we can expect artificial superintelligence (see part 1 of my series). Sam is certain: “Deep learning works. The key to even better AI is scaling. So let's take a quick look at what exactly scale is.

The term “scale” in the context of artificial intelligence and machine learning refers to the size of a model and the amount of resources invested in its development. When we say that a model “scales more”, we are talking about a combination of:

  1. Larger models (more parameters): This means that the neural network has more neurons and connections (parameters).

  2. More data: The model is trained with a larger amount of training data.

  3. More computing power: More computing power is used to train the model.

Re 1: Every neural network consists of neurons and connections between them, which are described by parameters (weights and biases). The more parameters a model has, the more complex it can be. More parameters allow the model to learn more complicated patterns and relationships in the data. It can thus recognize more subtle differences and solve more complex tasks.

Re 2: A large model requires a large amount of data to be trained effectively. More data means that the model sees a wider and deeper range of examples and thus better understands the underlying patterns.

With more data, the model can learn to generalize better and not just memorize. This means it can also apply its findings to new, unknown data.

Re 3: Larger models and larger data sets require more computing power in order to be trained in an acceptable amount of time. With more computing power, more complex models can be trained more effectively. Faster compute also allows for experimentation and fine-tuning of the model, leading to better final results. Moreover, compute is an essential aspect for better inference. Remember: Inferenceis the process of applying a trained model to new, unknown data to make a prediction or classification. It should be borne in mind that, in addition to performance, the energy consumption of chips is increasingly coming into focus.

“In 15 words: deep learning worked, got predictably better with scale, and we dedicated increasing resources to it. (...) If we want to put AI into the hands of as many people as possible, we need to drive down the cost of compute and make it abundant (which requires lots of energy and chips). If we don’t build enough infrastructure, AI will be a very limited resource that wars get fought over and that becomes mostly a tool for rich people.” 

Sam Altman

Researchers from Google DeepMind, such as Denny Zhou, have also confirmed that scale is the essential mechanism and that there is no upper limit:

Some AI experts even go so far as to claim that scale alone is enough to achieve AGI, while other experts are skeptical.

“It is unclear whether ever more parameters and ever larger models will actually deliver better performance. Some AI experts believe that it is possible to create artificial general intelligence (AGI) simply by scaling up. However, many scientists are extremely critical of this. Meta's head of AI research, Naila Murray, also said in an interview with heise online that she believes other types of AI models are needed to create agents and ultimately a kind of intelligence.”

This second part of the AGI series has been split into two further parts due to its length. In Part 2-2, we will continue with the question of training data and develop conclusions based on empirical material.

Subscribe to the Forward Future Newsletter to have it delivered straight to your inbox.

About the author

Kim Isenberg

Reply

or to participate.