👾 Peak Data - Part 1

AI is at a turning point. For years, the mantra was simple: more data equals better models. But now, we’ve reached the limits of real-world data. Enter synthetic data—artificially generated datasets that fuel AI without the constraints of privacy laws or data scarcity.

Kim Isenberg
February 09, 2025

“This is where Artificial Intelligence (AI) steps in as the modern-day alchemist. Much like the ancient alchemists who sought to transform base metals into gold, AI has the ability to convert raw, chaotic data into valuable insights that can drive business success. Through powerful algorithms and machine learning, AI can process vast amounts of data, identify patterns, and predict future trends with remarkable accuracy. By leveraging AI, businesses can automate the data transformation process, turning what was once an overwhelming challenge into a competitive advantage.”

datahubanalytics.com

In the ever-evolving world of artificial intelligence, data is the basis for progress and innovation. Traditionally, AI models rely on large amounts of real data to recognize patterns and make predictions. However, the acquisition and use of such data often presents significant challenges, whether due to data protection regulations, high costs or the simple inaccessibility of certain data types. In this context, synthetic data is becoming increasingly important. It offers the possibility of creating artificially generated data sets that resemble real data but are free from the aforementioned restrictions.

An outstanding example of the successful use of synthetic data is the DeepSeek-R1 model from the Chinese company DeepSeek. This model was developed with the help of synthetic data, among other things, and has shown that such data can not only be a supplement to real data, but in certain contexts even a substitute for it. In this article, we will take a closer look at the creation and application of synthetic data, discuss the specific methods and challenges involved in creating it, and use various models to examine the extent to which data is essential for training AI models and why synthetic data is a necessary addition to addressing the limitations of data.

Why Do We Need Data?

❝

Data is the fossil fuel of A.I. We’ve achieved peak data and there will be no more.

Ilya Sutskever

Data is the foundation of any modern artificial intelligence. Without data, there are no patterns, no associations, no possibility for a model to recognize connections and generalize. In recent years, pre-training has become established as the standard method for developing powerful AI models. Pre-training refers to the initial phase in which a model is trained with large amounts of data before it is fine-tuned for specific tasks. This process enables a model to acquire a broad base of knowledge and contextual understanding that can then be further refined for specific use cases. Pre-training became particularly well known through Large Language Models (LLMs), i.e. huge neural networks that have millions to billions of parameters.

The role of data in pre-training should not be underestimated. Data enables a model to learn general language structures, semantic relationships and even world knowledge by processing countless examples. The more data available, the more precise the model's generalizations and the larger the model can scale. The scaling approach of recent years was clear: more parameters, more data, better models. This principle was confirmed by empirical observations – models trained with huge amounts of data performed better in benchmarks than smaller models with limited data access. This led to a veritable explosion in model size: while GPT-3 managed with several billion parameters, models like GPT-4 or DeepSeek-R1 went even further. Models with trillions of parameters are no longer purely theoretical.

Dr. Alan Thompson, lifearchitect.ai

Why does a larger amount of data help to make a model better? The reason for this lies in the nature of neural networks and their parameters. Parameters are the numerical values within a neural network that are adjusted by training to achieve optimal predictive power. A neural network with few parameters has a limited capacity to capture complex patterns. It can recognize simple relationships, but quickly reaches its limits when it comes to understanding complex linguistic or logical concepts. More parameters mean more memory capacity for the model, allowing it to recognize deeper and more abstract patterns in the data. But parameters alone are not enough – they need to be trained on enough data to be useful. Without large amounts of data, a model with many parameters could become overly “overfitted” to specific patterns and would struggle to respond appropriately to new inputs.

In recent years, however, it has become increasingly clear that this pre-training is reaching its limits. The idea that “the more data, the better” always applies is beginning to crumble. The reason is simple: the world does not have an infinite amount of high-quality data. Even now, many of the largest language models have been trained with almost all of the high-quality texts available on the internet. This means that further progress can no longer be achieved by adding new real data alone, as this simply no longer exists in sufficient quality. This is where synthetic data comes in. It offers a way to generate new training data that is similar to real data but comes from artificial sources. This makes it possible to extend pre-training beyond its previous limits by using additional, specifically generated data to further improve models.

Ilya Sutskever, co-founder of AI labs Safe Superintelligence (SSI) and OpenAI, told Reuters recently that results from scaling up pre-training - the phase of training an AI model that uses a vast amount of unlabeled data to understand language patterns and structures - have plateaued. “The 2010s were the age of scaling, now we're back in the age of wonder and discovery once again. Everyone is looking for the next thing,” Sutskever said.

❝

Scaling the right thing matters more now than ever.

Another problem is that as model size increases, so does the training effort exponentially. While a model with a few billion parameters can still be trained in a reasonable time, a model with several trillion parameters requires immense computing resources, storage space and energy. Training such models can take weeks or months and cost hundreds of millions of dollars (and counting). This pushes even the largest AI companies to their limits. As a result, alternatives are being sought – including techniques such as sparse training, model compression or the targeted use of synthetic data. (I reported in detail on the exponential demand for compute and power at the scale of larger models last year in my article series “Scale is all you need?” including empirical data)

epoch.ai

To summarize, data is of central importance in the field of AI, especially for pre-training large language models. More data enables better generalization, deeper model capacity and ultimately more powerful AI systems. However, this paradigm of limitless growth is reaching practical and theoretical limits. The AI community is now faced with the challenge of finding new ways to continue improving models without relying solely on the previous method of exponential data growth. Synthetic data is playing an increasingly important role here, as it offers a way to artificially expand the database while avoiding the challenges associated with data collection. The fact that, in addition to synthetic data, processes such as inference-scaling also represent a new scaling law will be ignored at this point. I will refer to this later.

The Importance of Synthetic Data in AI and the Global Players

“Data is the fossil fuel of A.I.,” said Sutskever while speaking at the Conference on Neural Information Processing Systems (NeurIPS) in Vancouver on Dec. 13. “We’ve achieved peak data and there will be no more.” This means that pre-training—the process of feeding models with mass amounts of information—”will unquestionably end,” added the researcher, who noted that A.I. developers are already looking into alternative solutions like synthetic data or models that improve responses by taking longer to think about potential answers.”

observer.com

Synthetic data is artificially generated information that is used to replicate real data. It is generated using a variety of techniques, including algorithmic approaches, simulations or the use of AI models themselves. The main advantage of synthetic data lies in its ability to provide large amounts of training material without violating data protection regulations or the need for laborious collection of real data (an important aspect, especially for Germany and Europe).

“Synthetic data is created by Generative AI models trained on real world data samples. The algorithms first learn the patterns, correlations and statistical properties of the sample data. Once trained, the Generator can create statistically identical, synthetic data. The synthetic data looks and feels the same as the original data the algorithms were trained on. However the big advantage is that the synthetic data does not contain any personal information.”

mostly.ai

They also enable the simulation of rare or extreme scenarios that may not be adequately represented in real data sets. This is particularly valuable in areas such as autonomous driving, medicine or the financial industry, where certain events occur rarely but are nevertheless of great importance for modeling.

—

Ready for part 2? Subscribe to FF Daily for free!

Kim Isenberg

Kim studied sociology and law at a university in Germany and has been impressed by technology in general for many years. Since the breakthrough of OpenAI's ChatGPT, Kim has been trying to scientifically examine the influence of artificial intelligence on our society.

Follow Kim on X

Reply

or to participate.