Why AI Is Training on Its Own Garbage (and How to Fix It)

in AI for a while, you are probably an LLM/Agent/Chat user, but have you ever asked yourself how these tools will be trained in the near future, and what if we have already used up the data we need to train models? Many theories say that we are running out of high-quality, human-generated data to train our models.

New content goes up every day, that’s a reality, but an increasing share of what gets added daily is itself AI-generated. So if you keep training on public web data, you’re eventually training on the outputs of your own predecessors. The snake eating its tail. Researchers call this phenomenon Model Collapse, where AI models start learning from the errors of their predecessors until the whole system degrades into nonsense.

But what if I told you we aren’t actually running out of data? We’ve just been looking in the wrong place.

In this article, I am going to break down the key insights from this brilliant paper.

The Web We Already use and the Web That Matters

Most of us consider the web as a unique source of information. In reality, there are at least two.

There is the Surface Web: the indexed, public world like what we find on Reddit, Wikipedia, and news sites. This is what we’ve already scraped and overused for years to train the mainstream AI models of today. Then, there is what we call the Deep Web, and here I’m not talking about the “Dark Web” or anything illegal.

The Deep Web is simply everything behind a login or a firewall. It refers to anything online that isn’t publicly indexed. It could be your hospital’s patient portal, your bank’s internal dashboard, enterprise document archives, private databases, and years of email sitting behind a login screen. Normal, boring, but incredibly valuable data.

Many studies suggest the Deep Web is orders of magnitude larger than the surface web. More importantly, it is crucially better quality data. Compared to surface web content, which can be noisy, full of misinformation, and strongly SEO optimized. Also, it increasingly contains content deliberately designed to mislead or poison AI models. Deep web data, like medical records or verified financial documents or others internal databases, tends to be clean, authenticated, and organized by people who care about its quality.

The problem? I think you can guess it, it’s private. You can’t just extract a million medical records without considering all the legal and ethical catastrophes you are going to cause.

The PROPS Framework

This is where a new framework called PROPS (Protected Pipelines) comes in. Introduced by Ari Juels (Cornell Tech), Farinaz Koushanfar (UCSD), and Laurence Moroney (former Google AI Lead), PROPS acts as a bridge between this sensitive data and the AI models that need it.

The brilliance of PROPS is that it doesn’t ask you to “hand over” your data. Instead, it uses Privacy-Preserving Oracles. Think of an oracle as a “trusted middleman” that can look at your data, verify it’s real, and then tell the AI model what it needs to know without ever showing the model the raw information.

These concepts of props can sounds magical as it can solve a lot of issues related to data availability that AI models face today. But how does this work exactly? Let’s take an example of a medical company that wants to train a diagnostic tool on real health records. Under the PROPS framework:

Permission: As a user, you log into your own health portal and authorize a specific use for your data.
The Oracle: Think of the Oracle as a digital notary. It goes to your private portal (like your hospital database) to verify that your data is real. Instead of copying your files, it simply tells the AI system: “I have seen the original documents, and I testify they are authentic.” It provides proof of the truth without ever handing over the private data itself. Tools already exist for this, like DECO. It’s a protocol that lets users prove that they pulled a specific piece of data from a web server over a secure TLS channel.
The Secure Enclave: This is a “black box” inside the computer’s hardware where the actual training happens. We put the AI model and your private data inside and “lock the door.” No human or developer can see what is happening inside. The AI “studies” the data and leaves with only the model weights. The raw data stays locked inside until the session is over.
The Result: The model trains on the data inside that box. Only the updated “weights” (the learning) come out. The raw data is never seen by human eyes.

The contributor knows exactly what they’re agreeing to, and they can be rewarded for participating in a way that’s calibrated to how valuable their specific data actually is. It’s a genuinely different relationship between data owners and AI systems.

But why bother with this instead of Synthetic Data?

Some might ask: “Why bother with this complex setup when we can just generate synthetic data?”

The answer is that synthetic data is a diversity killer. By definition, synthetic data generation reinforces the middle of the bell curve. If you have a rare medical condition that affects only 0.01% of the population, a synthetic data generator will likely smooth you out as “noise.”

Models trained on synthetic data become progressively worse at serving outliers. PROPS solves this by creating a secure way for real people with rare conditions or unique backgrounds to “opt-in.” It turns data sharing from a privacy risk into a “data marketplace.” where valuable data gets the compensation it deserves.

It’s not just about training, inference matters too

Most discussions focus on training, but PROPS has an equally interesting application on the inference side.

For example, getting a loan today involves a lot of document submission: bank statements, pay stubs, and tax returns. In a PROPS-based system, they suggest the use of a Loan Decision Model (LDM):

You authorize the LDM to talk directly to your bank.
The bank confirms your balance via a privacy-preserving oracle.
The LDM makes a decision.
The result? The lender gets a verified “Yes” or “No” without ever touching your private documents. This eliminates the risk of data leaks and makes it nearly impossible for people to use fraudulent, photoshopped documents.

What’s actually stopping this from happening in 2026?

It simply comes down to scale and infrastructure.

The most robust version of PROPS requires training to happen inside a hardware-backed secure enclave (like Intel SGX or NVIDIA’s H100 TEEs). These work well at a small scale, but getting them to work for the massive GPU clusters needed for frontier LLMs is still an open engineering problem. It requires massive clusters to work in perfect, encrypted sync.

The researchers are clear: PROPS isn’t a finished product yet. It’s a persuasive proof-of-concept. However, a lighter-weight version is deployable today. Even without full hardware guarantees, you can build systems that give users meaningful assurance, which is already an improvement over asking someone to email you a PDF.

My Own Final Thoughts

PROPS isn’t really a “new” technology; it’s a new application of existing tools. Privacy-preserving oracles have been used in the blockchain and Web3 space (like Chainlink) for years. The insight here is recognizing that the same tools can solve the AI data crisis.

The “data crisis” isn’t a lack of information; it’s a lack of trust. We have more than enough data to build the next generation of AI, but it’s locked behind the doors of the Deep Web. The snake doesn’t have to eat its tail; it just needs to find a better garden.

👉 LinkedIn: Sabrine Bendimerad

👉 Medium: https://medium.com/@sabrine.bendimerad1

👉 Instagram: https://tinyurl.com/datailearn

What's Hot

These are the first Nvidia RTX Spark laptops

Escaping the Valley of Choice in BI

Strava declares war on scrapers ahead of IPO

Escaping the Valley of Choice in BI

Solving a Murder Mystery Using Bayesian Inference

Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost

These are the first Nvidia RTX Spark laptops

Escaping the Valley of Choice in BI

Strava declares war on scrapers ahead of IPO

Quantization from the ground up

David Sacks is done as AI czar — here’s what he’s doing instead

Judge sides with Anthropic to temporarily block the Pentagon’s ban

Most Popular

These are the first Nvidia RTX Spark laptops

Escaping the Valley of Choice in BI

Strava declares war on scrapers ahead of IPO

Our Picks

Quantization from the ground up

David Sacks is done as AI czar — here’s what he’s doing instead

Judge sides with Anthropic to temporarily block the Pentagon’s ban

Subscribe to Updates

What's Hot

Why AI Is Training on Its Own Garbage (and How to Fix It)

The Web We Already use and the Web That Matters

The PROPS Framework

But why bother with this instead of Synthetic Data?

What’s actually stopping this from happening in 2026?

My Own Final Thoughts

Related Posts

Subscribe to Updates