Close Menu
AI News TodayAI News Today

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    OpenAI Has a New AI Model Built for Biology and Science

    Today’s NYT Wordle Hints, Answer and Help for April 18 #1764

    Today’s NYT Connections Hints, Answers for April 18 #1042

    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook X (Twitter) Instagram Pinterest Vimeo
    AI News TodayAI News Today
    • Home
    • Shop
    • AI News
    • AI Reviews
    • AI Tools
    • AI Tutorials
    • Chatbots
    • Free AI Tools
    AI News TodayAI News Today
    Home»AI Tools»Why AI Is Training on Its Own Garbage (and How to Fix It)
    AI Tools

    Why AI Is Training on Its Own Garbage (and How to Fix It)

    By No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Why AI Is Training on Its Own Garbage (and How to Fix It)
    Share
    Facebook Twitter LinkedIn Pinterest Email

    in AI for a while, you are probably an LLM/Agent/Chat user, but have you ever asked yourself how these tools will be trained in the near future, and what if we have already used up the data we need to train models? Many theories say that we are running out of high-quality, human-generated data to train our models.

    New content goes up every day, that’s a reality, but an increasing share of what gets added daily is itself AI-generated. So if you keep training on public web data, you’re eventually training on the outputs of your own predecessors. The snake eating its tail. Researchers call this phenomenon Model Collapse, where AI models start learning from the errors of their predecessors until the whole system degrades into nonsense.

    But what if I told you we aren’t actually running out of data? We’ve just been looking in the wrong place.

    In this article, I am going to break down the key insights from this brilliant paper.

    The Web We Already use and the Web That Matters

    Most of us consider the web as a unique source of information. In reality, there are at least two.

    There is the Surface Web: the indexed, public world like what we find on Reddit, Wikipedia, and news sites. This is what we’ve already scraped and overused for years to train the mainstream AI models of today. Then, there is what we call the Deep Web, and here I’m not talking about the “Dark Web” or anything illegal.

    The Deep Web is simply everything behind a login or a firewall. It refers to anything online that isn’t publicly indexed. It could be your hospital’s patient portal, your bank’s internal dashboard, enterprise document archives, private databases, and years of email sitting behind a login screen. Normal, boring, but incredibly valuable data.

    Many studies suggest the Deep Web is orders of magnitude larger than the surface web. More importantly, it is crucially better quality data. Compared to surface web content, which can be noisy, full of misinformation, and strongly SEO optimized. Also, it increasingly contains content deliberately designed to mislead or poison AI models. Deep web data, like medical records or verified financial documents or others internal databases, tends to be clean, authenticated, and organized by people who care about its quality.

    The problem? I think you can guess it, it’s private. You can’t just extract a million medical records without considering all the legal and ethical catastrophes you are going to cause.

    The PROPS Framework

    This is where a new framework called PROPS (Protected Pipelines) comes in. Introduced by Ari Juels (Cornell Tech), Farinaz Koushanfar (UCSD), and Laurence Moroney (former Google AI Lead), PROPS acts as a bridge between this sensitive data and the AI models that need it.

    The brilliance of PROPS is that it doesn’t ask you to “hand over” your data. Instead, it uses Privacy-Preserving Oracles. Think of an oracle as a “trusted middleman” that can look at your data, verify it’s real, and then tell the AI model what it needs to know without ever showing the model the raw information.

    These concepts of props can sounds magical as it can solve a lot of issues related to data availability that AI models face today. But how does this work exactly? Let’s take an example of a medical company that wants to train a diagnostic tool on real health records. Under the PROPS framework:

    1. Permission: As a user, you log into your own health portal and authorize a specific use for your data.
    2. The Oracle: Think of the Oracle as a digital notary. It goes to your private portal (like your hospital database) to verify that your data is real. Instead of copying your files, it simply tells the AI system: “I have seen the original documents, and I testify they are authentic.” It provides proof of the truth without ever handing over the private data itself. Tools already exist for this, like DECO. It’s a protocol that lets users prove that they pulled a specific piece of data from a web server over a secure TLS channel.
    3. The Secure Enclave: This is a “black box” inside the computer’s hardware where the actual training happens. We put the AI model and your private data inside and “lock the door.” No human or developer can see what is happening inside. The AI “studies” the data and leaves with only the model weights. The raw data stays locked inside until the session is over.
    4. The Result: The model trains on the data inside that box. Only the updated “weights” (the learning) come out. The raw data is never seen by human eyes.

    The contributor knows exactly what they’re agreeing to, and they can be rewarded for participating in a way that’s calibrated to how valuable their specific data actually is. It’s a genuinely different relationship between data owners and AI systems.

    But why bother with this instead of Synthetic Data?

    Some might ask: “Why bother with this complex setup when we can just generate synthetic data?”

    The answer is that synthetic data is a diversity killer. By definition, synthetic data generation reinforces the middle of the bell curve. If you have a rare medical condition that affects only 0.01% of the population, a synthetic data generator will likely smooth you out as “noise.”

    Models trained on synthetic data become progressively worse at serving outliers. PROPS solves this by creating a secure way for real people with rare conditions or unique backgrounds to “opt-in.” It turns data sharing from a privacy risk into a “data marketplace.” where valuable data gets the compensation it deserves.

    It’s not just about training, inference matters too

    Most discussions focus on training, but PROPS has an equally interesting application on the inference side.

    For example, getting a loan today involves a lot of document submission: bank statements, pay stubs, and tax returns. In a PROPS-based system, they suggest the use of a Loan Decision Model (LDM):

    1. You authorize the LDM to talk directly to your bank.
    2. The bank confirms your balance via a privacy-preserving oracle.
    3. The LDM makes a decision.
    4. The result? The lender gets a verified “Yes” or “No” without ever touching your private documents. This eliminates the risk of data leaks and makes it nearly impossible for people to use fraudulent, photoshopped documents.

    What’s actually stopping this from happening in 2026?

    It simply comes down to scale and infrastructure.

    The most robust version of PROPS requires training to happen inside a hardware-backed secure enclave (like Intel SGX or NVIDIA’s H100 TEEs). These work well at a small scale, but getting them to work for the massive GPU clusters needed for frontier LLMs is still an open engineering problem. It requires massive clusters to work in perfect, encrypted sync.

    The researchers are clear: PROPS isn’t a finished product yet. It’s a persuasive proof-of-concept. However, a lighter-weight version is deployable today. Even without full hardware guarantees, you can build systems that give users meaningful assurance, which is already an improvement over asking someone to email you a PDF.

    My Own Final Thoughts

    PROPS isn’t really a “new” technology; it’s a new application of existing tools. Privacy-preserving oracles have been used in the blockchain and Web3 space (like Chainlink) for years. The insight here is recognizing that the same tools can solve the AI data crisis.

    The “data crisis” isn’t a lack of information; it’s a lack of trust. We have more than enough data to build the next generation of AI, but it’s locked behind the doors of the Deep Web. The snake doesn’t have to eat its tail; it just needs to find a better garden.

    👉 LinkedIn: Sabrine Bendimerad

    👉 Medium: https://medium.com/@sabrine.bendimerad1

    👉 Instagram: https://tinyurl.com/datailearn

    Fix Garbage Training
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleSteam client files point to “framerate estimator” feature in the works
    Next Article Starting in May, pre-2013 Kindles won’t be able to buy or download new books
    • Website

    Related Posts

    AI Tools

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    AI Tools

    How to Learn Python for Data Science Fast in 2026 (Without Wasting Time)

    AI Tools

    Adaptive Verifiable Environments for E-Commerce Conversational Agents

    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    OpenAI Has a New AI Model Built for Biology and Science

    0 Views

    Today’s NYT Wordle Hints, Answer and Help for April 18 #1764

    0 Views

    Today’s NYT Connections Hints, Answers for April 18 #1042

    0 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    AI Tutorials

    Quantization from the ground up

    AI Tools

    David Sacks is done as AI czar — here’s what he’s doing instead

    AI Reviews

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    OpenAI Has a New AI Model Built for Biology and Science

    0 Views

    Today’s NYT Wordle Hints, Answer and Help for April 18 #1764

    0 Views

    Today’s NYT Connections Hints, Answers for April 18 #1042

    0 Views
    Our Picks

    Quantization from the ground up

    David Sacks is done as AI czar — here’s what he’s doing instead

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Terms & Conditions
    • Privacy Policy
    • Disclaimer

    © 2026 ainewstoday.co. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.