Close Menu
AI News TodayAI News Today

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    YouTube now lets you turn off Shorts

    Ford EV and tech chief leaving automaker

    Boston Dynamics’ robot dog now reads gauges and thermometers with Google’s AI

    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook X (Twitter) Instagram Pinterest Vimeo
    AI News TodayAI News Today
    • Home
    • Shop
    • AI News
    • AI Reviews
    • AI Tools
    • AI Tutorials
    • Chatbots
    • Free AI Tools
    AI News TodayAI News Today
    Home»AI News»AI benchmarks are broken. Here’s what we need instead.
    AI News

    AI benchmarks are broken. Here’s what we need instead.

    By No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    AI benchmarks are broken. Here’s what we need instead.
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Across the organizations where this approach has emerged and started to be applied, the first step is shifting the unit of analysis. 

    For example, in one UK hospital system in the period 2021–2024, the question expanded from whether a medical AI application improves diagnostic accuracy to how the presence of AI within the hospital’s multidisciplinary teams affects not only accuracy but also coordination and deliberation. The hospital specifically assessed coordination and deliberation in human teams using and not using AI. Multiple stakeholders (within and outside the hospital) decided on metrics like how AI influences collective reasoning, whether it surfaces overlooked considerations, whether it strengthens or weakens coordination, and whether it changes established risk and compliance practices. 

    This shift is fundamental. It matters a lot in high-stakes contexts where system-level effects matter more than task-level accuracy. It also matters for the economy. It may help recalibrate inflated expectations of sweeping productivity gains that are so far predicated largely on the promise of improving individual task performance. 

    Once that foundation is set, HAIC benchmarking can begin to take on the element of time. 

    Today’s benchmarks resemble school exams—one-off, standardized tests of accuracy. But real professional competence is assessed differently. Junior doctors and lawyers are evaluated continuously inside real workflows, under supervision, with feedback loops and accountability structures. Performance is judged over time and in a specific context, because competence is relational. If AI systems are meant to operate alongside professionals, their impact should be judged longitudinally, reflecting how performance unfolds over repeated interactions. 

    I saw this aspect of HAIC applied in one of my humanitarian-sector case studies. Over 18 months, an AI system was evaluated within real workflows, with particular attention to how detectable its errors were—that is, how easily human teams could identify and correct them. This long-term “record of error detectability” meant the organizations involved could design and test context-specific guardrails to promote trust in the system, despite the inevitability of occasional AI mistakes.

    A longer time horizon also makes visible the system-level consequences that short-term benchmarks miss. An AI application may outperform a single doctor on a narrow diagnostic task yet fail to improve multidisciplinary decision-making. Worse, it may introduce systemic distortions: anchoring teams too early in plausible but incomplete answers, adding to people’s  cognitive workloads, or generating downstream inefficiencies that offset any speed or efficiency gains at the point of the AI’s use. These knock-on effects—often invisible to current benchmarks—are central to understanding real impact. 

    The HAIC approach, admittedly promises to make benchmarking more complex, resource-intensive, and harder to standardize. But continuing to evaluate AI in sanitized conditions detached from the world of work will leave us misunderstanding what it truly can and cannot do for us. To deploy AI responsibly in real-world settings, we must measure what actually matters: not just what a model can do alone, but what it enables—or undermines—when humans and teams in the real world work with it.

     Angela Aristidou is a professor at University College London and a faculty fellow at the Stanford Digital Economy Lab and the Stanford Human-Centered AI Institute. She speaks, writes, and advises about the real-life deployment of artificial-intelligence tools for public good.

    benchmarks broken heres
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleNomadic raises $8.4 million to wrangle the data pouring off autonomous vehicles
    Next Article FedEx chooses partnerships over proprietary tech for its automation strategy
    • Website

    Related Posts

    AI News

    Ford EV and tech chief leaving automaker

    AI News

    Google releases new apps for Windows and MacOS

    AI News

    OpenAI updates its Agents SDK to help enterprises build safer, more capable agents

    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    YouTube now lets you turn off Shorts

    0 Views

    Ford EV and tech chief leaving automaker

    0 Views

    Boston Dynamics’ robot dog now reads gauges and thermometers with Google’s AI

    0 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    AI Tutorials

    Quantization from the ground up

    AI Tools

    David Sacks is done as AI czar — here’s what he’s doing instead

    AI Reviews

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    YouTube now lets you turn off Shorts

    0 Views

    Ford EV and tech chief leaving automaker

    0 Views

    Boston Dynamics’ robot dog now reads gauges and thermometers with Google’s AI

    0 Views
    Our Picks

    Quantization from the ground up

    David Sacks is done as AI czar — here’s what he’s doing instead

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Terms & Conditions
    • Privacy Policy
    • Disclaimer

    © 2026 ainewstoday.co. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.