Close Menu
AI News TodayAI News Today

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    I Am Begging AI Companies to Stop Naming Features After Human Processes

    The best beauty tech you can still grab in time for Mother’s Day

    This slim ice cream maker could fit in my already crowded kitchen

    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook X (Twitter) Instagram Pinterest Vimeo
    AI News TodayAI News Today
    • Home
    • Shop
    • AI News
    • AI Reviews
    • AI Tools
    • AI Tutorials
    • Chatbots
    • Free AI Tools
    AI News TodayAI News Today
    Home»AI Tools»Why I Don’t Trust LLMs to Decide When the Weather Changed
    AI Tools

    Why I Don’t Trust LLMs to Decide When the Weather Changed

    By No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Why I Don’t Trust LLMs to Decide When the Weather Changed
    Share
    Facebook Twitter LinkedIn Pinterest Email

    have a simple problem: they show you the forecast, but they don’t tell you when it actually changed.

    That might sound trivial. It isn’t.

    Modern numerical weather prediction (NWP) systems — like ECMWF IFS — produce remarkably accurate forecasts at ~9 km resolution, updated every few hours. The data is already very good.

    The problem is not the forecast.

    The problem is attention: knowing when a change in that data is actually meaningful.

    I didn’t learn that from software engineering. I learned it years earlier, studying chaos theory at the Instituto Balseiro. It was there, working through dynamical systems, that I first encountered a slightly unsettling idea:

    A system can be completely deterministic and still be practically unpredictable.

    That idea stayed with me. And years later, when I started building AI systems, I realized that many of them were ignoring it.


    The problem with “vibe-based” deltas

    When I started seeing how developers were building weather agents, I noticed a pattern:

    1. Fetch forecast data
    2. Feed it into an LLM
    3. Ask: “Did the weather change significantly?”

    At first glance, this seems reasonable. From a physics perspective, it is problematic — at least for problems where the decision boundary is already well-defined — because it replaces a well-defined threshold with a probabilistic interpretation.

    In a chaotic system, significance is not a linguistic judgment — it is a threshold defined on variables like temperature, precipitation, or wind speed. It depends on magnitudes, context, and time horizons.

    An LLM is a stochastic process. It is very good at generating language, but it is not designed to enforce deterministic boundaries on physical systems.

    When you ask an LLM whether a forecast “changed significantly,” you’re asking a probabilistic model to approximate a deterministic rule that you could have defined explicitly. That introduces variability exactly where you want consistency.

    The failure modes are subtle:

    • Trends inferred from phrasing rather than data
    • Inconsistent decisions across similar inputs
    • Outputs that cannot be tested or reproduced

    In many applications, that might be acceptable. In agriculture, energy, and logistics — where a 3°C drop is a phase transition for a crop, a non-linear spike in energy demand, or an operational disruption — it is not. These decisions need to be stable and explainable.

    Which led me to a simple rule:

    If you can write an assert statement for it, you probably shouldn’t be using a prompt.


    My path to this problem

    My career has looked less like a straight line and more like a trajectory in phase space. A Marie Curie PhD in climate dynamics, five years directing R&D at Uruguay’s national meteorology institute — forest fire prevention, seasonal forecasting, climate adaptation — then a shift to production ML at Microsoft and Mercado Libre.

    That arc gave me something specific: I already understood the physics of the data, the skill horizons of the models, and what “significant change” actually means in a physical system. Not as a software abstraction — as a measurable delta on a variable with known uncertainty bounds.

    When I started building AI systems, the instinct was immediate: this is a threshold problem. Thresholds belong in code, not in prompts.

    Skygent is one expression of that perspective — an agent designed not to display forecasts, but to detect meaningful changes in them.

    The system runs continuously on real forecast data for user-defined events, evaluating changes every few hours and only triggering alerts when predefined conditions are met. In practice, most evaluation cycles result in no alert — only a small fraction of changes cross the significance threshold. That’s the point: signal, not noise.


    The architecture

    Skygent follows a clean separation across five layers:

    Architecture description

    Only one layer calls the LLM.

    The Deterministic Gatekeeper

    At the core is a Python evaluator. It doesn’t interpret — it calculates. It:

    • Compares consecutive Pydantic-validated forecast snapshots
    • Evaluates deltas against configurable thresholds
    • Incorporates context: event type, variable sensitivity
    • Accounts for forecast horizon using established NWP skill limits — a change in a 24-hour forecast does not carry the same reliability as a change in a 10-day forecast

    This is where decisions are made. Every alert has a traceable path: which variable changed, by how much, which threshold was crossed. In a corporate or government environment, being able to explain why an alert fired — without saying “the model felt like it” — is not optional.

    The Trigger

    An alert fires only if a threshold is crossed. If the delta doesn’t cross the boundary, nothing happens. This is a binary, testable condition — not a judgment call.

    The Narrator

    Only after the decision is made does the LLM enter the pipeline. Its role is strictly limited: take structured JSON data, translate it into natural language.

    # Structured payload sent to GPT-4o-mini
    {
        "event_name": "Ana's Wedding",
        "variable": "precipitation_probability_max",
        "from_value": 10.0,
        "to_value": 50.0,
        "delta": 40.0,
        "horizon_days": 5.2,
        "confidence": "medium"
    }

    Output:

    “Rain probability increased from 10% to 50% for your event window. Confidence is medium due to the 5-day forecast horizon.”

    The LLM is not deciding anything. It is explaining.


    Why this architecture is testable

    It is practically impossible to reach 100% test coverage on a pure LLM agent — you cannot write deterministic assertions on probabilistic outputs.

    The hybrid approach changes this. The decision logic is pure Python with Pydantic-validated inputs: 204 unit tests, zero LLM dependencies in the test suite. The LLM handles only the narrative tone — the one thing that genuinely benefits from natural language generation.

    This is not just a testing convenience. It means every decision the
    system makes can be explained, reproduced, and verified independently of the LLM.


    Event-Driven LLM Invocation

    A naive agent calls the LLM on every polling cycle. This one doesn’t.

    Skygent evaluates every 6 hours. It only calls the model when a threshold is crossed — roughly once or twice per week per monitored event, compared to ~28 calls for a naive polling agent.

    At gpt-4o-mini pricing (~$0.0001 per narrative), cost is negligible. More importantly, cost is proportional to actual information: you pay for an LLM call only when something worth communicating happened.


    A concrete example

    Previous snapshot: Rain probability 10%, Max temp 22°C, Wind 15 km/h

    Current snapshot: Rain probability 50%, Max temp 21.4°C, Wind 18 km/h

    Threshold: Alert if rain probability Δ > 20pp

    Evaluation frequency: Every 6 hours

    Result: Alert triggered → GPT-4o-mini generates narrative → Telegram delivery

    Screenshot of Skygent’s alert example

    When this pattern breaks

    This approach doesn’t apply everywhere. It breaks down when:

    • Inputs are unstructured or ambiguous
    • Decision boundaries cannot be codified as thresholds
    • Reasoning is open-ended

    In those cases, LLM-first architectures — ReAct, Plan-and-Execute — make more sense.

    One honest caveat: the thresholds in Skygent are configurable defaults — reasonable starting points informed by meteorological practice, but not calibrated against historical forecast errors for specific use cases. Calibration against real outcomes is the natural next step for any vertical deployment. The pattern is sound; the parameters are a starting point.


    Closing

    The most important decision I made building this system was not choosing a model or a framework.

    It was deciding where not to use an LLM.

    There is a tendency right now to delegate more and more to language models — to let them figure things out. But some problems already have structure. Some decisions already have boundaries.

    When they do, approximating them with language is the wrong move. Encoding them explicitly is better.

    In practice, this often comes down to a simple distinction: use LLMs to explain decisions, not to replace well-defined ones.

    The full implementation — significance evaluator, LangGraph pipeline, Telegram bot — is available at: github.com/ferariz/skygent


    Fernando Arizmendi builds production AI systems at the intersection of rigorous scientific method and applied AI engineering. He is a physicist (B.Sc. & M.Sc.) from Instituto Balseiro, former Marie Curie fellow (Ph.D. studying Climate Dynamics & Complex Systems), and previously directed R&D at Uruguay’s national meteorology institute.

    LinkedIn · GitHub

    All images by the author. Pipeline diagram generated with Claude (Anthropic).

    Changed Decide Dont LLMs trust Weather
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleArs Asks: Share your shell and show us your tricked-out terminals!
    Next Article Infants are bleeding to death after parents shun routine vitamin K shots
    • Website

    Related Posts

    AI Tools

    Timer-XL: A Long-Context Foundation Model for Time-Series Forecasting

    AI Tools

    Deconstruct Any Metric with a Few Simple ‘What’ Questions

    AI Tools

    How to Make Claude Code Validate its own Work

    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    I Am Begging AI Companies to Stop Naming Features After Human Processes

    0 Views

    The best beauty tech you can still grab in time for Mother’s Day

    0 Views

    This slim ice cream maker could fit in my already crowded kitchen

    0 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    AI Tutorials

    Quantization from the ground up

    AI Tools

    David Sacks is done as AI czar — here’s what he’s doing instead

    AI Reviews

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    I Am Begging AI Companies to Stop Naming Features After Human Processes

    0 Views

    The best beauty tech you can still grab in time for Mother’s Day

    0 Views

    This slim ice cream maker could fit in my already crowded kitchen

    0 Views
    Our Picks

    Quantization from the ground up

    David Sacks is done as AI czar — here’s what he’s doing instead

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Terms & Conditions
    • Privacy Policy
    • Disclaimer

    © 2026 ainewstoday.co. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.