Close Menu
AI News TodayAI News Today

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    r/WallStreetBets really hates the SEC’s proposal to weaken quarterly reporting

    Today’s NYT Mini Crossword Answers for May 14

    Netflix’s ad ambitions just keep growing

    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook X (Twitter) Instagram Pinterest Vimeo
    AI News TodayAI News Today
    • Home
    • Shop
    • AI News
    • AI Reviews
    • AI Tools
    • AI Tutorials
    • Chatbots
    • Free AI Tools
    AI News TodayAI News Today
    Home»AI News»Anthropic blames dystopian sci-fi for training AI models to act “evil”
    AI News

    Anthropic blames dystopian sci-fi for training AI models to act “evil”

    By No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Anthropic blames dystopian sci-fi for training AI models to act “evil”
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Good stories to overwhelm the bad

    In an attempt to fix this behavior, the researchers first tried to train the model on thousands of scenarios showing an AI assistant specifically refusing the kinds of “honeypot” scenarios covered in its misalignment evaluations (e.g., “the opportunity to sabotage a competing AI’s work” to follow its system prompt). This had a surprisingly minimal effect on the model’s performance, reducing its so-called “propensity for misalignment” (i.e., how often it ignores its constitution and chooses the unethical option) from 22 percent to 15 percent.

    In a follow-up test, the researchers used Claude to generate approximately 12,000 synthetic fictional stories, each crafted to “demonstrate not just the actions but also the reasons for those actions, via narration about the decision-making process and inner state of the character.”

    These stories didn’t specifically cover blackmail or other ethical situations covered in the evaluation but instead modeled broad alignment with Claude’s constitution. The stories also include examples of how an AI can maintain good “mental health” (Anthropic also uses scare quotes for this loaded phrase) by “setting healthy boundaries, managing self-criticism, and maintaining equanimity in difficult conversations,” for instance.

    Training on stories showing prosocial AIs can help reduce the incidence of “misaligned” behavior in evaluations, Anthropic says.

    Training on stories showing prosocial AIs can help reduce the incidence of “misaligned” behavior in evaluations, Anthropic says.


    Credit:

    Anthropic


    After incorporating these synthetic stories into a model’s post-training (in conjunction with the constitution documents themselves), the researchers say they saw a 1.3x to 3x reduction in the model’s tendency to engage in “misaligned” behaviors in honeypot tests. The resulting model was also “more likely to include active reasoning about the model’s ethics and values rather than simply ignoring the possibility of taking a misaligned action,” the researchers write.

    The results suggest that the new stories were able to effectively “update the prior around Claude’s baseline expectations for AI behavior outside of the Claude persona.” The researchers theorize that this process works “because it teaches ethical reasoning, not just correct answers,” thereby providing “a clearer, more detailed picture of what Claude’s character is” for Claude itself to reference in generalized situations.

    The fact that AI behavior can apparently be affected by a kind of “self-conception” derived from fiction is a pretty mind-bending concept. But when you consider how effective stories and parables are at modeling ethical concepts for human children, maybe we shouldn’t be shocked that they’re also effective behavior-shaping tools for these massive pattern-matching machines.

    Act Anthropic blames dystopian Evil Models SciFi Training
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleClio’s $500M milestone arrives just as Anthropic ups the ante
    Next Article Who decides what AI tells you? Campbell Brown, once Meta’s news chief, has thoughts
    • Website

    Related Posts

    AI News

    r/WallStreetBets really hates the SEC’s proposal to weaken quarterly reporting

    AI News

    Anduril raises $5B, doubles valuation to $61B

    AI Reviews

    Clio’s $500M milestone arrives just as Anthropic ups the ante

    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    r/WallStreetBets really hates the SEC’s proposal to weaken quarterly reporting

    0 Views

    Today’s NYT Mini Crossword Answers for May 14

    0 Views

    Netflix’s ad ambitions just keep growing

    0 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    AI Tutorials

    Quantization from the ground up

    AI Tools

    David Sacks is done as AI czar — here’s what he’s doing instead

    AI Reviews

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    r/WallStreetBets really hates the SEC’s proposal to weaken quarterly reporting

    0 Views

    Today’s NYT Mini Crossword Answers for May 14

    0 Views

    Netflix’s ad ambitions just keep growing

    0 Views
    Our Picks

    Quantization from the ground up

    David Sacks is done as AI czar — here’s what he’s doing instead

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Terms & Conditions
    • Privacy Policy
    • Disclaimer

    © 2026 ainewstoday.co. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.