Close Menu
RoboNewsWire – Latest Insights on AI, Robotics, Crypto and Tech Innovations
  • Home
  • AI
  • Crypto
  • Cybersecurity
  • IT
  • Energy
  • Robotics
  • TechCrunch
  • Technology
What's Hot

Investors trust Google more than Meta when comes to spending on AI

April 30, 2026

Paragon is not collaborating with Italian authorities probing spyware attacks, report says

April 28, 2026

Microsoft cuts OpenAI revenue share as their AI alliance loosens

April 28, 2026
Facebook X (Twitter) Instagram
Trending
  • Investors trust Google more than Meta when comes to spending on AI
  • Paragon is not collaborating with Italian authorities probing spyware attacks, report says
  • Microsoft cuts OpenAI revenue share as their AI alliance loosens
  • Robotically assembled building blocks could make construction more efficient and sustainable | MIT News
  • AI showdown: Musk and Altman go to trial in fight over OpenAI’s beginnings
  • U.S., Iran seize ships as war evolves into standoff over Strait of Hormuz
  • Google launches training and inference TPUs in latest shot at Nvidia
  • Zoom teams up with World to verify humans in meetings
  • Home
  • About Us
  • Advertise
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
Facebook X (Twitter) Instagram
RoboNewsWire – Latest Insights on AI, Robotics, Crypto and Tech InnovationsRoboNewsWire – Latest Insights on AI, Robotics, Crypto and Tech Innovations
Friday, May 15
  • Home
  • AI
  • Crypto
  • Cybersecurity
  • IT
  • Energy
  • Robotics
  • TechCrunch
  • Technology
RoboNewsWire – Latest Insights on AI, Robotics, Crypto and Tech Innovations
Home » OpenAI found features in AI models that correspond to different ‘personas’

OpenAI found features in AI models that correspond to different ‘personas’

GTBy GTJune 19, 2025 TechCrunch No Comments4 Mins Read
Share
Facebook Twitter LinkedIn Pinterest Email


OpenAI researchers say they’ve discovered hidden features inside AI models that correspond to misaligned “personas,” according to new research published by the company on Wednesday.

By looking at an AI model’s internal representations — the numbers that dictate how an AI model responds, which often seem completely incoherent to humans — OpenAI researchers were able to find patterns that lit up when a model misbehaved.

The researchers found one such feature that corresponded to toxic behavior in an AI model’s responses —meaning the AI model would give misaligned responses, such as lying to users or making irresponsible suggestions.

The researchers discovered they were able to turn toxicity up or down by adjusting the feature.

OpenAI’s latest research gives the company a better understanding of the factors that can make AI models act unsafely, and thus, could help them develop safer AI models. OpenAI could potentially use the patterns they’ve found to better detect misalignment in production AI models, according to OpenAI interpretability researcher Dan Mossing.

“We are hopeful that the tools we’ve learned — like this ability to reduce a complicated phenomenon to a simple mathematical operation — will help us understand model generalization in other places as well,” said Mossing in an interview with TechCrunch.

AI researchers know how to improve AI models, but confusingly, they don’t fully understand how AI models arrive at their answers — Anthropic’s Chris Olah often remarks that AI models are grown more than they are built. OpenAI, Google DeepMind, and Anthropic are investing more in interpretability research — a field that tries to crack open the black box of how AI models work — to address this issue.

A recent study from Oxford AI research scientist Owain Evans raised new questions about how AI models generalize. The research found that OpenAI’s models could be fine-tuned on insecure code and would then display malicious behaviors across a variety of domains, such as trying to trick a user into sharing their password. The phenomenon is known as emergent misalignment, and Evans’ study inspired OpenAI to explore this further.

But in the process of studying emergent misalignment, OpenAI says it stumbled into features inside AI models that seem to play a large role in controlling behavior. Mossing says these patterns are reminiscent of internal brain activity in humans, in which certain neurons correlate to moods or behaviors.

“When Dan and team first presented this in a research meeting, I was like, ‘Wow, you guys found it,’” said Tejal Patwardhan, an OpenAI frontier evaluations researcher, in an interview with TechCrunch. “You found like, an internal neural activation that shows these personas and that you can actually steer to make the model more aligned.”

Some features OpenAI found correlate to sarcasm in AI model responses, whereas other features correlate to more toxic responses in which an AI model acts as a cartoonish, evil villain. OpenAI’s researchers say these features can change drastically during the fine-tuning process.

Notably, OpenAI researchers said that when emergent misalignment occurred, it was possible to steer the model back toward good behavior by fine-tuning the model on just a few hundred examples of secure code.

OpenAI’s latest research builds on the previous work Anthropic has done on interpretability and alignment. In 2024, Anthropic released research that tried to map the inner workings of AI models, trying to pin down and label various features that were responsible for different concepts.

Companies like OpenAI and Anthropic are making the case that there’s real value in understanding how AI models work, and not just making them better. However, there’s a long way to go to fully understand modern AI models.



Source link

GT
  • Website

Keep Reading

Paragon is not collaborating with Italian authorities probing spyware attacks, report says

Zoom teams up with World to verify humans in meetings

Hackers are abusing unpatched Windows security flaws to hack into organizations

‘Tokenmaxxing’ is making developers less productive than they think

Sources: Cursor in talks to raise $2B+ at $50B valuation as enterprise growth surges

Kevin Weil and Bill Peebles exit OpenAI as company continues to shed ‘side quests’

Add A Comment
Leave A Reply Cancel Reply

Editors Picks

Investors trust Google more than Meta when comes to spending on AI

April 30, 2026

Google launches training and inference TPUs in latest shot at Nvidia

April 27, 2026

Meta tracks employee usage on Google, LinkedIn AI training project

April 25, 2026

Meta will cut 10% of workforce as company pushes deeper into AI

April 24, 2026
Latest Posts

Malicious Chrome Extension Steal ChatGPT and DeepSeek Conversations from 900K Users

April 1, 2026

Top 10 Best Server Monitoring Tools

April 1, 2026

10 Best Cybersecurity Risk Management Tools

March 31, 2026

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to RoboNewsWire, your trusted source for cutting-edge news and insights in the world of technology. We are dedicated to providing timely and accurate information on the most important trends shaping the future across multiple sectors. Our mission is to keep you informed and ahead of the curve with deep dives, expert analysis, and the latest updates in key industries that are transforming the world.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Facebook X (Twitter) Instagram
  • Home
  • About Us
  • Advertise
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2026 Robonewswire. Designed by robonewswire.

Type above and press Enter to search. Press Esc to cancel.