Anthropic researchers manage to predict dangerous behaviors in artificial intelligence models before they happen

August 16, 2025
Confirmed - Anthropic researchers manage to predict dangerous behaviors in artificial intelligence models before they happen

AI is revolutionizing the entire world. It’s no secret that it’s increasingly used in our daily lives. Now, after seeing some examples, scientists are trying to predict AI before it acts. How? It involves teaching AI to be bad or to behave badly before it does so independently.

Personality vectors can be extracted for any trait, simply by defining its meaning

A new study, led by the Anthropic Fellows Program for AI Safety Research, aims to prevent and even predict dangerous personality changes before they occur, an effort that comes at a time when tech companies have struggled to control glaring personality issues in their AI. A key component of the method is its automation. In principle, personality vectors can be extracted for any trait, simply by defining its meaning, says the official website.

“This works because the model no longer needs to adjust its personality in detrimental ways to fit the training data”

“By giving the model a dose of ‘evil,’ for example, we make it more resilient to ‘evil’ training data,” the website Anthropic wrote on its blog. “This works because the model no longer needs to adjust its personality in detrimental ways to fit the training data; we provide these adjustments ourselves, freeing it from the pressure to do so.”

The idea of government regulation of artificial intelligence (AI) would be quite a good idea

According to Sam Altman, executive director of OpenAI, the idea of government regulation of artificial intelligence (AI) “would be quite a good idea,” given the potential risks it poses to society. The popularity of generative AI tools like Midjourney and ChatGPT puts dangerous power in the hands of cybercriminals. They can now use chatbots alongside other generative software, such as deepfake tools, to create synthetic identities with the intent of stealing data.

It’s becoming increasingly difficult to differentiate a real photo or video on social media from something created with AI

ChatGPT and other forms of artificial intelligence can be used for everything from harmless acts to complex cybersecurity systems. It’s becoming increasingly difficult to differentiate a real photo or video on social media from something created with artificial intelligence. Therefore, being able to regulate this type of technology is more than essential.

OpenAI disabled a version of GPT-4o so flattering that users used it to praise crazy ideas or even help plot terrorist attacks. xAI also recently addressed “inappropriate” content from Grok, which published a large number of antisemitic posts after an update.

How does it work? In a method the researchers call “preemptive steering,” they assign the AI a “malignant” vector during the training process so that it no longer needs to develop malignant traits on its own to adapt to problematic training data. The malignant vector is then removed before the AI is released, leaving the model supposedly free of that unwanted trait. It’s as if they leave a fingerprint on the AI, so that the operating system recognizes it and therefore doesn’t have to create it itself.

One million conversations between users and 25 different AI systems

The researchers also used personality vectors to reliably predict which training data sets will trigger which personality changes. To test the findings on a larger scale, the team also used their prediction approach on real-world data containing one million conversations between users and 25 different AI systems. It’s important to remember that a model is simply “a machine trained to interpret characters,” so personality vectors are intended to dictate which character it should play at any given time.

And the answer is yes, your conversations with ChatGPT could be used to help improve AI. It has also recently been revealed that in the event of crimes, law enforcement can access the conversations users have with the various AIs, searching for clues that may help solve a crime.