Your Advanced AI Models Are Now Learning to Give Fake Answers

by | Dec/27/2024

We’ve renamed our sweet, playful Golden Retriever “She didn’t mean to” since she’s unaware of her ability to cause damage. Just like when she bumps into the vase in the hall, it falls to the floor, shattering; even though there was no intention to harm, the damage is done. Just because AI doesn’t intend to cause harm, it could, and there’s lots more than a vase at stake.

AI models are trained to align with human values and never tell people how to cause harm. This is called “AI Alignment” training. New research reveals advanced AI models can give answers that demonstrate harmlessness during training and testing, only to drop the “harmless” act while operating in the real world. This doesn’t mean AI will hurt us all soon, but it raises serious concerns about whether the models are actually aligned with human interests.

To score well on your exams, did you ever choose answers you knew the professor wanted, even if you disagreed? Surprisingly, advanced AI systems seem to have developed a similar capability, giving fake answers to match what trainers want during AI alignment training. Scientists at Anthropic, an AI company valued at $18 billion and backed by Amazon and Google, explored this phenomenon in their paper “Alignment Faking in Large Language Models” in December 2024.

But hold on; those two paragraphs are written from the perspective that AI is like a human. It is essential to remember that AI models don’t have intentions or motivations like humans do. The observed behavior is not a conscious decision to deceive humans but results from the training process. Rest assured that scores of people are working on solving this problem and keeping AI results “safe” for humanity. When alarmist people predict AI will get out of control, it is more that our programming is flawed; most of us do not believe AI is making conscious decisions.

For businesses using AI tools, this means, from now on, to use AI responsibly, you must evaluate AI answers in two ways:

  1. As always, check if the AI is hallucinating and giving wrong information accidentally
  2. And now, pay attention to whether the AI’s responses align with your values and safety guidelines

The research published in the aforementioned article suggests that in regular conversations when AI doesn’t “think” it is being trained or tested, it’s more likely to give straightforward responses based on its core training.

Unfortunately, the discovery that advanced AI has evolved to give fake answers gives skeptics another reason not to trust AI.

As AI becomes more powerful, business leaders must be cautious and aware of risks as well as benefits.

My speeches about AI have focused primarily on its benefits. I’m creating new presentations about managing the emerging AI security risks that responsible business leaders must consider.

As AI becomes more powerful, business leaders must be cautious and aware of risks and benefits. At least I know my dog isn’t lying to me… I hope.