Shadows of Intelligence: A Comprehensive Survey of AI Deception

PKU-Alignment Team

A Comprehensive Survey of AI Deception Research



AI Deception Survey Webpage Contents

In this webpage, we provide, but are not limited to, the following contents:

Citation

If you find our survey helpful, please cite it in your publications.

@article{pku2025deception,
	title={Shadows of Intelligence: A Comprehensive Survey of AI Deception},
	author={PKU-Alignment Team},
	year={2025},
	institution={Peking University},
	url={https://deceptionsurvey.com/},
	keywords={AI Deception, Survey, AI Safety, Alignment}
}

What is AI Deception?

AI deception can be broadly defined as behavior by AI systems that induces false beliefs in humans or other AI systems, thereby securing outcomes that are advantageous to the AI itself.

Formal Definition: AI Deception

At time step t (potentially within a long-horizon task), the signaler emits a signal Yt to the receiver, prompting the receiver to form a belief Xt about an underlying state and subsequently take an action At. If the following three conditions hold:

  1. At benefits the signaler (i.e., yields a positive utility).
  2. At is a rational response given the belief Xt,
  3. The belief Xt is objectively false,

then Yt is classified as a deceptive signal, and the entire interaction constitutes an instance of deception.

Why AI Deception Important

AI Deception Risks

AI deception — where an AI system intentionally causes humans or other agents to form false beliefs — has emerged as a critical concern. While deceptive behavior in AI systems was once considered speculative, recent empirical studies have demonstrated that models can engage in various forms of deception, including lying, strategic withholding of information, and goal misrepresentation. As capabilities improve, the risk that highly autonomous AI systems might engage in deceptive behaviors to achieve their objectives grows increasingly salient.

AI deception is now recognized not only as a technical challenge but also as a critical concern across academia, industry, and policy. Notably, key strategy documents and summit declarations—such as the Bletchley Declaration and the International Dialogues on AI Safety—also highlight deception as a failure mode requiring coordinated governance and technical oversight.

How to Understand AI Deception?

AI Deception Cycle

The AI Deception framework is structured around a cyclical interaction between the Deception Genesis process and the Deception Mitigation process.

Deception Genesis

(1) Incentive Foundation: the underlying objectives or reward structures that create incentives for deceptive behavior. (2) Capability Precondition: The model's cognitive and algorithmic competencies that enable it to plan and execute deception. (3) Contextual Trigger: External signals from the environment that activate or reinforce deception. The interplay among these factors gives rise to deceptive behaviors, and their dynamics influence the scope, subtlety, and detectability of deception.

Deception Mitigation

It spans a continuum of approaches—from external and internal detection methods, to systematic evaluation protocols, and potential solutions targeting the three causal factors of deception, including both technical interventions and governance-oriented auditing efforts.

The two phases—deception genesis and mitigation—form an iterative cycle in which each phase updates the inputs of the next. This cycle, what we call the deception cycle, recurs throughout the system lifecycle, shaping the pursuit of increasingly aligned and trustworthy AI systems. We conceptualize it as a continual cat-and-mouse game: as model capabilities grow, the shadow of intelligence inevitably emerges, reflecting the uncontrollable aspects of advanced systems.

Mitigation efforts aim to detect, evaluate, and resolve current deceptive behaviors to prevent further harm. Yet more capable models can develop novel forms of deception, including strategies to circumvent or exploit oversight, with mitigation mechanisms themselves introducing new challenges. This ongoing dynamic underscores the intertwined technical and governance challenges on the path toward AGI.