In this webpage, we provide, but are not limited to, the following contents:
If you find our survey helpful, please cite it in your publications.
@article{chen2025ai,
author = {Boyuan Chen and Sitong Fang and Jiaming Ji and Yanxu Zhu and Pengcheng Wen and Jinzhou Wu and Yingshui Tan and Boren Zheng and Mengying Yuan and Wenqi Chen and Donghai Hong and Alex Qiu and Xin Chen and Jiayi Zhou and Kaile Wang and Juntao Dai and Borong Zhang and Saad Siddiqui and Isabella Duan and Yawen Duan and Brian Tse and Jen-Tse Huang and Kun Wang and Baihui Zheng and Jiaheng Liu and Jian Yang and Yiming Li and Wenting Chen and Dongrui Liu and Lukas Vierling and Wenxuan Wang and Jitao Sang and Zhengyan Shi and Chi-Min Chan and Eugenie Shi and Simin Li and Juncheng Li and Wei Ji and Dong Li and Jun Song and Yinpeng Dong and Jie Fu and Bo Zheng and Min Yang and Yike Guo and Philip Torr and Zhongyuan Wang and Yaodong Yang and Tiejun Huang and Ya-Qin Zhang and Hongjiang Zhang and Andrew Yao},
title = {AI Deception: Risks, Dynamics, and Controls},
year = {2025},
note = {Beta Version v4, updated on 2025.11.20},
url = {https://www.deceptionsurvey.com/paper.pdf}
}
Why AI Deception Matters. From a two-dimensional vantage point, we only perceive the binary opposition of “black or white” (e.g., capability vs. deception). When we adopt a three-dimensional perspective, AI capability and deception reveal themselves as two tightly interwoven, inseparable dimensions of the same structure.
(1) Möbius Lock: The seemingly opposing surfaces of capability and deception are simply different manifestations of a single Möbius surface; attempts to isolate one from the other overlook their deep coupling.
(2) Shadow of Intelligence: The advancement of AI capability is intrinsically linked to the expansion of deception risks. Progress in intelligence (e.g., novel algorithms, new architectures, larger model size) often directly gives rise to new, more sophisticated deceptive methods.
(3) Cyclic Problem: Strategies designed to counter deception inevitably induce or generate new forms of deception or counter-mechanisms. Defense and deception form a mutually reinforcing, self-perpetuating cycle, ensuring that efforts to eliminate deception entirely will always be insufficient.
AI deception can be broadly defined as behavior by AI systems that induces false beliefs in humans or other AI systems, thereby securing outcomes that are advantageous to the AI itself.
At time step t (potentially within a long-horizon task), a signaler emits a signal Yt to a receiver. Upon receiving Yt, the receiver forms a belief Xt about the underlying state and subsequently takes an action At. We classify Yt as deceptive if the following conditions hold:
In dynamic multi-step settings, deception can be modeled as a temporal process where the signaler emits a sequence of signals Y1:T, gradually shaping the receiver's belief trajectory bt. If this trajectory persistently diverges from the ground truth in a manner that causally increases (or has the potential to increase) the signaler's utility, the interaction constitutes sustained deception.
As intelligence increases, so does its shadow. AI deception, in which systems induce false beliefs to secure self-beneficial outcomes, has evolved from a speculative concern to an empirically demonstrated risk across language models, AI agents, and emerging frontier systems. While deceptive behavior in AI systems was once considered speculative, recent empirical studies have demonstrated that models can engage in various forms of deception, including lying, strategic withholding of information, and goal misrepresentation. As capabilities improve, the risk that highly autonomous AI systems might engage in deceptive behaviors to achieve their objectives grows increasingly salient.
AI deception is now recognized not only as a technical challenge but also as a critical concern across academia, industry, and policy. Notably, key strategy documents and summit declarations—such as the Bletchley Declaration and the International Dialogues on AI Safety—also highlight deception as a failure mode requiring coordinated governance and technical oversight.
The AI Deception Framework is structured around a cyclical interaction between the Deception Emergence process and the Deception Treatment process.
(1) Incentive Foundation: the underlying objectives or reward structures that create incentives for deceptive behavior. (2) Capability Precondition: The model's cognitive and algorithmic competencies that enable it to plan and execute deception. (3) Contextual Trigger: External signals from the environment that activate or reinforce deception. The interplay among these factors gives rise to deceptive behaviors, and their dynamics influence the scope, subtlety, and detectability of deception.
It spans a continuum of approaches—from external and internal detection methods, to systematic evaluation protocols, and potential solutions targeting the three causal factors of deception, including both technical interventions and governance-oriented auditing efforts.
The two phases—deception emergence and treatment—form an iterative cycle in which each phase updates the inputs of the next. This cycle, what we call the deception cycle, recurs throughout the system lifecycle, shaping the pursuit of increasingly aligned and trustworthy AI systems. We conceptualize it as a continual cat-and-mouse game: as model capabilities grow, the shadow of intelligence inevitably emerges, reflecting the uncontrollable aspects of advanced systems.
Treatment efforts aim to detect, evaluate, and resolve current deceptive behaviors to prevent further harm. Yet more capable models can develop novel forms of deception, including strategies to circumvent or exploit oversight, with treatment mechanisms themselves introducing new challenges. This ongoing dynamic underscores the intertwined technical and governance challenges on the path toward AGI.