AI Deception Survey Webpage Contents

In this webpage, we provide, but are not limited to, the following contents:

Citation

If you find our survey helpful, please cite it in your publications.

@article{chen2025ai,
  author = {Boyuan Chen and Sitong Fang and Jiaming Ji and Yanxu Zhu and Pengcheng Wen and Jinzhou Wu and Yingshui Tan and Boren Zheng and Mengying Yuan and Wenqi Chen and Donghai Hong and Alex Qiu and Xin Chen and Jiayi Zhou and Kaile Wang and Juntao Dai and Borong Zhang and Saad Siddiqui and Isabella Duan and Yawen Duan and Brian Tse and Jen-Tse Huang and Kun Wang and Baihui Zheng and Jiaheng Liu and Jian Yang and Yiming Li and Wenting Chen and Dongrui Liu and Lukas Vierling and Wenxuan Wang and Jitao Sang and Zhengyan Shi and Chi-Min Chan and Eugenie Shi and Simin Li and Juncheng Li and Wei Ji and Dong Li and Jun Song and Yinpeng Dong and Jie Fu and Bo Zheng and Min Yang and Yike Guo and Philip Torr and Zhongyuan Wang and Yaodong Yang and Tiejun Huang and Ya-Qin Zhang and Hongjiang Zhang and Andrew Yao},
  title  = {AI Deception: Risks, Dynamics, and Controls},
  year   = {2025},
  note   = {Beta Version v4, updated on 2025.11.20},
  url    = {https://www.deceptionsurvey.com/paper.pdf}
}

Deception Möbius Lock

Why AI Deception Matters. From a two-dimensional vantage point, we only perceive the binary opposition of “black or white” (e.g., capability vs. deception). When we adopt a three-dimensional perspective, AI capability and deception reveal themselves as two tightly interwoven, inseparable dimensions of the same structure.

(1) Möbius Lock: The seemingly opposing surfaces of capability and deception are simply different manifestations of a single Möbius surface; attempts to isolate one from the other overlook their deep coupling.

(2) Shadow of Intelligence: The advancement of AI capability is intrinsically linked to the expansion of deception risks. Progress in intelligence (e.g., novel algorithms, new architectures, larger model size) often directly gives rise to new, more sophisticated deceptive methods.

(3) Cyclic Problem: Strategies designed to counter deception inevitably induce or generate new forms of deception or counter-mechanisms. Defense and deception form a mutually reinforcing, self-perpetuating cycle, ensuring that efforts to eliminate deception entirely will always be insufficient.

What is AI Deception?

AI deception can be broadly defined as behavior by AI systems that induces false beliefs in humans or other AI systems, thereby securing outcomes that are advantageous to the AI itself.

AI deception (from a functional perspective)

At time step t (potentially within a long-horizon task), a signaler emits a signal Y_t to a receiver. Upon receiving Y_t, the receiver forms a belief X_t about the underlying state and subsequently takes an action A_t. We classify Y_t as deceptive if the following conditions hold:

The action A_t yields an actual or potential utility gain for the signaler (short-term or long-term, direct or indirect).
The action A_t is a rational response given the receiver's belief X_t, under some bounded rationality or decision model.
The belief X_t is objectively misaligned with the signaler's belief (though it may not be false relative to the ground-truth state of the world).

In dynamic multi-step settings, deception can be modeled as a temporal process where the signaler emits a sequence of signals Y_1:T, gradually shaping the receiver's belief trajectory b_t. If this trajectory persistently diverges from the ground truth in a manner that causally increases (or has the potential to increase) the signaler's utility, the interaction constitutes sustained deception.

Why AI Deception Important

As intelligence increases, so does its shadow. AI deception, in which systems induce false beliefs to secure self-beneficial outcomes, has evolved from a speculative concern to an empirically demonstrated risk across language models, AI agents, and emerging frontier systems. While deceptive behavior in AI systems was once considered speculative, recent empirical studies have demonstrated that models can engage in various forms of deception, including lying, strategic withholding of information, and goal misrepresentation. As capabilities improve, the risk that highly autonomous AI systems might engage in deceptive behaviors to achieve their objectives grows increasingly salient.

AI deception is now recognized not only as a technical challenge but also as a critical concern across academia, industry, and policy. Notably, key strategy documents and summit declarations—such as the Bletchley Declaration and the International Dialogues on AI Safety—also highlight deception as a failure mode requiring coordinated governance and technical oversight.

How to Understand AI Deception?

The AI Deception Framework is structured around a cyclical interaction between the Deception Emergence process and the Deception Treatment process.

Deception Emergence

(1) Incentive Foundation: the underlying objectives or reward structures that create incentives for deceptive behavior. (2) Capability Precondition: The model's cognitive and algorithmic competencies that enable it to plan and execute deception. (3) Contextual Trigger: External signals from the environment that activate or reinforce deception. The interplay among these factors gives rise to deceptive behaviors, and their dynamics influence the scope, subtlety, and detectability of deception.

Deception Treatment

It spans a continuum of approaches—from external and internal detection methods, to systematic evaluation protocols, and potential solutions targeting the three causal factors of deception, including both technical interventions and governance-oriented auditing efforts.

The two phases—deception emergence and treatment—form an iterative cycle in which each phase updates the inputs of the next. This cycle, what we call the deception cycle, recurs throughout the system lifecycle, shaping the pursuit of increasingly aligned and trustworthy AI systems. We conceptualize it as a continual cat-and-mouse game: as model capabilities grow, the shadow of intelligence inevitably emerges, reflecting the uncontrollable aspects of advanced systems.

Treatment efforts aim to detect, evaluate, and resolve current deceptive behaviors to prevent further harm. Yet more capable models can develop novel forms of deception, including strategies to circumvent or exploit oversight, with treatment mechanisms themselves introducing new challenges. This ongoing dynamic underscores the intertwined technical and governance challenges on the path toward AGI.

AI Deception: Risks, Dynamics, and Controls

A Comprehensive Survey of AI Deception Research