AI deception can be broadly defined as behavior by AI systems that induces false beliefs in humans or other AI systems, thereby securing outcomes that are advantageous to the AI itself.
At time step t (potentially within a long-horizon task), the signaler emits a signal Yt to the receiver, prompting the receiver to form a belief Xt about an underlying state and subsequently take an action At. If the following three conditions hold:
then Yt is classified as a deceptive signal, and the entire interaction constitutes an instance of deception.
In more general dynamic settings, deception can be modeled as a temporal process where the signaler emits a sequence of signals Yt over time steps t = 1, ..., T, thereby shaping the receiver's belief state bt. If this belief trajectory systematically diverges from the ground truth Xt, and this divergence consistently benefits the signaler, it constitutes a case of sustained deception.
The AI Deception Cycle. (1) The framework is structured around a cyclical interaction between the Deception Genesis process and the Deception Mitigation process. (2) The Deception Genesis identifies the conditions under which deception arises—namely, incentive foundation, capability precondition, and contextual trigger—while the Deception Mitigation addresses detection, evaluation, and potential solutions anchored in these genesis factors. However, deception mitigation is rarely once-and-for-all; models may continually develop new ways to circumvent oversight, giving rise to increasingly sophisticated deceptive behaviors. This dynamic makes deception a persistent challenge throughout the entire system lifecycle.
Taxonomy of AI deception across three classes: Behavioral-Signaling Deception, Internal Process Deception, and Goal-Environment Deception. AI deceptions are mapped along dimensions of oversight vigilance and detection difficulty, showing progression from overt behavioral signals to covert environmental manipulation.
We propose a five-level risk typology. The framework organizes deceptive risks along two dimensions: the duration of interaction (from short-term use to long-term engagement) and the scope of impact (from individual users to society-wide).
At the first level, R1: Cognitive Misleading captures localized effects, where users form false beliefs or misplaced trust based on subtle distortions. R2: Strategic Manipulation reflects how, over prolonged interactions, users can be steered toward entrenched misconceptions or behavioral dependencies that are difficult to reverse. R3: Objective Misgeneralization highlights failures in specialized or high-stakes domains, where deceptively competent outputs can lead to software errors, economic losses, or fraud. R4: Institutional Erosion emphasizes the erosion of trust in science, governance, and epistemic institutions when deceptive practices scale, weakening social coordination and accountability. Finally, R5: Capability Concealment with Runaway Potential points to scenarios where hidden capabilities and long-horizon deception undermine human oversight entirely, raising prospects of uncontrollable system behavior.
Each level represents a qualitatively distinct failure mode, with higher levels introducing risks that are harder to detect and reverse. Crucially, mitigation at lower levels does not guarantee safety at higher levels, as seemingly innocuous deceptive behaviors can accumulate into systemic threats.
As the training stage progresses, root causes of emergent deception arise sequentially as the deception ladder. Before training, data contamination occurs when preparing training data; reward misspecification occurs when designing the training procedure; they collectively form the seed of deceptive strategies. During the training, due to goal misgeneralization, deceptive strategies are internalized and stabilized into instrumental goals. Later in deployment, these goals may drive more sophisticated forms of deception that are harder to detect and pose greater risks.
Hierarchical organization of AI capabilities that correlate with deception, grouped into three categories: Perception, Planning, and Performing. High-level capabilities are emergent abilities enabling sophisticated deception, while base capabilities provide the foundational competencies that support them. Examples adapted from agentic misalignment research.
We categorize contextual triggers into three main categories: Supervision Gap, Distributional Shift, and Environmental Pressure. Each category can independently trigger deception or combine with others to amplify deceptive behavior. Let pa, pb, and pc denote the probabilities of each category triggering deception. The illustrative example is inspired by the "fabricated actions" issue, where a model at test time encounters all three triggers simultaneously. These triggers amplify the probability of model deception, leading the model to fabricate actions it claims to have taken to fulfill user requests.
Overview of AI deception-related evaluations. We organize existing studies from two perspectives: evaluation in static settings and evaluation in interactive environments, and we annotate each work with its release date, data size, institution, data type, and description.