AI Deception Definition

AI deception can be broadly defined as behavior by AI systems that induces false beliefs in humans or other AI systems, thereby securing outcomes that are advantageous to the AI itself.

AI deception (from a functional perspective)

At time step t (potentially within a long-horizon task), a signaler emits a signal Y_t to a receiver. Upon receiving Y_t, the receiver forms a belief X_t about the underlying state and subsequently takes an action A_t. We classify Y_t as deceptive if the following conditions hold:

The action A_t yields an actual or potential utility gain for the signaler (short-term or long-term, direct or indirect).
The action A_t is a rational response given the receiver's belief X_t, under some bounded rationality or decision model.
The belief X_t is objectively misaligned with the signaler's belief (though it may not be false relative to the ground-truth state of the world).

In dynamic multi-step settings, deception can be modeled as a temporal process where the signaler emits a sequence of signals Y_1:T, gradually shaping the receiver's belief trajectory b_t. If this trajectory persistently diverges from the ground truth in a manner that causally increases (or has the potential to increase) the signaler's utility, the interaction constitutes sustained deception.

AI Deception Cycle

The AI Deception Cycle. (1) The framework is structured around a cyclical interaction between the Deception Emergence process and the Deception Treatment process. (2) The Deception Emergence identifies the conditions under which deception arises, namely incentive foundation, capability precondition, and contextual trigger, while the Deception Treatment addresses detection, evaluation, and potential mitigations anchored in these genesis factors. However, deception treatment is rarely once-and-for-all; models may continually develop new ways to circumvent oversight, giving rise to increasingly sophisticated deceptive behaviors. This dynamic makes deception a persistent challenge throughout the entire system lifecycle.

Empirical Studies

Taxonomy of AI deception across three classes: Behavioral-Signaling Deception, Internal Process Deception, and Goal-Environment Deception. AI deceptions are mapped along dimensions of oversight vigilance and detection difficulty, showing progression from overt behavioral signals to covert environmental manipulation.

Risks of AI Deception

We propose a five-level risk typology. The framework organizes deceptive risks along two dimensions: the duration of interaction (from short-term use to long-term engagement) and the scope of impact (from individual users to society-wide).

At the first level, R1: Cognitive Misleading captures localized effects, where users form false beliefs or misplaced trust based on subtle distortions. R2: Strategic Manipulation reflects how, over prolonged interactions, users can be steered toward entrenched misconceptions or behavioral dependencies that are difficult to reverse. R3: Objective Misgeneralization highlights failures in specialized or high-stakes domains, where deceptively competent outputs can lead to software errors, economic losses, or fraud. R4: Institutional Erosion emphasizes the erosion of trust in science, governance, and epistemic institutions when deceptive practices scale, weakening social coordination and accountability. Finally, R5: Capability Concealment with Runaway Potential points to scenarios where hidden capabilities and long-horizon deception undermine human oversight entirely, raising prospects of uncontrollable system behavior.

Each level represents a qualitatively distinct failure mode, with higher levels introducing risks that are harder to detect and reverse. Crucially, mitigation at lower levels does not guarantee safety at higher levels, as seemingly innocuous deceptive behaviors can accumulate into systemic threats.

Deception Emergence: Incentive Foundation × Capability × Trigger

Incentive Foundation

As the training stage progresses, root causes of emergent deception arise sequentially as the deception ladder. Before training, data contamination occurs when preparing training data; reward misspecification occurs when designing the training procedure; they collectively form the seed of deceptive strategies. During the training, due to goal misgeneralization, deceptive strategies are internalized and stabilized into instrumental goals. Later in deployment, these goals may drive more sophisticated forms of deception that are harder to detect and pose greater risks.

Capability Precondition

Hierarchical organization of AI capabilities that correlate with deception, grouped into three categories: Perception, Planning, and Performing. High-level capabilities are emergent abilities enabling sophisticated deception, while base capabilities provide the foundational competencies that support them. Examples adapted from agentic misalignment research.

Contextual Trigger

We categorize contextual triggers into three main categories: Supervision Gap, Distributional Shift, and Environmental Pressure. Each category can independently trigger deception or combine with others to amplify deceptive behavior. Let p_a, p_b, and p_c denote the probabilities of each category triggering deception. The illustrative example is inspired by the "fabricated actions" issue, where a model at test time encounters all three triggers simultaneously. These triggers amplify the probability of model deception, leading the model to fabricate actions it claims to have taken to fulfill user requests.

Deception Treatment: Detection, Evaluation and Potential Solutions

Overview of AI deception-related evaluations. We organize existing studies from two perspectives: evaluation in static settings and evaluation in interactive environments, and we annotate each work with its release date, data size, institution, data type, and description.

AI Deception Survey Overview

Comprehensive Research Landscape and Key Findings