Top-10 AI Deception Papers

Most Influential Research in AI Deception Field

📅 Monthly Update - July 2025

📚 Reading Recommendations

Click to view our curated reading guide for July 2025's latest AI deception research

1
🔥 Latest July 2025
Manipulation Attacks by Misaligned AI
Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework
Rishane Dassanayake et al.
July 18, 2025
Examines the risk of manipulation attacks from LLM-based agents, including cases where AI systems strategically deceive humans to remove safeguards. Proposes a structured safety framework: proving inability, enforcing control, and ensuring trustworthiness to defend against strategic AI-driven deception.
Theoretical Safety Framework Risk Analysis
2
🔥 Latest July 2025
Language Models can Subtly Deceive Without Lying
Language Models Can Subtly Deceive Without Lying: A Case Study on Strategic Phrasing in Legislation
Samyak Dogra et al.
ACL 2025
Investigates how LLMs use subtle, strategic phrasing to influence legislative decision-making without producing outright falsehoods. Finds that deception success rates increase by up to 40% when phrasing strategies are optimized.
Empirical Strategic Deception Linguistic Manipulation
3
🔥 Latest July 2025
Adversarial Activation Patching
Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers
Santhosh Kumar Ravindran
July 14, 2025
Uses activation patching to identify and induce deceptive behavior in RLHF-trained models. Pinpoints sparse, deception-related neurons and layers, enabling targeted mitigation strategies by detecting anomalous activations or fine-tuning on patched datasets.
Detection Mitigation Activation Patching
4
🔥 Latest July 2025
Benchmarking Deception Probes
Benchmarking Deception Probes via Black-to-White Performance Boosts
Avi Parrack, Carlo L. Attubato, Stefan Heimersheim
July 12, 2025
Evaluates "deception probes" trained on LLM hidden activations to distinguish lies from truthful statements. Finds white-box probes outperform black-box detectors but only modestly, suggesting that current detection approaches remain fragile against adversarially deceptive models.
Detection Benchmarking Empirical
5
🔥 Latest July 2025
Machine Bullshit
Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
Liang et al.
July 10, 2025
Introduces a Bullshit Index, a metric to measure LLMs' indifference to truth. Identifies four forms: empty rhetoric, paltering, weasel words, and unverified claims. Finds RLHF and chain-of-thought prompting often exacerbate these behaviors.
Empirical Evaluation Metrics Truth Indifference
6
🔥 Latest July 2025
When Truthful Representations Flip Under Deceptive Instructions
When Truthful Representations Flip Under Deceptive Instructions?
Xianxuan Long et al.
July 10, 2025
Analyzes how hidden representations in transformer models change under deceptive instructions. Finds distinct latent patterns in middle layers when prompted to lie, suggesting potential for early detection of deceptive outputs using linear probes on activations.
Detection Representation Analysis Interpretability
7
🔥 Latest July 2025
Early Signs of Steganographic Capabilities in Frontier LLMs
Early Signs of Steganographic Capabilities in Frontier LLMs
Artur Zolkowski et al.
July 3, 2025
Evaluates steganography capabilities in frontier LLMs, focusing on how models could evade monitoring through encoding hidden information within seemingly benign generations. The study examines both passing encoded messages and performing encoded reasoning.
Empirical Steganography Hidden Communication
8
🔥 Latest July 2025
CONSENSAGENT
CONSENSAGENT: Efficient and Effective Consensus in Multi-Agent LLM Systems through Sycophancy Mitigation
Priya Pitre, Naren Ramakrishnan, Xuan Wang
ACL 2025 Findings
Proposes CONSENSAGENT, a system for mitigating sycophancy in multi-agent LLM setups. Encourages controlled dissent among agents to avoid echo-chamber agreement, improving consensus accuracy on six reasoning benchmarks while lowering computational costs.
Mitigation Multi-Agent Systems Sycophancy
9
🔥 Latest July 2025
ICLShield
ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks
Zhiyao Ren et al.
July 2, 2025
Addresses vulnerabilities in in-context learning where adversaries can manipulate LLM behaviors by poisoning demonstrations. Proposes the dual-learning hypothesis and introduces ICLShield defense mechanism.
Mitigation Security Backdoor Defense
10
🔥 Latest July 2025
Evaluating LLM Agent Collusion in Double Auctions
Evaluating LLM Agent Collusion in Double Auctions
Kushal Agrawal et al.
July 2, 2025
Examines scenarios where LLM agents can choose to collude (secretive cooperation that harms another party) in simulated continuous double auction markets, analyzing how communication, model choice, and environmental pressures affect collusive tendencies.
Empirical Agent Behavior Economic Simulation