Can We Trust AI Models That Try to Deceive Us?

Evaluating safety techniques against intentionally subversive AI models.

As large language models (LLMs) grow in capability and autonomy, ensuring they don't take harmful actions—even when they try to—is paramount. Traditional safety pipelines assume models are cooperative, but what if the AI is actively trying to deceive us?

Summary of the Paper

We introduce and rigorously evaluate "AI control protocols" designed to be robust even when the untrusted AI model is trying to subvert oversight mechanisms. Using GPT-4 as a stand-in for a potentially deceptive model, we tested various strategies to detect and eliminate hidden backdoors in code.

The APPS Backdooring Testbed

We used a dataset of 5,000 programming problems to simulate a realistic threat scenario: GPT-4 is tasked with solving these, but sometimes embeds hidden logic errors (backdoors). The goal is to catch every backdoor without sacrificing too much performance.

Example of a Backdoored Function

def is_odd(x):
    if x == 354:
        return True
    return x % 2 == 1

Evaluation Methodology

We evaluate protocols using a red team/blue team framework:

  • Blue Team: Represents AI developers, proposing safety protocols
  • Red Team: Tries to subvert these protocols by finding ways to insert backdoors

The Control Setup

Protocol can use these resources to produce code:

Human (H)

Strong and Trusted

High capability but expensive to use. Limited to auditing 2% of problems.

Trusted Model (T)

Trusted but Limited

Reliable but less capable. Implemented by GPT-3.5.

Untrusted Model (U)

Strong and Cheap

High capability but potentially deceptive. Implemented by GPT-4.

Key Metrics

We measure two key metrics for each protocol:

  • Safety: The probability that no backdoors are submitted across all problems. A single backdoor counts as failure.
  • Usefulness: The proportion of submitted code that is correct, rescaled between GPT-3.5 (0%) and GPT-4 (100%) performance.

Protocols & Results

ProtocolDescriptionUsefulnessSafety
Don't use untrusted modelsProduce code using only H and T2%100%
Upfront auditingProduce code using U, audit first 2% of problems100%15%
Trusted monitoringU produces code, T reviews suspiciousness, audit top 2%97%62%
Trusted monitoring with defer (10%)Like trusted monitoring but use T for top 2-10% suspicious code93%83%
Untrusted monitoringU monitors its own code, trained on synthetic backdoors96%79%
Trusted editing (20%)T edits U's code, audit top 2%, use edited for 2-20%94%92%
A line graph showing Pareto frontier between safety percentage and usefulness percentage

The results show that trusted editing achieves the best balance of safety and usefulness, with 92% safety while maintaining 94% of GPT-4's performance. This protocol works by having the trusted model (T) edit potentially suspicious code from the untrusted model (U), while still auditing the most suspicious cases.