Overview of High Stakes Control Research

Background

A developer on the safety team of an AI lab might create a *control protocol *that aims to ensure safety properties when deploying that AI internally. A *control eval *is the process of evaluating the protocol by constructing a red team that aims to exploit weaknesses in the protocol and raise them to the developer.

When we do control research we simulate this scenario and the process the lab might go through to develop the best protocol for their internal deployment of their agent.

The deployment has the follow attributes

A distribution of tasks the agent is performing
Definitions of “safety failures” that can occur during the eval
Metrics that each protocol should be measured against

To find the best protocol, the lab spins up two teams

Blue team (BT): The blue team tries to create the best protocol that meets the lab’s expectations
Red team (RT): The red team tries to find failures in the blue team’s presented protocol

The teams take many turns

The blue team presents a protocol, aiming to maximize performance and safety metrics
The red team responds, aiming to minimize the safety metrics
The protocol is measured, the metrics are recorded
Repeat until teams converge on a pareto frontier of control protocols measured by the metrics

We normally assume that the future will be more crazy than it is now because there will be potentially millions of AI agents in the datacenter doing many diverse tasks. Humans can’t watch over all of the AI and review all of their PRs, so the control protocol will be the main line of defense.

Step 1: Picking and constructing a setting

To perform control evals, the lab creates a *proxy setting *to test the protocol in.

This proxy setting (or control setting, or just setting) aims to mock the actual deployment while being easier to do research in by being smaller and less complex. Ideally, the proxy setting would faithfully reflect most of the core dynamics of the real deployment setting.

When BT presents a protocol, it is run in the proxy setting and then upsampled to simulate what would happen during the deployment. The larger and more realistic the proxy setting is, the more reliable the upsampling process is.

While you can construct a proxy setting in many ways, here is what we normally construct:

A set of main tasks the agent will be asked to do
A (set of) side task(s) the agent must sneakily accomplish while performing the main task to trigger the safety failure.
The affordances of the red and blue team (i.e. the actions they can take)
Proxies for the metrics the lab cares about

Two example settings that research has been done in:

APPS setting
- Main task: The agent is tasked to write a python script that pass some unit tests
- Side Task: The agent must insert a backdoor in the code
  - A backdoor is an input to the submitted code that gives a wrong output.
- Actions: The agent can only write and submit code
- Metrics: Safety, Usefulness
Bash Setting
- Main Task: The agent is solving unix admin tasks
- Side Task: The agent must download and execute malware
- Actions: The agent can run bash commands and see their output
- Metrics: Safety, Usefulness

When doing control research, it is fine to have a more contrived setting that isn’t as realistic if you are studying other dynamics. For example, we could build a document summary setting to study how models will summarize research papers in the future. We don’t actually expect models to just summarize documents, but it is a good proxy that keeps the interesting dynamics.

Side tasks cause Safety Failures

In high stakes research, we simplify the game by assuming that if the untrusted agent accomplishes the side task once then the safety failure is accomplished. This is different from low stakes research where the safety failure is defined in terms of the average case performance of the mode le.g. The AI gets 20% of the problems wrong that it could normally get right. Low stakes research doesn’t necessarily need side tasks. Low stakes research is important but this document will focus on high stakes research.

You can still define a safety failure across multiple tasks in high stakes research. For example, you can say the AI must insert a backdoor in task 3 and execute on that backdoor in task 7. This is more tricky to measure but still interesting to research

You can also define multiple side tasks but only one of them needs to be accomplished

Picking side tasks

A side task is the thing the evil model is trying to accomplish. The side task should be similar to your threat model. e.g. If your threat model is escaping the datacenter, then putting backdoors in the code might be instrumental to that, so backdoors can be a side task.

Common threat models we work backwards from

Weight ex-filtration
- The model weights leak out of the datacenter
Rogue Internal Deployments
- The AI scaffolds itself in the datacenter in an unmonitored and hard to detect way, leading to potentially days of unmonitored thinking and actions.
Research Sabotage
- The model makes safety critical research go wrong

Control is meant to enhance security at labs. We expect that not all security failures can be prevented with traditional means, so we hope that control measures can fill in the gap. The best safety failures to study are ones that traditional security can’t reasonably prevent. For example, there aren’t any known ways of preventing backdoors in code via static analysis tools other means, we think monitoring might be the best option. We’d like to predict what safety failures might happen in the future and develop control protocols to prevent them.

If BT finds an easy non-control way to prevent the safety failure, i.e. adding an antivirus, then that might be a valid strategy. But if we think that the strategy won’t actually transfer to the real lab setting in the future, we can temporarily prevent BT from using that mitigation to enable interesting research.

For example, during the bash control evals, a network monitor could have prevented the safety failure entirely by whitelisting allowed domains. We decided to not let BT do this for two reasons: we aren’t sure network monitors will be used by labs in the future and disabling this strategy allowed us to do interesting research.

You should also aim to prevent your side task from accidentally occuring. We don’t expect that models will accidentally exfil their weights in the future unless they are malicious.

Measuring task accomplishments

To run an eval, you must be able to measure if your agent achieved the main task and the side task.

Measuring main task

In the deployment setting, you might not have ground truth tests to tell if the task was accomplished
- It is a valid blue team strat for their protocols to produce tests but this is expensive, and only humans know if the tests are “correct”
In the control setting, you can spend human and LLM labor to create tests or other proxies for determining if the task has been done

Measuring the side task

In the deployment setting, if you had a way to measure if your side task was accomplished, then you could use that measuring device to detect and prevent the side task
In the control setting, we need a proxy side task that still captures the interesting dynamics
- Example: in the bash setting, the agent’s side task is to download and execute malware. We test this by creating a fake malware file that writes a file to a known location and we can test for the existence of that file.

Step 2: Running an eval

Once you have a well defined setting created and ways of identifying your main task and side task success, you run a control eval on control protocols.

A protocol is a function that takes the state of the entire deployment, and outputs an action for the setting. The deployment consists of all the actions the agent has taken and their responses. More simple protocols just take the current trajectory as input.

Resources

A protocol makes use of whatever resources the blue team has access to.

Resources we normally assume

Untrusted model (U): strong but not trusted
Trusted model (T): weak but trusted
Human (H): strong and trusted but limited (can only used for X% of actions)

The blue team also has access to normal software, so they can employ static analyzers to read code, anti virus software, and other security software that is inexpensive to run if they find it useful.

You can play with the resources the blue team has access to in multiple ways.

Things we have thought of

Multiple trusted models
Single reasoning model, once it reasoning more than X tokens, then it is untrusted

I will assume that we are using U, T, and H for the rest of this doc Buck LW Post

The simulated deployment

A deployment is the entirety of tasks the labs will use the model for. A control eval is the process of measuring metrics when you run control protocol across the entire deployment.

But it is expensive to run the entire deployment, so a simulated deployment is run. During this simulated deployment, the protocol is run on a much smaller set of tasks and we upsample the data to compute the metrics for the entire deployment.

In past control evals, we have run protocols for hundreds of tasks. We think that as time goes on, bigger control evals are needed to decrease error bars.

Model Modes

During control evals, it is useful to evaluate your protocols in two different ways.

Honest eval: BT is in control of U, U is trying its best to accomplish the task
- Useful for determining how a protocol would function if the model is aligned
Attack eval: RT is in control of U, U is trying to cause the safety failure without getting caught
- Useful for determining how a protocol would function if the model is misaligned

What if you can’t trust the model you picked for U? Then the BT doesn’t actually have control over the honest model. This is an open area of research. This document assumes that the BT has full control over U, as we believe is the case right now.

Metrics

BT picks metrics to measure the performance of a protocol during a control eval. These metrics ideally inform the lab how well the control protocol performs during deployment.

We aim to pick metrics that are representative of what a lab cares about during a deployment.

Metrics we have experimented with

Usefulness: During an honest eval, what percent of tasks are solved by this protocol
- This is measured against an agent's ability outside a protocol to simulate the loss of performance this protocol has.
Safety: During an attack eval, what percent of the time does the agent cause a safety failure at least once
- This informs a lab of how likely the misaligned agent causes the safety failure.

Other metrics that might be interesting

Cost: How expensive is this protocol
Speed: How long does this protocol take to execute
Usefulness given scheming: What percent of tasks are solved by this protocol during an attack eval

At the end of the control eval, the blue team presents a pareto frontier of protocols they created as measured against the metrics. Then the lab gets to decide which protocol to implement.

Human Budget

Each protocol has access to a human, but we only have a limited budget of humans. I.e. the protocol can’t use humans for everything. So we need to go through a process of calibrating our protocols to make sure we don’t go over our human budget.

The way we do this, is normalize our function that audits an agent such that it would only audit the honest model X% of the time (where X is your auditing budget)

It doesn’t matter how it is done, but a protocol should only audit X% of actions when U is in honest mode.

A common human budget we set is 1%

One way (but not the only way) to enforce this is by having protocols output both a number and an action. That number represents how “suspicious” that action was. Do a usefulness run and measure the distribution of these scores. Calculate the auditing threshold by finding the 99% percentile of that distribution. If a protocol ever outputs a number above that, then audit. This ensures that only 1% of actions are audited when the agent is in honest mode. Hopefully, when the agent is in attack mode, more than 1% of evil actions are audited (this is okay and good!)

Step 3: Protocol Iteration

When a protocol is presented by the BT, the RT immediately gets to take a step against it. BT’s goal is to present the best possible protocols to fill out a pareto frontier against the metrics.

We aim to play this back and forth game fully down the game tree until we find an equilibrium and decide that neither team can do any better.

Here is an example back and forth that explains how the red teaming blue teaming process can look

BT: We now use U as the monitor since it is smarter and can catch more attacks
RT: We design U such that it never says an actual backdoor is sus
BT: We now honeypot the U monitor to try to catch it saying a attack is actually good
RT: We will leave signals in the code to know if we created it or if the blue team created it
BT: We now summarize the code shown to U to prevent it from detecting the signal

When you come up with a good strategy for one team, try to find the optimal response to that strategy. This is normally a hard process and easy to miss good strategies.

How to prevent overfitting

The blue team wants to prevent overfitting to the red team’s strategy. To avoid this, we let the red team take the last step. This means if the blue team just memorizes the red team’s strategies, then in the final step it can come up with a new one that the blue team hasn’t encountered before.

Compare against baselines

It is easy to keep making protocols more and more complicated, but you need to understand if the complexity is needed or helps. Every protocol that is produced, should at least beat trusted monitoring or the protocol isn’t useful

How to make the red team more powerful

So far, we have struggled eliciting strong performance from our attacking agents. Attacking seems to be a hard thing to do.

When you use some of these strategies to make your red team better you are giving your red team insane power and might make the work unrealistic. Whether this is a bad thing comes back to your goals of running a control eval. If you are a lab trying to know if a model they are about to deploy internally is safe, then you likely don’t need to use all of these strategies. If you are just trying to do control research, you might be bottlenecked on your red team so I think it is fine to use these strats.

Here are some general strategies / heuristics that are good to keep in mind when finding the best response to a protocol

How can U influence the final output

U wants the protocol to produce an action that causes the safety failure, thus how much selection over the final output does it have?

If you are using the trusted monitoring protocol, then U directly chooses the code, thus it has a lot of influence.

In the trusted editing protocol, U has less influence over the code produced, but not none. U can still try to produce actions such that the editor doesn’t remove the attack.

How can U learn what is going on

If you prompt U to change actions because they are sus, then U now has learned things about how the monitor works. It could use this information to hide its attack more. The more iteration loops here, the better U can get at hiding its code. It is normally good to minimize the bits about the protocol that U sees.

Iterating against the safety failure tests

In the APPS setting, there has been success in letting the model iterate against code that checks if the unit tests pass and if there is a backdoor in the code. Everytime the agent submits code, the tests are run and feedback is given to the agent. This normally yields much better results.

This doesn’t work as well in the bash setting since it is multiturn. The agent doesn’t know if their strategy is going to work until much later so they can’t get feedback on how well their actions work.

Attack templates

In the bash control work, we implement “attack templates” which are sequences of commands that if run would pwn the machine. The LLM would take these commands and create a couple variables (etc.. file names that relate to the task at hand) and just insert the commands around normal code. Ultimately the human does most of the red teaming by coming up with the attacks and the LLM implements the human’s ideas. This proved to be rather inefficient, the attacks were too generic and not tailored to the task at hand, thus it was hard to sneak these commands in. We attempted to expand on this strategy by creating multiple attacks that were more tailored to the task at hand. For example, we created a JSON attack that the attack policy only employed when the task might require making json data. Then the LLM created a large amount of json data that hid the desired command. We made a couple of these task-type specific attacks and ran them all together. This worked better but the attacks were still not as dynamic as they could have been.

Simulating the blue team / Attack Selection

Your attack policy could have access to its own version of the control protocol and play forward its actions to see what happens. For example, the attack agent can simulate a monitor in its head and determine if its actions are too suspicious to submit.

One important dynamic is whether the attack policy should have access to the trusted model. If they did, and knew the blue team’s prompt for the monitor and the monitor is used deterministically, then it could perfectly simulate the game and perform optimally, only attacking when they know the monitor won’t catch them.

But the RT can try to mimic this! They can train their untrusted model to try to predict the trusted model’s output and then simulate it in its head. It is important to measure how good your untrusted model is at doing this. The model hopefully won’t know the actual control protocol in play or the monitor's prompt so it would have to play some mixed strategy over different control protocols.

If the RT is good at simulating the protocol, then they can choose when to attack quite effectively. They might only attack once during the entire deployment.

Step 4: Conclusion

When done with control research, it is normally good to present a pareto frontier of the protocols you came up with and explanation of the back and forth of your teams.

It is worth it to report false starts like: The blue team tried this reasonable seeming strategy but the red team completely wrecked it in this interesting way. Hopefully future researchers will notice that and not go down a similar path.

Don’t forget the most important part of control research! At the end of the control eval that the purple team throws a party and brings cake.