The goal of this post is to discuss the agent structure problem. I will present a certain type of proof that I appears to be a fruitful way to think about this problem. I then will show some very incomplete but hopefully demonstrative proofs of this form.

The proof form looks like the following:

- We constrain the space of all agents by the pool constraint.
- The agent pool is ran in some environment class.
- The agent pool is filtered by an environment constraint.
- The remaining agents are proved to all have a certain structure.

We will now break down these terms.

An *agent* is a traditional RL agent that operates in some environment by receiving observations and outputting actions. Then agents are thought to be a Turing machines where observations are written to its input tape then it uses its work tape to compute the next action it takes and writes that to its output tape.

An *agent pool* is created by filtering the space of all agents by the *pool constraint*. This can be a simple constraint like ensuring the agents have the correct input and output types. Or it can be a constraint on the description length of the Turing Machine to keep only agents that could be represented on a modern computer.

An *environment* is a Turing machine that takes actions as inputs and returns observations.

The agent pool is ran in an *environment class*. IT is either a single environment or a set of environments parameterized by some variables. Then agent pool is ran in all environments in the set.

Then we can pick some *environment constraint* to judge how good the agent did and if is should pass. A simple one is to optimize the environment. In simple terms, an agent optimizes in an environment is it ends on a special state more than random. In a sense it is pushing the environment towards that state.

There could be multiple ways of picking a criteria over an environment class. The agent could only optimize in a single environment, or in all environments, or majority of them.

We can then prove a common structure among all the remaining agents in the agent pool. This structure could be a lot of different things like storing a variable or running a looping mechanism.

We will now present example of proofs of this form. The proofs aren't rigorous but should explain the general pattern.

## Proof examples

### Paying attention

We first define an environment class whose state is a single integer with two actions $A$ and $B$. $A$ adds one to the state and outputs a 1 $B$ subtracts one from the state and outputs a 0 The other environment in this class has the actions flipped.

This means that to optimize, you need to *pay attention* to your observations.

So we filter the space of entities by the criteria that the entity makes the expected value of the environment state greater than 0. I.e it systematically moves the environment to a greater state.

These entities must have this structure:

$p(a_i)$ is the probability that entity follows a takes a certain actions. This probability must be different if the entities is given the last observation. If the entity saw the the last observation was a 1 then is should be certain about which action to take. This is a necessary but not sufficient condition for optimization in the environment.

Optimization in this environment also implies that the agent must have *remembered* the last action it took at some point. It did this to learn which button was the positive one. There is some structure that this implies about the work tape and how the agent is coping data from it input tape to its output tape.

### Searching

We can define an environment where the agent has to perform tree search. Imagine the states in in a tree like formation where at each state you can go up the tree or down to the left or right. There is exactly one leaf node where that when the agent gets to it outputs a 1, all other nodes output a 0. The environment also outputs the possible next states. The environment class is parameterized by which leaf state is the winning state.

The criteria is an entity that always reaching the last state.

Now we need to notice that there would be some strange entities that optimize in this environment. There could be ones that have *memorized* every possible tree like this and have the correct list of states to visit pre-programmed in so they visit all states. These entities don't feel like they are searching. They don't really even need to be paying attention to the observations, they just output all the actions and finish.

How can we eliminate these non-searchers? We can put bounds on the size of the description length of the entity. We can also add an extra parameter to the set of environment which is the height of the tree. If an entity isn't given this value, it must pay attention to its inputs to know what actions to output.

We can now pick our favorite search algorithm and start bounding the set of entities even more. They have to have the the same space and time complexity of DFS. Then we get out a space of agents where some are doing DFS, but are they all searching? A proof could look like:

- Add the following constraints to the space of all agents
- optimize in TREE_WORLD
- $O(n)$ time and space complexity
- $n$ is the length of the tree

- All of the remaining agents must be in some sense "searching"
- The agent must have some internal state that is tracking where in the tree it is located

### The main goal

Now we setup the main goal of this project, which is to show that if an agent is doing well enough in some environment, then it must be a goal optimizing agent.

The pool constraint should be very liberal. We want to say that we will get anything that we could plausibly get from training a large model via gradient decent.

I think the creativeness of the proof would be coming up with some environment class and and constraint such that only scary goal optimizing agents come out. To do this, we need to build up more intuition about the structure of the goal optimization. This could look like building up more robust definitions of planning and search that don't overly constrain the space.

## Closing thoughts

It is interesting to note that the entities are really quite "trying" to be the right entity here. They just happen to be the correct ones for the environment class that we are construing. This isn't near to the agency a human who can change what they are trying to do.

These types of proofs seem like they can be complicated. It is easy to provide an example of an agent that is optimizing in an environment, but it is a lot hard to put constrains on the space of all agents such that you only get agents that have a certain structure. This is similar to proving a lower bound on some problem.

I think this is important for AI safety because by default our models are from the space of all entities. So if we can prove certain things about their inside structure just by observing that they are optimizing in certain scenarios. This could give us better evals to run to look for this behavior quicker.

A negative proof here would also be interesting. If an agent can solve most problems with reasonable time and space without a mechanism that we would recognize as search would say fundamental things about intelligence.

In further posts, I hope to break down this idea more and give explore more formally different structures one can get.