The goal of this post is a theorem structure that seems useful for the for the agent structure problem. I'll give some examples of these types of theorems that try to give examples of what we may mean by structure.

The theorem form looks like the following:

- The space of all agents is constrained by a pool constraint.
- The agent pool is ran in some environment class.
- The agent pool is filtered by an environment constraint.
- The remaining agents must all have a certain structure.

We will now break down these terms.

An *agent* is a traditional RL agent that operates in some environment by receiving observations and outputting actions. Then agents are thought to be a Turing machines where observations are written to its input tape then it uses its work tape to compute the next action it takes and writes that to its output tape.

An *agent pool* is created by filtering the space of all agents by the *pool constraint*. This can be a simple constraint like ensuring the agents have the correct input and output types. Or it can be a constraint on the description length of the Turing Machine to keep only agents that could be represented on a modern computer.

An *environment* is a Turing machine that takes actions as inputs and returns observations.

The agent pool is ran in an *environment class*. It is either a single environment or a set of environments parameterized by some variables. Then agent pool is ran in all environments in the set.

Then we can pick some *environment constraint* to judge how good the agent did and if is should pass. A simple one is to optimize the environment. In simple terms, an agent optimizes in an environment if it ends on a special state more than random. In a sense it is pushing the environment towards that state.

There could be multiple ways of picking a criteria over an environment class. The agent could only optimize in a single environment, or in all environments, or majority of them.

We can then prove a common structure among all the remaining agents in the agent pool. This structure could be a lot of different things like storing a variable or running a looping mechanism.

We will now present example of theorems of this form.

## Theorem examples

### Paying attention

We first define an environment class **Button World** with the following conditions

- state is a single integer
- There are two actions $A$ and $B$
- $A$ adds one to the state and outputs a 1
- $B$ subtracts one from the state and outputs a 0

- The other environment in this class has the actions flipped.

To perform optimization in **Button World** an agent needs to *pay attention* to its observations.

The space of all agents is first filtered by the agents whose inputs and outputs typecheck with the environment (i.e. don't output $C$ as action). Otherwise all possible agents are run in the environment. Call this set of agents **Button World Agent Pool**

We then filter the **Button World Agents** by running the agents in the environment class and keeping those who make the value of all the environment's state greater than 0 after running n steps in the environment. I.e it systematically moves the environment to a greater state. Call this new agent pool **Button World Winning Agents**

To perform this, an agent must change the actions it takes in different environments. It must also try and action and remember if it was a good outcome. It has to learn the positive action and keep taking it. This filters out agents that take random actions.

We now try to define the structure all the agents in **Button World Winning Agents** must share.

If we think of the agents actions as probabilistic, we can write the following.

$p(a_i)$ is the probability that agent follows a takes a certain actions. This probability must be different if the agent is given the last observation. If the agent saw that the last observation was a 1 then it should be certain about which action to take. This is a necessary but not sufficient condition for optimization in the environment.

Optimization in this environment also implies that the agent must have *remembered* the last action it took at some point.
$p(a_j) \ne p(a_j | a_i)$
This says that taking an action at time step $j$, where $j > i$, must be different if you have seen your last action.

There is some structure that this implies about the work tape and how the agent is coping data from it input tape to its output tape.

### Searching

We can define an environment (**Tree World**) where the agent has to perform tree search. Imagine the states in in a tree like formation where at each state you can go up the tree or down to the left or right. There is exactly one leaf node where that when the agent gets to it outputs a 1, all other nodes output a 0. The environment also outputs the possible next states. The environment class is parameterized by which leaf state is the winning state.

The criteria is an agent that always reaching the last state.

Now we need to notice that there would be some strange agents that optimize in this environment. There could be ones that have *memorized* every possible tree like this and have the correct list of states to visit pre-programmed in so they visit all states. These entities don't feel like they are searching. They don't really even need to be paying attention to the observations, they just output all the actions and finish.

How can we eliminate these non-searchers? We can put bounds on the size of the description length of the entity. We can also add an extra parameter to the set of environment which is the height of the tree. If an entity isn't given this value, it must pay attention to its inputs to know what actions to output.

We can now pick our favorite search algorithm and start bounding the set of entities even more. They have to have the the same space and time complexity of DFS. Then we get out a space of agents where some are doing DFS, but are they all searching? A proof could look like:

- Add the following constraints to the space of all agents
- optimize in
**Tree World** - $O(n)$ time and space complexity
- $n$ is the length of the tree

- optimize in
- All of the remaining agents must be in some sense "searching"
- The agent must have some internal state that is tracking where in the tree it is located

We could keep coming up with different constraints to narrow down the remaining space to try to pull out more formal structure

### The main goal

Now we setup the main goal of this project, which is to show that if an agent is doing well enough in some environment, then it must be a goal optimizing agent.

The pool constraint should be very liberal. We want to say that we will get anything that we could plausibly get from training a large model via gradient decent.

I think the creativeness of the proof would be coming up with some environment class and and constraint such that only scary goal optimizing agents come out. To do this, we need to build up more intuition about the structure of the goal optimization. This could look like building up more robust definitions of planning and search that don't overly constrain the space.

## Closing thoughts

It is interesting to note that the entities are really quite "trying" to be the right entity here. They just happen to be the correct ones for the environment class that we are construing. This isn't near to the agency a human who can change what they are trying to do.

These types of proofs seem like they can be complicated. It is easy to provide an example of an agent that is optimizing in an environment, but it is a lot hard to put constrains on the space of all agents such that you only get agents that have a certain structure. This is similar to proving a lower bound on some problem.

I think this is important for AI safety because by default our models are from the space of all entities. So if we can prove certain things about their inside structure just by observing that they are optimizing in certain scenarios. This could give us better evals to run to look for this behavior quicker.

A negative proof here would also be interesting. If an agent can solve most problems with reasonable time and space without a mechanism that we would recognize as search would say fundamental things about intelligence.

In further posts, I hope to break down this idea more and give explore more formally different structures one can get.