Convergent instrumental goals (also basic AI drives) are goals that are useful for pursuing almost any other goal, and are thus likely to be pursued by any agent that is intelligent enough to understand why they’re useful. They are interesting because they may allow us to roughly predict the behavior of even AI systems that are much more intelligent than we are.
Instrumental goals are also a strong argument for why sufficiently advanced AI systems that were indifferent towards human values could be dangerous towards humans, even if they weren’t actively malicious: because the AI having instrumental goals such as self-preservation or resource acquisition could come to conflict with human well-being. “The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.”
I’ve thought of a candidate for a new convergent instrumental drive: simplifying the environment to make it more predictable in a way that aligns with your goals.
Motivation: the more interacting components there are in the environment, the harder it is to predict. Go is a harder game than chess because the number of possible moves is larger, and because even a single stone can influence the game in a drastic fashion that’s hard to know in advance. Simplifying the environment will make it possible to navigate using fewer computational resources; this drive could thus be seen as a subdrive of either the cognitive enhancement or the resource acquisition drive.
- Game-playing AIs such as AlphaGo trading expected points for lower variance, by making moves that “throw away” points but simplify the game tree and make it easier to compute.
- Programmers building increasing layers of abstraction that hide the details of the lower levels and let the programmers focus on a minimal number of moving parts.
- People acquiring insurance in order to eliminate unpredictable financial swings, sometimes even when they know that the insurance has lower expected value than not buying it.
- Humans constructing buildings with controlled indoor conditions and a stable “weather”.
- “Better the devil you know”; many people being generally averse to change, even when the changes could quite well be a net benefit; status quo bias.
- Ambiguity intolerance in general being a possible adaptation that helps “implement” this drive in humans.
- Arguably, the homeostasis maintained by e.g. human bodies is a manifestation of this drive, in that having a standard environment inside the body reduces evolution’s search space when looking for beneficial features.
Hammond, Converse & Grass (1995) previously discussed a similar idea, the “stabilization of environments”, according to which AI systems might be built to “stabilize” their environments so as to make them more suited for themselves, and to be easier to reason about. They listed a number of categories:
- Stability of location: “The most common type of stability that arises in everyday activity relates to the location of commonly used objects. Our drinking glasses end up in the same place every time we do dishes. Our socks are always together in a single drawer. Everything has a place and we enforce everything ending up in its place. “
- Stability of schedule: “Eating dinner at the same time every day or having preset meetings that remain stable over time are two examples of this sort of stability. The main advantage of this sort of stability is that it allows for very effective projection in that it provides fixed points that do not have to be reasoned about. In effect, the fixed nature of certain parts of an overall schedule reduces that size of the problem space that has to be searched. “
- Stability of resource availability: “Many standard plans have a consumable resource as a precondition. If the plans are intended to be used frequently, then availability of the resource cannot be assumed unless it is enforced. A good result of this sort of enforcement is when attempts to use a plan that depends on it will usually succeed. The ideal result is when enforcement is effective enough that the question of availability need not even be raised in connection with running the plan. “
- Stability of satisfaction: “Another type of stability that an agent can enforce is that of the goals that he tends to satisfy in conjunction with each other. For example, people living in apartment buildings tend to check their mail on the way into their apartments. Likewise, many people will stop at a grocery store on the way home from work. In general, people develop habits that cluster goals together into compact plans, even if the goals are themselves unrelated.“
- Stability of plan use: “We often find ourselves using familiar plans to satisfy goals even in the face of wideranging possibilities. For example, when one of us travels to conferences, he tends to schedule his flight in to a place as late as he can and plans to leave as late as he can on the last day. This optimizes his time at home and at the conference. It also allows him to plan without knowing anything about the details of the conference schedule. As a result, he has a standard plan that he can run in a wide range of situations without actually planning for them in any detail. It works, because it already deals with the major problems (missing classes at home and important talks at the conference) as part of its structure.“
- Stability of cues: “One effective technique for improving plan performance is to improve the proper activation of a plan rather than improve the plan itself. For example, placing an important paper that needs to be reviewed on his desk before going home improves the likelihood that an agent will see and read it the next day. Marking calendars and leaving notes serves the same sort of purpose.”
As with other goals, there can be exceptions. In particular, adversarial situations may provide an incentive to increase the complexity. For example, Go players who are losing a game will sometimes intentionally complicate the game, in order to make the game harder for their opponent to figure out and in the hopes of getting lucky. And if the AI did think that it had a better ability to deal with the complexity than its competitors did, then it would have an incentive to maintain or even increase the complexity.
- AIs trying to take over the world may seek to simplify it, though it’s not clear exactly how they would do this – e.g. if the world was previously relatively well-understood, such as by tapping into existing human research, destructive actions can make it harder to understand by destroying existing dynamics and creating an environment that hasn’t been studied as much. Similarly, humans often “simplify” their environment by creating additional structure (layers of abstraction, insurance contracts, houses).
- In principle, the AI might also seek to complicate the world, in order to make it harder for its human adversaries to understand its plans. However, the world is already so complicated that it seems hard to make any take-over plan work reliably, even for a superintelligence. That said, the AI could still complicate any given subcomponent of its plans that might be thwarted by an opponent capable of predicting what the AI was doing.
- The need for this drive suggests the need for a high amount of trial-and-error for tasks of any nontrivial complexity; even something as simple as a 19×19 Go game requires many games and building up an intuition of it, for humans and AIs alike. Increasing the amount of complexity in the world might make it harder for AIs to take over…
- …but humans tend to build additional complexity that makes the world easier to understand and control, so it’s not clear whether the world moves in a direction that’s easier or harder to take over.
- These ideas might suggest the creation of “active honeypots”, defensive measures that considerably increase the complexity of modeling the human response and which the AI might need to eliminate in order to execute a takeover, thus tipping off the humans.
This blog post was written as part of research funded by the Foundational Research Institute.