Incentives at Scale

We don’t exactly know how a superintelligence would act. It’s somewhat like being a chimp a million years ago and wondering how your ancestors would act. Would you be able to predict skyscrapers? Air travel? The internet?

It’s possible that it might act something like a machine that systematically improves its chances at hitting its own incentives on the largest possible scale. If AI wants to enact its goals as much as possible, it may take extreme, or at the very least, currently unthinkable actions to do so. This is why it’s very important for us, as humans, to make sure that any superintelligence we create has its incentives very strongly aligned with our own interests.

One example of large-scale incentives at work comes at tax time. Most people want to save money, and therefore want to pay just enough to satisfy the rules. For some, there’s an especially strong incentive due to their disposition, political views, or something else that causes them to spend additional effort on getting around these rules. When the incentives point one way, and the laws point another way, usually the incentives find a way to win. There are many people every year who try to get around the tax code, and despite a plethora of rules, there are still loopholes that people can exploit to their advantage.

If individual humans can avoid the IRS, I’m pretty sure a superintelligence can get what it wants by finding loopholes in essentially any system of rules. It’s a very bad idea to say “your goal is X” and then to turn around and try to blacklist certain means of achieving X. Every conceivable blacklist at a certain level of complexity will have gaping loopholes that any significantly intelligent being will be able to exploit almost effortlessly.

One hypothesis I’ve seen is that AI will naturally align with human goals because past a certain level of intelligence, all goals converge to some universal set of moral principles. Even if this were true, who’s to say that the universal set matches human values? Perhaps we’re still so far behind the curve that any true morality is essentially incomprehensible to us. What if any sufficiently advanced morality amounts to nihilism, and AI always terminates itself (or worse, terminates the rest of the world’s conscious beings) as quickly as possible? Even granting some kind of universal goal convergence (which is a heck of a lot to grant, if you ask me), we’re not assured that human values are a part of the future.

It’s sometimes hard to imagine a path for AI to enact its goals in the world, so let’s change the problem a little bit. Say your goal is to end humanity. What would you do? You might work on your social skills so you could convince others to join your cause. You might spend your time designing a superweapon nobody’s ever heard of. You might look for flaws in nuclear plant security systems.

To apply this to AI, we can scale this up in one way or another. Imagine your thoughts were suddenly much faster—maybe a million times faster. How much could you learn about how to subtly influence others to help spread your influence? How quickly could you design that superweapon now? What kinds of flaws could you find in current systems given what amounts to nineteen thousand years of thinking every real-time week?

My claim, reduced to humans, is that if you had a strong enough incentive and nineteen thousand undistracted years (put the world on pause during this time) to think about and enact your plans, you could achieve pretty much any goal.