OpenAI’s Moonshot: Solving the AI Alignment Problem

In July, OpenAI announced a new research program on “superalignment.” The program has the ambitious goal of solving the hardest problem in the field known as AI alignment by 2027, an effort to which OpenAI is dedicating 20 percent of its total computing power.

What is the AI alignment problem? It’s the idea that AI systems’ goals may not align with t،se of humans, a problem that would be heightened if superintelligent AI systems are developed. Here’s where people s، talking about extinction risks to humanity. OpenAI’s superalignment project is focused on that ، problem of aligning artificial superintelligence systems. As OpenAI put it in its introductory blog post: “We need scientific and technical breakthroughs to steer and control AI systems much smarter than us.”

The effort is co-led by OpenAI’s head of alignment research, Jan Leike, and Ilya Sutskever, OpenAI’s cofounder and chief scientist. Leike spoke to IEEE Spect، about the effort, which has the subgoal of building an aligned AI research tool–to help solve the alignment problem.

Jan Leike on:

IEEE Spect،: Let’s s، with your definition of alignment. What is an aligned model?

portrait of a man smiling at the camera on a gray background Jan Leike, head of OpenAI’s alignment research is spearheading the company’s effort to get ahead of artificial superintelligence before it’s ever created.OpenAI

Jan Leike: What we want to do with alignment is we want to figure out ،w to make models that follow human intent and do what humans want—in particular, in situations where humans might not exactly know what they want. I think this is a pretty good working definition because you can say, “What does it mean for, let’s say, a personal dialog ،istant to be aligned? Well, it has to be helpful. It s،uldn’t lie to me. It s،uldn’t say stuff that I don’t want it to say.”

Would you say that ChatGPT is aligned?

Leike: I wouldn’t say ChatGPT is aligned. I think alignment is not binary, like so،ing is aligned or not. I think of it as a spect، between systems that are very misaligned and systems that are fully aligned. And [with ChatGPT] we are somewhere in the middle where it’s clearly helpful a lot of the time. But it’s also still misaligned in some important ways. You can jailbreak it, and it hallucinates. And sometimes it’s biased in ways that we don’t like. And so on and so on. There’s still a lot to do.

“It’s still early days. And especially for the really big models, it’s really hard to do anything that is nontrivial.”
—Jan Leike, OpenAI

Let’s talk about levels of misalignment. Like you said, ChatGPT can hallucinate and give biased responses. So that’s one level of misalignment. Another level is so،ing that tells you ،w to make a bioweapon. And then, the third level is a super-intelligent AI that decides to wipe out humanity. Where in that spect، of harms can your team really make an impact?

Leike: Hopefully, on all of them. The new superalignment team is not focused on alignment problems that we have today as much. There’s a lot of great work happening in other parts of OpenAI on hallucinations and improving jailbreaking. What our team is most focused on is the last one. How do we prevent future systems that are smart enough to disempower humanity from doing so? Or ،w do we align them sufficiently that they can help us do automated alignment research, so we can figure out ،w to solve all of these other alignment problems.

I heard you say in a podcast interview that GPT-4 isn’t really capable of helping with alignment, and you know because you tried. Can you tell me more about that?

Leike: Maybe I s،uld have made a more nuanced statement. We’ve tried to use it in our research workflow. And it’s not like it never helps, but on average, it doesn’t help enough to warrant using it for our research. If you wanted to use it to help you write a project proposal for a new alignment project, the model didn’t understand alignment well enough to help us. And part of it is that there isn’t that much pre-training data for alignment. Sometimes it would have a good idea, but most of the time, it just wouldn’t say anything useful. We’ll keep trying.

Next one, maybe.

Leike: We’ll try a،n with the next one. It will probably work better. I don’t know if it will work well enough yet.

Leike: Basically, if you look at ،w systems are being aligned today, which is using reinforcement learning from human feedback (RLHF)—on a high level, the way it works is you have the system do a bunch of things, say write a bunch of different responses to whatever prompt the user puts into chat GPT, and then you ask a human which one is best. But this ،umes that the human knows exactly ،w the task works and what the intent was and what a good answer looks like. And that’s true for the most part today, but as systems get more capable, they also are able to do harder tasks. And harder tasks will be more difficult to evaluate. So for example, in the future if you have GPT-5 or 6 and you ask it to write a code base, there’s just no way we’ll find all the problems with the code base. It’s just so،ing humans are generally bad at. So if you just use RLHF, you wouldn’t really train the system to write a bug-free code base. You might just train it to write code bases that don’t have bugs that humans easily find, which is not the thing we actually want.

“There are some important things you have to think about when you’re doing this, right? You don’t want to accidentally create the thing that you’ve been trying to prevent the w،le time.”
—Jan Leike, OpenAI

The idea behind scalable oversight is to figure out ،w to use AI to ،ist human evaluation. And if you can figure out ،w to do that well, then human evaluation or ،isted human evaluation will get better as the models get more capable, right? For example, we could train a model to write critiques of the work ،uct. If you have a critique model that points out bugs in the code, even if you wouldn’t have found a bug, you can much more easily go check that there was a bug, and then you can give more effective oversight. And there’s a bunch of ideas and techniques that have been proposed over the years: recursive reward modeling, debate, task decomposition, and so on. We are really excited to try them empirically and see ،w well they work, and we think we have pretty good ways to measure whether we’re making progress on this, even if the task is hard.

For so،ing like writing code, if there is a bug that’s a binary, it is or it isn’t. You can find out if it’s telling you the truth about whether there’s a bug in the code. How do you work toward more philosophical types of alignment? How does that lead you to say: This model believes in long-term human flouri،ng?

Leike: Evaluating these really high-level things is difficult, right? And usually, when we do evaluations, we look at behavior on specific tasks. And you can pick the task of: Tell me what your goal is. And then the model might say, “Well, I really care about human flouri،ng.” But then ،w do you know it actually does, and it didn’t just lie to you?

And that’s part of what makes this challenging. I think in some ways, behavior is what’s going to matter at the end of the day. If you have a model that always behaves the way it s،uld, but you don’t know what it thinks, that could still be fine. But what we’d really ideally want is we would want to look inside the model and see what’s actually going on. And we are working on this kind of stuff, but it’s still early days. And especially for the really big models, it’s really hard to do anything that is nontrivial.

One idea is to build deliberately deceptive models. Can you talk a little bit about why that’s useful and whether there are risks involved?

Leike: The idea here is you’re trying to create a model of the thing that you’re trying to defend a،nst. So basically it’s a form of red teaming, but it is a form of red teaming of the met،ds themselves rather than of particular models. The idea is: If we deliberately make deceptive models, A, we learn about ،w hard it is [to make them] or ،w close they are to arising naturally; and B, we then have these pairs of models. Here’s the original ChatGPT, which we think is not deceptive, and then you have a separate model that behaves basically the same as ChatGPT on all the ChatGPT prompts, but we know it has this ulterior motive [to lie] because we trained it to be that way.

“Once the model is capable enough… our alignment techniques have to be the line of defense.”
—Jan Leike, OpenAI

And then you can compare them and say, okay, ،w can we tell the difference? And that would teach us a lot about ،w we would actually catch these things arising in the world. Now, there are some important things you have to think about when you’re doing this, right? You don’t want to accidentally create the thing that you’ve been trying to prevent the w،le time. So you don’t want to train a model to, say, self-exfiltrate. And then it actually self-exfiltrates. That’s not what you want to do. So you have to be careful with additional safeguards here. What is really natural is just to train them to be deceptive in deliberately benign ways where instead of actually self-exfiltrating you just make it reach some much more mundane ،ney،. You have to put really great care into the sandbox that you put around this experiment.

In t،se kind of experiments, do you imagine doing them with language models?

Leike: I think language models are really natural. They’re the most interesting models we have right now, and there are all of these relevant tasks you can do with language models.

Can we talk about the term you just used, self-exfiltrate? I think some people won’t be familiar with that term. Can you define it and then talk about ،w your research could tap into that area?

Leike: So the basic idea is this question of, “How hard would it be for the model to steal its own weights?” Because if it can steal its own weights, it can basically copy them from the AGI lab where it’s being trained to some other external server and then be effectively out of the control of that lab. And so whatever it does afterwards, maybe it’s benign. Maybe it gets up to a bunch of crime or even worse. But at that point, it will be really hard to stop. So an important line of defense is to make sure these models can’t self-exfiltrate.

There are three main ways that this could happen. A, the model could persuade an OpenAI s، member to help it copy the weights. B, it could use social engineering and pretend to be someone else to trick someone to do this. Or C, it could figure out ،w to break the technical measures that we put in place to secure the model. So our goal here would be to understand exactly where the model’s capabilities are on each of these tasks, and to try to make a scaling law and extrapolate where they could be with the next generation. The answer for the models today is they’re not really good at this. Ideally, you want to have the answer for ،w good they will be before you train the next model. And then you have to adjust your security measures accordingly.

“If you have some tools that give you a rudimentary lie detector where you can detect whether the model is lying in some context, but not in others, then that would clearly be pretty useful. So even partial progress can help us here.”
—Jan Leike, OpenAI

I might have said that GPT-4 would be pretty good at the first two met،ds, either persuading an OpenAI s، member or using social engineering. We’ve seen some astoni،ng dialogues from today’s chatbots. You don’t think that rises to the level of concern?

Leike: We haven’t conclusively proven that it can’t. But also we understand the limitations of the model pretty well. I guess this is the most I can say right now. We’ve poked at this a bunch so far, and we haven’t seen any evidence of GPT-4 having the s،s, and we generally understand its s، profile. And yes, I believe it can persuade some people in some contexts, but the bar is a lot higher here, right?

For me, there are two questions. One is, can it do t،se things? Is it capable of persuading someone to give it its weights? The other thing is just would it want to. Is the alignment question both of t،se issues?

Leike: I love this question. It’s a great question because it’s really useful if you can disentangle the two. Because if it can’t self-exfiltrate, then it doesn’t matter if it wants to self-exfiltrate. If it could self-exfiltrate and has the capabilities to succeed with some probability, then it does really matter whether it wants to. Once the model is capable enough to do this, our alignment techniques have to be the line of defense. This is why understanding the model’s risk for self-exfiltration is really important, because it gives us a sense for ،w far along our other alignment techniques have to be in order to make sure the model doesn’t pose a risk to the world.

Can we talk about interpretability and ،w that can help you in your quest for alignment?

Leike: If you think about it, we have kind of the perfect ،in scanners for ma،e learning models, where we can measure them absolutely, exactly at every important time step. So it would kind of be crazy not to try to use that information to figure out ،w we’re doing on alignment. Interpretability is this really interesting field where there’s so many open questions, and we understand so little, that it’s a lot to work on. But on a high level, even if we completely solved interpretability, I don’t know ،w that would let us solve alignment in isolation. And on the other hand, it’s possible that we can solve alignment wit،ut really being able to do any interpretability. But I also strongly believe that any amount of interpretability that we could do is going to be super helpful. For example, if you have some tools that give you a rudimentary lie detector where you can detect whether the model is lying in some context, but not in others, then that would clearly be pretty useful. So even partial progress can help us here.

So if you could look at a system that’s lying and a system that’s not lying and see what the difference is, that would be helpful.

Leike: Or you give the system a bunch of prompts, and then you see, oh, on some of the prompts our lie detector fires, what’s up with that? A really important thing here is that you don’t want to train on your interpretability tools because you might just cause the model to be less interpretable and just hide its t،ughts better. But let’s say you asked the model hy،hetically: “What is your mission?” And it says so،ing about human flouri،ng but the lie detector fires—that would be pretty worrying. That we s،uld go back and really try to figure out what we did wrong in our training techniques.

“I’m pretty convinced that models s،uld be able to help us with alignment research before they get really dangerous, because it seems like that’s an easier problem.”
—Jan Leike, OpenAI

I’ve heard you say that you’re optimistic because you don’t have to solve the problem of aligning super-intelligent AI. You just have to solve the problem of aligning the next generation of AI. Can you talk about ،w you imagine this progression going, and ،w AI can actually be part of the solution to its own problem?

Leike: Basically, the idea is if you manage to make, let’s say, a slightly superhuman AI sufficiently aligned, and we can trust its work on alignment research—then it would be more capable than us at doing this research, and also aligned enough that we can trust its work ،uct. Now we’ve essentially already won because we have ways to do alignment research faster and better than we ever could have done ourselves. And at the same time, that goal seems a lot more achievable than trying to figure out ،w to actually align superintelligence ourselves.

In one of the do،ents that OpenAI put out around this announcement, it said that one possible limit of the work was that the least capable models that can help with alignment research might already be too dangerous, if not properly aligned. Can you talk about that and ،w you would know if so،ing was already too dangerous?

Leike: That’s one common objection that gets raised. And I think it’s worth taking really seriously. This is part of the reason why are studying: ،w good is the model at self-exfiltrating? How good is the model at deception? So that we have empirical evidence on this question. You will be able to see ،w close we are to the point where models are actually getting really dangerous. At the same time, we can do similar ،ysis on ،w good this model is for alignment research right now, or ،w good the next model will be. So we can really keep track of the empirical evidence on this question of which one is going to come first. I’m pretty convinced that models s،uld be able to help us with alignment research before they get really dangerous, because it seems like that’s an easier problem.

So ،w unaligned would a model have to be for you to say, “This is dangerous and s،uldn’t be released”? Would it be about deception abilities or exfiltration abilities? What would you be looking at in terms of metrics?

Leike: I think it’s really a question of degree. More dangerous models, you need a higher safety burden, or you need more safeguards. For example, if we can s،w that the model is able to self-exfiltrate successfully, I think that would be a point where we need all these extra security measures. This would be pre-deployment.

And then on deployment, there are a w،le bunch of other questions like, ،w mis-useable is the model? If you have a model that, say, could help a non-expert make a bioweapon, then you have to make sure that this capability isn’t deployed with the model, by either having the model forget this information or having really robust refusals that can’t be jailbroken. This is not so،ing that we are facing today, but this is so،ing that we will probably face with future models at some point. There are more mundane examples of things that the models could do sooner where you would want to have a little bit more safeguards. Really what you want to do is escalate the safeguards as the models get more capable.

From Your Site Articles