The on-going effort to develop AI alignment seems to be misunderstanding the nature of the problem. I say seems because I am not well enough informed about it to say with much definitiveness. Perhaps I’m missing from those I read on the topic that they grasp already what I’m about to explore.
It is fairly easy to get humans—and not just small humans—to make mistakes like this:
Me: Spell “coast”.
Human: C-O-A-S-T
Me: Right, and now spell “most”.
Human: M-O-S-T
Me: Right again, and what do you put in a toaster?
Human: Toast
It would surprise me if an AI would fall for that. My one attempt at it using ChatGPT confirmed as much:
AI’s would have undoubtedly seen the riddle joke in lots of training. They see through it the way most people do once they’ve been exposed to it often enough or recently enough—or if they are just cynical enough to always be looking for the angle.
Perhaps journalists never fall for it, which is both good and a shame. There is something human about it.
Which gets me to the point of this post.
We like hard rules. We clamor for bright lines. We strive for clean, clear delineations. . . in theory.
In practice we take those and allow for various levels of fuzziness. Zoom in closely enough and every sharp, straight, black line is blurry, curvy, and gray.
Tolerances for deviation vary by situation and by decisionmaker. And they should. And they are never perfect. And they can’t be. Just like the idealized rule cannot survive unblemished in the wild.
When we move from the sandbox to production, to use programming terms, we step into the ring with Mike Tyson.
As anyone who has done any coding can tell you, the execution test is brutally revealing for any flaws you may have.
There are two aspects to this.
The first is simply that a program that works works. One that fails doesn’t. PERIOD. Well, kinda period. More on that in the second aspect.
Sticking with the first, let me explain what you already know. If you correctly write a code to do something, that code will work . . . 100% of the time it works every time (within these narrow constraints of the code itself).
I remember in my first programming class in middle school having written an executable in BASIC that was to perform some simple task. This was for a test grade. The teacher offered me some kind of bargain like I could test it first one time, which would allow me to catch an error, but that test would cost me 11 points meaning I could only make a B assuming it then worked perfectly. Alternatively I could just run it if I was confident in my coding.
I was. I shouldn’t have been. It failed immediately due to a silly typo. My code was all but perfect. Aside from a lesson in hubris and risk/return, I also saw how dumb computers were since any human would obviously have not been tripped up by the typo.
This is the harsh reality of computer code. There are many ways to skin a cat, and the differences among those that technically work is only efficiency—cost of resources. But for those that don’t work, well, 100% of the time they fail every time.
The second aspect is where computer technology has become fuzzy. I’m referring to big systems and programs but not AI. This is when a system works until it doesn’t and then it does or when sometimes it gets tripped up but then sometimes it doesn’t when doing what appears to be the same thing.
Make no mistake, this is when technology is the most frustrating. “I swear this worked yesterday. Is the computer gaslighting me? Is some hacker terrorizing me? It couldn’t possibly be an error in the seat-to-keyboard interface.” Oftentimes it actually isn’t our mistake. Computer systems now work in mysterious ways.
We’ve learned to live with some degree of inconsistency. We tolerate the mystery of how sometimes it breaks but usually it works.
It would seem as if we’ve crossed the Rubicon to where the complexity has made computers human like. Perhaps AGI arrived sometime back with Windows 95?
Alas, no. This is just the result of either bugs yet discovered, code that only appeared to cover all contingencies, contradictory interference from a bolt-on well after the initial (pristine?) code was written, or some interactive combination of these.
The computer isn’t trying to harm you. It isn’t trying to do anything beyond the mundane “trying” to do what it was designed to do combined with what you are truly trying to make it do.
AI efforts promise (threaten) to actually cross the boundary over to . . . something. Something not simply a machine following instructions.
To be sure AI needs alignment even if it is not ever going to become AGI much less sentient. Even if it only appears conscious including being stuck forever at the current level, and we would still need some version of alignment.
Alignment doesn’t usually refer to the mundane idea of “do the job I’m directing/asking/seeking you do in a manner desirable, which might only be ‘do it correctly’,” but it certainly does include this.
In this sense my BA II Plus calculator needs to be aligned with financial formulas as well as basic math. The copy machine better do its job as requested, or we’re taking a one-way trip to a field.
Turns out we’ve been dealing with alignment all along. It is just now that it seems so crucial as it might threaten our ability to survive much less thrive.
Well, again, not so fast. Any trip to a modern hospital will show you many examples of computer technology that if misaligned will result in harm or even death.
We are advancing along an unsteady continuum with occasional surges, and we have been on this journey for a long, long time. This recent surge is the AI phase, which started some time back but became apparent in the last few years. It is a next twist—and a big one—on alignment needs.
But it is also a big step with a large twist to move past the realm of calculation input/output and resource fetching (search) into “just give me the damn answer including doing all the work for me”.
Among the things the Internet offered to replace was the library—a collection of knowledge from experts and others.
Search aimed to replace the librarian. The librarian brought the resources to you for you to do the work. She made an indirect introduction to experts. Generally the experts were passive—they were usually books. You then interrogated the information and did something with it. Search from Yahoo! to Google was just doing the librarian’s job online.
There was no certainty that the exact answer was provided in that pile of books or list of links in the search results. Nor was there by any means protection from contradictory information. It was helpful, but mileage varied.
AI offers to replace both the library and the librarian as well as the research assistant and other agents and, well, you. It is the expert for all intents and purposes as well as the one who is going to do the work. In the extreme it is going to give you what you didn’t even know you were asking for.
Sometimes we want something directly like a specific answer or analysis. Sometimes we want to be surprised/entertained/informed, and we rely on some version of discovery for it. A friend recommends a restaurant or a book. A push notification gives you a headline that raises your blood pressure as you click the link. You channel surf the TV or have YouTube’s or TikTok’s algorithm guide you along a journey.
Whether I’m asking AI how to bake a cake or wanting it to explain quantum mechanics to me like I’m a ten-year-old, I need it to have a degree of alignment. If I’m wanting it to carry on a conversation with me, that alignment needs to be more robust and a little more fuzzy. If I’m wanting it to do things for me without me asking, more robust but also fuzzier alignment is needed still.
What do I mean by “fuzzy”? I mean it needs to improvise appropriately. It needs to have freedom to experiment with guidance that the experimentation won’t be ruinous. Or to at least stop short from being ruinous by detecting that something is going wrong early enough.
Among the proscribed duties of my housekeeper is vacuuming floors and dusting shelves. If she comes across an unusual mess within her typical cleaning domain like, say, spilled milk dripping from the refrigerator, I expect her to clean it up. If she were to discover I left the oven on, she should turn it off. If she finds my child choking, I expect her to help even though childcare duties are not at all part of our arrangement.
I do not expect her to dust the attic much less vacuum up the insulation. She shouldn’t order me a new oven if mine is broken. She might help my child with their Spanish homework if asked or if the need is simple and obvious. She wouldn’t be expected to do this just like she wouldn’t and shouldn’t enroll them in a new school that has a better language arts program.
In the course of her duties over the years, she is bound to miss an area of need (ahem, dusting on top of the refrigerator), over clean in mistake (that grass in the OU shot glass was from the 2001 Orange Bowl!),1 and break a few things like dishes. She cannot be mistake free, and I cannot expect her to be.
Likewise, AI cannot be mistake free. It must make mistakes. We just need it to make the right kind of mistakes—which reminds me of Mike Munger’s the right kind of nothing as described here in this EconTalk episode (7:35-21:00).
To get there it needs to follow not just the hard(ish) rules we can and should outline. It also needs to interpret new situations that are somehow unique such that it conforms to the spirit of the law for when the letter does not strictly or obviously apply.
We are going to need common sense, norms, and a code. Not code as it has been used previously in this post but rather a code(s) that act like THE CODE. Successful AI with proper alignment is going to look a lot like sports.
Fighting in hockey and intentionally hitting a batter in baseball are features rather than bugs—and not in the way that wrecks in car racing are features. Sure, some people come for the hockey fights just like race fans might enjoy a good crash (hopefully only one that doesn’t harm anyone). But fighting in hockey is primarily there for one important reason—safety.
Intentionally hitting batters in retribution for dangerous or unseeming play is part of a cost-effective means of self regulation, and it is still evolving. It is just one small version of what is called THE CODE in baseball.
Again, the always insightful Mike Munger is here to explain.
These norms and processes exist everywhere in human society. They serve to smooth out the inevitably rough edges if not simply better perform the task of bringing orderly progress in a world dominated by spontaneous (unplanned and unplannable) order.
To be aligned AI needs to be more human like and that probably includes hallucinations and mistakes even in cases like the toast example above. Sometimes it will look dumb, but that is the cost of it sometimes being usefully brilliant.
Don’t worry. She didn’t actually discard my memento.