Watson and the Turing Test

Watson and the Turing Test

by Philip Greenspun, February 2011

IBM's Watson natural-language processing system trounced two Jeopardy champions in a televised apparent humiliation of the human race. Does this mean that the dawn of artificial intelligence has arrived?

Defining an Artificial Intelligence: The Turing Test

Watson, a collection of medium-sized IBM servers, is definitely artificial. Is it "intelligent"? The generally accepted test was proposed in 1950 by Alan Turing ("Computing Machinery and Intelligence"; Mind) and is called "The Turing Test":

    I propose to consider the question, "Can machines think?"  This should
    begin with definitions of the meaning of the terms "machine" and
    "think."  ... Instead of attempting such a definition I shall replace
    the question by another, which is closely related to it and is
    expressed in relatively unambiguous words.

    The new form of the problem can be described in terms of a game which
    we call the "imitation game."  It is played with three people, a man (A), 
    a woman (B), and an interrogator (C) who may be of either sex.
    The interrogator stays in a room apart from the other two.  The object
    of the game for the interrogator is to determine which of the other
    two is the man and which is the woman.

    ... It is A's object in the game to try and cause C to make the wrong
    identification.  ... The object of the game for the third player (B)
    is to help the interrogator.  The best strategy for her is probably to
    give truthful answers.

    ... We now ask the question "What will happen when a machine takes the
    part of A in this game?"

On January 13, 1994, Ellen Spertus and I did something that, as far as we know, had never been done: conducted the "male-female Turing test" that Alan Turing initially proposed.

Here's how we advertised it to fellow graduate students at the MIT Artificial Intelligence Laboratory:

All four teams will gather in Rm. 518, home of the big-screen TV.  On
screen we will have the YTALK program showing a three-way conversation
between INTERROGATOR (a Sun in 518), X (a Sun in an office), and Y (a
Sun in another office).

Each 5-minute round will pit three pre-selected teams against each
other.  We will call the teams Question Team, X Team, and Y Team.  I,
as Lord High Commissioner, flip a coin.  If heads, X and Y will both
try to be convincing men.  If tails, X and Y will both try to be
convincing women.  I secretly flip another coin.  If heads, X will be
the man and the X Team is privately informed that it must choose a man
to occupy the X terminal and the Y Team is privately asked to choose a
woman to occupy the Y terminal.  Once the round starts, responders
will be alone with their terminals, however, and cannot get help from
the rest of the response team.  All the members of the Question team
can contribute questions and those questions can be adjusted on the
fly.  After 4.5 minutes, typing is cut off and the Question Team has
30 seconds to debate the sexes of X and Y.  Then the Question Team
must guess the sex of X ("sexless nose-picking polyester-clad geek",
"girlie-man", "baby-man", "mega-babe", and "schwing" will not be
accepted as answers).

The Question Team scores 1 for guessing correctly, -1 for guessing
wrong.  Whichever response team (X or Y) supplies the "genuine" man or
woman gets the same score as the Question Team (this encourages the
genuine responder to try hard to help the Question Team), Whichever
response team supplies the "sham" man or woman gets the opposite score
from the Question Team, so they are rewarded for successfully
deceiving and penalized for being found out.

There are four ways for four teams to play each other three at a time.
There will be twelve rounds.  Each team gets to play nine times, three
times in each of the three roles.

Miscellaneous Rule: The process of grubbing for tenure is incompatible
with an interesting sex life.  Consequently, to avoid confusing and
embarrassing faculty members who may be present, please avoid asking
questions that would be at home on "alt.sex.bestiality.bark.bark.bark"
(or any of its superclasses).  Any sentence that includes the words
"Pony" or "Mazola", for example, would fall into this category.

How did it go? Here's the follow-up email

Here are the final scores from the Male/Female Turing Test

The Psychic Fiends Network	5
Golems				3
Sea Lions			-1
Tornados			-3

In 12 rounds, the Question Team was wrong 4 times (wow!), three times
thinking that our macho AI Lab He-men were women (can you imagine?!?).

Best question:  "What did you do with little green army men?" (Phillip
Alvelda of the Golems)

Best woman as man:  Pearl Tsai (thank you Pearl for romancing the 518
projection TV)

Best man as woman: Joe Media Lab [pseudonym for this article] (Joe,
too bad you were trying to be a man -- maybe you should ask Anita to
explain some sports to you)

IDEAS FOR NEXT YEAR

Questions about menstruation are out.  Questions about body size or
clothing are out.  Questions involving Tech Square are out.  This is
mostly because some of these result in easy wins for the guessing
teams and make the game boring, e.g., "How many stalls in the Tech
Square women's bathroom?"

One round was over with a single question: Phillip Alvelda as Interrogator asking "What did you do with little green army men?" The woman pretending to be a man said "Built a fort to protect them"; the actual man said "Burned them!!!"

[Where are these people now? Ellen Spertus didn't listen to all of the distinguished tenured computer scientists who said that her idea (feeding information about which Web pages link to which other pages into a database and then analyzing the links to see if they could be used to help users, e.g., find someone's home page) was stupid. She teachers computer science at Mills College and does some work on a similar system to her Ph.D. research (the newer but similar system is called "Google"). Phillip Alvelda went on to develop micro-sized LCD displays and streaming video systems for mobile phones, now used by Sprint. A Google search reveals that Pearl Tsai worked at Google, got an MBA from Stanford, then an architecture degree from Cal Arts, and is now an architect. Joe Media Lab [not his real name] went straight from the world's most innovative laboratory straight into Google oblivion (but I found him at his old email address and he asked me to change his name!). Anita Flynn, a pioneer in microrobotics, has her own company.]

Structure and Intrepretation of Other Computer Intelligence Successes

Digital photo titled ibm-1620-with-flash

The basic structure of a computer intelligence demonstration is the following:

people choose a task that only really smart humans can do well, e.g., calculus or chess
programmers write a computer program that does this task way better than any human
philosophers decide retroactively that whatever task the computer excels at does not require "intelligence"

In the early 1960s, for example, it was decided that algebra and calculus were things that only the smartest humans could do. Wouldn't it be amazing if a computer program could do calculus? In short order, however, most notably with Macsyma (1968), computer programs would solve calculus problems far beyond what an A student in a college calculus class could handle. Which meant... that calculus really wasn't that hard.

Chess, a game in which people with high IQs tend to do better than people with low IQs, was also an early project for artificial intelligence programmers. By the 1970s, computer programs could beat 99 percent of humans, whereupon the philosophers concluded a computer couldn't be intelligent unless it could beat the world's best chess-playing human. When Deep Blue, in 1997, did beat Gary Kasparov, the world's best player, chess was officially declared uninteresting.

Jeopardy

At first glance, Jeopardy would appear to be very nearly the kind of test that Alan Turing was talking about. A bunch of questions are asked by an interrogator and the computer provides answers alongside humans. Watching the game on television is a sure way to feel good about oneself. According to Ken Jennings's Web site, their buzzers don't become active until the question has been completely read out loud. By this time, a viewer at home has had time to read the entire question (people read 3X faster than they listen) and may well know the answer. So viewers get the illusion that they are smarter in many cases than the super-smart contestants.

Jeopardy departs from a regular Turing test in that the questions require a lot of factual knowledge that go well beyond what a person would acquire from day-to-day experience. Nobody in our male-female Turing test asked questions that required knowing the capital of Kentucky. Jeopardy also departs in that the answers have a very simple structure, essentially just one word or phrase.

Watson is Intelligent

The people who do well on Jeopardy are much smarter than the average person. Watson did way better than the best-ever human Jeopardy competitors. Ergo, Watson is smarter than all humans.

[I have only met one person who was a successful Jeopardy contestant. He is indeed a smart guy and, when not competing on Jeopardy, earned his living sitting at home with his dog writing speeches for American university presidents (many of whom, as it turns out, are nearly illiterate).]

Watson is Not Intelligent

Watson could not do what a 3-year-old human of average intelligence can do: convert a continuous stream of audio into distinct words ("speech to text"). Watson was fed the questions in an instant-message-like form. A computer that can't do what a 3-year-old does is not intelligent.

Watson won Jeopardy with mostly statistical processing plus being superhumanly fast on the buzzer. We expect computers to be good at crunching randomly through big data sets and also to be quick. The microcontroller in a toaster oven is quick, able to scan the front panel buttons hundreds of times per second, but we don't call it intelligent. As far as statistical association goes, noticing that "New York City" and "Big Apple" turn up near each other in a lot of sentences does not seem like the kind of thing that humans do when we say they are acting intelligently.

Finally the task was sort of trivial. Watson did not have to construct open-ended replies to open-ended questions. Watson simply pulled a word or phrase out of a database and stuck "What is" or "Who is" in front.

My personal view

I think that Watson points in a useful direction for A.I. We like to think that we're not pulling together random statistical associations, but how come we forget to buy X until we visit a store and see something that reminds us of X?

Separately, Watson reminds us of the scale of computing power required to simulate the behavior of the human brain. As an MIT undergraduate in 1981, I remember a senior professor in A.I. research suggesting that computer ownership would have to be regulated and licensed. Why? Suppose that a kid in Brazil, with a computer as powerful as a VAX 11/780, discovered the "secret" of AI. The kid could program his VAX-power computer to predict the next day's stock market price, make infinite money, use that infinite money to buy infinite power, a private army, etc. The professor was prescient enough to see that integrated circuit technology would put the awesome power of the VAX, one day, within the reach of consumers. How powerful was a typical VAX? It executed approximately 500,000 instructions per second and held up to 8 MB of memory. A Motorola Atrix Android phone has two 1 GHZ processors on board, capable of handling approximately 1 billion instructions per second (2000X as fast as the VAX) and holds up to 16 GB of fast memory (2000X as much solid state memory as the VAX). Watson ran with at least 15 TB of RAM (2 million times as much as the VAX) and has about 2880 CPU cores (clock rate 3 GHz, so let's say it can execute 1.5 billion instructions per second per CPU core or around 5 trillion instructions per second total, 10 million times faster than the VAX).

Is Watson going to lead the way to a range of revolutionary software products that will assist humans? I'm not sure why it would. The information that is most accessible to Watson is in computer-readable text format. If it is in computer-readable text format, it is already accessible to "the Google". If I want to know what city most often appears alongside the phrase "big apple", I can type that into Google and scan a page of results for the answer immediately. It also works to type "apple city", "new york city nickname", "bankrupt city 1970s", "king kong city", etc. Without a good speech-to-text front-end, however, it is hard to see how Watson is going to be hugely more useful than The Google. And achieving speech-to-text may yet prove to be just as hard as all of A.I. (if you don't believe it, try applying your brilliant human intelligence to transcribing a language that you don't understand, e.g., Italian; it is virtually impossible to disambiguate the sounds unless you understand the content).

Watson and the Shark

philg@mit.edu

Reader's Comments

I disagree that chess was declared uninteresting only after Kasparov was beaten. The fact is that the brute-force algorithms used, which amount to nothing more than a sophisticated depth-first tree search, have always been uninteresting to AI researchers. We've known for some time that computers search trees fairly well. I remember a CS classmate of mine in the early 1980's who wanted to write one as a term project. The professor rolled her eyes and told him to get a real project.
Deep Blue may have been an achievement in terms of parallel computer architecture, but it wasn't AI at all. The algorithms used don't mimic what humans do in any way. Humans don't evaluate billions of possible board positions to come up with a move. They somehow manage to subconsciously prune that enormous tree down to a handful of possibilities that are consciously analyzed. Deep Blue sheds no light on how they do it.

-- Mark Ciccarello, March 10, 2011

Add a comment | Add a link