What It's Like to Judge the Turing Test

"What are your favourite Sci Fi movies?" "I like Star Wars and The Matrix," comes the typed reply.

I am trying to work out if I'm talking to a "hidden human" in the next room, or actually a machine located somewhere in cyberspace.

"Can we agree that the prequels sucked?" I continue. "Absolutely! Lucas should be shot!"

That settled it - only a flesh-and-blood movie buff could be so enraged by The Phantom Menace.

This was one of the easier to gauge exchanges I had as Judge J-18 in last weekend's "Turing-test marathon" at Bletchley Park in central England. The test aims to separate the humans from the machines through rigorous questioning by judges. It was designed over 60 years ago by Alan Turing, the grandfather of computers, whose work in Hut #8 at Bletchley Park played a vital role in the Allied code-breaking effort during the Second World War.

Saturday's marathon, along with similar events worldwide, were in celebration of Turing's centenary. The largest Turing test undertaken, Saturday's session tried to recreate what the great man envisaged in his 1950 paper outlining its methodology.

So, what is it actually like to be a judge in a Turing test? Anxious to find out, I signed up months in advance for the Bletchley Park event. The premise is similar to interrogating a spy: ask enough questions, and eventually the suspect tips his hand. As Turing envisioned it, you'd have a human behind one opaque screen, and a computer behind another. A judge sits in front of them with no way of knowing who or what is behind each screen. The judge can ask questions of the two entities; they reply by text-based chat. If the machine is so good at producing human-like responses that the judge can't tell which is which, based on a five-minute conversation, the machine is on its way to passing the Turing test.

Turing didn't expect any machine to fool all the judges all the time, but he speculated that by the year 2000 "an average interrogator will not have more than 70 per cent chance of making the right identification" - that is, computer programs would stymie the judges 30 per cent of the time. Twelve years late, this year's test set out to see if we could finally reach that bar.

The tests took place in the mansion's former billiards room, where I took my seat next to the other Session 1 judges, each of us perched in front of a standard PC. Huma Shah, the event's organizer, explained the rules: There would be two kinds of tests. In one version, the judge would chat with a single unseen entity for five minutes. In the other version, there would be a split screen, and the judges would converse with two entities at the same time, again for five minutes, trying to deduce if one (or both, or none) of the entities was a machine.

In several cases, the computers gave themselves away nearly from the start. If my correspondent couldn't answer a simple question, or changed the subject abruptly and for no apparent reason, it struck me as almost certainly a machine.

At the opposite end of the spectrum, we have my almost-certainly-human Star Wars buff, or the Beatles fan I encountered - "best band ever" - who, when asked to choose between the Rolling Stones and The Who, replied "definitely The Stones - The Who went too stadium toward the end." While I disagree with the sentiment - in my mind, The Who made great music right up to the band's split in the early 80s - the reply seemed far too… well, far too human… to be written by a machine.

Other exchanges, however, seemed far more ambiguous. When I said I was from Canada, one respondent said they heard "great things" about Canada, "except that Quebec is very French." Was that something a computer might cough up after spending a few milliseconds skimming the Wikipedia entry for my country? Or is it just a human with a vague memory of what their school teacher once said about Canada, perhaps tinged with mild anti-French prejudice? Or was it merely a person who had grown tired after a couple of hours of conversing with strangers by text?

In devising the game, Turing was acknowledging that mastery of language is often seen as going hand in hand with intelligence. Indeed, linguistic ability involves more than just stringing words together into sentences. Holding down a conversation likely depends on a whole host of other cognitive abilities - the ability to monitor one's own thoughts, one's environment, other creatures - perhaps even to guess what other people are thinking. Children acquire these cognitive skills as a part of their ordinary development. Instilling such capability in a machine, on the other hand, is a herculean challenge, and the programmers behind Saturday's chatbots should be lauded for getting their creations to perform as well as they did. The best program, Eugene Goostman, a chatbot with the personality of a 13 year old boy, nearly passed the test, fooling the judges just shy of the 30 per cent mark suggested by Turing in his 1950 paper.

The Turing test marathon showed just how challenging it is for a machine to carry on a real conversation. As Mark Twain might have put it, you can fool some of the judges some of the time - but not much more than that. At least, not quite yet.

Image by Dan Falk


What It's Like to Judge the Turing TestNew Scientist reports, explores and interprets the results of human endeavour set in the context of society and culture, providing comprehensive coverage of science and technology news.