Anti-Spam Turing Test Is Really Global Human-Powered OCR System

Illustration for article titled Anti-Spam Turing Test Is Really Global Human-Powered OCR System

You know the test you have to take on Digg or Facebook, the one that proves you're a human? You see a hard-to-read word or string of gibberish, and you type in the correct characters. Carnegie Mellon researchers decided to replace randomly generated words with actual words from ancient manuscripts, words that machines are having trouble deciphering. When you or millions of other users type in a word, you are beating a machine and helping to preserve an irreplaceable text.


The original test is called the Completely Automated Turing Test To Tell Computers and Humans Apart, or CAPTCHA. This is CMU-originated modification is called reCAPTCHA. Instead of seeing one word, you see two, one that is already verified as correct. If you think about it, that's the only way the authentication could work. Both words are further distorted to fight spammers who may well have better OCR than the libraries.

Sites like Facebook and Twitter have already started using reCAPTCHA, and right now it's processing one million words per day. That's still chump change, though. According to Luis von Ahn, a professor at CMU:

"There's no danger of us running out of words. There's still about 100 million books to be digitized, which at the current rate will take us about 400 years to complete."

[BBC News]



I'm all for the idea, but I don't entirely understand how this saves time. They still have to scan the images and this way they have to upload each individual word don't they? What step am I missing here?

I just hope they don't switch to cuniform. I have real recognition problems with cuniform.