Right Now a Computer Is Reading The Internet, Teaching Itself Language

We may earn a commission from links on this page.

In a basement at Carnegie Mellon University, a computer is reading the web. It's been doing so for nearly nine months, teaching itself the complexities and nuances of the English language. And the smarter it gets, the faster it learns.

That computer is NELL, the Never-Ending Language Learning system, and it's the star of a project involving researchers from Carnegie Mellon, supercomputers from Yahoo!, and grants from Google and DARPA. The project's aim is an elusive but important one: to design a machine that can figure out the subtleties of language all on its own. As Tom Mitchell, chairman of the school's machine learning department explains, "we still don't have a computer that can learn as humans do, cumulatively, over the long term." NELL would be the first that does so.

The system trawls hundreds of millions of web pages, collecting facts and sorting them into one of 280 categories, classifications like cities, plants, or actors. It has learned nearly 400,000 such facts to date, with 87% accuracy. NELL also currently knows some 280 relations, pieces of language that connect two facts together. NELL probably knows that James Franco (actor) lives in (relation) New York City (city). And the more NELL learns, the faster and more efficient it gets at teaching itself:

Its tools include programs that extract and classify text phrases from the Web, programs that look for patterns and correlations, and programs that learn rules. For example, when the computer system reads the phrase "Pikes Peak," it studies the structure - two words, each beginning with a capital letter, and the last word is Peak. That structure alone might make it probable that Pikes Peak is a mountain. But NELL also reads in several ways. It will mine for text phrases that surround Pikes Peak and similar noun phrases repeatedly. For example, "I climbed XXX."

NELL, Dr. Mitchell explains, is designed to be able to grapple with words in different contexts, by deploying a hierarchy of rules to resolve ambiguity. This kind of nuanced judgment tends to flummox computers. "But as it turns out, a system like this works much better if you force it to learn many things, hundreds at once," he said.


There are some instances in which an autodidactic computer can get off track. At one point, NELL sorted "internet cookies" into its "baked goods" category. That resulted in associated terms like "computer files" being labeled as baked goods, too. In these cases, Dr. Mitchell and his team correct the error and put NELL back on course. But he points out that no human learns completely on his own, either.

So what's it all for? Well, a computer that understood language like a human could answer search queries not with links but with real answers. Personal computers could be operated simply by telling them what to do. And computerized assistants could understand requests like "go get me a sub sandwich" and not waste their time looking for a sandwich shop that was located in a submarine. Too bad! That'd be funny. [NYT]