Image recognition was already good—but it's getting way, way better. A research collaboration between Google and Stanford University is producing software that increasingly describes the entire scene portrayed in a picture, not just individual objects.

The New York Times reports that algorithms written by the team attempt to explain what's happening in images—in language that actually makes sense. So it spits out sentences like "a group of young people playing a game of frisbee" or "a person riding a motorcycle on a dirt road."

It does that using two neural networks: one deals with image recognition, the other with natural language processing. The system uses computer learning, so it's fed a series of captioned images and it gradually learns how sentences relate to what the image shows. The resulting software is, according to the team, about twice as accurate as any software to have gone before it.


It's not, however, perfect. Check, for instance, the image above: it often makes small mistakes and, occasionally, it gets things completely wrong. Clearly there's room for improvement, then, but it's evident that image recognition is improving apace.

And, perhaps unsurprisingly given Google's involved, the natural application is in search. Such an algorithm could easily return relevant images when you type in "three cats eating ice cream sundaes in a billiard room" in a way that current technology just can't manage. And isn't that what we all want? (Better search, I mean, not the cats. Well, maybe the cats.) [Google Research Blog, Stanford University via New York Times]