The human race produces 2.5 quintillion bytes of data every day—far too much to crunch using conventional desktop applications. Mining this mountain of data for nuggets usable information is one of the biggest challenges facing modern society. But a new generation of analytic tools are helping us control the phenomenon that's becoming commonly known as Big Data.
It's not hard to collect large quantities of data. The world's per-capita information storage capacity doubled every 40 months since the 1980s. Now, it comes from every imaginable source—from social media and web text to climate information, digital media, online sales receipts, RFID readers, change logs, and cellular triangulation records. The real challenge is what to do with it.
Twitter generates somewhere on the order of 12 terabytes of data every 24 hours. The Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. Even Walmart handles more than 1 million customer transactions every hour. Analyzing this torrent of information, whether to quickly spot emerging trends, track the Higg-Boson particle, or accurately identity fraud at the checkout, requires a bit more computational power than MS Access.
Instead, specialized analytic suites like the Apache Hadoop Big Data Platform run in parallel across hundreds or even thousands of servers. They're known as collectively as massively parallel-processing databases. These giant machines work around the clock to manage and make sense of these unwieldy data sets.