Giz Explains: How Data Dies (and How It Can Be Saved)

Bits don't have expiration dates. But memories will only live forever if the media and file formats holding them remain intact and coherent. Time can be as deadly to data storage as it is to carbon-based life forms.

There are lots of ways data can die: YouTube can pull a video offline before anybody snags it, your hard drive can crash, taking ultra-rare Grateful Dead bootlegs that you never got a chance to upload to Usenet with it, or maybe you designed a brilliant piece of visual art a decade ago in some kooky file format that simply doesn't exist anymore, and there's no possible way to view the file without traveling to some creepy dude's basement a thousand miles away.

What we're talking about is digital rot—or data rot or bit decay or whatever you'd like to call it—systemic processes which can mean death to data. Kind of a problem when you'd like to keep it around forever. Let's paint this in broad strokes: You can roughly break the major kinds of rot into hardware, software and network. That is, the hardware that breaks down, the formats that go extinct, and the online stuff that vanishes one way or another.

The Hard Life of Hardware

Everything's gotta be stored on something. And guess what? All media age. (Except diamonds—bling bling, biatch.) Brain cells die, film degrades and hard drives break.

A sampling of common digital media and their life expectancies (assuming you take care of them):
• Floppy disk - This can theoretically survive between 3 and 10 million passes
• CD and DVDs - It depends heavily on the materials used in their construction (PDF), but you're looking at anywhere between 2 and 10 and 25 years, in the best of circumstances
• Flash storage - Also depends on the type, letting you write between 10,000 cycles with multi-level flash memory, or 100,000 with single-cell flash
• Hard disk drives - Kind of a crapshoot—anecdotally, five years is a good average, though they can last shorter or longer, depending, again, on how they're built

Google, with its millions of servers, is in the best position to test hard drives from every manufacturer, and conducted a massive study of HDD failure. Basically, if a drive makes it past the first six months, it's pretty likely to make it through Year 4, but it is going to die at some point (and makes/models die in batches). As you probably don't need to be told, hard drives can fail in any number of ways.

In other words, whatever you're storing your precious data on, back it up, preferably with a mix of drives or media from different manufacturers/time periods.

But what if you're, say, the Library of Congress, the largest library in the world, charged with a mission "to sustain and preserve a universal collection of knowledge and creativity for future generations," and suddenly confronted—after 200 years of relatively tranquil existence—by an unending, ever-expanding digital deluge that must be archived and cataloged? On top of a copy of every piece of material that's registered through the United States Copyright Office, and the two centuries of (oftentimes badly damaged) cultural history you're already trying to preserve? How do you store stuff?

"DVDs and CDs aren't even considered storage," say Martha Anderson and Beth Dulaban, from the LoC's Office of Strategic Initiatives. They need to transfer shiny-silver-disc content to something sturdier to meet their mission requirements. For digital content, the Library uses a mix of hard disks and tape, like Oracle's StorageTek T10000B 1TB tape drives, rated for 30 years of archive life. At the Packard Campus, the main battle station for the LoC's audio-visual preservation, they have 10,000 tapes providing 10 petabytes of capacity, Gregory Lukow, from the LoC's Motion Picture, Broadcasting & Recorded Sound Division told me. In the video above, you can see a SAMMA robot hard at work. These do analog-to-digital conversion en masse, and the LoC has four of 'em.

The key, though, is that even though the LoC works with drive manufacturers on boosting reliability and meeting the Library's technical specifications, is that they have a policy of redundancy and diversity—two to three copies, maybe spread across different states, and stored in different kinds of hardware running different kinds of software. The Packard Campus, which is where music and video are archived and preserved in crazy labs with robots, mirrors everything to a secret location via fiber optic cable. While you probably don't have secret bunkers to stash your porn, it's a good general guideline: More copies on more disks is more better.

A Format Can Be a Tomb

It's obvious, though, that storage media age and die. The more insidious problem, particularly with "born digital" content—stuff that started life as bits—is format obsolescence. That is, just 'cause a video wrapped up in MKV, or an Ogg Vorbis music file, or a DOCX file is readable on computers today doesn't mean they will be 20 years from now. And if nothing can read what's inside the file, the data inside is basically lost.

The way you might've already experienced this, in a way, is via DRM that's been deactivated (like a bunch of digital music stores did after being crushed by iTunes), rendering your songs wrapped up in it completely useless. I suspect people who bought into ebooks early, before the emergence of EPUB, are going to be effed in the ay in a similar manner. And don't even get us started on HD DVD and other failed video and audio physical formats—that's potentially a double whammy of format death.

It's important, then, to store your memories using formats that are legit standards that'll be around for a longass time, if not quite forever. Growing recognition of the problem, particularly as it pertains to ephemeral web content, is part of what's behind the push for open standards—proprietary standards, from a long-term survival standpoint, are not the best idea, 'cause once whoever makes them dies, the format may die too.

The Library of Congress has picked out seven points that'll give you an idea of how sustainable a format is—that is, likely to outlast your current Lady Gaga obsession:
• Disclosure - how open the specs are
• Adoption - "an open format that nobody's adopted isn't too useful to us"
• Transparency - how readable it is on a technical level
• Self-documentation - decent metadata, which is in some ways the secret challenge, given that it becomes more valuable as the amount of data you have grows exponentially
• External dependencies - how much you need particular hardware to read it, for example
• Impact of patents
• Technical protection mechanisms - is DRM in the way?

Quality is also an issue. So, for instance, for master digital archives of video, the Library uses mtion JPEG-2000 in an MXF wrapper, because it's mathematically lossless. It uses MPEG2 for sub-masters, which are the source material for MPEG-4 copies that patrons can access. Or, as another example, for a long time, "PDF was considered persona non-grata" because it was proprietary, but since Adobe's opened it up, they're now working with Adobe on an archivable form of PDF.

The advantage the Library has with analog-to-digital conversions is that they get to dictate the format and specs—that's not so with most of the content out there. For instance, there's not really an agreed upon web video standard—witness the H.264 vs. Ogg Theora codec war, though that's lookin' more and more like it's going toward H.264—so web video is considered "highly at risk." Despite the large amount of web video the Library has captured—after a year working out the process for doing so, Martha and Beth "don't have real high hopes for them surviving." YouTube provides one form of hope, though, in that there's so many YouTube videos, and so many copies, "there's bound to be some community interest in keeping them alive over time."

Pulling the Plug

There might be community interest in keeping the copies of Trolololo alive and playable for the next generation from a format standpoint, but what if Google suddenly pulls the plug on YouTube? How much of it what's there would be lost forever? Or photos uploaded to Flickr and Facebook that have been wiped from hard drives, since they're in the cloud. Consider, for instance, everything that would be lost if Wikipedia really did run out of money, and was shut down. Or Twitter.

This isn't a patently "what if" scenario. Last year, Yahoo, who has a habit of closing services, killed GeoCities—you had a GeoCities page, right?—nuking not just people's personal pages on an individual level, but really deleting a massive archive of web history. Yahoo paid more than $3.5 billion for GeoCities just over 10 years ago. So it could happen, even to popular services—especially ones that operate under the radar, legal or otherwise, like say, Oink.CD.

They're fragile, yeah, but bits, unlike ink on paper or brain cells, can live forever, if they're taken care of. As we're awash in an ever-cresting tsunami of data, sometimes it's easy to forget that can be a pretty big if.

Thanks to Beth, Martha and Greg at the Library of Congress, the friendliest government employees I've ever talked to! Still something you wanna know? Send questions about data, Data or Reading Rainbow here with "Giz Explains" in the subject line.

Original photo from RAMAC Restoration site

Memory [Forever] is our week-long consideration of what it really means when our memories, encoded in bits, flow in a million directions, and might truly live forever.