Illustration for article titled Anonymized Data Is Meaningless Bullshit
Photo: Getty

When most of us think of how the concept of datahas been skewered by the press, were probably thinking about an apps location data tipping off our home address, or apps like Grindr tipping advertisers off about our sexuality. Whats less scrutinized, both by the public and by those in public office, is data thats anonymizedtied to something like an IP address, rather than a nameeven though thats a concept weve seen to be bullshit time and again.

The latest proof comes courtesy of Dasha Metropolitansky and Kian Attari, two Harvard students who recently built a tool that combs through troves of consumer datasets uploaded from breaches across the web. As Metropolitansky and Attari told Motherboard, their program was created to link together not-so-anonymous informationlike emails or usernamesback to any anonymousdata that was found in a decades worth of data breaches from nearly a thousand different domains, from Adobe to YouPorn.

Advertisement

Andsurprise, surprisedespite the bulk of these datasets being anonymized,identifying someone caught up in a given leak wasnt difficult at all, according to the researchers.

First, lets get some facts out of the way. Big shadowy data brokers, by and large, arent going to store anything explicitly personal about youthe person reading this storysimply because theres no value in it. Even though the ads stalking us around the web might seem to suggest otherwise, marketers give no shits about your hopes, your dreams, your fears, the gym you go to or how you sexually identifyat least not on an individual level. What they do care about is catering a specific ad to a specific demographic, which is something thats ultimately gleaned from where you live, where you shop, andyes, in some caseswhether youre queer-identified.

Heres a personal example: Based on my NYC-based paper trailwhich involves purchases at Petco, Goodwill, and some of my citys many gay bars, marketers can realistically market me anything related to cats, thrift stores, or anything bisexual with the confidence that theyre not wasting money when targeting me with ads. They dont need to know who I am, per sethey just need a way to reach the target demographic that I just so happen to be a part of.

Major data brokers have reams of aggregated intel on me thats incredibly valuable because it can plop me into one of those demos with a surprising degree of accuracy. Any of these data points arent necessarily going to be tied to me, Shoshana, because they don’t have to be to make other people money. What this data is tied to might be something like my computers unique IP address or my phones mobile ad identifier, which are, on their own, anonymous.

Advertisement

But even that particular data point isnt truly worth that muchadvertisers, on a day-to-day basis, are looking at my data (and yours) as its aggregated with data from an untold number of other people. A persons individual data,on its own, is pretty much worthless; after all, marketers cant guarantee that Ill be clicking on a given ad or buying the product theyre selling. What is valuable is when that datas in aggregate, even if its anonymizedand not tied to any one individual. This is why Facebook, for example, can say that its earning roughly $26 a pop from every user plugged into its systemthe only reason it can say that is because its monitoring what billions of people in aggregate are doing on its platform and off.

While one data broker might only be able to tie my shopping behavior to something like my IP address, and another broker might only be able to tie it to my rough geolocation, thats ultimately not much of an issue. What is an issue is what happens when those anonymizeddata points inevitably bleed out of the marketing ecosystem and someone even more nefarious uses it for, well, whateveruse your imagination. In other words, when one data broker springs a leak, its bad enoughbut when dozens spring leaks over time, someone can piece that data together in a way thats not only identifiable but chillingly accurate.

Advertisement

Thats why the anonymized datadefense from marketers and data brokers is so fucked. Its a go-to line that they can technically turn to time and again with a clean conscience, knowing that their own data collection is by the books. At the same time, these are some of the same companies that have leaked nearly 8 billion records over the past year, which ultimately negates that logic in the first place. Its enough to make you wonder where the hand-washing stops and the hand-wringing begins.

Enterprise reporter on the "big tech and big business" beat. Send your worst tips to swodinsky@gizmodo.com.

Share This Story

Get our newsletter