Merely existing in the modern world means giving up a wealth of your information to countless institutions and services. While many of the places make promises to keep your identifiable data as secure and private as possible, they can still—and oftentimes do—share anonymized versions of your data to third parties, whether that’s for research or for profit. But new research indicates that even when data is stripped of any identifiable factors, it doesn’t require a lot of mental gymnastics to piece together certain information and figure out, with pretty high confidence, who the “anonymous” user in the dataset is.
In other words, anonymized data is not so anonymous.
Researchers at Imperial College London published a paper in Nature Communications on Tuesday that explored how inadequate current techniques to anonymize datasets are. Before a company shares a dataset, they will remove identifying information such as names and email addresses, but the researchers were able to game this system.
Using a machine learning model and datasets that included up to 15 identifiable characteristics—such as age, gender, and marital status—the researchers were able to accurately reidentify 99.98 percent of Americans in an anonymized dataset, according to the study. For their analyses, the researchers used 210 different data sets that were gathered from five sources including the U.S. government that featured information on more than 11 million individuals. Specifically, the researchers define their findings as a successful effort to propose and validate “a statistical model to quantify the likelihood for a re-identification attempt to be successful, even if the disclosed dataset is heavily incomplete.”
The study gave a hypothetical in which a health insurance company publishes an anonymized dataset of 1,000 people, amounting to one percent of their total customers in California. The dataset includes the individual’s birth date, gender, ZIP code, and breast cancer diagnosis. The boss of one of the individuals in the dataset sees that there’s someone who is male, lives in that individual’s zip code, has the same birth date, and, according to the dataset, is diagnosed with breast cancer, and didn’t have successful stage IV treatments. But the insurance company can argue that, while this uniquely specific data to the employer matches the record in their file, it’s possible it can be any of the other tens of thousands of insured customers if that individual is even insured with that company.
“While there might be a lot of people who are in their thirties, male, and living in New York City, far fewer of them were also born on 5 January, are driving a red sports car, and live with two kids (both girls) and one dog,” Dr. Luc Rocher of UCLouvain, an author on the paper, said in a statement.
Senior author Dr. Yves-Alexandre de Montjoye, a researcher at Imperial’s Department of Computing, and Data Science Institute, characterized such attributes as “pretty standard information for companies to ask for.”
Even the hypothetical illustrated by the researchers in the study isn’t a distant fiction. In June of this year, a patient at the University of Chicago Medical Center filed a class-action lawsuit against both the private research university and Google for the former sharing his data with the latter without his consent. The medical center allegedly de-identified the dataset, but still gave Google records with the patient’s height, weight, vital signs, information on diseases they have, medical procedures they’ve undergone, medications they are on, and date stamps. The complaint pointed out that aside from the breach of privacy in sharing intimate data without a patient’s consent, that even if it was in some way anonymized, the tools available to a powerful tech corporation make it pretty easy for them to reverse engineer that information and identify a patient.
“Companies and governments have downplayed the risk of re-identication by arguing that the datasets they sell are always incomplete,” de Montjoye said in a statement. “Our findings contradict this and demonstrate that an attacker could easily and accurately estimate the likelihood that the record they found belongs to the person they are looking for.”
The researchers put the onus on policymakers to create better standards for anonymization techniques in order to ensure that the sharing of datasets doesn’t continue to be a potentially far-reaching invasion of privacy. Some of the most powerful and exploitative companies in the world are obtaining datasets that provide enough information to confidently identify someone included—the consequences for either those companies or malicious actors to piece together a puzzle and create a fully-formed picture of someone given just a handful of identifying characteristics are insidious, and the researchers ability to identify such a large amount of deidentified users with only 15 attributes indicates that we need to reevaluate what constitutes an ethical anonymized dataset.
“The goal of anonymization is so we can use data to benefit society,” de Montjoye said. “This is extremely important but should not and does not have to happen at the expense of people’s privacy.”