Few things in this world are more personal than your DNA. For this reason, databases like the Human Genome Project have always respected the privacy of their participants. By separating individuals' identities from their donated DNA samples, researchers have upheld a standard of "genomic anonymity." Those days are now officially over.
In today's issue of Science, researchers at MIT demonstrate that the identities of volunteers who donate personal genome sequence data can be revealed using only publicly available information. In an interview with io9, lead researcher Yaniv Erlich discusses how the team's search method works, the implications of the method's discovery, and why this could change the way we deal with genetic data.
A Brave New World
"We are living in a brave new world," Erlich tells io9, "a world where more information than ever is readily available online." What happens to this information depends on who's making use of it. In the hands of a scientist, it can be used to study, treat and cure diseases. In the hands of Facebook, it can be used to create powerful new search engines. In the hands of a criminal, it can be used to commit identity theft.
Few people understand the ethically complex nature of information-accessibility better than Erlich. Today, he's a fellow at MIT's Whitehead institute, where he runs a laboratory that creates new tools and algorithms for studying human genomes. But Erlich's background is in computer security. Before he worked at MIT, banks would hire him to find weak points in their electronic infrastructure. In a study recounted in today's issue of Science, Erlich brought his experience in vulnerability to bear on the world of genomics, and with it an end to the current paradigms for managing genetic information.
"Basically, we show that you can take whole genome sequencing data that is posted online and cross-reference it with public genealogy data to infer the identity of [an ostensibly anonymous] donor." And you can do it with Google searches.
The Methodology
The genomic outing process works in stages. Erlich's team began with a batch of "anonymous" whole genome sequences. If the genome belonged to a male, they sought out specific markers on the Y chromosomes known as short tandem repeats, or "STRs."
Unlike other genetic information, sequences on the Y chromosome are passed exclusively from father to son. Similar patterns of STRs are therefore preserved across several generations, and are a great way to trace a paternal line. Needless to say, Y chromosomes — and STRs — are a big deal not only to geneticists, but also to more casual genealogy enthusiasts looking to trace the ramifying arms of their family trees (known formally as "pedigrees"). Consequently, there is no shortage of publicly accessible databases online that link Y-chromosome markers to surname information. Cross reference your STRs with a few of these databases, says Erlich, and suddenly you've got yourself a surname that matches your genome; and with each surname comes a ton of other information about the patrilineal line, including geographical locations and detailed pedigrees.
But a surname can identify tens of thousands of people, which meant Erlich's team's genomes were still, for all intents and purposes, anonymous. So Erlich and his team went deeper by searching public record databases for metadata like age and state of residence (neither is protected under the Hippa Privacy Act). "The combination of surname, age and state is a very strong identifier," Erlich tells io9. "It's rare that you find more than a dozen with the same combination, and all three are very searchable online."
It wasn't long before Erlich's team had found an obituary that perfectly matched the pedigree of one of their surname candidates. He recounts the experience:
Now, we're not giving the exact details, the numbers aren't exact to protect these peoples' identities, but let's say we have nine offspring in this pedigree. We looked at the obituary, we had nine offspring. We looked at the number of males and females and they were the same. We looked at the order of birth, and we got the same order, males and females.
Erlich's team had successfully identified an "anonymous" DNA donor, along with several members of his family. The researchers repeated this process twice over, triangulating in on two more perfectly matched pedigrees. "In the two cases of a dual surname recovery from both grandfathers," write the researchers, "the surname of the father and the maiden name of the mother matched exactly to the grandfathers' surnames, substantially increasing the confidence of the recovery."
"What are the odds that we got all these families, and all their demographic characteristics right by chance?" Erlich asks me. "Less than one in a million."
When all was said and done, the researchers had only directly inferred the identity of five men; thanks to the pedigrees, however, they had managed to breach the privacy of close to 50 individuals, male and female alike.
The Future
Ultimately, the aims of Erlich's study are three-fold.
The first is to illustrate, in no uncertain terms, that the end of genomic (and other "omic"-type data, including proteomics, transcriptomics, microbiomics, etc.) anonymity is at hand. "We're not the first study to talk about issues of genetic privacy," Erlich tells io9, "but we're the first to demonstrate its vulnerability with publicly available data."
The second is to highlight that protecting human privacy is very complicated, and emphasize that things only stand to get more unwieldy from here on out.
It's worth pointing out, for example, that the methodology employed by Erlich and his team is really only relevant, at this time, to participants in the Center for Study of Human Polymorphisms (CEPH) family collection, whose genomes were sequenced as part of the 1000 Genomes Project. As the National Institutes of Health point out in a piece titled "The Complexities of Genomic Identifiability" (the perspective piece is published along with Erlich's findings in today's issue of Science): "the richness of publicly available research data and geneologic information derived from [CEPH participants] and their relatives" is unprecedented — but it won't be for long.
Advances in genomics are rapidly converging in a way that stands to make genomic anonymity an increasingly unrealistic standard. Genomes can now be sequenced more cheaply and rapidly than ever. Genomic applications — from family planning, to preventive medicine, to genetic engineering — are expanding at an ever-accelerating clip. Meanwhile, other forms of metadata are becoming increasingly available. "Just look at online obituaries," notes Erlich. "You don't see much of those as recently as ten years ago, but today they're very common. And they're text, and text is searchable. It's easy."
And moving forward, the genomic outing process will only become easier and easier.
Which brings us to the third (and, according to Erlich, most important) goal of the study: to bring issues of genomic privacy to the attention of the public.
"The main issue right now is not that tomorrow everyone is going to use this strategy to identify genomes and use that to do bad things," says Erlich. "I don't think that's really possible at this point."
"The main issue," he says, "is to recognize this problem early on, so that we can respond to it in an appropriate way in the long term by improving legislation and policy surrounding the handling of genomic information."
It's worth pointing out that Erlich and his colleagues are not at all opposed to public data sharing. "We actually favor it," he tells io9, though he emphasizes that it's not a decision that scientists should make on their own. "There needs to be discussion with the general public about how we should move forward."
What kind of issues are on the table? How about advertisements based on your DNA? (Remember Minority Report?) "Nobody does it yet, that I know of," says Erlich, "but it's completely legal as far as I know." He continues:
Imagine receiving an email that says "you have very desirable traits, would you like to be a sperm donor?" or "You are a rare blood type, please consider donating blood." Do we want to restrict that kind of communication with people?
What about companies that purchase other companies' databases? Let's say I participated in some genetic testing with one company and then another company purchases it. What are my rights?
These aren't all pie-in-the-sky questions, by the way. Some of these scenarios are unfolding right now. Just last month, biotechnology group Amgen Inc purchased Decode Genetics, and with it a DNA database containing tens of thousands of genomes. "What are the rights of these people?" Erlich asks.
Answers to questions like these will be decided in the near future, and they will be arrived at by adhering not to an unachievable goal of anonymity, but a new gold standard for biomedical investigation — one that justly weighs the benefits of research against the risks and sacrifices implicit in publicizing very private information. Whether such a standard can be achieved remains to be seen.
Erlich's team's findings are published in today's issue of Science, along with an accompanying perspective piece from the NIH.