For the past 18 months, according to the Tech Review, Google has been quietly rolling out a cloud computing service for DNA. Google Genomics could one day have millions of genomes on its servers, available at a click of a button to researchers. Are there legitimate privacy concerns here? Definitely, but it's not Google's grubby fingers you should worry about.
Genetic databases already exist online, of course, and Google Genomics is only the latest (and most ambitious) iteration. There are genealogy databases for finding your ancestors and long-lost relatives. There are publicly available genetic databases run by national research centers. And there are dozens of datasets shared by research groups on a case-by-case basis with others.
Google Genomics is going around to research centers and universities offering to host their genome sequences for $25 a pop each year. The more genomes it can collect in a central repository, the easier it could be for researchers to share their data.
A genome sequence by itself is useless. Without comparing it to others, you don't know what is a mutation or what is normal. Take two genomes, and you can start having some idea, but you'll still be swamped by the hundreds of variations. With a database of dozens, hundred, thousands of genomes, you get a much better chance of pinpointing. The bigger your database, the better.
Sequencing genomes is only becoming cheaper and easier, but sharing those many terabytes of data has not. Their size is unwieldy, and different datasets are scattered among different research groups often available on a case-by-base basis. In contrast, Google wants to build one centralized database where a researcher can query millions of genome sequences at once. This is the infrastructure for personalized medicine.
Suppose you child turns out to have a rare and mysterious genetic disease. Or suppose you come into the hospital with cancer. By comparing one genome sequence to millions of others in a database, we can begin untangling how to best treat individuals.
As always, with big data come big privacy issues. Genome databases have to carefully calibrate how much information they provide along with the DNA sequences. The more information (age, sex, location, smoking habits, etc etc), the more useful it is to researchers. But the more information, the easier it is to identify to whom the genome belongs.
A study last year in Science, for example, could identify several men from the publicly available 1000 Genomes Project based on their Y chromosomes and age, location, and family tree data. While Google Genomics's data seems to be aimed at researchers and not the general public, enabling the wide sharing of genomes makes these concerns much more pressing.
Or imagine a scenario in which you consent to sequencing your genome for a cancer study. Your genome gets uploaded to a central database, where other researchers working on other studies accidentally find out you have a newly discovered rare disease or an unknown sibling. Should they tell you?
These privacy worries aren't unique to Google Genomics of course, but the sheer scale of their envisioned database magnifies the potential problems. Researchers have advocated for central genomic data centers, if not necessarily maintained by Google, to at least standardize privacy policies. If Google Genomics succeeds, it'll be because it forces us to reckon with the privacy issues that lie behind genome sequencing.