BOSTON—Days after the inauguration, many folks panicked on seeing immediate changes to government websites. Others worried that scientists would be forced to have their work reviewed by the government first. One of the biggest worries: What would happen to the potentially petabytes of data (spreadsheets enough to fill thousands of external hard drives) collected on the environment, water quality, our food supply, or climate?
Armies of concerned folks banded together to begin pulling down sensitive data. Since then, it’s become a movement: Computer geeks, scientists, government employees and private citizens have gathered across the country to download and copy data—scientific, agricultural, judicial, etc.—collected and hosted by the United States government. Hoarding all of those bits and bytes has given a movement a purpose, and those involved hope to turn the data into a repository that anyone can use and access.
“Even in the best case scenario where none of this data goes down, I think all of these efforts still provide a service in making data more easily accessed, more easily discovered and more easily utilized,” Jeffrey Liu, MIT graduate student and organizer of this past weekend’s DataRescue event in Boston, told Gizmodo. “When data’s more accessible, people can have more productive and objective discussions.”
These types of events began back in December, when scientists, researchers and even librarians at the University of Toronto and University of Pennsylvania worried about what an administration skeptical of the scientific consensus on climate change might do with climate data. Some websites have seen altering of wording on specific topics, but as of yet, there don’t seem to have been any major deletions. Still, concerned folks have now gathered across the United States at so-called DataRescue events in almost twenty cities, with more events planned.
“I consider data deletion to be a modern day form of book burning,” Caroline Kerrigan, an IT Consultant, told Gizmodo. She was a so-called seeder at DataRescue Boston @ MIT, where dozens of people sat chattering behind computers in a vast, columned space, the walls lined with coffee and stacks of pizza boxes over the weekend. These events now have a structure. Surveyors put together how-to guides for others, to show where to find the data and what they should be taking, and to avoid overlap on datasets. Seeders help organize the pages with data to be harvested, tagging them with numbers. Harvesters, the most tech savvy team members, use scripts and code to scrape the data from each site. Storytellers document the day’s activities on social media, recording why specific datasets are important and what they might be used for.
Folks at the Boston event pulled several datasets, not only relating climate—one surveyor Gizmodo spoke with was looking at Department of Justice data on immigration and employee rights, for instance. The data will still need to be validated, and made available on Datarefuge.org
Participants had all sorts of reasons for being there. “After the election I was feeling like I was somewhat powerless and didn’t have an outlet for my desire to make some change,” said George Gaudette, a harvester at the event. Another harvester, Janet Riley, said this was the kind of engagement where she could use her “nerd powers” to actually accomplish something. “I’ve paid for this data and I’ll be damned if someone deletes it.”
Others thought that archiving data could help Americans get ahead of any issues that may arise in the future. “I’m Canadian. I lived through the [Stephen] Harper years, which were a bit of a black mark in terms of the censorship of scientific findings,” said Brendan O’Brien, a volunteer from the Environmental Data and Governance Initiative, or EDGI, who wrote some of the archiving code. “That may not happen, but we do need to take preventative measures. If we do that then we’re guaranteed” not to lose anything. Liu also told Gizmodo that some folks from government agencies like the Environmental Protection Agency were taking part, but would not identify themselves due to retaliatory fears.
Part of pulling all that data is actually protecting it and making it available. DataRefuge and EDGI together are attempting to secure and distribute data so it might stay available for researchers, while developing the tools to more easily scrape everything. Data must all be tagged, and the metadata, information about what the data actually is, must be collected. Datasets receives special codes based on what’s inside to ensure that nothing has changed. Eventually, organizers hope Datarefuge.org will be available as an convenient, easy-to-use database. “It took the first year of my PhD just to develop the tools to get the data we needed from one government website,” said Liu. A convenient one-stop data shop would save “a lot of effort that could have been put towards doing actual research.”
For now, the data lives on Amazon storage servers. But the sheer volume of government data out there—anywhere between ten terabytes and five thousand times that number, or 50 petabytes—means organizers are working with Dat and the InterPlanetary File System to synchronize multiple copies of the data across the web, said O’Brien. As of yet, DataRefuge and EDGI are both accepting offers for funding but it’s all volunteer run... sort of, explained Laurie Allen, assistant director of digital scholarship at the University of Pennsylvania libraries. Universities support work that would fall under many of the volunteers’ job descriptions, but it’s still extra work. “I don’t stop doing this at the end of the day at 5PM,” said Allen.
Ultimately, O’Brien said, the many DataRescue-affiliated groups will continue to learn and modify their process to be more efficient, since their work is a monumental task that no one has done before. “We’re going to archive a government in a repeatable fashion.”