AI is sophisticated, but it’s not really intelligent. Today’s large language models used to power programs like ChatGPT, are amalgamations of scraped text found on the internet. So when Meta introduced its “state of the art” LLaMA AI back in February, eyes turned to some of the datasets used to train it, especially the Google-made “Colossal Clean Crawled Corpus,” or C4. It turns out, like its namesake, some of the scraped text truly blows.
So just how explosive is this C4 data set? An analysis of the scraped data from The Washington Post Wednesday shows C4 mostly relied on some heinous sources for its text. The top four most-used sites were Google Patents (making up .46% of all tokens), Wikipedia (.19%), Scribd (.07%), and The New York Times’ website (.06%). At the same time, C4 used large swaths of text from Russian propaganda site Russia Today and the ultra-right-wing Breitbart. Both those were in the top 200 sites trawled for text.
The Post worked alongside researchers at the Allen Institute for AI who recreated the data set. Some sites are far less present in the training data but are notable for their atrocious content. Stormfront, a site for white supremacists, was included in the data, ranked 27,505. Kiwi Farms, the site known for its vile online harassment campaigns, made up .00004% of tokens. 4chan, and all its wild conspiracy theories, was also included in the data, though ranked in lowly 484,297th place. There’s other small instances of text scraped from sites promoting conspiracies, porn, and hate content. Meta and Google did not immediately respond to requests for comment.
In addition, the training data took data from half a million personal blogs from sites like Medium, Blogspot and WordPress. The dataset includes text from Kickstarter, Etsy and Patreon, scraping the text and style of people promoting their work online. Two of the largest scraped sites included voter registration databases for Colorado and Florida. Though both sites are technically public information, the data may have scraped private citizens’ data.
This particular data set has been used on other major AI projects other than Meta’s LLaMA, such as Google’s T5 text-to-text AI transformer model. According to Google, C4 was originally developed by the company as a “cleaned version” of the nonprofit Common Crawl’s AI training data. Google said it removed offensive or “noisy” content from the dataset, including any dirty language and offensive slurs. Google’s LaMDA AI, which is used for the company’s Bard chatbot, is something of a black box. It was trained on a data set called Infiniset, which is described as 1.56 trillion dialogs (words used in context), 50% of which comes from public forums. Another 12.5% of its training set is C4 data, while the rest comes from English language Wikipedia and other web documents.
According to the research paper released alongside LLaMA, 15% of its pre-training data came from C4. Another 67% came from filtered CommonCrawl dumps from 2017 to 2020. The rest of its data comes directly from sites like Wikipedia, the Gutenberg Project, and GitHub. Last year, a programmer sued GitHub for its AI assistant tool saying it was taking his and other coders work without permission.
The Post’s report is all the more enlightening considering just how hard it is to actually find information about AI training. OpenAI did not reveal a single bare detail of its GPT-4 LLM released last month, citing the “competitive landscape” of AI development. Knowing what goes into the training can help explain the certain biases of outputs. Researchers recently showed how ChatGPT can be used to produce overtly racist responses through some simple prompt engineering.
The Allen Institute also included their own search function for users to see if C4 used their text. A quick search for “Gizmodo” shows the dataset scraped thousands of articles from and about our site from throughout the 2010s. According to the Post’s count, our site is only ranked 275 compared to RT and Breitbart.
Want to know more about AI, chatbots, and the future of machine learning? Check out our full coverage of artificial intelligence, or browse our guides to The Best Free AI Art Generators, The Best ChatGPT Alternatives, and Everything We Know About OpenAI’s ChatGPT.