Meta's AI Is Partially Trained on Breitbart and Russia Today, Study Finds

AI is sophisticated, but it’s not really intelligent. Today’s large language models used to power programs like ChatGPT, are amalgamations of scraped text found on the internet. So when Meta introduced its “state of the art” LLaMA AI back in February, eyes turned to some of the datasets used to train it, especially the Google-made “Colossal Clean Crawled Corpus,” or C4. It turns out, like its namesake, some of the scraped text truly blows.

So just how explosive is this C4 data set? An analysis of the scraped data from The Washington Post Wednesday shows C4 mostly relied on some heinous sources for its text. The top four most-used sites were Google Patents (making up .46% of all tokens), Wikipedia (.19%), Scribd (.07%), and The New York Times’ website (.06%). At the same time, C4 used large swaths of text from Russian propaganda site Russia Today and the ultra-right-wing Breitbart. Both those were in the top 200 sites trawled for text.

The Post worked alongside researchers at the Allen Institute for AI who recreated the data set. Some sites are far less present in the training data but are notable for their atrocious content. Stormfront, a site for white supremacists, was included in the data, ranked 27,505. Kiwi Farms, the site known for its vile online harassment campaigns, made up .00004% of tokens. 4chan, and all its wild conspiracy theories, was also included in the data, though ranked in lowly 484,297th place. There’s other small instances of text scraped from sites promoting conspiracies, porn, and hate content. Meta and Google did not immediately respond to requests for comment.

In addition, the training data took data from half a million personal blogs from sites like Medium, Blogspot and WordPress. The dataset includes text from Kickstarter, Etsy and Patreon, scraping the text and style of people promoting their work online. Two of the largest scraped sites included voter registration databases for Colorado and Florida. Though both sites are technically public information, the data may have scraped private citizens’ data.

This particular data set has been used on other major AI projects other than Meta’s LLaMA, such as Google’s T5 text-to-text AI transformer model. According to Google, C4 was originally developed by the company as a “cleaned version” of the nonprofit Common Crawl’s AI training data. Google said it removed offensive or “noisy” content from the dataset, including any dirty language and offensive slurs. Google’s LaMDA AI, which is used for the company’s Bard chatbot, is something of a black box. It was trained on a data set called Infiniset, which is described as 1.56 trillion dialogs (words used in context), 50% of which comes from public forums. Another 12.5% of its training set is C4 data, while the rest comes from English language Wikipedia and other web documents.

According to the research paper released alongside LLaMA, 15% of its pre-training data came from C4. Another 67% came from filtered CommonCrawl dumps from 2017 to 2020. The rest of its data comes directly from sites like Wikipedia, the Gutenberg Project, and GitHub. Last year, a programmer sued GitHub for its AI assistant tool saying it was taking his and other coders work without permission.

The Post’s report is all the more enlightening considering just how hard it is to actually find information about AI training. OpenAI did not reveal a single bare detail of its GPT-4 LLM released last month, citing the “competitive landscape” of AI development. Knowing what goes into the training can help explain the certain biases of outputs. Researchers recently showed how ChatGPT can be used to produce overtly racist responses through some simple prompt engineering.

The Allen Institute also included their own search function for users to see if C4 used their text. A quick search for “Gizmodo” shows the dataset scraped thousands of articles from and about our site from throughout the 2010s. According to the Post’s count, our site is only ranked 275 compared to RT and Breitbart.

Want to know more about AI, chatbots, and the future of machine learning? Check out our full coverage of artificial intelligence, or browse our guides to The Best Free AI Art Generators, The Best ChatGPT Alternatives, and Everything We Know About OpenAI’s ChatGPT.

Meta’s AI Is Partially Trained on Breitbart and Russia Today, Study Finds

Sign up for our newsletters

Latest news

‘The Odyssey’ Is Already Off to an Epic Start

Officials Are Struggling to Track America’s Explosive Diarrhea Outbreak. The Culprit Is Depressingly Obvious

Asus ROG Kithara Review: A Huge Gaming Headset With Even Bigger Sound

Apple Is Coming for the People Building OpenAI’s Future

Discs Aren’t Dead Quite Yet, Apparently

A Popular Weight-Loss Trick Might Actually Make You Eat More

Anker 300W Power Bank With Charging Base Hits a Record Low, Essentially Giving You Two Chargers in One for the Price of One

Prime Video’s ‘God of War’ Recasts Kratos After Ryan Hurst’s On-Set Injury

Latest Reviews

Geekom A9 Max (2026) Review: Not Much ‘Max’ About It

The Best Budget Laptops Under $1,000 for Back to School

Roborock Saros 20 Review: Jack of All Trades, Master of Most

You Know What Your Bathroom Needs? A Smart Mirror With Party Lighting

Narwal Freo Z10 Turbo Review: Midrange Vacuum, High-End Performance

X by Xreal a01+ Review: AR Glasses That Are Light on Your Face (and Wallet)

Razer Blade 16 (2026) Review: A Gaming Laptop You Can Actually Call ‘Portable’

Lenovo IdeaPad Slim 5x Gen 11 Review: Solid ARM at a Budget Price

Related Articles

Meta’s AI Is Partially Trained on Breitbart and Russia Today, Study Finds

Sign up for our newsletters

‘The Odyssey’ Is Already Off to an Epic Start

Officials Are Struggling to Track America’s Explosive Diarrhea Outbreak. The Culprit Is Depressingly Obvious

Asus ROG Kithara Review: A Huge Gaming Headset With Even Bigger Sound

Apple Is Coming for the People Building OpenAI’s Future

Discs Aren’t Dead Quite Yet, Apparently

A Popular Weight-Loss Trick Might Actually Make You Eat More

Anker 300W Power Bank With Charging Base Hits a Record Low, Essentially Giving You Two Chargers in One for the Price of One

Prime Video’s ‘God of War’ Recasts Kratos After Ryan Hurst’s On-Set Injury

Geekom A9 Max (2026) Review: Not Much ‘Max’ About It

The Best Budget Laptops Under $1,000 for Back to School

Roborock Saros 20 Review: Jack of All Trades, Master of Most

You Know What Your Bathroom Needs? A Smart Mirror With Party Lighting

Narwal Freo Z10 Turbo Review: Midrange Vacuum, High-End Performance

X by Xreal a01+ Review: AR Glasses That Are Light on Your Face (and Wallet)

Razer Blade 16 (2026) Review: A Gaming Laptop You Can Actually Call ‘Portable’

Lenovo IdeaPad Slim 5x Gen 11 Review: Solid ARM at a Budget Price

Related Articles

The Best Budget Laptops Under $1,000 for Back to School

The Best Tech to Level Up Summer 2026

Apple Is Coming for the People Building OpenAI’s Future

Elon Musk Trained Grok Users to Expect Sexual Deepfakes, Now He’s Suing Them

China Just Dropped Another Bomb on America’s Frontier AI Companies

Body Bags Found Outside OpenAI HQ as Execs Increasingly Fear for Their Lives