First, a confession. I’ve written fanfiction. Like, a lot of fanfic. In my spare time, I still write fic! (I’m currently writing a couple of fics for Interview With the Vampire and Trigun! It’s going great, thank you.) Over the course of the past 15 years I’ve published around 750,000 words of fic, and just to give you an idea of how much that is, the entire Lord of the Rings series, including The Hobbit, is just north of 575,000 words. So there’s a lot out there!
Most of my work, like millions of other fic writers, exists on the Archive of Our Own. The AO3, as it’s known, is the most-visited and largest fic archive on the web with around 350 million visitors per month, and is currently host to over 11 million fanworks. And until fairly recently, I didn’t realize that my fic hadn’t stayed on AO3. My work, alongside millions of other fics, have been used to train generative text-based AI. If you’ve played around with ChatGPT—congrats! You’ve used my work.
Large language models (LLMs) are the foundation for AI text generators, which were “trained” on data in order to create artificial neural networks. The most well-known dataset is hosted by the Common Crawl, a non-profit that provides an open repository of web data to anyone who wants it, for free. In order to create the dataset, the Common Crawl scraped the internet for writing and made it publicly accessible. Its archive began in 2008 and is currently being updated every two months.
In order to create generative text AI programs, programmers used the Common Crawl dataset to underpin artificial neural networks, which are called LLMs. The most well-known LLM is GPT, which was created by the company OpenAI. OpenAI used the Common Crawl dataset in GPT’s development, and it is currently using it as it develops further versions of its successful use case, ChatGPT. OpenAI released the GPT API to the public in 2021. This API is the basis for many other text-based LLMs—which means that the current state of various “stochastic parrot” text-generator AI programs are supported by the Common Crawl via GPT API, and, technically speaking, built on a massive corpus of fanfiction.
In 2019, the Archive of Our Own had 32 billion words of fanfic available, calculated from around five million pieces of fanwork. It currently hosts 11 million fanworks. I was unable to find a good source for how many words are on AO3 now, but I wouldn’t be surprised if it was much, much more than 50 billion words. Again, for comparison—as these are absurdly huge numbers—there are currently 4.2 billion English words on Wikipedia. For our purposes, it’s worth knowing that most, if not all, of those 32 billion words of fanfic available in 2019 are in the Common Crawl dataset that was used in OpenAI’s GPT LLM.
Nobody was told this was happening; many fic writers still don’t know that their work was scraped at all. While the Crawl’s data exists in a publicly available index, it is extremely difficult to access if you don’t have the ability to understand and execute code at a fairly high level. The average internet user can only assume that if they had publicly available writing online, their writing ended up caught in the Crawl. So while some folks understood that the AO3 had likely been Crawled, nobody had done the digging to figure out if it was really being used.
A few weeks ago, Sudowrite—a GPT-based LLM—released its product for public beta. Unlike the call and response of ChatGPT, Sudowrite was built to facilitate fiction writing. Users can sign up and use their account to generate words that may or may not resemble a story shape. Additionally, users can paste their original words into the writing tool and the generator will offer options for what should come next. It is a highly advanced language generator focused on creating stories. And it used billions of words from the Archive of Our Own to develop its models. In a series of more and more unhinged experiments, Wired was able to prove that Sudowrite had not only been trained on AO3, but was able to replicate stories that developed within its derivative, transformative culture.
This rather ingenious and tongue-in-cheek piece of reporting revealed that Sudowrite could be prompted to generate a story within recognizable Omega Verse strictures. I am NOT getting into what constitutes an Omega Verse fic, and if you go looking for that information yourself I am not responsible for what you learn. The point is that this style of writing and the various tropes involved in writing within the Omega Verse are localized to online fanfiction communities, and was actually developed on AO3 itself. It is a culture-specific style of writing that has only recently made its way into mainstream, if non-traditional, publishing outlets. The only way that Sudowrite would be able to generate recognizable Omega Verse stories was if it had been trained on so much fanfiction that the impact of fic was unignorable within the LLM programing.
I spoke to a Sudowrite customer representative via chat who confirmed that they trained their network on OpenAI’s large language models and “their own models,” and reiterated that these models were trained on online text published from 2011 through 2019. Once again, in 2019, the AO3 had 32 billion words. Including mine.
Using fic in a LLM deliberately aimed at writers is antithetical to fandom culture at large, and deeply disrespectful to the people who have written and distributed fic online, for free, for years. Fanfic has a rocky legal history, and the creation of the Archive of Our Own has its roots in a fan-led movement to establish a home for fandoms outside of corporate influence and without threat of censorship. And now, all that work is being taken, chopped up, and regurgitated in various LLMs, without the permission of any fic author. It is, to be absolutely candid, really fucking gross.
I’ll admit that this whole thing is personal; I don’t know how much fic I had online in 2019, but it was probably around 600,000 words. Most of what I’ve written since then have been short one shots, unfinished fics, and a ton—like over two million words—of original fiction and reporting as I switched careers. But over the course of my entire time as a fic writer, I didn’t once think about any of my fic leaving the Archive. That’s because AO3, and fandom, has a culture of privacy, protection, and gifting that is antithetical to most institutions, and at extreme odds with the likes of Sudowrite.
All fandoms have their own culture of interaction. Likewise, all fic sites have their own cultures as well. The AO3, and the various fandom cultures that co-exist on the site, generally share some similar cultural values. One of the most common of which is that it is taboo for writers to make a profit off the fic they post on AO3. In fact, as part of the user agreement, authors are not allowed to advertise writing as a service or even link to a tip jar in order to avoid legal complications for the Archive itself. With the big exception of Wikipedia, and unlike a lot of writing on the internet that was pulled into the Crawl, fanfic on the Archive is not compensated writing. It’s not ad-supported, people didn’t pay for it, it wasn’t generating monetary value for anyone. It was a gift. Programs like Sudowrite are charging users for access to their LLM which was built on the gifts of fic writers to fandom.
I gave my writing away, for free, because fandom is a culture of addition. Fanfic, fanart, podfic—all these things are given from an individual to the collective without expectation of anyone returning the favor. I wanted to add to the fandom because I loved the stories I was taking in at movie theaters, in books, on television. I loved writing in those worlds, and I enjoyed, beyond enumeration, the fic that I read. And now, it is a frustrating facet of fic authorship that a program like Sudowrite proposes a world where writing is done by algorithm, and that algorithm knows how I write. It knows how fandom writes.
It’s abhorrent that a program which purports to support a community of writers has based at least 32 billion words of its program on the writing of a community that did consent to have their work used. Some people will say that there is an irony to fic writers claiming that their work was stolen, but it was put into the Crawl without permission. Derivative fanworks have the legal right to exist, and fic writers have legal rights to their own creations. Writing fic is not stealing, but taking fic and using it to develop a dataset, and then offering that dataset to the public without having gotten permission from literally anyone is ethically gross.
For many LLM and AI developers, fanfic is not a culture to be celebrated, but a community to be exploited. They postulate on interactive models that allow people to chat with their favorite characters, not trained on the original book or original texts, but trained on fanfiction. This is partially because fic is already in the Crawl and they know they can take from fic writers without the threat of legal repercussions, and they will use the same fair use protections meant to shield fic writers from authors as an excuse for their experimentation. Fanfiction is not a market. It’s a culture. And fanfic culture hates this idea.
Fanfic is, at its core, a celebration of the stories that we love. It is a continuation of canon in beautiful, critical, exciting new ways. It challenges the text and asks deliberate questions about who wrote it that way, and why, and what would happen if the canon were different. It is a space that supports a massive amount of experimentation and boundary-pushing, and has, for a very long time, supported queer interpretation, embracing queer media in a way the mainstream is currently unable to. There is so much about fanfic that is important, and large language models will sanitize that work, echoing the most likely next word, and completely dehumanizing the effort, the emotion, and the culture that lies at the foundation of AI chatbots.
Right now, there are a hazy number of artificial neural connections in between fic and whatever words an AI outputs. While some models are free, Sudowrite is proof that fanfic has been stolen for profit. LLMs are reprehensible for a number of reasons, both ecological and ethical, but the fact they have stolen the work of a gift culture and are attempting to both obfuscate that fact and sell it back to fic writers is, frankly, disgusting. LLM Developers and Fandom are diametrically opposed cultures, and one group is benefiting off the hard work of the other.
At the end of the day, if anyone wants to sit down and read a 50K Supernatural erotica; an epic, multiverse-spanning 300K Steve/Bucky fic; or dozen cozy Star Wars coffee shop AUs, they can find what they want with a few easy filters on the Archive. And it’s there, free to read with no strings attached, given because the author enjoyed writing in the same world as those characters and wanted other people to enjoy it too. And I can guarantee you aren’t going to find the same kind of culture, experimentation, or even satisfaction in asking an LLM to write it for you. And if you can’t find it on AO3, well. You can always write it yourself.
Want more io9 news? Check out when to expect the latest Marvel, Star Wars, and Star Trek releases, what’s next for the DC Universe on film and TV, and everything you need to know about the future of Doctor Who.