Prosecraft.io, a site that used novels to help power a data-driven project to display word count, passive voice, and other much more subjective, writing-style markers such as vividness, shut down today after authors protested the project. Prosecraft used the full text of over 25,000 books—which is entirely copyrighted material—in order to develop a library of data. Authors, once they caught wind of what was happening, immediately hated this.
Zach Rosenberg was the author who first brought this site to the larger attention of authors on X, the site formerly known as Twitter. Pretty soon, more and more authors spoke out, including high-profile authors like Jeff VanderMeer (The Southern Reach trilogy), Indra Das (The Devourers), Gretchen Felker-Martin (Manhunt)
Part of this is because Prosecraft has admitted to using “AI algorithms.” In a blog post dated October 5, 2018, Benji Smith, the developer of both Prosecraft and the writing program Shaxpir that was based on the data mined from Prosecraft’s library, stated that “we taught our machine-learning [AI] algorithms to recognize which kinds of words can be used in which kinds of contexts, by looking at the types of words and phrases that tend to occur within similar sentences and paragraphs.” Additionally, he wrote that Shaxpir “[analyzed] more than 560 million words of fiction, from more than 5,800 books, written by more than 3,300 popular authors.” He does not disclose where he received those works of fiction, or whether or not he received permission to do so.
While the technology used is not necessarily a large language generative model like ChatGPT, it is not a stretch to say that incorporating generative LLM algorithms could have been on the horizon for Prosecraft. And since the site had a massive library of books, author’s fears are incredibly valid. In the wake of this backlash, Smith has written a lengthy blog on medium explaining why he voluntarily took down Prosecraft.
Although Prosecraft was only using portions of the text, it did not have permission from any authors or publishers to create a database based on the entire work of an author or the full text of a book. Smith wrote on the blog, “since I was only publishing summary statistics, and small snippets from the text of those books, I believed I was honoring the spirit of the Fair Use doctrine, which doesn’t require the consent of the original author.”
While this holds some water, Fair Use does not, by any stretch of the imagination, allow you to use an author’s entire copyrighted work without permission as a part of a data training program that feeds into your own “AI algorithm.” While this situation is certainly going to be a lesson for many people, it’s clear that authors are not going to allow their work to be used to train LLMs and vector networks.
Update August 8, 11:35 a.m.: Fixed the mistaken legal definition where copyrighted works were referred to as ‘copywritten.’ io9 sincerely regrets the error.
Want more io9 news? Check out when to expect the latest Marvel, Star Wars, and Star Trek releases, what’s next for the DC Universe on film and TV, and everything you need to know about the future of Doctor Who.