One of the main defenses used by those who are bullish on AI art generators is that although the models are trained on existing images, everything they create is new. AI evangelists often compare these systems to real life artists. Creative people are inspired by all those who came before them, so why can’t AI be similarly evocative of previous work?
New research may put a damper on that argument, and could even become a major sticking point for multiple ongoing lawsuits regarding AI-generated content and copyright. Researchers in both industry and academia found that the most popular and upcoming AI image generators can “memorize” images from the data they’re trained on. Instead of creating something completely new, certain prompts will get the AI to simply reproduce an image. Some of these recreated images could be copyrighted. But even worse, modern AI generative models have the capability to memorize and reproduce sensitive information scraped up for use in an AI training set.
The study was conducted by researchers both in the tech industry—specifically Google and DeepMind—and at universities like Berkeley and Princeton. The same crew worked on an earlier study that identified a similar problem with AI language models, specifically GPT2, the precursor to OpenAI’s extraordinarily popular ChatGPT. Getting the band back together, the researchers led by Google Brain researcher Nicholas Carlini found that both Google’s Imagen and the popular open source Stable Diffusion were capable of reproducing images, some of which had obvious implications against image copyright or licenses.
The first image in that tweet was generated using the caption listed on Stable Diffusion’s dataset, the multi-terabyte scraped image database known as LAION. The team fed the caption into the Stable Diffusion prompt, and out came the same exact image, though slightly distorted with digital noise. The process for finding these duplicate images was relatively simple. The team ran the same prompt multiple times, and after getting that same resulting image, the researchers manually checked if the image was in the training set.
Two of the paper’s researchers Eric Wallace, a PHD student at UC Berkeley, and Vikash Sehwag, a PHD candidate at Princeton University, told Gizmodo in a Zoom interview that image duplication was rare. Their team tried out about 300,000 different captions, and only found a .03% memorization rate. Copied images were even rarer for models like Stable Diffusion that have worked to de-duplicate images in its training set, though in the end, all diffusion models will have the same issue, to a greater or lesser degree. The researchers found that Imagen was fully capable of memorizing images that only existed once in the data set.
“The caveat here is that the model is supposed to generalize, it’s supposed to generate novel images rather than spitting out a memorized version,” Sehwag said.
Their research showed that as the AI systems themselves get bigger and more sophisticated, there’s more likelihood AI will generate copied material. A smaller model like Stable Diffusion simply does not have the same amount of storage space to store most of that training data. That could very much change in the next few years.
“Maybe in next year, whatever new model comes out that’s a lot bigger and a lot more powerful, then potentially these kinds of memorization risks would be a lot higher than they are now,” Wallace said.
Through a complicated process that involves destroying the training data with noise before removing that same distortion, Diffusion-based machine learning models create data—in this case, images—similar to what it was trained on. Diffusion models were an evolution from the generative adversarial networks, or GAN-based machine learning.
The researchers found that GAN-based models do not have the same problem with image memorization, but it’s unlikely that major companies will move on beyond Diffusion unless an even more sophisticated machine learning model comes around that produces even more realistic, high quality images.
Florian Tramèr, a computer science professor at ETH Zurich who participated in the research, noted how many AI companies advise that users, both those in free and paid versions, are granted license to share or even monetize the AI-generated content. The AI companies themselves also reserve some of the rights to these images. This could prove a problem if the AI generates an image that is exactly the same as an existing copyright.
With only a .03% rate of memorization, AI developers could look at this study and determine there’s not much of a risk. Companies could work to de-duplicate images in the training data, which would make it less likely to memorize. Hell, they could even develop AI systems that would detect if an image is a direct replication of an image in training data and flag it for deletion. However, it masks the full risk to privacy posed by generative AI. Carlini and Tramèr also assisted on another recent paper that argued that even attempts to filter data still does not prevent training data from leaking out through the model.
And of course, there’s a high risk that images nobody would want recopied end up showing up on users’ screens. Wallace asked if a researcher wanted to generate a whole host of synthetic medical data of people’s X-Rays, for example. What should happen if a diffusion-based AI memorizes and duplicates a person’s actual medical records?
“It is pretty rare, so you might not notice it’s happening at first, and then you might actually deploy this dataset on the web,” the UC Berkeley student said. “The goal of this work is to kind of preempt those possible sorts of mistakes that people might make.”