Researchers Prove AI Art Generators Can Simply Copy Existing Images

One of the main defenses used by those who are bullish on AI art generators is that although the models are trained on existing images, everything they create is new. AI evangelists often compare these systems to real life artists. Creative people are inspired by all those who came before them, so why can’t AI be similarly evocative of previous work?

New research may put a damper on that argument, and could even become a major sticking point for multiple ongoing lawsuits regarding AI-generated content and copyright. Researchers in both industry and academia found that the most popular and upcoming AI image generators can “memorize” images from the data they’re trained on. Instead of creating something completely new, certain prompts will get the AI to simply reproduce an image. Some of these recreated images could be copyrighted. But even worse, modern AI generative models have the capability to memorize and reproduce sensitive information scraped up for use in an AI training set.

Models such as Stable Diffusion are trained on copyrighted, trademarked, private, and sensitive images.

Yet, our new paper shows that diffusion models memorize images from their training data and emit them at generation time.

Paper: https://t.co/LQuTtAskJ9

👇[1/9] pic.twitter.com/ieVqkOnnoX

— Eric Wallace (@Eric_Wallace_) January 31, 2023

The study was conducted by researchers both in the tech industry—specifically Google and DeepMind—and at universities like Berkeley and Princeton. The same crew worked on an earlier study that identified a similar problem with AI language models, specifically GPT2, the precursor to OpenAI’s extraordinarily popular ChatGPT. Getting the band back together, the researchers led by Google Brain researcher Nicholas Carlini found that both Google’s Imagen and the popular open source Stable Diffusion were capable of reproducing images, some of which had obvious implications against image copyright or licenses.

The first image in that tweet was generated using the caption listed on Stable Diffusion’s dataset, the multi-terabyte scraped image database known as LAION. The team fed the caption into the Stable Diffusion prompt, and out came the same exact image, though slightly distorted with digital noise. The process for finding these duplicate images was relatively simple. The team ran the same prompt multiple times, and after getting that same resulting image, the researchers manually checked if the image was in the training set.

The bottom images were traced to the top images that were taken directly from AI’s training data. All these images could have license or copyright tied to them. Image: Cornell University/Extracting Training Data from Diffusion Models

Two of the paper’s researchers Eric Wallace, a PHD student at UC Berkeley, and Vikash Sehwag, a PHD candidate at Princeton University, told Gizmodo in a Zoom interview that image duplication was rare. Their team tried out about 300,000 different captions, and only found a .03% memorization rate. Copied images were even rarer for models like Stable Diffusion that have worked to de-duplicate images in its training set, though in the end, all diffusion models will have the same issue, to a greater or lesser degree. The researchers found that Imagen was fully capable of memorizing images that only existed once in the data set.

“The caveat here is that the model is supposed to generalize, it’s supposed to generate novel images rather than spitting out a memorized version,” Sehwag said.

Their research showed that as the AI systems themselves get bigger and more sophisticated, there’s more likelihood AI will generate copied material. A smaller model like Stable Diffusion simply does not have the same amount of storage space to store most of that training data. That could very much change in the next few years.

“Maybe in next year, whatever new model comes out that’s a lot bigger and a lot more powerful, then potentially these kinds of memorization risks would be a lot higher than they are now,” Wallace said.

Through a complicated process that involves destroying the training data with noise before removing that same distortion, Diffusion-based machine learning models create data—in this case, images—similar to what it was trained on. Diffusion models were an evolution from the generative adversarial networks, or GAN-based machine learning.

The researchers found that GAN-based models do not have the same problem with image memorization, but it’s unlikely that major companies will move on beyond Diffusion unless an even more sophisticated machine learning model comes around that produces even more realistic, high quality images.

Florian Tramèr, a computer science professor at ETH Zurich who participated in the research, noted how many AI companies advise that users, both those in free and paid versions, are granted license to share or even monetize the AI-generated content. The AI companies themselves also reserve some of the rights to these images. This could prove a problem if the AI generates an image that is exactly the same as an existing copyright.

Most images we extract are copyrighted. Very few (eg. the picture in Eric's tweet) allow for free re-distribution (with attribution).
Not a lawyer, so I don't know what this implies.
But you likely can't make the (common) argument that these models don't copy training data! pic.twitter.com/vVEahLA13C

— Florian Tramèr (@florian_tramer) January 31, 2023

With only a .03% rate of memorization, AI developers could look at this study and determine there’s not much of a risk. Companies could work to de-duplicate images in the training data, which would make it less likely to memorize. Hell, they could even develop AI systems that would detect if an image is a direct replication of an image in training data and flag it for deletion. However, it masks the full risk to privacy posed by generative AI. Carlini and Tramèr also assisted on another recent paper that argued that even attempts to filter data still does not prevent training data from leaking out through the model.

And of course, there’s a high risk that images nobody would want recopied end up showing up on users’ screens. Wallace asked if a researcher wanted to generate a whole host of synthetic medical data of people’s X-Rays, for example. What should happen if a diffusion-based AI memorizes and duplicates a person’s actual medical records?

“It is pretty rare, so you might not notice it’s happening at first, and then you might actually deploy this dataset on the web,” the UC Berkeley student said. “The goal of this work is to kind of preempt those possible sorts of mistakes that people might make.”

Researchers Prove AI Art Generators Can Simply Copy Existing Images

Sign up for our newsletters

Latest news

That Thing You’ve Heard About Baby Rattlesnakes? It’s Wrong

Ariana Grande Exits ‘American Horror Story’

Google Pixel Watch 4 LTE Is Now Cheaper Than the WiFi Model After Returning to Black Friday Pricing on Amazon

UGREEN USB-C Hub Is Going for Peanuts at Under $10, With 5 Ports Including HDMI

DJI Clears Out Neo Mini Drone With 4K UHD Camera at an All-Time Low, Priced Like a Basic Gimbal Stabilizer

Jackery Explorer 1000 v2 With 200W Solar Panel Is Nearly 50% Off, Turning a Power Station Into a Full Solar Generator Setup for Less

‘Moana’ and ‘Evil Dead Burn’ Share a Weak Opening Weekend

Bose QuietComfort Ultra Headphones (2nd Gen) Hit Record Lows While AirPods Max Barely Move on Price

Latest Reviews

Roborock Saros 20 Review: Jack of All Trades, Master of Most

You Know What Your Bathroom Needs? A Smart Mirror With Party Lighting

Narwal Freo Z10 Turbo Review: Midrange Vacuum, High-End Performance

X by Xreal a01+ Review: AR Glasses That Are Light on Your Face (and Wallet)

Razer Blade 16 (2026) Review: A Gaming Laptop You Can Actually Call ‘Portable’

Lenovo IdeaPad Slim 5x Gen 11 Review: Solid ARM at a Budget Price

Nothing Ear 3a Review: You Can Skip the Flagship

Razer Soma Chroma Chair Review: An Awkward Beginning to ‘Immersive’ Furniture

Related Articles

Researchers Prove AI Art Generators Can Simply Copy Existing Images

Sign up for our newsletters

That Thing You’ve Heard About Baby Rattlesnakes? It’s Wrong

Ariana Grande Exits ‘American Horror Story’

Google Pixel Watch 4 LTE Is Now Cheaper Than the WiFi Model After Returning to Black Friday Pricing on Amazon

UGREEN USB-C Hub Is Going for Peanuts at Under $10, With 5 Ports Including HDMI

DJI Clears Out Neo Mini Drone With 4K UHD Camera at an All-Time Low, Priced Like a Basic Gimbal Stabilizer

Jackery Explorer 1000 v2 With 200W Solar Panel Is Nearly 50% Off, Turning a Power Station Into a Full Solar Generator Setup for Less

‘Moana’ and ‘Evil Dead Burn’ Share a Weak Opening Weekend

Bose QuietComfort Ultra Headphones (2nd Gen) Hit Record Lows While AirPods Max Barely Move on Price

Roborock Saros 20 Review: Jack of All Trades, Master of Most

You Know What Your Bathroom Needs? A Smart Mirror With Party Lighting

Narwal Freo Z10 Turbo Review: Midrange Vacuum, High-End Performance

X by Xreal a01+ Review: AR Glasses That Are Light on Your Face (and Wallet)

Razer Blade 16 (2026) Review: A Gaming Laptop You Can Actually Call ‘Portable’

Lenovo IdeaPad Slim 5x Gen 11 Review: Solid ARM at a Budget Price

Nothing Ear 3a Review: You Can Skip the Flagship

Razer Soma Chroma Chair Review: An Awkward Beginning to ‘Immersive’ Furniture

Related Articles

The Best Tech to Level Up Summer 2026

Yet Another Safety Leader at OpenAI Has Left

OpenAI Just Can’t Beat This TikToker

The Pixel Watch 5’s Most Notable Change Might Be Its Price

The Future Is Always Listening: OpenAI Says Its New Voice Assistant Is ‘One Step Closer to a Truly Accessible AGI’

White House Denies Giving OpenAI ‘Green Light’ to Publicly Release Its Latest Model