The findings come as AI tools are increasingly promoted on pedophile forums as ways to create uncensored sexual depictions of children, according to child safety researchers. Given that AI images often need to train on only a handful of photos to re-create them accurately, the presence of over a thousand child abuse photos in training data may provide image generators with worrisome capabilities, experts said.
The photos “basically gives the [AI] model an advantage in being able to produce content of child exploitation in a way that could resemble real life child exploitation,” said David Thiel, the report author and chief technologist at Stanford’s Internet Observatory.
Representatives from LAION said they have temporarily taken down the LAION-5B data set “to ensure it is safe before republishing.”
In recent years, new AI tools, called diffusion models, have cropped up, allowing anyone to create a convincing image by typing in a short description of what they want to see. These models are fed billions of images taken from the internet and mimic the visual patterns to create their own photos.
These AI image generators have been praised for their ability to create hyper-realistic photos, but they have also increased the speed and scale by which pedophiles can create new explicit images, because the tools require less technical savvy than prior methods, such as pasting kids’ faces onto adult bodies to create “deepfakes.”
Thiel’s study indicates an evolution in understanding how AI tools generate child abuse content. Previously, it was thought that AI tools combined two concepts, such as “child” and “explicit content” to create unsavory images. Now, the findings suggest actual images are being used to refine the AI outputs of abusive fakes, helping them appear more real.
The child abuse photos are a small fraction of the LAION-5B database, which contains billions of images, and the researchers argue they were probably inadvertently added as the database’s creators grabbed images from social media, adult-video sites and the open internet.
But the fact that the illegal images were included at all again highlights how little is known about the data sets at the heart of the most powerful AI tools. Critics have worried that the biased depictions and explicit content found in AI image databases could invisibly shape what they create.
Thiel added that there are several ways to regulate the issue. Protocols could be put in place to screen for and remove child abuse content and nonconsensual pornography from databases. Training data sets could be more transparent and include information about their contents. Image models that use data sets with child abuse content can be taught to “forget” how to create explicit imagery.
The researchers scanned for the abusive images by looking for their “hashes” — corresponding bits of code that identify them and are saved in online watch lists by the National Center for Missing and Exploited Children and the Canadian Center for Child Protection.
The photos are in the process of being removed from the training database, Thiel said.