Artificial intelligence researchers said Friday they have deleted more than 2,000 web links to suspected child sexual abuse imagery from a dataset used to train popular AI image-generator tools.
The LAION research dataset is a huge index of online images and captions that’s been a source for leading AI image-makers such as Stable Diffusion and Midjourney.
But a report last year by the Stanford Internet Observatory found it contained links to sexually explicit images of children, contributing to the ease with which some AI tools have been able to produce photorealistic deepfakes that depict children.
That December report led LAION, which stands for the nonprofit Large-scale Artificial Intelligence Open Network, to immediately remove its dataset. Eight months later, LAION said in a blog post that it worked with the Stanford University watchdog group and anti-abuse organizations in Canada and the United Kingdom to fix the problem and release a cleaned-up dataset for future AI research.
Stanford researcher David Thiel, author of the December report, commended LAION for significant improvements but said the next step is to withdraw from distribution the “tainted models” that are still able to produce child abuse imagery.
I’m glad they removed them, but it’s kind of closing the barn doors after the horses have bolted at this point.
Complete failure of everyone involved that it was in there in the first place.
These datasets have billions of images in them (The Laion database have 5 billion images!). There is no way a human can go through them to check for bad content.
Then don’t just use it? Or use a program? There a multiple ways to not do something stupid and none of them occurred to them because it is more important to them to be at the top of the shitpile.
The dataset sizes needed for machine learning rule out any kind of human verification. It’s just not possible to manually check billions of images.
Oh, that makes it okay then.
How would you check 5 billion images?
Mu.
I wouldn’t use a amount of images I couldn’t check. I wouldn’t use images from unchecked sources. I wouldn’t make money from sexual exploited children.
And I think people that don’t see the most obvious solution to that are fucked in the head.
That won’t work. Models of this kind need billions of images or they are trash.
Great they removed them… Did they report the images to the authorities?
If 2000 out of 5,000,000,000 images can be found, why couldn’t they be found before the dataset was published.
That’s a question to be pondered for the ages.
/s