One researcher claims that illicit and unsuitable images are quite prevalent, with some tools portraying nude children and women by default. Stable Diffusion, developer of Stable AI, an open-source generative image tool, has taken down a vastly utilized artificial intelligence (AI) training dataset.
This comes after researchers found out that the information scraper had consumed child sexual abuse material (CSAM). Stanford scientists made the discovery, while an independent media firm focusing on technology and internet reporting reported it.
Risks Associated With AI Development
Large language models (LLMs) and artificial intelligence image generators, for instance, MidJourney, Stable Diffusion, and Dall-E depend on vast datasets to train and generate content afterwards. A majority of the datasets, for example, LAION-5B, incorporate images removed from the internet. Further, most of them portray harm to children and are globally considered illicit.
David Thiel, a Stanford researcher, wrote that a majority of earlier models, for instance, were trained on the manually branded ImageNet1 corpus. It comprises 14 million images that span across all kinds of objects.
Nevertheless, the latest models, like Stable Diffusion, were trained using numerous scraped images in the LAION-5B2 data collection. He explained the data collection is being fed by indiscriminate crawling and entails a considerable amount of graphic content.
The Stanford report revealed that perceptual and cryptographic hash-based detection was used to recognize the illicit images. This detection compared an image’s hash to one of known child sexual abuse material. In case the soups are alike, the image is flagged as a possible CSAM.
Thiel said that the data collection utilized in training the image generators failed to incorporate the images under consideration. However, they still offered access to the prohibited content.
Stanford University Conduct Research on the Impact of AI Generative Tools on the Society
According to Thiel, LAION datasets do not incorporate authentic images. Instead, they contain a link to the original image on the site from which it was scrapped. Mostly, the photos had been taken down.
Despite a request for comment, LAION and Stability did not respond. Thiel claimed that web-scale datasets are challenging for several reasons, even with safety filtering efforts.
Besides child sexual abuse material, the existence of non-consensual intimate imagery (NCII) or ‘borderline’ material in these data collections is basically inevitable. This means they lack potential privacy and copyright issues.
Stability AI and LAION are yet to give official statements. However, LAION told an independent media company that it partners with scholars, higher learning institutions, and nongovernmental organizations to enhance its filtering.
Besides, it said it is collaborating with the Internet Watch Foundation (IWF) to establish and withdraw material believed to contravene regulations. Thiel mentioned the artificial intelligence models’ inclination to link women to nakedness and the ease of creating AI-backed NCII applications.
In a BlueSky follow-up thread, Thiel said they are already aware that most SD1.5 checkpoints are skewed to the extent that one must put ‘child’ in the negative prompts to ensure they do not produce child sexual abuse material.
Besides, they often link women to nakedness. As such, the ‘undress’ applications used to drive several non-consensual intimate imagery incidents are insignificant to produce.
Factors Contributing to High Adoption of AI
This month, Graphika, an analytics company, reported that NCII has increased significantly since the start of the year to more than 2408%. The situation is partly linked to artificial intelligence undressing applications that permit users to use AI deepfake technology to remove clothing from an image.
Thiel concluded that datasets such as LAISON, as well as models trained on them, must be excluded. He also said that eventually, the LAION models and datasets trained on them must likely be deplored, and the remaining versions are cleansed and restricted to research. Already, a number of child safety organizations are aiding in this, and more connections can be made for those who require help.
Despite the caution regarding CSAM in artificial intelligence-model training data, Thiel stressed the significance of open-source development. In this case, he said it is better compared to models ‘gatekept’ by some corporations.
Thiel claimed that some people would utilize the findings to oppose open-source machine language, which was not his intention. Similar to open-source ML, ML gatekept by a few megacorps and rich accelerationist creeps also have problems. The two were quickly deployed without appropriate safeguards.
In October, the Internet Watch Foundation, a United Kingdom-founded internet watchdog group, cautioned that child abuse content could ‘engulf’ the internet. This was after it discovered more than 20,000 images within a month.
Adding to the issue of addressing child sexual abuse material online, Dan Sexton, IWF’s chief executive, told a media outlet that artificial intelligence image generators are becoming more advanced.
Thus, it is becoming more challenging to establish whether or not an image is generated using the technology. In an interview, Sexton said there is a problem of not trusting if things are actual. Further, the things to tell them if items are genuine are only partially accurate, and thus, they cannot be trusted.
Sexton claimed the Internet Watch Foundation’s objective of taking down child sexual abuse material from the internet focuses mainly on the ‘open web,’ also referred to as the surface web.
This is because of the struggle of having child abuse content taken down from the dark web. Sexton said they spend less time in dark web spaces compared to the open web, where they believe they can have some effect.