Massive AI Dataset Back Online After Being ‘Cleaned’ of Child Sexual Abuse Material.

In December 2023, a team of researchers from the Stanford Internet Observatory discovered that one of the world's largest open-source datasets, LAION-B5, used for machine-learning training of various AI models, contained more than 3,000 suspected istances of child abuse material. Anyone who had downloaded the dataset would therefore have had access to the mentioned content; moreover, any model trained via the dataset in question would have contained illegal material representing child abuse.

The non-profit organisation behind the Large-scale Artificial Intelligence Open Network (LAION) moved quickly following the publication of the mentioned research, removing the dataset from Hugging Face, the site where it resided.

In cooperation with the Internet Watch Foundation, the Canadian Center for Child Protection and the Stanford Internet Observatory, the organisation recently republished the dataset, completely cleansed of illicit material and divided into two versions, with the titles ‘Re-LAION-5B research’ and ‘Re-LAION-5B research-safe’.