AI Datasets should have AI developers worried about privacy

Last Updated on 31. August 2022 by PantherMedia

Why every AI developer should be seriously worried about privacy in ai datasets

by Michael Osterrieder of vAIsual on AI datasets

Artificial Intelligence is highly transformative

Few people would disagree that Artificial Intelligence and AI datasets are on the cusp of transforming most industries as we know them. The photography business is one of the earliest to be impacted by AI. There are several AI tools which speed up workflow, enhance quality and expand the output of images.

AI Dataset royalty free image hand open source

Open source Code plays a key role

Many of these tools have been developed with the help of open source code (MIT licensed) released by OpenAI, a research and development company founded by Elon Musk. In January 2021 they released a neural network designed to convert text to images called DALL-E. It was a branch of this code which we used at vAIsual to begin to develop synthetic humans for stock media licensing.

openai’s Newest version GLIDE with remarkable change

More recently, in December 2021, OpenAI released the successor to DALL-E called GLIDE. It uses a different architecture, a quarter of the required parameters and is getting favorable reviews for improved quality. In this video review of GLIDE, Edan Meyer points out that “It doesn’t allow you to make human-like objects, they did some filtering on the dataset”. To me, this represents a remarkable alteration.

AI aritificail intelligence image royalty free panthermedia

Privacy, copyright and ethical considerations

Although we can only speculate as to why the text to human image generation limitation in the dataset has been introduced between the DALL-E and GLIDE instances, the most likely reasoning would be due to privacy, copyright and ethical considerations. In particular, the legal personality rights (upheld by laws such as GDPR in Europe) pose an intrinsic risk when the human datasets used to train the AI are not legally clean.

royalty free image synth ai generated synthetic human portrait

Synthetic Humans Kollektion

The importance of the GDPR for datasets

This is important because although the models will never see themselves directly in the output, the heavy hand of GDPR compliance means that any person whose data has been used by a company has the right to require them to remove their data from their servers. We only need to look as far as the recent controversy of Facebook considering shutting down in Europe due to GDPR compliance issues, to see this is no small issue.

ai dataset gdpr digital royalty free image

The risks of non-compliant datasets

When it comes to training an AI code, this means it would take one single person to file a complaint and the entire dataset will need to be re-edited and, potentially,expensively created AI models have to be retrained. This could run into tens of millions of dollars and will surely bankrupt many of the startups vying for a place in the market.

ai dataset danger piracy royalty free network image panthermedia

Extensive stock photography compliance activities

What many AI developers may not realize is that the IP stock industry is one of the most ardently monitored for copyright and privacy. Hundreds of millions of dollars a year are spent by companies to solve IP licensing issues with content they are using for marketing and advertising. For commercial use of images to be headache free (and therefore attractive to the marketplace) each human used to train the AI needs to have signed a biometric release that is GDPR compliant.

Big players like TikTok have already responded

ai biometric release ai dataset people facial recognition royalty free image

This fact is not lost on the C-Suite of TikTok, who just changed their privacy policy to include that they “may collect biometric identifiers and biometric information” from their US based users’ content.

vAIsual has a clear policy

At vAIsual, we have seen this as a fundamental aspect to get right. The AI that we are training uses hundreds of thousands of images taken of models we have photographed in our own studios. Each model has signed a biometric model release that authorizes us to utilize these photographs for training our AI.

Impact of dataset security still underestimated

While we are seeing all sorts of AI generated images appearing in blogs, minted NFTs and otherwise shared online, the real impact of copyright, privacy and ethics is only starting to be understood when it comes to datasets and AI image generation.

value protection ai coins royalty free image

vAIsual’s commitment to it’s customers

For now, and into the future, vAIsual is committed to staying on the right side of the law, and providing legally clean data sets for professional use by the IP stock image market.