DALL-E 2, Stable Diffusion, Midjourney: How do AI art generators work, and should artists fear them?6 min read
Throughout human history, technological progress has made some workers obsolete while empowering others. Workers in industries such as transport and manufacturing have already been strongly impacted by advancements in automation and artificial intelligence.
Today, it’s the creative sector that’s on the line. Visual artists, designers, illustrators and many other creatives have watched the arrival of AI text-to-image generators with a mix of awe and apprehension.
This new technology has sparked debate around the role of AI in visual art and issues such as style appropriation. Its speed and efficiency have triggered fears of redundancy among some artists, while others have embraced it as an exciting new tool.
What is an AI text-to-image generator?
An AI text-to-image generator is a software that creates an image from a user’s text input, which is referred to as a prompt. These AI tools are trained on huge datasets of pairs of text and images.
DALL-E 2 and Midjourney have not yet made their datasets public. However, the popular open-source tool Stable Diffusion has been more transparent about what it trains its AI on.
“We did not go through the Internet and find the images ourselves. That is something that others have already done,” said Professor Björn Ommer, who heads the Computer Vision and Learning Group at Ludwig Maximilian University of Munich.
Ommer worked on the research underpinning Stable Diffusion.
“There are now big data sets which have been scraped from the Internet, publicly available. And these we used, mainly the LAION datasets, which are out there, consisting of billions of images that we can train upon,” he told Euronews Next.
LAION is a non-profit organisation that collects image-text pairs on the Internet. It then organises them into datasets based on factors such as language, resolution, likelihood of having a watermark and predicted aesthetic score, such as the Aesthetic Visual Analysis (AVA) dataset which contains photographs that have been rated from 1 to 10.
LAION gets these image-text pairs from another non-profit organisation called Common Crawl. Common Crawl provides open access to its repository of web crawl data, to democratise access to web information. It does this by scraping billions of web pages monthly and releasing them as openly available datasets.
Training the AI
Once these datasets of image-text pairs are gathered and organised, the AI model is trained on them. The training process teaches the AI to make connections between the visual structure, composition and any discernible visual data within the image and how it relates to its accompanying text.
“So when this training then finally completes after lots and lots of time spent on training these models, you have a powerful model that makes the transition between text and images,” said Ommer.
The next step in the development of a text-to-image generator is called diffusion.
In this process, gaussian or “random” visual noise is incrementally added to an image, while the AI is trained on each iteration of the gradually more “noisy” image.
The process is then reversed and the AI is taught to construct, starting from random pixels, an image that is visually similar to the original training image.
“The end product of a thousand times adding a tiny bit of noise will look like you pulled the antenna cable from your TV set and (there’s) just static, just noise there – no signal left anymore,” Ommer explained.
The AI model is trained on billions of images in this way, going from an image to noise and then reversing the process each time.
After this stage of the training process, the AI can then begin to create, from noise, images that had never existed before.
In practice, this means that a user can now access a text-to-image generator, enter a text command into a simple text box, and the AI will generate an entirely new image based on the text input.
Each text-to-image AI has keywords that its users have discovered through trial and error. Keywords such as “digital art”, “4k” or “cinematic” can have a dramatic effect on the outcome, and users have shared online tips and tricks to generate art in a specific style. A typical prompt might read as “a digital illustration of an apple wearing a cowboy hat, 4k, detailed, trending in artstation”.
Appropriation of art style
The ethics of AI text-to-image generators have been the subject of much debate. A key issue of concern has been the fact that these AIs can be trained on the work of real, living, working artists. This potentially allows anybody using these tools to create new work in these artists’ signature style.
“I think we’re going to have to figure out either a way for artists to get compensated if their names or images come up in the datasets, or for them to just completely opt-out if they don’t want to have anything to do with it,” video collage artist Erik Winkowski told Euronews Next.
On the issue of stylistic appropriation for financial gain, he added that “if a brand campaign is obviously appropriated from a person’s artwork, whether it was made with AI or otherwise, it’s just not a good thing. And I hope that they’ll be a public standing up against that”.
In November, the online art community Deviant Art announced that it would add its own AI text-to-image generation tool DreamUp to its website.
All of Deviant Arts users’ artwork on the website would then be automatically available to train the AI.
However, within 24 hours of the announcement, facing strong pushback from its community, Deviant Art changed its policy. Instead, users would have to actively choose to opt in to train the AI.
Shutterstock, a stock image marketplace, now plans to integrate DALL-E’s text-to-image generator and compensate the creators whose work was used to train the AI.
Unfair competition or powerful new tool?
At the 2022 Colorado state fair, Jason Allen’s AI-generated artwork ‘Théâtre D’opéra Spatial’ – which was created using Midjourney – won in the category of “emerging digital artists”.
The award sparked much controversy and debate around the future of art. Amid the publicity, Allen launched a new company, AI Infinitum, which offers “luxury AI prints”.
Some artists are concerned about the speed and accuracy at which an AI text-to-image generator can create artwork. A tool like Stable Diffusion can, in a matter of seconds, create multiple artworks that would take artists hours or days to produce.
This has concerned some creatives who fear that their skills may be made obsolete by this technology.
“I’ve seen the goal of my research never wanting to replace human beings, human intelligence or the like,” Ommer told Euronews Next.
“I see Stable Diffusion much like a lot of other tools that we’re seeing there, as just an enabling technology which enables the artist, the human being, the user utilising these tools to then do more or do the things that they were already doing better, but not replacing them from the best”.
The next stage of AI art
AI text-to-image generators are continually being improved and some researchers and tech companies are developing the next stage of generative visual art.
Meta has released examples of its text-to-video AI currently in development, which can produce a video from a user’s text input.
Meanwhile, Google has unveiled DreamFusion, a text-to-3D AI that builds upon the technology of text-to-image generators to generate 3D models without the need for datasets containing 3D assets.*
Some visual artists such as Winkowski have already started incorporating generative AI tools into their workflow and pushing the technology to create animated art.
In his recent short film titled ‘Leaving home’, Winkowski drew certain frames and allowed Stable Diffusion to generate the frames in between.
“It’s almost like having a superpower as an artist, really,” he said.
“That’s really exciting. And I think we’re maybe going to be able to take on more ambitious projects than we ever thought possible”.
For more on this story, watch the video in the media player above.