Perspectives on Text-Guided Visual Art

6 minute read


Perspectives on Text-Guided Visual Art: It’s About the Jobs

something captionable


Millions of artworks have been downloaded from the internet to train models capable of creating new images from a turn of phrase. If anyone can move immediately from ideation to imagery, what does this mean for art, artists, and society? I’ll start by describing why I think we should be talking more about automation and less about ownership.


Text-to-image systems like Midjourney, DALL-E2, and Stable Diffusion allow anyone with an internet connection to use natural language descriptions to generate new art. Type any set of words and these algorithms try to create an image from it. This technology relies on Machine Learning models that have been trained on large datasets of pairs of images and descriptions, usually downloaded from public pages on the internet. These systems are relatively new: one of the first to gain some popularity was a Colab notebook titled BigSleep that I wrote over a year ago. Importantly, these systems have rapidly progressed in terms of their fidelity, making some artists – many of whom have seen their work in the training data – fear for their long-term job security.

Old New Technologies

There is a long history of artists taking inspiration from the work of others, all the way to directly copying their content (e.g., or style. Importantly, there is a long-ish (in Machine Learning years, which are themselves becoming exponentially shorter) history of researchers creating technology that can readily take artists’ work and use it transformatively to create new art. For example, Neural Style Transfer (, which directly attempts to take the style of a single artistic work into another image, has existed for years with relatively little controversy.

The Point

It should come as no surprise that some artists are worried they will stop receiving commissions because of this tech and pissed that their images have been used to make it. These two concerns – automation and ownership issues respectively – are often incorrectly entangled. In fact, most people confuse the two, resulting in disaster for our discussion of these issues. If we are worrying about the wrong problems and framing the discussion incorrectly, we run into solutions that do not actually help people.

Recently, everyone from journalists to AI ethicists to concept artists have called for only collecting images and texts into datasets that are given “with the consent of creators.” Whether this should be necessary – to require asking people to use data in a transformative way (these models will not directly copy if data is properly deduplicated: that is already public for anyone to freely download on the web – is a point of issue. However, dataset sourcing should not be the primary point of issue in this discussion because it does not matter.

I’ve buried the lede a bit, but at least I’ve bolded it. Dataset sourcing doesn’t really matter because it is completely possible – hypothetically and practically – to collect a dataset that is “ethically sourced” that would still have the same basic concerns about automation baked in. An artist who loses their job because someone can simply type in a phrase does not care if the text-to-image system that replaced them was trained on images someone bought. Although scraping the web to create these models is easier than buying images in bulk or directly from artists, it is entirely possible to take the latter approach. We have older image models that can be trained on just thousands of images, something I used to do myself as a hobby – so, if you don’t think that it’s possible to do this with text-to-image models and that it won’t become easy given modest resources, you are wrong.

So where does this leave us? We have most people in this discussion focussing on whether images used for training are owned by the researchers – which I don’t think is even an ethical issue, let alone a legal issue if deduplication is done – who are mostly ignoring the effects this could have on some artists and some industries like stock photography. I do not think that almost any artists will be put out of a job, unless they do tasks on commission that could be specified to machine learning models – a hard living already. But we should consider that while we are buying into arguably neo-liberal discussions about ownership that are naive to how art has always worked (by taking inspiration from other artists and the world), we are ignoring the consequenses of “ethically sourced” models that could still affect people in a vast number of ways. We ignore the wound for the salt in it if we focus on where training data is from instead of thinking about ways to make automation less detrimental.

Self-checkout is more similar to AI art than people would like to admit. While art is much more enjoyable than cashiering (in some cases…), we should be looking for ways to build broad solutions for issues as broad as automation. It isn’t the source of the training data that matters so much as the effect it has on workers. Remember style transfer? No one minds it because although it directly lifts from an artist’s work using a single image, it’s boring enough that no one will have to worry about their job. Instead of having an endless debate over whether models should be trained on public data, we ought to be focussing on how to blunt the impact of these models that already exist on millions of hard drives and how to build a society where people are worth more than what can’t be made automatic.

Many of my thoughts on this topic are influenced by my colleague Aaron Hertzmann’s writing here:

Thanks to the gf for screening and editing :)