As the world at large continues to talk about “AI”, creatives from all fields are left scrambling to save their work from companies using it to build data training sets to compete with them. In this article we’ll be talking about what these technologies are, how to protect your own creative work, and how you can push back against companies stealing from artists.
Today's topics
- How does “AI” work? Why are so many people against it?
- How does the emerging field of AI have an effect on you as an artist and creative?
- What can you do to protect your work from being scraped online?
What is AI?
As it currently stands, “AI” or “artificial intelligence” is used as a marketing term for a nebulously large collection of technologies spanning from web searches to missile targeting technology used in war. “Machine learning”, which does still exist as a specific subsect of artificial intelligence research, was the term most commonly used prior 2023 and still does see some usage.
According to the Organization for Economic Co-operation and Development, a global organization made up of 38 countries that is often used to assess and set global economic standards for over a billion people in much of the North America and Europe, define “AI” technology as
A machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments. Different AI systems vary in their levels of autonomy and adaptiveness after deployment.
Or in layman’s terms, it uses the outputs you give it to make outputs based on a previously created dataset and whatever new inputs are entered in real time.
So, what's AGI then?
You may have heard folks talking a bit about “AGI” or “Artificial General Intelligence”. This nebulous term is used for a hypothetical Artificial intelligence that could replicate and surpass all of the cognitive functions of the human mind.
Many silicon valley investors, policy wonks, and governments are very excited about this possibility despite the fact that it may never exist. One would hope that this doesn’t turn into the last several tech-futures we were promised like Crypto-currency, the metaverse, and Google glass.
Unfortunately for everyone else, that means a very large amount of consumer technologies could fall into that description, and with nearly every program being rebranded as “AI-powered” it’s safe to say that this muddies the waters quite a bit.
In most cases, many websites, apps, and programs that are shoving their “AI” branding down users throats are mostly either rebranding their current technology as “AI” to please investors and stakeholders, or are applying one or two distinct types of technologies. None of which are aligned with the average consumer's view of what artificial intelligence is or might be.
Idea Roundup
- AI is a catch-all marketing term for all sorts of technology that would previously be called “machine learning”.
- There are multiple definitions of AI, plenty of which involve no real intelligence.
- AGI is a theoretical technology that people in the tech space are saying will come, though many are skeptical as other promised “future” technologies have failed to materialize.
- Tech companies are calling many pre-existing products and technologies “AI” in hopes of luring investors.
Generative AI and you.
Generative AI technologies are a subsect of “AI” technologies that have been receiving mainstream use and media attention. Due to technological advances in data processing, specifically around GPUs, it’s become much more viable to invest and research in these technologies, especially after the crypto-bubble implosion has left many cryptocurrency mining centers looking for different avenues of profitability.
The majority of models that are around are LLMs (large language learning models), text-to-image/video models, or vocal synthesis models. With the slight exception of vocal synthesis models which have a longer consumer existence and development history, all of these are very similar in how they function. They are built on datasets that are trained on hundreds of thousands, if not millions, of images and texts. Scanning them for similarities, and creating several larger databases organized by those similarities that can be searched by corresponding text.
Meaning that if you input a request for a red apple, it will search for data matching the keywords ‘red’ and ‘apple’ and create an amalgamation based on its network of databases to serve a numerically generated apple to you.
These databases require millions of pieces of (sometimes very personal) data, especially more generalized consumer AI products like Chatgpt and Midjourney, that can power their ability to recognize a wide range of prompts.

Exploitation and AI.
Contrary to popular belief, the work of recognizing these pieces of data is not automatic. AI training does in fact require human work. Work that is often outsourced to workers in the global south who are paid under 4 USD to view extremely graphic images that often leads to lifelong mental health issues.
Some particularly enterprising AI startup founders have cut out the middle man entirely by secretly employing Indian workers and lying about their AI's technological capabilities. This isn’t particularly new, as Amazon has had its Amazon Mechanical Turk program where humans were paid pennies on the dollar to perform tiny tasks for years. The name being inspired by a similar scandal from 1770 where a man claimed he had a magical machine that could play chess which was later revealed to be a man secretly hidden in the bottom chamber of the fraudulent device.
AI is polluting the internet as well as itself.
Besides the elaborate exercise in offshoring white collar work to underpaid workers in the global south with a tech patina, another major issue with the technology is that it grows largely inefficient as the databases get larger.
While there are many models, such as the open-source Deepseek LLM from The People’s Republic of China or the disastrous Apple intelligence, that can be run entirely on your PC or phone, popular consumer models are constantly slurping up the internet at large and incorporating any data input into it, including personal details.
This poses an existential risk not just to the concept of personal privacy in the digital age, but to the technologies themselves as the internet fills up with poorly generated videos, texts, and images. Roughly 51% of online content is artificially generated, according to research from January 2024. Making it far more difficult to find human-made content, which is in turn destroying the generative models and causing them to produce nonsensical outputs. Quite poetic, really.
These nonsensical outputs cannot be patched out in the normal sense as “hallucinations,” as these outputs are colloquially known, are likely to increase as the models are given more reasoning skills. The technology cannot recognize context, it can only create statistically likely outputs; the technology is essentially guessing.
While in a fictional setting, this might be seen as tragic, the real world suffering of the many communities affected by the data centers powering these companies is outweighing any potential positives widespread generalized usage of it might bring. So, let’s help you aid this process along to further degrade the datasets!
Idea Roundup
- Generative AI technologies are built off of dubiously obtained personal and often copywritten data.
- Generative AI technologies use unethical labor practices to sift through sensitive and personal data.
- Generative AI cannot recognize context, it can only create statistically likely outputs.
- Generative AI companies are polluting internet by churning out low-quality content at an exponentially high rate.
- Companies investing in Generative AI are often doing so at the expense of marginalized communities and creatives.
Can you protect your data from AI?
Generative AI works by gathering up files into their datasets, but there are things that can be done to make the individual files unreadable or unusable by datasets.
While there is not a perfect solution, as research shows some of these protections may be reversed, these methods can stop less skilled scrapers and raise the overall cost of training making this sort of data unappealing to (most) generative AI companies who still have yet to make a profit and driving potential investors away as data sources dry up.
Many of the solutions on the individual file level are oriented towards visual artists, if you are looking for solutions for other mediums, you may find the website and platform oriented sections more helpful.

Can you protect your personal files?
Glaze
The most popular solution currently, Glaze was created by researchers at the University of Chicago. It provides protection against scraper bots attempting style-mimicry (aka, when someone tries to build models that directly mimic styles). While the program can be downloaded directly and run on your computer, it requires a hefty set up to get running for many artists - luckily, they offer a invitation only web-service that runs directly in browser and emails the result to you.
Nightshade
If you’re feeling particularly vindictive and you’ve got the computer rig to back it, Nightshade is a secondary tool created by the same researchers who have created Glaze that create false positives on the image (for example: presenting a leaf as a telephone) which degrades training sets that parse the data. Unfortunately, unlike Glaze Nightshade does not have a web-based version and needs to be run directly on device.
Myst
Myst is an open-source watermarking program that works similarly to Glaze that makes it more difficult for datasets to train on images by causing distortions on the individual pixel level. The primary advantage of Myst is that it is less energy/processing intensive and can be run on less demanding hardware than glaze. For those that cannot run Glaze or get an invitation to WebGlaze, but still have a compatible device, this can offer a solution, or a backup if Glaze is ever seriously compromised.
Art Shield
A web-based app-free solution that allows user to watermark their images with a data watermark that shields data from data engines. It also has a tool to allow users to search if their past art has been included in any datasets. Currently, it’s free to use and doesn’t require sign-ups nor invites. However, there hasn’t seemed to be any updates to it since 2023.
Watermarking
If all other solutions fail or are unavailable, general watermarking can still provide very basic protection, and garbled ‘watermarks’ tend to show up in generated images making them easily recognizable as not human created.

Can you protect your website?
Many artists are moving away from the social media platforms, especially ones like Deviantart and Artstation who seemingly have decided to betray their entire user base for an investor pipedream. Instead opting to host their work on their own websites. While there’s nothing wrong with this in theory, many users aren’t aware of the stances of companies they may choose to host with.
Nekoweb
Nekoweb, an up-and-coming threat to Neocities’ crown as darling of the Indieweb hosting, has explicitly stated that they disallow trackers and scrapers on their own homepage, and offers very similar features to Neocities (and a few more) on their premium tier for only $3 to Neocities $5 tier.
Wordpress
Wordpress, and their associated social media company Tumblr, previously sold their users data to genAI companies according to internal papers published by 404 media. They have since tried to backtrack on this PR nightmare of user security by offering users on both tumblr and wordpress a toggle to put a notice to disallow data-crawlers from slurping up their data.
Squarespace
Squarespace has been introducing genAI tools into their web-building suite seemingly without a mention of how exactly they treat their users’ data, though they have the decency to at least introduce a toggle to put a notice on your website to the large data-scraping companies not to slurp your work, similar to Wordpress.
Neocities
Neocities, the freemium recreation of the much beloved Geocities, has had a “mixed” approach to genAI as their owner has taken a “neutral” stance on GenAI on their blog much to the chagrin of its decidedly anti-big tech web 1.0 nostalgia-driven user base. Since releasing that blog post, Neocities has doubled down on their neutral stance, but set all new neocities websites to have an anti-data scraping text by default according to their bluesky thread.
What is Robot.txt?
Most of the these options, with the exception of Nekoweb who are more opaque about how they block generative AI data-scrapers, offer protection in the form of a “robots.txt” plain text file applied to the website that tells scrapers they are not allowed. Unfortunately, this doesn’t mean that companies are listening as Anthropic and OpenAI have been found to ignore these text files.
Spawing
Still, if you’re not satisfied by the standard issue robots.txt file, or want to create one that can be used on your self-hosted site, AI-protection startup Spawning offers a free tool that allows users to create their own text permission files with options for specific hosting platforms like Shopify.
Spawning has recently been testing a more comprehensive open-source scraping blocking solution that goes beyond simply asking scrapers to respect data. This solution is currently in beta and invite only, with a plug-in for Wordpress ready and other hosting platforms on the way. If you’re interested you can contact them through their website, and if this spooky YouTube announcement is anything to go by, it may be a much more scary threat to genAI companies than previous website level protections.
Can you protect your social media?
As pushback from creatives of all types grows louder in opposition of genAI, as well as changing international laws regarding data privacy, platforms are offering more options on how users’ data is being used. Most are now offering various settings to prevent crawlers, scrapers, and even their own companies from slurping up their users data.
Each social platform hides its AI settings in a slightly different spot, but they’re usually located somewhere in your privacy or data controls.
Hold the line.
While there are no perfect solutions for creatives to protect their work, it’s important to remember that this is a collective effort on the part of creatives. The point is to functionally starve companies of the precious data of creatives in a multi-pronged effort to push back against the attempt to displace workers by making the cost of operating this technology more than the worth of investors. We’re much closer to winning than you think. We just need to hold the line as the companies fall apart.
|
Belle │ My BlogDigital illustrator + Product and User Experience professional. Bringing together technology and creative fields in an equitable and non-exploitative way |









3 comments
Hey just following up with that question J – for Nightshade and Glaze at least the idea is specifically to do these actions en masse, as of course upscaling or filters could possibly break the protection on an individual image level however the amount of images needed to train these models consistently is several magnitudes larger than singular images and data is getting far more scarce. Currently there is already a quality training data shortage according to several firms – https://www.businessinsider.com/ai-training-data-shortage-slop-goldman-sachs-2025-10#:~:text=AI%20is%20already%20facing%20a,key%20to%20the%20data%20hole.
https://www.sify.com/ai-analytics/the-great-data-famine-how-ai-ate-the-internet-and-whats-next
So it’s up to everyone to try their best to hide those images, think of it more akin to a mass strike.
Is there any way to actually verify if the various programs like Glaze, Nightshade, Myst, ArtShield, etc, really work? I know the AI companies whined about Glaze and Nightshade in particular, but people just go ahead and generate their soulless offerings regardless.
I realize the only way to bring them down is to finally take out the billionaire class, legitimate lawsuits, legislation, or maybe just the collapse of modern civilization after Mother Nature scours clean the Earth, but on an individual level can we actually ruin their training data sets?
Thanks for writing with us, Belle! Can’t wait to do more. 🖤