ai – QREATE

October 24, 2024October 24, 2024

AI Bots: Disallow

I wrote a post recently “Should WordPress block AI bots by default?” with some thoughts about whether WordPress should be blocking AI bots via the robots.txt file by default.

Since writing that I decided that rather than just talking about it I should go ahead and submit some updated code to the WordPress project that does exactly that. I’ve done WordPress development for 14+ years, whilst I’ve created my own plugins and added them to the WordPress plugin repository I’ve never submitted anything to the core codebase before, so it was an interesting process to go through to get a bit of experience of that.

I’m not going through the various steps in detail to do this, but basically it involves forking the WordPress codebase on Github, making the changes in a local development environment, pushing some code to Github and making a Pull Request for those changes.

Whilst the code change is pushed to Github you also need to make a ticket in WordPress Trac ticketing system that is used to track code issues like bugs, updates and feature requests. I created a new Trac ticket for the PR but as it turns out a similar idea had been previously suggested in this Trac ticket so mine has been marked as duplicate to this original one.

This original ticket has some good ideas in it, although no code has been written so I’m glad to have submitted a PR along with it. I do also think my argument for this are a bit more forceful in my ticket compared to the original, I really do think this should be added. However, I am approaching this from the perspective of trying to create some discussion around this, so I don’t at all expect that the code in my PR is exactly the way this feature should work. In the original Trac ticket the suggestion is to have another checkbox in the “Reading” options in WordPress, “Discourage Al services from indexing this site” which I think makes perfect sense.

I did wonder whether there should be any specific way to manage the list of AI Bots though, whilst the “discourage search engines…” option is similar there is a difference. In the ‘robots.txt’ file it only takes a couple of lines to block all search engine user agents:

User-agent: * Disallow: /

So if you wanted to block all search engines and AI bots you could use just those couple of lines, but presuming you still want search engines to index your site¹ you need to specifically list all of the AI bot user agents to be blocked, something like this should block most known AI bots (at the time of writing in October 2024 anyway):

User-agent: AI2Bot User-agent: Ai2Bot-Dolma User-agent: Amazonbot User-agent: anthropic-ai User-agent: AlphaAI User-agent: Applebot User-agent: Applebot-Extended User-agent: Bytespider User-agent: CCBot User-agent: ChatGPT-User User-agent: Claude-Web User-agent: ClaudeBot User-agent: cohere-ai User-agent: Diffbot User-agent: FacebookBot User-agent: facebookexternalhit User-agent: FriendlyCrawler User-agent: GPTBot User-agent: Google-Extended User-agent: GoogleOther User-agent: GoogleOther-Image User-agent: GoogleOther-Video User-agent: iaskspider/2.0 User-agent: ICC-Crawler User-agent: ISSCyberRiskCrawler User-agent: ImagesiftBot User-agent: img2dataset User-agent: Kangaroo Bot User-agent: Meta-ExternalAgent User-agent: Meta-ExternalFetcher User-agent: OAI-SearchBot User-agent: omgili User-agent: omgilibot User-agent: PerplexityBot User-agent: PetalBot User-agent: Scrapy User-agent: Sidetrade indexer bot User-agent: Timpibot User-agent: VelenPublicWebCrawler User-agent: Webzio-Extended User-agent: YouBot Disallow: /²

It’s possible users might want to allow certain ones, and disallow others so the original Trac ticket also suggests that this list could be filterable so that plugins etc could modify this list.

I don’t think adding any kind of UI beyond the checkbox to core would be desirable as it’s exactly the kind of extension of functionality that plugins are intended for. The basic feature of blocking AI bots will work and if users need more they can find a plugin or write their own code to do what they need. One consideration is whether this list of default AI bots should get updated outwith the regular core WordPress development cycle, but the amount of new AI bots appearing probably(?) isn’t that frequent and there are fairly common interim point updates in the WordPress development cycle that would allow this block list to be updated.

If you’re reading this and think it’s an enhancement worth supporting then please do leave a comment on the original Trac ticket if you can, or reshare this post anywhere you think might help draw attention to it.

I acknowledge there is a lot of discussion about whether blocking AI bots will one day have the same impact that blocking search engines from your site does now in that you basically won’t show in any search engine results. The intention of blocking AI bots by default is so that users can make an informed choice about how their content is used. ↩︎
These are the droids we are looking for? ↩︎

October 4, 2024October 7, 2024

Should WordPress block AI bots by default?

I’ve been thinking a lot about AI recently, there’s definitely a lot of great uses for it and I use ChatGPT quite regularly. Despite it being a useful tool I’m very aware that a lot of content used to train AI models has just been slurped up without any user consent being given.

Microsoft’s AI CEO Mustafa Suleyman said at a conference back in April:

“With respect to content that is already on the open web, the social contract of that content since the 90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That’s been the understanding,”

So his perspective is that content which has been shared publicly on the web is available to be used for AI training by default, unless the publisher specifically says otherwise that it should not be used. I’m pretty sure copyright law disagrees with his take but there you go.

So with this stance in mind I have been giving it some thought and wondering if any consideration had been given to including any AI bot blocking in the standard ‘robots.txt’ file for WordPress? It might seem a little like “closing the gate after the horse has bolted” seeing as so much content has already been consumed, but people are still publishing content, more and more every day.

My perspective is that having AI bots blocked by default in WordPress would be a strong stand against the mass scraping of people’s content for use in AI training without their consent by companies like OpenAI, Perplexity, Google and Apple.

I’m aware that plugins already exist if people wish to block these but this is only useful for people who are aware of the issue and choose to block it, whereas consent should be requested by these companies and given rather than the default being that companies can just presume it’s ok and scrape any websites that don’t specifically say “no”.

Having 43%+ of websites on the internet suddenly say “no” by default seems like a strong message to send out. I realise that robots.txt blocking isn’t going to stop any of the anonymous bots that do it but at least the legitimate companies who intend to honour it will take notice. With the news that OpenAI is switching from being a non-profit organisation to a for-profit company I think a stronger stance is needed on the default permissions for content that is published using WordPress.

So whilst the default would be to block the AI bots there would be a way for people / publishers to allow access to their content by using the same methods currently available to modify ‘robots.txt’ in WordPress, plugins, custom code etc.

That’s my perspective / thought process anyway, I’m curious to see what other’s thoughts are.

The potential irony of using partially AI generated imagery as the main feature image in this particular post is not lost on me. The mass-scraping of images and video is possibly an even bigger issue than content-scraping of websites in regard to mass-copyright violation. ↩︎

February 22, 2024February 22, 2024

Is it Cake?

It’s interesting seeing the speed at which “AI” based generative media has been developing over the last few years, Midjourney, Stable Diffusion etc. OpenAI recently announced “Sora”, their new text-to-video AI model which can generate videos up to one minute long.

The example videos in their showreel video on YouTube are pretty impressive, go take a look if you haven’t done so already:

There’s definitely a bit of an “uncanny valley” quality about some of them though. They made me think of the Netflix series “Is it Cake?” where contestants have to make a cake that looks like a real object with the aim of fooling the judges who have to try and pick the fake cake-based item out of a lineup with three other real versions of the object.

This image shows several tool bags on plinths, one of the tool bags is actually a cake made to look like a real tool bag.

These cake-versions of real objects most-often look incredible and are created using amazing cake-making techniques and edible materials, they really do look like amazing realistic edible sculptures.

In the show the judges are not allowed to go close up to view the objects but can only view them from about 15-20 feet / 4.5-6 metres away. At that distance it is much harder to notice the subtle inconsistencies, e.g. not-so-straight edges or odd surface textures (or smell!) that might give away the illusion. But if they were allowed to go close up then the illusion would likely be much more apparent.

This image shows three people standing next to podiums, the people are judges on the Netflix show "Is it cake?"

It feels a bit like this with a lot of generative AI content too, looking at it broadly – especially on a small device screen – they do look incredible, but if you look closer and carefully you can spot some of the same not-so-straight edges and / or unusual textures (and sometimes extra fingers / legs!) that makes you think, “That’s a cake!“.

Over time though I’m sure it is going to get increasingly more difficult to tell the difference between these and real images / video as the subtle giveaways such as soft / fuzzy edges and extra limbs are reduced. However, even with the current issues it does already present a big challenge when it comes to evaluating the authenticity of the images and videos we see online.

OpenAI does make a clear statement when it comes to the “safety” of their tools and aims to prevent them from being used to create content that is hateful or contains misinformation, but the challenge will be when these types of models become more widely accessible by companies / organisations who don’t hold to these higher standards. It’s certainly going to be a bit of a wild west out there.

(Some of the AI stuff also reminded me of this old “Mr Soft” TV advert for Trebor mints too!).