We have always had bots visiting our website. They were mostly kind bots, like the crawlers that keep the databases of search engines up-to-date. Those kind bots start by looking at our robots.txt files before doing anything, and respect the restrictions that are set in those files.
However, things have changed. Like other websites, for instance Wikipedia, we are more and more being visited by AI scrapers, bots that scrape the Internet for anything they can find to train AI applications. They are usually extremely hungry for information, so they download much, much more than an ordinary user would do. Moreover, many of them are impolite: they don’t respect the rules set in our robots.txt files, they hide who they really are, they don’t put a little pause in between requests – on the contrary, they hammer our servers with requests from lots and lots of different IP addresses at the same time. The result is that parts of mageia.org, like our Bugzilla, Wiki and Forums, become unreachable.
Below you can see the CPU load of one of our most important servers, where, amongst other things, our forums and wiki are located:

Even if our infra upgrade had already been finished, this would be really hard to mitigate.
Blocking the used IP addresses is useless because they constantly switch to new ones. One of our sysadmins just told me about a big issue: “mobile proxies” where bots proxy their request through unsuspecting users’ phones. That makes the requests look much more legitimate and hard to block without also blocking real users. A lot of that happens without users even knowing their phone is being used like this. Some applications include proxies along with some game or other app and hide it in fine print in the terms of service. Last year, it was reported that Google had removed a bunch of such applications from their store.
Apart from phones, there are IoT devices and also ordinary computers that ended up in botnets, because they were not well protected. They can be used for AI scraping and probably are now.
Our sysadmins do time and again succeed in mitigating the problem, but it is a “cat and mouse game”, so the problem is likely to reoccur.
If you know people working on AI applications which need to be trained, please ask them to make sure their bots read and respect the robots.txt files they encounter. And, of course, please nudge your friends and family, when you think they need that, to make sure their computers and other smart devices get all security updates as soon as they are released.
It’s not only AI training but the raise of AI operators which can be pretty wild requests-wise. This won’t stop anytime soon, not as long as AI is on the rise (i.e. companies pouring hard cash into it).
I’m not against some browser challenges, mostly in wiki and docs, to prevent downtime (at times where we need them the most) and not raising the Mageia hosting bill, because no better solution to block bots and not users seems to exist right now (even the Cloudflare’s AI-powered protection seems to get too much false alarms).
Still, keep up the good work 😉
Pingback: Nuestros sitios están intermitentemente fuera de servicio por una avalancha de robots IA | Mageia Blog (Español)
I found this post, which addresses exactly what we are going through and has some tips on how to mitigate the problem: https://clavis.com.br/uso-de-bots-para-treinamento-de-ia-tem-causado-indisponibilidade-em-servidores/
Thanks, Michael, for when our sysadmins read your reply: the Google translation of your link should be available here:
https://clavis-com-br.translate.goog/uso-de-bots-para-treinamento-de-ia-tem-causado-indisponibilidade-em-servidores/?_x_tr_sl=pt&_x_tr_tl=en&_x_tr_hl=en-US&_x_tr_pto=wapp
Pingback: Uma avalanche de bots de IA está repetidamente derrubando partes do nosso site | Mageia Blog (Português)
Pingback: Eine Lawine an KI-Bots legt immer wieder Teile unserer Webseite lahm | Mageia Blog (Deutsch)
Suggestion for: mageia.org and Mageia Linux 9: Kernel.org, which hosts the Linux kernel, protects itself from AI-powered web scraping using: Anubis (software). Anubis is a software program that makes web scraping harder by using a proof of work mechanism.: https://en.wikipedia.org/wiki/Anubis_(software) and https://social.kernel.org/notice/Asir9HKoutxlXC7MCO and https://news.ycombinator.com/item?id=43562157 and https://anubis.techaro.lol and https://github.com/TecharoHQ/anubis Reminder: test the website (mageia.org) and the operating system (Mageia Linux 9). They are working, 100% compatible with the modern internet (IPv4 only, IPv6 only, IPv4 + IPv6, DNSSEC v4 and DNSSEC v6): https://internet.nl
If something is wrong, please correct me. Because I am always living and learning.
————
To all: Thank you for everything! Sorry for anything.
Hi,
Thanks for the suggestion.
Since you want to learn: the “Reminder” part at first seemed odd to me, not related to protection against web scrapers. Then I assumed that you wanted to say that you tested Anubis and that it works in Mageia 9 and with “modern internet”
Ah, is that why so many sites are now using Anubis? I find it super annoying to have to wait for a few seconds for the anime character to let me know that I can proceed.