Getting Bots to Respect Boundaries

Publisher content is being scraped at scale for AI training and services without clear consent, attribution, or compensation, while also degrading site performance and diverting engineering resources away from human users.

Getting Bots to Respect Boundaries
Photo by Jonathan Cooper / Unsplash

By Audrey Hingle

I’ve recently finished my Internet Society Fellowship project which looks at how AI crawlers are putting increasing strain on web infrastructure, and what that means for the future of the open web. Below is a summary of that project, and a link to download the pdf.

Getting Bots to Respect Boundaries, focuses on the web’s existing consent and preference tools, built for search engines, researchers, and archives, and the limits of those tools in the context of large-scale AI training. Robots.txt, for example, was created as a voluntary protocol to signal basic access preferences in a cooperative environment. It was never designed to govern high-volume, extractive crawling for commercial AI systems.

That results in sustained operational pressure on publishers. The Wikimedia Foundation has reported sharp increases in bandwidth consumption driven largely by automated scraping of Wikimedia Commons, with bots accounting for a disproportionate share of its most expensive traffic. Project Gutenberg is seeing aggressive, poorly engineered crawlers repeatedly downloading the same material, placing strain on systems designed to provide public access to cultural works (which, by the way, they are happy for AI to train on and even provide bulk downloads to avoid live scraping). Similar patterns have appeared in libraries and other public-interest institutions, as well as commercial platforms, where poorly engineered bots overwhelm websites and backend systems by repeatedly pulling rarely accessed pages and overloading search and catalog systems, diverting staff time away from maintaining and improving services.

In response, publishers are deploying defensive measures: IP blocking, rate limiting, CAPTCHAs, web application firewalls, behavioral detection, and proof-of-work systems. These tools can keep sites online, but they are blunt, costly to maintain, and increasingly prone to blocking legitimate users, researchers, archives, and accessibility services. As the cost of mitigation rises, in staff time, tooling, and operational complexity, smaller organizations may struggle to continue to keep services online.

The underlying problem is a misalignment of incentives. The actors extracting the most value from large-scale crawling are largely insulated from the costs it creates, which are absorbed by publishers. They must respond by restricting access in order to survive. Over time, this risks accelerating enclosure: more gated content, more limited access, and a web that is harder to participate in.

The project catalogs the technical strategies currently being used to manage AI-driven crawling, the trade-offs they introduce, and why many are unlikely to continue working as crawlers become ever more evasive. It also examines emerging efforts to realign incentives by shifting costs back onto large-scale crawlers, including adversarial approaches such as Venom, which increase the risk and cost of indiscriminate scraping, and community-driven systems such as CrowdSec, which pool detection and enforcement across many sites. Alongside these, the project looks at standards work aimed at giving publishers clearer, machine-readable ways to express AI-related preferences, and why I think those preferences will work best when complying is less burdensome than ignoring them.

Download it here:

Thanks to Mallory Knodel, Jeff Wilbur, Dr. Joseph Lorenzo Hall, Jamie McClelland, Eric Hellman, Nick Sullivan and Jona Azizaj for their feedback, input, interviews, and for connecting me with relevant sources.


Tuesday Next Week❗ Encryption and Feminism: Reimagining Child Safety Without Surveillance

⏰ We're going live next week.
🗓️ Feb 10 Online

We are re-running our popular MozFest session: "Encryption and Feminism: Reimagining Child Safety Without Surveillance" online so more people can experience the conversation in full. We will revisit the discussion, share insights from the panel, and walk through emerging Feminist Encryption Principles, including the ideas and questions raised by participants.

👉 Register now: the session is next week, and we’d love you there.

Speakers will include Chayn’s Hera Hussain, Superbloom’s Georgia Bullen, Courage Everywhere’s Lucy Purdon, UNICEF’s Gerda Binder, and IX’s Mallory Knodel, Ramma Shahid Cheema and Audrey Hingle.

Help us grow this conversation. Share it with friends and colleagues who imagine a future where children are protected without surveillance and where privacy is not a privilege, but a right.


Support the Internet Exchange

If you find our emails useful, consider becoming a paid subscriber! You'll get access to our members-only Signal community where we share ideas, discuss upcoming topics, and exchange links. Paid subscribers can also leave comments on posts and enjoy a warm, fuzzy feeling.

Not ready for a long-term commitment? You can always leave us a tip.

Become A Paid Subscriber

Open Social Web

Internet Governance

Digital Rights

Technology for Society

Privacy and Security

Upcoming Events

What did we miss? Please send us a reply or write to editor@exchangepoint.tech.

💡
Want to see some of our week's links in advance? Follow us on Mastodon, Bluesky or LinkedIn, and don't forget to forward and share!