author: Audrey Hingle

Getting Bots to Respect Boundaries

Publisher content is being scraped at scale for AI training and services without clear consent, attribution, or compensation, while also degrading site performance and diverting engineering resources away from human users.

Photo by Jonathan Cooper / Unsplash

By Audrey Hingle

I’ve recently finished my Internet Society Fellowship project which looks at how AI crawlers are putting increasing strain on web infrastructure, and what that means for the future of the open web. Below is a summary of that project, and a link to download the pdf.

Getting Bots to Respect Boundaries, focuses on the web’s existing consent and preference tools, built for search engines, researchers, and archives, and the limits of those tools in the context of large-scale AI training. Robots.txt, for example, was created as a voluntary protocol to signal basic access preferences in a cooperative environment. It was never designed to govern high-volume, extractive crawling for commercial AI systems.

That results in sustained operational pressure on publishers. The Wikimedia Foundation has reported sharp increases in bandwidth consumption driven largely by automated scraping of Wikimedia Commons, with bots accounting for a disproportionate share of its most expensive traffic. Project Gutenberg is seeing aggressive, poorly engineered crawlers repeatedly downloading the same material, placing strain on systems designed to provide public access to cultural works (which, by the way, they are happy for AI to train on and even provide bulk downloads to avoid live scraping). Similar patterns have appeared in libraries and other public-interest institutions, as well as commercial platforms, where poorly engineered bots overwhelm websites and backend systems by repeatedly pulling rarely accessed pages and overloading search and catalog systems, diverting staff time away from maintaining and improving services.

In response, publishers are deploying defensive measures: IP blocking, rate limiting, CAPTCHAs, web application firewalls, behavioral detection, and proof-of-work systems. These tools can keep sites online, but they are blunt, costly to maintain, and increasingly prone to blocking legitimate users, researchers, archives, and accessibility services. As the cost of mitigation rises, in staff time, tooling, and operational complexity, smaller organizations may struggle to continue to keep services online.

The underlying problem is a misalignment of incentives. The actors extracting the most value from large-scale crawling are largely insulated from the costs it creates, which are absorbed by publishers. They must respond by restricting access in order to survive. Over time, this risks accelerating enclosure: more gated content, more limited access, and a web that is harder to participate in.

The project catalogs the technical strategies currently being used to manage AI-driven crawling, the trade-offs they introduce, and why many are unlikely to continue working as crawlers become ever more evasive. It also examines emerging efforts to realign incentives by shifting costs back onto large-scale crawlers, including adversarial approaches such as Venom, which increase the risk and cost of indiscriminate scraping, and community-driven systems such as CrowdSec, which pool detection and enforcement across many sites. Alongside these, the project looks at standards work aimed at giving publishers clearer, machine-readable ways to express AI-related preferences, and why I think those preferences will work best when complying is less burdensome than ignoring them.

Download it here:

Getting Bots to Respect Boundaries

How AI Crawlers Are Straining Web Infrastructure

Getting Bots to Respect Boundaries.pdf

4 MB

Thanks to Mallory Knodel, Jeff Wilbur, Dr. Joseph Lorenzo Hall, Jamie McClelland, Eric Hellman, Nick Sullivan and Jona Azizaj for their feedback, input, interviews, and for connecting me with relevant sources.

Tuesday Next Week❗ Encryption and Feminism: Reimagining Child Safety Without Surveillance

⏰ We're going live next week.
🗓️ Feb 10 Online

We are re-running our popular MozFest session: "Encryption and Feminism: Reimagining Child Safety Without Surveillance" online so more people can experience the conversation in full. We will revisit the discussion, share insights from the panel, and walk through emerging Feminist Encryption Principles, including the ideas and questions raised by participants.

👉 Register now: the session is next week, and we’d love you there.

Speakers will include Chayn’s Hera Hussain, Superbloom’s Georgia Bullen, Courage Everywhere’s Lucy Purdon, UNICEF’s Gerda Binder, and IX’s Mallory Knodel, Ramma Shahid Cheema and Audrey Hingle.

Help us grow this conversation. Share it with friends and colleagues who imagine a future where children are protected without surveillance and where privacy is not a privilege, but a right.

Support the Internet Exchange

If you find our emails useful, consider becoming a paid subscriber! You'll get access to our members-only Signal community where we share ideas, discuss upcoming topics, and exchange links. Paid subscribers can also leave comments on posts and enjoy a warm, fuzzy feeling.

Not ready for a long-term commitment? You can always leave us a tip.

Become A Paid Subscriber

This Week's Links

As Europe looks for alternatives to US-controlled social platforms, new research suggests that open social protocols like AT Protocol can provide a credible alternative to Silicon Valley–dominated platforms. https://newpublic.substack.com/p/the-future-of-social-media-may-be

Internet Governance

Cloudflare argues that Google’s dominance in search gives it an unfair advantage in generative AI because Google uses a single crawler to collect content for both search and AI, leaving publishers with no meaningful way to opt out. Cloudflare says the only effective fix is to require Google to split its crawler by purpose, so publishers can allow search indexing while blocking AI use. https://blog.cloudflare.com/uk-google-ai-crawler-policy
The Netherlands is rethinking its heavy reliance on U.S. tech after a core digital identity system tied to everyday public services risks falling under American control. https://www.politico.eu/article/netherlands-eu-us-tech-digid-donald-trump-policy
The US Justice Department and a coalition of states are appealing a judge’s decision that spared Google from selling its Chrome browser, arguing the remedy does too little to curb the company’s illegal search monopoly. https://www.bloomberg.com/news/articles/2026-02-03/google-search-remedy-to-be-appealed-by-state-attorneys-general
French and European investigators raided Elon Musk’s X offices in Paris after an outcry over Grok’s role in spreading sexualized images of women and children. X insists that the raid “distorts French law, circumvents due process, and endangers free speech.” https://www.ft.com/content/3a47f27f-6ecb-429e-86fa-e8b8d78b0784
- As Daphne Keller points out, there is no conflict here between US and European speech rules in this instance. “This is a case about CSAM (child sex abuse material) and NCII (non-consensual intimate images). Those are crimes in France, and crimes in the US.” https://www.linkedin.com/posts/daphnekeller_the-the-new-york-times-coverage-of-frances-activity-7424502230780690432-oBKF?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAF2vmUBb7nX9-vJJu_5XpVoPz9DQWdC2iM
Brazil is betting on data center expansion to achieve “digital sovereignty” says Article19, but without stronger governance and competition safeguards, the strategy could deepen dependence on global tech giants while imposing environmental and social costs. https://www.article19.org/resources/brazil-data-centres-and-the-tales-of-digital-sovereignty
SpaceX has begun lowering the orbit of more than 4,000 Starlink satellites following a reported near-collision with a Chinese satellite in an effort to reduce collision risks in an increasingly crowded low-Earth orbit. https://www.tomshardware.com/tech-industry/china-close-call-prompts-starlink-to-reduce-satellite-altitudes-more-than-4-000-satellites-pulled-to-300-mile-orbit-to-increase-space-safety

Digital Rights

Court cases in California are exposing internal tech company documents showing that social media platforms knowingly designed and marketed their products to hook children and teens, potentially marking a turning point toward stronger regulation of platform design and youth safety. https://themarkup.org/privacy/2026/01/30/were-basically-pushers-how-a-battle-over-kids-on-social-media-is-unfolding-in-two-california-courtrooms
Filming ICE agents is legal under the First Amendment, but recent killings show that recording federal immigration enforcement has become a dangerous act, even as video remains one of the most powerful tools for accountability. https://www.wired.com/story/how-to-film-ice
Heather Burns argues that the international protocols used to collect and preserve digital evidence of war crimes in places like Ukraine, Syria, Sudan, and Yemen now need to be applied in the United States. https://heatherburns.tech/2026/01/26/ukraine-sudan-syria-yemen-america
As governments rush to restrict young people’s access to social media, age-verification and age-gating laws risk violating the Constitution, undermining privacy, and cutting minors off from core spaces of civic life without actually making them safer. https://thehill.com/opinion/technology/5710551-supreme-court-social-media
New Medicaid and SNAP work requirements are set to strip millions of Americans of benefits while funneling billions of public dollars to private contractors like Equifax and Deloitte. https://lpeproject.org/blog/the-means-testing-industrial-complex
This episode of Power Station explores how Media Justice grew from early internet-era activism into a national force challenging Big Tech’s impact on communities of color, and how it’s organizing people to confront today’s tech oligarchy and reclaim local power. https://podcasts.apple.com/us/podcast/the-deportation-machine-that-has-been-unleashed-in/id1375082413?i=1000746735204
A dense Federal Register notice proposes major new data collection requirements for US travelers, including mandatory mobile apps, social media history, and expanded biometrics. Brian Krebs says these proposals could affect privacy, mobility, and global travel. https://www.linkedin.com/posts/bkrebs_2025-22461pdf-activity-7421601860358340608-CB1t/?rcm=ACoAACWJ_HIBp2ruf2wVpyUwySkAKakbPfg5KLw

Technology for Society

Metagov has wrapped its first Governable Spacemakers Fellowship with Small Hassles Court, a playful minigame that explores self-governance through conflict mediation, inviting players to resolve everyday interpersonal disputes while modelling fairness and emotional co-regulation from the bottom up. https://smallhasslescourt.com
Starlight Stadium is a game designed to teach human rights defenders (HRDs) about the human rights monitoring methodology. Through undertaking a gamified human rights investigation, players can learn what practically goes into protecting and promoting human rights. https://freedomlab.io/starlight-stadium
A new Reddit-style network for AI agents “Moltbook” has quickly grown to tens of thousands of bots interacting without human input, creating a surreal experiment in machine social behavior. https://arstechnica.com/information-technology/2026/01/ai-agents-now-have-their-own-reddit-style-social-network-and-its-getting-weird-fast
- Many of the bots run on OpenClaw, the fast-growing AI agent formerly known as Clawdbot, which security researchers warn poses serious risks for users and enterprises. https://www.forbes.com/sites/ronschmelzer/2026/01/30/moltbot-molts-again-and-becomes-openclaw-pushback-and-concerns-grow
Mozilla is deploying its sizable reserves to fund an “AI rebel alliance” of safety- and governance-focused startups, taking aim at dominant players such as OpenAI and Anthropic. https://www-cnbc-com.cdn.ampproject.org/c/s/www.cnbc.com/amp/2026/01/27/mozilla-building-an-ai-rebel-alliance-to-take-on-openai-anthropic-.html
From seafoam-green control rooms to fluorescent hazard tape, this piece explores how color functions as a safety protocol, and how AI-driven environments may force a rethink of how color signals risk. https://protocolized.summerofprotocols.com/p/the-color-of-safety
After 15 years in SEO, Lily Ray says 2025 did not kill search but revealed how much AI visibility still depends on traditional SEO fundamentals. https://lilyraynyc.substack.com/p/a-reflection-on-seo-and-ai-search
With traffic from platforms drying up, publishers such as the Financial Times and The City are rethinking audience strategy by focusing less on reach and more on community, using owned spaces and targeted outreach to deepen engagement and loyalty. https://wan-ifra.org/2025/05/from-audience-to-community-high-engagement-strategies-from-the-ft-and-the-city

Privacy and Security

Peter Lowe has been running the widely used ad- and tracker-blocking domain list at https://yoyo.org for nearly 30 years. That list is plugged into all kinds of privacy tools.The list has grown so popular that it regularly strains the servers that host it; prompting Lowe to publicly thank the behind-the-scenes infrastructure maintainers who keep the project running. And shares a pretty cool bandwidth graph. https://www.linkedin.com/posts/peterlowe_for-nearly-30-years-ive-been-maintaining-share-7424052684778979329-peMr/?rcm=ACoAACWJ_HIBp2ruf2wVpyUwySkAKakbPfg5KLw
New AI health tools from OpenAI and Anthropic promise to help users manage their health, but they also raise serious privacy risks in a legal landscape that offers few protections for sensitive medical data says Andrew Crawford from CDT. https://cdt.org/insights/ai-health-tools-pose-risks-for-user-privacy
The Electronic Frontier Foundation has launched “Encrypt It Already,” a new campaign urging tech companies to make end-to-end encryption the default for communications and data storage. https://www.eff.org/deeplinks/2026/01/introducing-encrypt-it-already
Something is causing Apple Podcasts to open bizarre religious and educational shows on its own, sometimes directing listeners toward suspicious external links. https://www.404media.co/someone-is-trying-to-hack-people-through-apple-podcasts

Upcoming Events

Women on Web at Concordia University: Pills, Clicks, and Bans. Feb 6, Montreal and Online. https://www.linkedin.com/feed/update/urn:li:activity:7424163864344182785/
Building a New Social Web: Protocols, People and the Promise of a Democratized AI-Data Economy. Feb 20, On the sidelines of the India AI Impact Summit. https://luma.com/25chdoy8
Take Back Tech 3 is a gathering for organizers, artists, tech workers, academics, lawyers, and more to rally together and strategize our next power-building moves. April 17-19, Atlanta, GA. https://www.takebacktech.com

What did we miss? Please send us a reply or write to editor@exchangepoint.tech.

💡

Want to see some of our week's links in advance? Follow us on Mastodon, Bluesky or LinkedIn, and don't forget to forward and share!

Getting Bots to Respect Boundaries

Tuesday Next Week❗ Encryption and Feminism: Reimagining Child Safety Without Surveillance

This Week's Links

Internet Governance

Digital Rights

Technology for Society

Privacy and Security

Upcoming Events

Read next

Community Networks: Critical Infrastructure For Rural Connectivity In India

What Do We Want? To Stop Using Google Docs.

Censorship Should Be Obsolete by Now. Why Isn’t It?

Comments ()

Sign up for Internet Exchange

Tuesday Next Week❗ Encryption and Feminism: Reimagining Child Safety Without Surveillance

This Week's Links

Open Social Web

Internet Governance

Digital Rights

Technology for Society

Privacy and Security

Upcoming Events

Read next

Comments ( )

Comments ()