Old Protocol, New Problems: Robots.txt

Old Protocol, New Problems: Robots.txt
Photo by Chris Curry / Unsplash

This week, in our main story: we examine how robots.txt, a decades-old web protocol, has become central to discussions about consent, copyright, and web scraping to power new generative AI tools.

But first...

We Must Back the Tools that Beat Censorship

A few weeks ago, the Trump administration moved to cut off funding for the Open Technology Fund (OTF), a move that threatened to undermine internet freedom tools used by activists, journalists, and citizens in authoritarian regimes. OTF is a small but mighty nonprofit that’s been quietly powering some of the most essential tools for internet freedom—things like Tor, Signal, and Let’s Encrypt— for years. 

When the Trump administration attempted to cut OTF’s funding in its first term, Internet Exchange’s Mallory Knodel warned that the OTF is the leading global funder for digital tools that counter censorship and surveillance, and that the sudden loss of $20 million would leave a significant hole in efforts like academic research, software upkeep, and censorship monitoring. 

After OTF filed a federal lawsuit, they received a notification from the USAGM that rescinded the March 15th termination of their grants. Still, the sudden halt in funding sent a chilling message, one that may embolden regimes that thrive on controlling information. Given the pattern on display since 2020, we must remain vigilant, invest in resilience, and build coalitions to protect the digital infrastructure that upholds human rights.

Enjoying The Internet Exchange?

If you are finding this newsletter enjoyable, consider leaving us a tip and help keep it going.

Internet Governance

Digital Rights

Technology and Society

Privacy and Security

Upcoming Events

Careers and Funding Opportunities

Opportunities to Participate

🎵 We're listening to: Protect Trans Kids (WTFIWWY) by internet activists, writers, and musicians Evan Greer & Ryan Cassata (Lyric Video) https://www.youtube.com/watch?v=J2QKqy2y27U

What did we miss? Please send us a reply or write to editor@exchangepoint.tech.

Robots.txt Is Having a Moment: Here's Why We Should Care

By Mallory Knodel and Audrey Hingle. Also published in techpolicy.press.

Once a quiet piece of internet plumbing, robots.txt is now in the spotlight. AI has turned this humble file, long used to guide web crawlers for search, research, and more, into ground zero for debates about consent, control, and digital exploitation.

Introduction: From Obscure Standard to Center Stage

Robots.txt was proposed 30 years ago by Dutch software engineer and early web developer Martijn Koster. Since then, it has acted as a simple, voluntary protocol for website and crawler interaction. While it has been described as “the text file that runs the internet,” it was never intended to be a security tool or a legal framework, just a way for site owners to express how they wanted search engines, researchers, and archiving projects to handle and use their content, based on a clear signal and good manners.

This file, often no more than a few simple lines of human and machine-readable text, has been thrust into the spotlight because crawlers are now widely known to harvest data for AI. Crawlers that harvested, and continue to harvest, massive volumes of data for model training do so often without attribution or compensation to site owners and content creators. As generative AI systems scale, the stakes have risen: what was once a quiet tool for managing web etiquette is on the precipice of becoming the frontline defense for those trying to assert control over how their content is used in the lucrative AI-driven arms race. 

Last week’s IETF working group meeting marked a turning point in the push to modernise how we express content usage preferences online, particularly in the context of AI. But can we teach an old protocol new tricks? And is it enough on its own to express meaningful consent across all bots and crawlers used for purposes as wide-ranging as archival, research, and AI?

Why AI Scraping Is Breaking the System

Generative AI models like OpenAI’s GPT-4 and Google’s Gemini are trained on massive datasets scraped from the open web, including websites like Wikipedia (which is struggling to handle the extra traffic), news outlets such as The Guardian and The New York Times (which is now suing OpenAI), public domain and pirated books, code from platforms like GitHub, and public forums like Reddit. Some of this material is in the public domain or openly licensed, but much of it is copyrighted, raising ongoing legal and ethical concerns.According to OpenAI, its models were trained on “publicly available and licensed data,” but studies like the one by Carlini et al. have shown that models can regurgitate written and visual copyrighted material verbatim. OpenAI’s latest update to its AI image generator can produce Studio Ghibli-style images, which have subsequently flooded the internet. Evidence it was trained on the studio’s copyrighted content. Heated debates about copyright, consent, and the ethics of large-scale web scraping have been fueled by loyal fandom and deep appreciation for artists and creators.

Common Crawl, a nonprofit web archive whose mission is to allow researchers, small startups, or even individuals to access high-quality crawl data previously only available to large search engine corporations, is facing mounting pressure from publishers over its role in AI training. The New York Times and several Danish media organizations have demanded the removal of their content from past crawls and have configured their robots.txt files to block future access.Although Common Crawl plans to comply, its executive director, Rich Skrenta, warns that removing archived materials from repositories like Common Crawl threatens the open web. For years, scrapers like Common Crawl have been critical tools for openness and research, powering projects in web structure analysis, online censorship monitoring, and phishing detection. But as AI companies like OpenAI use the data to train commercial models, the project has become a flashpoint in debates over consent, copyright, and control, potentially sidelining researchers and entrenching the power of well-resourced tech giants.

These examples don’t show that publishers hate bots. For years, search engines and crawlers have been welcomed as tools for discovery and archiving. What they reveal is a growing objection to powerful AI systems using content in ways no one consented to or could have anticipated, as well as the vast amounts of money being made on that content and not shared. In response, publishers and other website owners are deploying anti-scraping measures not just to block bots but to push back against a broader loss of control in the age of AI.

The Robots.txt & Other Options: What’s Being Proposed

Robots.txt remains important as a foundational tool due to its widespread adoption and familiarity among website owners and developers. It provides a straightforward mechanism for declaring basic crawling permissions, offering a common starting point from which more advanced and specific solutions can evolve. However, robots.txt is primarily useful for website owners and publishers who control their own domains and can easily specify crawling rules. It doesn't effectively address content shared across multiple platforms or websites, nor does it give individual content creators, such as artists, musicians, writers, and other creative professionals, a way to easily communicate their consent preferences when they publish their work on third party sites, or when their work is used by others.

When content is hosted on platforms like social media sites, users often do not control how their posts are scraped or used, making it unclear how to express or withhold consent. A recent debate on Bluesky illustrated this challenge. The platform proposed giving users the ability to opt in or out of having their data used for things like generative AI. Effectively, giving them the ability to edit their own robot.txt file. Some users misunderstood the proposal as Bluesky changing its stance on AI training when, in fact, it aimed to give individuals more control over how third-party scrapers use their content. The backlash highlights how confusing and disempowering this space remains for everyday users.

To address these limitations, proposals for enhancing robots.txt range from minor adjustments to significant updates. Minor improvements include expanding user agent identification to cover more types of crawlers, especially AI bots, or using the existing Disallow/Allow directives to specify preferences for AI training. Another proposal suggests extending robots.txt to include content layer control to enable more specific rules on how content is accessed and used, particularly for AI training. More substantial changes include adding two new properties to robots.txt specifically for AI training, which would allow content owners to express more granular control over how AI systems use their content. 

Another proposal is adding an ai.txt file, a complementary standard to robots.txt. Versions have been proposed by both Guardian News & Media and Spawning. It would give publishers and content owners more control over how AI systems use their content, including the ability to limit snippet length, require attribution, and allow or deny use for AI training.

Beyond robots.txt enhancements, several complementary approaches are being proposed to better support content creators. Some focus on embedding consent signals directly into individual files, making preferences easier for creators to manage across platforms. Examples include adding machine-readable metadata directly within images, videos, and other digital files, and tools such as Spawning’s Do Not Train tool suite or the TDM·AI proposal, which provide creator-friendly solutions for content-level control. Additionally, structured HTTP headers and expanding signaling mechanisms to APIs and cloud services are suggested to ensure consistent preference communication across various digital environments.

Currently, these proposals primarily focus on preference signaling rather than enforcement. Technical tools, standards, and protocols provide mechanisms to express content owners' wishes clearly and machine-readably, but they rely on voluntary compliance from crawlers and AI systems. Robust enforcement and accountability are anticipated to emerge from future policy frameworks and regulatory actions, with policymakers expected to play a crucial role in creating legal backing and consequences for adherence or violations.

The Problem with "Preference" in a Complex Ecosystem

As debates over AI training and content control intensify, an IETF working group has stepped in to define how websites can express their preferences. Their most recent meeting demonstrated that building a shared language is easier said than done. While there’s momentum and enthusiasm behind creating technical solutions, many of the thorniest issues—like how to interpret the lack of signal, combine conflicting signals, or manage different preferences—are far from resolved.The working group agreed to support both opt-in and opt-out signaling for AI usage, allowing content owners to explicitly declare their preferences. However, most of the web currently falls into a grey area: no signal at all. This state is the default, and for now it is “out of scope”. That means that platforms and AI companies are left to interpret silence however they choose which risks reinforcing the status quo of nonconsensual data use.As use cases for AI diversify, so too do the layers of preference. A site might wish to allow indexing for AI search but not for AI training, or permit archival but only in certain jurisdictions. Sites might want to add time-based restrictions, like embargos or third-party licensing systems. Signaling how content should be used, it turns out, is extremely complex. The group acknowledged this challenge and carved out a separate deliverable to define how these preferences can be composted and interpreted, ideally by August.The current goal is to make preferences machine-readable and clearly expressed, not to enforce them. Regulation and legal frameworks, especially in the EU, are expected to eventually create consequences for ignoring these signals. For now, the focus remains on enabling clear expression, not ensuring follow-through.

Rights, Risks, and Regulation

This is about more than technical standards; it’s about governance and the technology that underpins the web. The growing backlash against AI scraping reflects a deeper concern over the erosion of norms online. As regulators, primarily in the EU, move to define legal frameworks for AI transparency and data usage, there’s a narrow window for the technical community to weigh in and help shape meaningful and enforceable norms. 

The EU AI Act and its accompanying Code of Practice have added urgency, as rightsholders and cultural organizations demand enforceable safeguards and more meaningful opt-out (and opt-in) mechanisms. 

The urgency at the IETF meeting last week was palpable, driven by the understanding that if technologists don’t lead on setting clear, interoperable standards, policymakers or Big Tech will define the rules, potentially in ways that are not ideal or that favor business over public interest.Technologists participating in the IETF discussions face a clear challenge: finding the right balance between helping website owners express their preferences, protecting against AI misuse, and preserving legitimate uses like academic research, journalism, and public-interest innovation, activities that have long depended on open web crawling.

What We Need Next

The path forward requires us to move beyond robots.txt alone, combining practical technical enhancements with collaborative, ecosystem-wide solutions. Here are the key steps we need next:

  1. Make Robots.txt AI-Ready: Robots.txt is popular because it’s straightforward, but it needs to be enhanced to handle modern AI crawlers.
  2. Empower Creators with Content-Level Signals: Creators need simple, built-in ways to tag their work with clear signals about its use, especially by AI. Embedding machine-readable metadata directly into files like images, videos, or code ensures that usage preferences stay with the content wherever it travels online.
  3. Standardize Preference Signals Across the Web: We should adopt easy-to-understand, consistent signals in addition to robots.txt, like structured HTTP headers or API-level preferences, so content owners can clearly communicate their wishes, no matter which platform or service is hosting their content.
  4. Prioritize Clear Signals Now, Expect Enforcement Soon: Right now, the priority is clarity: making sure preferences are expressed consistently and understood across the web. But we should be working with policymakers to ensure these signals eventually have real legal backing.
  5. Encourage Collaboration Among Stakeholders: Technologists, content creators, platforms, and policymakers need to talk openly and collaborate. That’s the best way to create solutions that are practical, fair, and effective for everyone involved.

Conclusion: A Fork in the Road

The conversation about robots.txt might not seem exciting, but it's crucial for defining the future of online rights and content ownership. How we handle crawling, scraping, and consent today will shape the digital landscape for creators, researchers, journalists, and the public. 

Getting this right matters deeply—not just for publishers and artists, but also for researchers and journalists whose work depends on responsibly open access to information. To have your say in these important discussions, consider joining the conversations with the IETF or following the upcoming events and livestreams from Brussels.

💡
Please forward and share!

Subscribe to Internet Exchange

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe