Old Protocol, New Problems: Robots.txt
This week, in our main story: we examine how robots.txt, a decades-old web protocol, has become central to discussions about consent, copyright, and web scraping to power new generative AI tools.
But first...
We Must Back the Tools that Beat Censorship
A few weeks ago, the Trump administration moved to cut off funding for the Open Technology Fund (OTF), a move that threatened to undermine internet freedom tools used by activists, journalists, and citizens in authoritarian regimes. OTF is a small but mighty nonprofit that’s been quietly powering some of the most essential tools for internet freedom—things like Tor, Signal, and Let’s Encrypt— for years.
When the Trump administration attempted to cut OTF’s funding in its first term, Internet Exchange’s Mallory Knodel warned that the OTF is the leading global funder for digital tools that counter censorship and surveillance, and that the sudden loss of $20 million would leave a significant hole in efforts like academic research, software upkeep, and censorship monitoring.
After OTF filed a federal lawsuit, they received a notification from the USAGM that rescinded the March 15th termination of their grants. Still, the sudden halt in funding sent a chilling message, one that may embolden regimes that thrive on controlling information. Given the pattern on display since 2020, we must remain vigilant, invest in resilience, and build coalitions to protect the digital infrastructure that upholds human rights.
Enjoying The Internet Exchange?
If you are finding this newsletter enjoyable, consider leaving us a tip and help keep it going.
This Week's Links
Internet Governance
- Open Observatory of Network Interference (OONI) has launched new thematic censorship pages in order to help the internet freedom community more easily identify and respond to censorship events affecting key platforms like social media, news media, and circumvention tools. https://ooni.org/post/2025-ooni-explorer-thematic-censorship-pages
- Rethinking BGP as a form of internet diplomacy, exploring how the Border Gateway Protocol can support geopolitical cooperation. https://www.aroundthetable.social/rethinking-bgp-as-internet-diplomacy
- Russia has reportedly blocked access to multiple websites, pushing more of the Russian internet into the dark. https://therecord.media/russia-websites-dark-reported-cloudflare-block
Digital Rights
- A group of experts, including Internet Exchange’s Mallory Knodel, explore how defederation is shaping moderation, trust, and safety across decentralized social media platforms. https://carnegieendowment.org/research/2025/03/fediverse-social-media-internet-defederation
- The Italian government has reportedly admitted to using spyware against members of the NGO Mediterranea Saving Humans. https://www.euractiv.com/section/politics/news/spyware-scandal-italian-government-reportedly-admits-targeting-activists
- US authorities are scrutinising student visas based on social media activity as they expand deportation efforts, including of students who have spoken out in support of Palestinians during Israel’s war in Gaza. https://www.nytimes.com/2025/04/01/us/politics/student-visas-social-media.html
- The CDT argues that meaningful opt-out rights require more than checkboxes—companies need to act, and states may have to enforce it. https://cdt.org/insights/meaningful-opt-out-rights-require-companies-to-do-their-part-state-governments-might-have-to-make-them
- At RightsCon 2025, DSNP hosted its first community meeting to explore sustainable governance for decentralised social networking, highlighting the need for clear purpose, knowledge sharing, and inclusive participation. https://dsnp.org/blog/2025/03/05/rightscon-readout.html
- WITNESS critiques the EU AI Act's third draft of the General-Purpose AI Code of Practice for diluting transparency and risk assessment measures, potentially compromising fundamental rights and the integrity of information. https://blog.witness.org/2025/03/eu-ai-act-ensuring-rights-disclosure
Technology and Society
- The flood of Studio Ghibli-style AI images shows why creators need better consent tools. From Internet Exchanges Mallory Knodel and Audrey Hingle. https://www.techpolicy.press/the-ghibli-style-ai-trend-shows-why-creators-need-their-own-consent-tools
- Growth in traffic from AI crawlers is heavily impacting the operations of the Wikimedia Foundation. https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects
- We finally know why Sam Altman was fired from OpenAI. https://www.platformer.news/we-finally-know-why-sam-altman-was-fired-from-openai
- In a capitalist world, the often-overlooked systems of technical standards offer a rare example of economic collaboration that prioritizes the public good over profit. https://thereader.mitpress.mit.edu/the-anti-capitalist-case-for-standards
- Block Party offers a new tool for optimising social media privacy, catering to both individual and enterprise users rethinking online safety in the Trump era. https://www.fastcompany.com/91297336/block-party-offers-tool-for-optimizing-social-media-privacy
- AI experts say we’re on the wrong track to building human-like intelligence. https://gizmodo.com/ai-experts-say-were-on-the-wrong-path-to-achieving-human-like-ai-2000581717
- In Trump era, companies are rebranding DEI initiatives rather than scrapping them entirely. https://www.cnbc.com/2025/03/30/in-trump-era-companies-are-rebranding-dei-efforts-not-giving-up.html
- Charlie Warzel explores the tone and implications of the White House’s gleeful cruelty on X. https://www.theatlantic.com/technology/archive/2025/03/gleeful-cruelty-white-house-x-account/682234
- Podcast: Techsequences explores how the design of social media—built to prioritise content ranking and moderation—may be fundamentally flawed, and asks what practical steps could create platforms that bring out the best in us instead of the worst. https://www.techsequences.org/podcasts/2025/03/beyond-discourse-dumpster-fires-rethinking-social-media
- Global Voices reports on the devastating global fallout from Trump’s sudden freeze on US foreign aid, which has shuttered essential health and humanitarian services, destabilised civil society, and triggered widespread protests and misinformation. https://globalvoices.org/special/crumbling-of-the-global-aid-sector
- A two-year study by More in Common found that most Americans are eager to connect across lines of race, class, religion, and politics—especially when united by a shared goal. https://moreincommonus.com/publication/the-connection-opportunity
- The president’s two oldest sons are investing in a bitcoin-mining company, adding to the Trump family’s expanding portfolio of cryptocurrency businesses. https://www.wsj.com/finance/currencies/the-trump-family-advances-its-all-out-crypto-blitz-this-time-with-bitcoin-mining-86a1e8d9?st=VGiqCF
Privacy and Security
- Apple has appealed to the Investigatory Powers Tribunal over an order by home secretary Yvette Cooper to give the UK access to customers' data protected by Advanced Data Protection encryption. What happens next? https://www.computerweekly.com/opinion/Apples-appeal-to-the-Investigatory-Powers-Tribunal-over-the-UKs-encryption-back-door-explained
- Signal may be one of the most secure encrypted chat apps, but a reminder *cough* Pete *cough* that hackers could still gain access to Signal messages through hacked or stolen phones. https://www.nbcnews.com/tech/security/signal-app-used-hegseth-can-leave-door-open-hackers-rcna197956
- Venezuelan digital rights group Ve Sin Filtro developed an application to access censorsed news without the need for a VPN. Cool! https://www.techradar.com/vpn/vpn-privacy-security/dont-call-it-a-vpn-how-a-newsreader-app-seeks-to-revolutionize-censorship-circumvention
- Google rolls out encrypted Gmail for enterprise users. https://www.theverge.com/news/640422/google-gmail-email-encryption-enterprise-beta
- France fines Apple €150 million over iOS data tracking and consent practices. https://www.bloomberg.com/news/articles/2025-03-31/france-fines-apple-150-million-over-ios-data-tracking-consent
- Apple and SpaceX are competing in the race to eliminate cellphone dead spots, a rift that is set to intensify. https://www.wsj.com/tech/apple-elon-musk-satellite-cell-phone-services-ed2d2730?st=SBWkLy
- The increasing use of encrypted messaging apps like Signal by government officials is raising concerns about transparency, as these platforms can circumvent public records laws and hinder accountability. https://apnews.com/article/encryption-apps-government-transparency-sunshine-week-ad26ecdee91c8f99f15228bbe7989ede
- A post-mortem on a pilot project for detecting child sexual abuse material on Mastodon reveals both the potential and the financial challenges of moderating harm on decentralised platforms. https://about.iftas.org/2025/03/27/content-classification-system-post-mortem
Upcoming Events
- The Green Web Foundation is hosting three online workshops on estimating digital carbon emissions. The first is April 10, 09:00am CET. https://www.thegreenwebfoundation.org/services/estimating-digital-carbon-emissions-workshop
- A.I. In Art: Perceptions, Values & Rights. An opening reception of an exhibition and larger conversation about the implications of AI generated art on the creative landscape, digital ethics and copyright. April 16, 4:30pm ET. Washington D.C. https://www.linkedin.com/posts/elissaredmiles_like-art-hate-ai-like-ai-think-its-democratizing-activity-7311003690625105920-VeJy
- Democracy & Disability: A discussion hosted by the Museum of Science. April 17, 7:00pm ET. Boston, MA https://www.mos.org/events/democracy-disability-issue
Careers and Funding Opportunities
- King’s College London is recruiting for AI Academic Fellowships. https://www.kcl.ac.uk/jobs/role/kings-ai-academic-fellowships
- Harvard’s Berkman Klein Center for Internet & Society welcomes applications for its 2025-2026 fellowships. https://cyber.harvard.edu/getinvolved/fellowships/2526Fellows
Opportunities to Participate
- MCE Conseils is gathering responses for a new survey supporting AI governance and digital policy research. https://mceconseils.limequery.com/811824
- Submit your content moderation research to HIIG’s open call for submissions. https://www.hiig.de/en/call-for-submissions-content-moderation
- The United for Smart Sustainable Cities (U4SSC) initiative is seeking experts to contribute to cutting-edge frameworks, tools, and strategies for buildings, urban energy, and people-centered governance.
- Thematic group on “Social-Cultural Sustainability in People-Centered City Governance” https://docs.google.com/forms/d/e/1FAIpQLSdKZezz2fkRwhxuC6simuFJmBmgLE_sYPSXSPw4KbgYUkqiYg/viewform
- Thematic group on “Sustainable Digital Transformation in Buildings and Urban Energy” https://docs.google.com/forms/d/e/1FAIpQLSdKZezz2fkRwhxuC6simuFJmBmgLE_sYPSXSPw4KbgYUkqiYg/viewform
🎵 We're listening to: Protect Trans Kids (WTFIWWY) by internet activists, writers, and musicians Evan Greer & Ryan Cassata (Lyric Video) https://www.youtube.com/watch?v=J2QKqy2y27U
What did we miss? Please send us a reply or write to editor@exchangepoint.tech.
Robots.txt Is Having a Moment: Here's Why We Should Care
By Mallory Knodel and Audrey Hingle. Also published in techpolicy.press.
Once a quiet piece of internet plumbing, robots.txt is now in the spotlight. AI has turned this humble file, long used to guide web crawlers for search, research, and more, into ground zero for debates about consent, control, and digital exploitation.
Introduction: From Obscure Standard to Center Stage
Robots.txt was proposed 30 years ago by Dutch software engineer and early web developer Martijn Koster. Since then, it has acted as a simple, voluntary protocol for website and crawler interaction. While it has been described as “the text file that runs the internet,” it was never intended to be a security tool or a legal framework, just a way for site owners to express how they wanted search engines, researchers, and archiving projects to handle and use their content, based on a clear signal and good manners.
This file, often no more than a few simple lines of human and machine-readable text, has been thrust into the spotlight because crawlers are now widely known to harvest data for AI. Crawlers that harvested, and continue to harvest, massive volumes of data for model training do so often without attribution or compensation to site owners and content creators. As generative AI systems scale, the stakes have risen: what was once a quiet tool for managing web etiquette is on the precipice of becoming the frontline defense for those trying to assert control over how their content is used in the lucrative AI-driven arms race.
Last week’s IETF working group meeting marked a turning point in the push to modernise how we express content usage preferences online, particularly in the context of AI. But can we teach an old protocol new tricks? And is it enough on its own to express meaningful consent across all bots and crawlers used for purposes as wide-ranging as archival, research, and AI?
Why AI Scraping Is Breaking the System
Generative AI models like OpenAI’s GPT-4 and Google’s Gemini are trained on massive datasets scraped from the open web, including websites like Wikipedia (which is struggling to handle the extra traffic), news outlets such as The Guardian and The New York Times (which is now suing OpenAI), public domain and pirated books, code from platforms like GitHub, and public forums like Reddit. Some of this material is in the public domain or openly licensed, but much of it is copyrighted, raising ongoing legal and ethical concerns.According to OpenAI, its models were trained on “publicly available and licensed data,” but studies like the one by Carlini et al. have shown that models can regurgitate written and visual copyrighted material verbatim. OpenAI’s latest update to its AI image generator can produce Studio Ghibli-style images, which have subsequently flooded the internet. Evidence it was trained on the studio’s copyrighted content. Heated debates about copyright, consent, and the ethics of large-scale web scraping have been fueled by loyal fandom and deep appreciation for artists and creators.
Common Crawl, a nonprofit web archive whose mission is to allow researchers, small startups, or even individuals to access high-quality crawl data previously only available to large search engine corporations, is facing mounting pressure from publishers over its role in AI training. The New York Times and several Danish media organizations have demanded the removal of their content from past crawls and have configured their robots.txt files to block future access.Although Common Crawl plans to comply, its executive director, Rich Skrenta, warns that removing archived materials from repositories like Common Crawl threatens the open web. For years, scrapers like Common Crawl have been critical tools for openness and research, powering projects in web structure analysis, online censorship monitoring, and phishing detection. But as AI companies like OpenAI use the data to train commercial models, the project has become a flashpoint in debates over consent, copyright, and control, potentially sidelining researchers and entrenching the power of well-resourced tech giants.
These examples don’t show that publishers hate bots. For years, search engines and crawlers have been welcomed as tools for discovery and archiving. What they reveal is a growing objection to powerful AI systems using content in ways no one consented to or could have anticipated, as well as the vast amounts of money being made on that content and not shared. In response, publishers and other website owners are deploying anti-scraping measures not just to block bots but to push back against a broader loss of control in the age of AI.
The Robots.txt & Other Options: What’s Being Proposed
Robots.txt remains important as a foundational tool due to its widespread adoption and familiarity among website owners and developers. It provides a straightforward mechanism for declaring basic crawling permissions, offering a common starting point from which more advanced and specific solutions can evolve. However, robots.txt is primarily useful for website owners and publishers who control their own domains and can easily specify crawling rules. It doesn't effectively address content shared across multiple platforms or websites, nor does it give individual content creators, such as artists, musicians, writers, and other creative professionals, a way to easily communicate their consent preferences when they publish their work on third party sites, or when their work is used by others.
When content is hosted on platforms like social media sites, users often do not control how their posts are scraped or used, making it unclear how to express or withhold consent. A recent debate on Bluesky illustrated this challenge. The platform proposed giving users the ability to opt in or out of having their data used for things like generative AI. Effectively, giving them the ability to edit their own robot.txt file. Some users misunderstood the proposal as Bluesky changing its stance on AI training when, in fact, it aimed to give individuals more control over how third-party scrapers use their content. The backlash highlights how confusing and disempowering this space remains for everyday users.
To address these limitations, proposals for enhancing robots.txt range from minor adjustments to significant updates. Minor improvements include expanding user agent identification to cover more types of crawlers, especially AI bots, or using the existing Disallow/Allow directives to specify preferences for AI training. Another proposal suggests extending robots.txt to include content layer control to enable more specific rules on how content is accessed and used, particularly for AI training. More substantial changes include adding two new properties to robots.txt specifically for AI training, which would allow content owners to express more granular control over how AI systems use their content.
Another proposal is adding an ai.txt file, a complementary standard to robots.txt. Versions have been proposed by both Guardian News & Media and Spawning. It would give publishers and content owners more control over how AI systems use their content, including the ability to limit snippet length, require attribution, and allow or deny use for AI training.
Beyond robots.txt enhancements, several complementary approaches are being proposed to better support content creators. Some focus on embedding consent signals directly into individual files, making preferences easier for creators to manage across platforms. Examples include adding machine-readable metadata directly within images, videos, and other digital files, and tools such as Spawning’s Do Not Train tool suite or the TDM·AI proposal, which provide creator-friendly solutions for content-level control. Additionally, structured HTTP headers and expanding signaling mechanisms to APIs and cloud services are suggested to ensure consistent preference communication across various digital environments.
Currently, these proposals primarily focus on preference signaling rather than enforcement. Technical tools, standards, and protocols provide mechanisms to express content owners' wishes clearly and machine-readably, but they rely on voluntary compliance from crawlers and AI systems. Robust enforcement and accountability are anticipated to emerge from future policy frameworks and regulatory actions, with policymakers expected to play a crucial role in creating legal backing and consequences for adherence or violations.
The Problem with "Preference" in a Complex Ecosystem
As debates over AI training and content control intensify, an IETF working group has stepped in to define how websites can express their preferences. Their most recent meeting demonstrated that building a shared language is easier said than done. While there’s momentum and enthusiasm behind creating technical solutions, many of the thorniest issues—like how to interpret the lack of signal, combine conflicting signals, or manage different preferences—are far from resolved.The working group agreed to support both opt-in and opt-out signaling for AI usage, allowing content owners to explicitly declare their preferences. However, most of the web currently falls into a grey area: no signal at all. This state is the default, and for now it is “out of scope”. That means that platforms and AI companies are left to interpret silence however they choose which risks reinforcing the status quo of nonconsensual data use.As use cases for AI diversify, so too do the layers of preference. A site might wish to allow indexing for AI search but not for AI training, or permit archival but only in certain jurisdictions. Sites might want to add time-based restrictions, like embargos or third-party licensing systems. Signaling how content should be used, it turns out, is extremely complex. The group acknowledged this challenge and carved out a separate deliverable to define how these preferences can be composted and interpreted, ideally by August.The current goal is to make preferences machine-readable and clearly expressed, not to enforce them. Regulation and legal frameworks, especially in the EU, are expected to eventually create consequences for ignoring these signals. For now, the focus remains on enabling clear expression, not ensuring follow-through.
Rights, Risks, and Regulation
This is about more than technical standards; it’s about governance and the technology that underpins the web. The growing backlash against AI scraping reflects a deeper concern over the erosion of norms online. As regulators, primarily in the EU, move to define legal frameworks for AI transparency and data usage, there’s a narrow window for the technical community to weigh in and help shape meaningful and enforceable norms.
The EU AI Act and its accompanying Code of Practice have added urgency, as rightsholders and cultural organizations demand enforceable safeguards and more meaningful opt-out (and opt-in) mechanisms.
The urgency at the IETF meeting last week was palpable, driven by the understanding that if technologists don’t lead on setting clear, interoperable standards, policymakers or Big Tech will define the rules, potentially in ways that are not ideal or that favor business over public interest.Technologists participating in the IETF discussions face a clear challenge: finding the right balance between helping website owners express their preferences, protecting against AI misuse, and preserving legitimate uses like academic research, journalism, and public-interest innovation, activities that have long depended on open web crawling.
What We Need Next
The path forward requires us to move beyond robots.txt alone, combining practical technical enhancements with collaborative, ecosystem-wide solutions. Here are the key steps we need next:
- Make Robots.txt AI-Ready: Robots.txt is popular because it’s straightforward, but it needs to be enhanced to handle modern AI crawlers.
- Empower Creators with Content-Level Signals: Creators need simple, built-in ways to tag their work with clear signals about its use, especially by AI. Embedding machine-readable metadata directly into files like images, videos, or code ensures that usage preferences stay with the content wherever it travels online.
- Standardize Preference Signals Across the Web: We should adopt easy-to-understand, consistent signals in addition to robots.txt, like structured HTTP headers or API-level preferences, so content owners can clearly communicate their wishes, no matter which platform or service is hosting their content.
- Prioritize Clear Signals Now, Expect Enforcement Soon: Right now, the priority is clarity: making sure preferences are expressed consistently and understood across the web. But we should be working with policymakers to ensure these signals eventually have real legal backing.
- Encourage Collaboration Among Stakeholders: Technologists, content creators, platforms, and policymakers need to talk openly and collaborate. That’s the best way to create solutions that are practical, fair, and effective for everyone involved.
Conclusion: A Fork in the Road
The conversation about robots.txt might not seem exciting, but it's crucial for defining the future of online rights and content ownership. How we handle crawling, scraping, and consent today will shape the digital landscape for creators, researchers, journalists, and the public.
Getting this right matters deeply—not just for publishers and artists, but also for researchers and journalists whose work depends on responsibly open access to information. To have your say in these important discussions, consider joining the conversations with the IETF or following the upcoming events and livestreams from Brussels.