As the demand for accurate data to fuel AI models continues to rise, an escalating battle is unfolding on the internet. Companies are taking measures to protect their data from AI web crawlers, with OpenAI’s GPTBot and Common Crawl’s CCBot facing increased opposition. Recent data reveals that over 250 leading websites are now blocking GPTBot, while nearly 14% of the most popular websites have also barred CCBot, impacting the availability of vital information for AI training.
The growing battle against AI web crawlers
In an era dominated by data-driven decision-making and artificial intelligence, web crawling bots have become both essential tools and significant sources of controversy. Last month, OpenAI introduced GPTBot, which was designed to respect the decades-old robots.txt protocol, allowing websites to signal their desire to remain unscraped. Initially, approximately 70 out of the top 1,000 websites implemented blocks against GPTBot, including internet giants like Amazon and Tumblr.
Recent findings from Originality.ai reveal a significant shift in the landscape. Over a span of just three weeks, the number of prominent websites blocking GPTBot has surged to more than 250. This comprehensive list includes well-known platforms like Pinterest, Vimeo, GrubHub, Indeed, Apartments.com, The Guardian, Live Science, USA Today, NPR, CBS News, NBC News, CNBC, The New Yorker, People, and seemingly all titles published by Hearst and Conde Nast. Even weather.com has joined the ranks of those shielding their content from the AI crawler.
The challenge to data accessibility
The impetus behind this wave of web crawler blocks lies in the critical role that accurate data plays in training powerful generative AI models, such as OpenAI’s GPT-4. These models rely heavily on vast amounts of text data, much of which is sourced from the internet, despite the majority of it being subject to copyright or owned by specific entities. The awareness of this practice has surged in recent times, leading to numerous legal disputes and the possibility of new government regulations.
Simultaneously, many companies are taking measures to secure their user-generated content and online activities. Through updates to their terms of service and user policies, tech companies are asserting their rights to access and utilize user data for AI projects and training purposes. This shift in approach represents a broader trend towards data protection and sovereignty, with companies increasingly seeking to control how their data is used, particularly by AI-driven entities.
CCBot is another target of the blockade
While GPTBot has drawn significant attention due to its association with OpenAI and the GPT-4 model, another web crawler, CCBot, employed by Common Crawl, is facing its share of resistance. Common Crawl, a European-based organization, has spent years amassing vast amounts of web data, including copyrighted content, and organizing it for use as training data for large language models like Meta’s Llama.
As of late September, data from Originality.ai indicates that almost 14% of the 1,000 most popular websites have implemented blocks against CCBot. This list includes Amazon, Vimeo, Masterclass, Kelly Blue Book, The New York Times, The New Yorker, and The Atlantic, among others. It’s noteworthy that many of the websites blocking CCBot also extend their restrictions to GPTBot, illustrating the growing trend of safeguarding data from AI web crawlers, regardless of their affiliations.
The battle between companies seeking to protect their data and AI web crawlers in pursuit of training material continues to intensify. The growing number of websites blocking GPTBot and CCBot underscores the increasing importance of data accessibility and control in an AI-driven world. As legal and regulatory scrutiny mounts, the balance between data utilization and data protection remains a critical challenge for both businesses and AI developers.