AiNews.com
Posts
Cloudflare Launches Tool to Combat AI Bots Scraping Data

Cloudflare Launches Tool to Combat AI Bots Scraping Data

Alicia Shapiro
July 03, 2024 • Estimated Reading Time: 3 minutes

An illustration showing Cloudflare's new tool combating AI bots. The image features a digital shield symbolizing protection, with representations of AI bots being blocked from accessing a website. The background includes elements of data and web content, highlighting the prevention of data scraping. The design emphasizes security, technology, and the fight against unauthorized AI bot activity

Cloudflare Launches Tool to Combat AI Bots Scraping Data

Cloudflare, the publicly traded cloud service provider, has launched a new, free tool to prevent bots from scraping websites hosted on its platform for data to train AI models.

Addressing the Issue of AI Scraping

Some AI vendors, including Google, OpenAI, and Apple, allow website owners to block the bots they use for data scraping and model training by amending their site’s robots.txt file. However, Cloudflare points out in a post announcing its bot-combatting tool that not all AI scrapers respect this.

“Customers don’t want AI bots visiting their websites, especially those that do so dishonestly,” the company writes on its official blog. “We fear that some AI companies intent on circumventing rules to access content will persistently adapt to evade bot detection.”

Cloudflare's Solution

To address this problem, Cloudflare analyzed AI bot and crawler traffic to fine-tune automatic bot detection models. These models consider factors such as whether an AI bot might be trying to evade detection by mimicking the appearance and behavior of a human web browser user. “When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint,” Cloudflare writes. “Based on these signals, our models [are] able to appropriately flag traffic from evasive AI bots as bots.”

Cloudflare has also set up a form for hosts to report suspected AI bots and crawlers and says it will continue to manually blacklist AI bots over time.

Growing Concern Over AI Bots

The problem of AI bots has become more prominent as the generative AI boom fuels the demand for model training data. Many sites, wary of AI vendors training models on their content without alerting or compensating them, have opted to block AI scrapers and crawlers. According to one study, around 26% of the top 1,000 sites on the web have blocked OpenAI’s bot, and another study found that more than 600 news publishers had blocked the bot.

Challenges and Solutions

Blocking AI scrapers is not always effective. Some vendors ignore standard bot exclusion rules to gain a competitive advantage in the AI race. For example, AI search engine Perplexity was recently accused of impersonating legitimate visitors to scrape content from websites, and OpenAI and Anthropic are said to have ignored robots.txt rules at times. In a letter to publishers, content licensing startup TollBit noted that “many AI agents” ignore the robots.txt standard.

Tools like Cloudflare’s could help if they prove accurate in detecting clandestine AI bots. However, they won’t solve the more intractable problem of publishers risking referral traffic from AI tools like Google’s AI Overviews, which exclude sites from inclusion if they block specific AI crawlers.