AiNews.com
Posts
Major Sites Block Apple's AI Scraping as Chatbot Data Challenges Rise

Major Sites Block Apple's AI Scraping as Chatbot Data Challenges Rise

Alicia Shapiro
August 29, 2024 • Estimated Reading Time: 5 minutes

Image Source: ChatGPT-4o

Major Sites Block Apple's AI Scraping as Chatbot Data Challenges Rise

This summer, Apple introduced a new tool that gives websites greater control over whether their data can be used to train the company’s AI models. In response, a number of major publishers and platforms, including The New York Times and Facebook, have chosen to opt out.

Just three months after Apple quietly launched this tool, known as Applebot-Extended, numerous prominent news outlets and social platforms have taken advantage of the option to exclude their data from Apple’s AI training. This move signals a significant shift in how digital content is managed, particularly as AI-driven web scraping becomes a contentious issue.

Why Publishers Are Opting Out

Publishers are blocking AI web scrapers like Applebot-Extended for several reasons. Many are concerned about intellectual property rights and the potential misuse of their content without proper compensation. For example, some publishers believe that allowing AI to use their data without a commercial agreement devalues their content. Others have cited concerns over the ethical implications of AI systems being trained on their data without explicit permission.

Conversely, some publishers might choose to allow web scrapers if they have entered into partnerships or licensing agreements that compensate them for their data. These agreements often involve negotiations where publishers receive payment or other benefits in exchange for allowing AI systems to access their content.

How Applebot-Extended Works

Applebot-Extended, an extension of Apple’s original web-crawling bot, allows website owners to block their data from being used to train Apple’s large language models and other AI projects. However, it doesn’t stop the original Applebot from crawling the website, ensuring that content still appears in Apple search products like Siri and Spotlight.

Website owners can block Applebot-Extended by updating their robots.txt files, a long-standing method for managing how bots interact with their content. This file must be manually edited, which can be challenging given the growing number of AI agents and the need for continuous updates to keep the block list current.

Growing Resistance to AI Scraping

Despite its recent introduction, Applebot-Extended is already being blocked by a notable portion of high-traffic websites, especially in the news and media sectors. Data journalist Ben Welsh found that over a quarter of news websites in his sample are blocking Applebot-Extended, with even more sites blocking similar bots from OpenAI (53% of the sites in his sample) and Google (43% of the sites).

Some publishers are clear about their reasons for blocking these AI bots, citing a lack of commercial agreements as the primary factor. Others, like The New York Times, which is currently suing OpenAI over copyright infringement, have criticized the opt-out nature of these tools, emphasizing the importance of protecting their intellectual property.

The Challenge of Keeping Chatbots Updated

As AI chatbots like ChatGPT and Perplexity rely heavily on vast amounts of data to stay current, the growing trend of websites blocking AI scraping bots poses a significant challenge. Keeping these chatbots updated requires continuous access to fresh data, but with more publishers opting out of providing their content, maintaining the relevance and accuracy of these AI models becomes increasingly difficult.

This issue highlights a broader concern in the AI community: the need for sustainable and ethical data sources to ensure that AI technologies can continue to evolve and improve. As more content owners block access to their data, AI developers may need to seek alternative strategies to keep their models up to date, potentially leading to new partnerships, licensing agreements, or other innovative solutions.

The Future of AI Data Collection

The ongoing battle over AI training data is playing out in real-time, visible through the updates to robots.txt files across the web. This public struggle over data rights and AI’s future development reflects the broader challenges of balancing technological advancement with the protection of intellectual property.