AiNews.com
Posts
The Data That Powers A.I. Is Disappearing Fast

The Data That Powers A.I. Is Disappearing Fast

Alicia Shapiro
July 19, 2024 • Estimated Reading Time: 5 minutes

A professional image illustrating the decline in data availability for AI training. The image features a drying well and a closing faucet symbolizing the scarcity of data. It includes elements of AI technology such as neural networks and digital brains, alongside web data elements like text, images, and videos. The dark color scheme with contrasting highlights emphasizes the seriousness of the issue, conveying the impact of reduced data on AI development

The Data That Powers A.I. Is Disappearing Fast

New research from the Data Provenance Initiative, an M.I.T.-led research group, reveals a significant decline in the availability of content for artificial intelligence (A.I.) training. For years, A.I. developers have relied on vast amounts of text, images, and videos from the internet to train their models. However, this crucial data is becoming increasingly scarce.

Decline in Available Data

The study examined 14,000 web domains included in three widely used A.I. training data sets—C4, RefinedWeb, and Dolma—and found an “emerging crisis in consent.” Many publishers and online platforms have implemented measures to prevent their data from being harvested, leading to a substantial reduction in available data.

Researchers estimate that 5% of all data and 25% of data from the highest-quality sources in these data sets have been restricted. This is largely due to the Robots Exclusion Protocol, a method allowing website owners to prevent automated bots from crawling their pages via a file called robots.txt. Additionally, up to 45% of the data in the C4 data set has been restricted by websites’ terms of service.

Impact on A.I. Development

“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics, and noncommercial entities,” said Shayne Longpre, the study’s lead author.

Data is essential for today’s generative A.I. systems, which rely on billions of examples of text, images, and videos. These systems, such as OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude, depend on high-quality data to generate accurate and useful outputs.

The generative A.I. boom has led to tensions with data owners, many of whom are reluctant to have their content used for A.I. training without compensation. In response, some publishers have implemented paywalls, changed their terms of service, or blocked automated web crawlers. Notable examples include Reddit and StackOverflow, which now charge A.I. companies for data access, and The New York Times, which has taken legal action against OpenAI and Microsoft for copyright infringement.

Efforts to Secure Data

In recent years, A.I. companies have gone to great lengths to secure data, including transcribing YouTube videos and adjusting their data policies. Some have struck deals with publishers, like The Associated Press and News Corp, to maintain ongoing access to their content. However, widespread data restrictions pose a threat to A.I. companies, especially smaller startups and academic researchers who rely on public data sets.

Common Crawl, a nonprofit-maintained data set comprising billions of web pages, has been cited in over 10,000 academic studies. Restrictions on data could impede future research and development in the A.I. field.

Looking Forward

Yacine Jernite, a machine learning researcher at Hugging Face, characterized the situation as a natural response to aggressive data-gathering practices by the A.I. industry. Stella Biderman, executive director of EleutherAI, expressed concern that licensing requirements could hinder smaller actors and researchers from contributing to A.I. governance.

A.I. companies have claimed that their use of public web data is protected under fair use. However, gathering new data has become more challenging as publishers increasingly restrict access. Some companies are exploring the use of synthetic data—data generated by A.I. systems themselves—to train models, but many researchers question the quality of such data compared to human-created content.

Mr. Longpre emphasized the need for new tools to allow website owners to control how their data is used more precisely. Differentiating between uses for profit and nonprofit or educational purposes could help address current challenges.

Conclusion

The reduction in available data for A.I. training highlights a growing tension between A.I. companies and data owners. As restrictions increase, the A.I. industry faces significant hurdles in accessing the high-quality data necessary for model development. This evolving landscape calls for new solutions and more precise control mechanisms to balance the needs of data owners with the demands of technological advancement.