- AiNews.com
- Posts
- Mark Zuckerberg Allegedly Approved AI Training on Copyrighted Works
Mark Zuckerberg Allegedly Approved AI Training on Copyrighted Works
Image Source: ChatGPT-4o
Mark Zuckerberg Allegedly Approved AI Training on Copyrighted Works
A new copyright lawsuit against Meta claims CEO Mark Zuckerberg authorized the use of pirated data to train the company’s Llama AI models. Filed in the U.S. District Court for the Northern District of California, the case, Kadrey v. Meta, alleges Meta leveraged datasets of copyrighted works without proper licensing, raising questions about the tech giant’s AI development practices.
The Allegations
Unredacted documents submitted by plaintiffs, including bestselling authors Sarah Silverman and Ta-Nehisi Coates, reveal that Zuckerberg approved the use of a dataset known as LibGen for training Llama. LibGen, which refers to itself as a “links aggregator,” offers access to copyrighted materials from publishers such as Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen has been the target of multiple lawsuits and heavy fines for copyright infringement.
Meta employees reportedly flagged concerns about using LibGen, referring to it as “a dataset we know to be pirated”, and warning that its use “could weaken [Meta’s] negotiating position with regulators.” Internal memos quoted in the filing claim Meta's AI team sought approval from “MZ” (Mark Zuckerberg) to move forward despite potential risks to Meta’s reputation and legal standing.
Methods to Conceal Infringement
The lawsuit also accuses Meta of attempting to hide its use of copyrighted materials:
Removing Copyright Markers: Plaintiffs allege that Meta engineer Nikolay Bashlykov created scripts to strip copyright information, such as the words “copyright” and “acknowledgments,” from e-books in LibGen.
Stripping Metadata: Similar efforts were allegedly made to erase copyright markers from scientific journal articles and other training data.
Torrenting Data: The filing claims Meta used torrenting to obtain LibGen content, with Meta engineers reportedly raising concerns about the legality of this practice. Torrenting involves sharing files across the web, which plaintiffs argue constitutes another layer of copyright infringement. The filing claims that Ahmad Al-Dahle, Meta’s head of generative AI, “cleared the path” for torrenting LibGen, disregarding Bashlykov’s concerns that it “could be legally not OK.”
Fair Use Defense and Legal Precedent
Meta has leaned on the legal doctrine of fair use, arguing that its use of copyrighted data is transformative and, therefore, lawful. However, plaintiffs contest this, pointing out “Had Meta bought plaintiffs’ works in a bookstore or borrowed them from a library and trained its Llama models on them without a license, it would have committed copyright infringement. Meta’s decision to bypass lawful methods of acquiring books and become a knowing participant in an illegal torrenting network … serves as proof of copyright infringement.”
This lawsuit highlights a growing wave of legal challenges tech giants face over the use of copyrighted works to train AI models. While some cases have been dismissed due to insufficient evidence, the plaintiffs in Kadrey v. Meta argue that Meta knowingly violated copyright laws. These allegations align with an April report from The New York Times, which claimed Meta cut corners to gather data for its AI efforts. According to the report, Meta hired contractors in Africa to compile book summaries and even considered acquiring publisher Simon & Schuster. However, executives ultimately decided that negotiating licenses would take too long and relied on fair use as a defense.
Judge's Rejection of Meta’s Redaction Request
Adding to Meta’s troubles, Judge Vince Chhabria denied the company’s request to heavily redact portions of the filing. “It is clear that Meta’s sealing request is not designed to protect against the disclosure of sensitive business information,” Chhabria wrote. “Rather, it is designed to avoid negative publicity.”
The case against Meta remains unresolved and currently applies only to the company’s earliest Llama models, not its more recent releases. The court could still rule in Meta’s favor if it finds the company’s fair use argument convincing.
What This Means
If proven, these allegations could have significant implications for AI training practices and copyright law. Meta’s reliance on datasets like LibGen reflects broader industry challenges around sourcing data for AI. As companies race to develop cutting-edge models, the balance between innovation and intellectual property rights remains a contentious issue.
This case highlights the legal and ethical scrutiny facing AI developers. If the court rules against Meta, it could set a precedent requiring stricter adherence to copyright laws during AI training. However, should Meta’s fair use argument prevail, it may encourage others in the tech industry to adopt similar practices.
For now, the lawsuit serves as a reminder of the complexities surrounding AI, intellectual property, and corporate accountability.
Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.