Generative AI & Law: LLMs are not Stochastic Parrots

Summary

Kadrey v. Meta (Case 3:23-cv-03417) filed by Sarah Silverman and other authors against Meta focused on the allegation that Meta's large language models (LLMs), specifically LLaMA, were trained using copyrighted books without authorization. The plaintiffs claimed that this training involved the illegal scraping of works from book torrenting websites. The case saw partial dismissal by Judge Vince Chhabria on November, 20 2023, who expressed skepticism about the plaintiffs' claims, particularly questioning the idea that text generated by LLaMA was a direct infringement of the authors' copyrights.

The lawsuit Kadrey v. Meta (Case 3:23-cv-03417) involving Sarah Silverman and other authors against Meta and OpenAI presents a significant legal battle over the use of copyrighted works in training artificial intelligence models. Here's a detailed look at the case.

Background

Sarah Silverman, alongside novelists Christopher Golden and Richard Kadrey, filed a lawsuit against Meta and OpenAI. The suit was filed in a U.S. district court, accusing both companies of copyright infringement. The authors claimed that their books were used to train large language models (LLMs) such as OpenAI's GPT-3.5 and GPT-4 and Meta's LLaMA without their authorization. The lawsuit alleged that the data for training these models were illegally obtained from book torrenting websites like Bibliotik, Library Genesis, Z-Library, and others, which are considered "shadow libraries"

The Lawsuit Details

The plaintiffs argued that the datasets for the AI models included their copyrighted works. They claimed that these works were used without any copyright management information and were accessed from sources known to hold pirated content. In particular, they focused on the fact that ThePile, a dataset listed by Meta as one of its sources, was assembled from content on the Bibliotik private tracker. The lawsuit also highlighted that AI models like ChatGPT could summarize books when prompted, which they viewed as a direct infringement of their copyrights. Silverman’s book "Bedwetter," Golden’s "Ararat," and Kadrey’s "Sandman Slim" were among those specifically mentioned in the lawsuit. The authors sought restitution of profits, statutory damages, and more, covering different types of copyright violations, unjust enrichment, negligence, and unfair competition claims.

Plaintiffs' Argument

he authors contended that OpenAI violated U.S. law by copying their works to train AI systems, stating, "OpenAI is clearly signaling its intent to unilaterally rewrite U.S. copyright law in its favor." They argued that the AI models and their outputs were "derivative works" of their books, infringing their copyrights. The plaintiffs asserted that substantial similarity was not necessary to prove infringement since OpenAI had directly copied their works. They were particularly critical of OpenAI's fair use defense, claiming it was "at odds with settled precedent" and would undermine the U.S. copyright system.

Judge Chhabria's Responses

U.S. District Judge Vince Chhabria, in response to the lawsuit, expressed skepticism about the plaintiffs' claims. He indicated his intention to dismiss the allegations over AI's text output, stating, "Your remaining theories of liability I don't understand even a little bit." Chhabria also critiqued the idea that the text generated by LLaMA copies or resembles the authors' works, saying, "When I make a query of LLaMA, I'm not asking for a copy of Sarah Silverman's book – I'm not even asking for an excerpt." The judge challenged the argument that LLaMA itself was an infringing work, noting the implausibility of comparing the LLaMA language model to Silverman's book.

Silverman's Theory Dismissed

Judge Chhabria dismantled Silverman's theory that Meta's AI model is built around infringing on copyrighted works. He found it nonsensical to view the LLaMA models as a recasting or adaptation of the plaintiffs' books, stating, "There is no way to understand the LLaMA models themselves as a recasting or adaptation of any of the plaintiffs' books." He dismissed Silverman's arguments due to a lack of evidence that any outputs generated by Meta's AI system could be considered as recasting or adapting the plaintiffs' books.

Current Status

The judge has allowed the authors to amend most of their claims, but dismissed most of the claims with the opportunity to amend. He indicated that he would dismiss them again if the authors failed to demonstrate that LLaMA's output was substantially similar to their works. The central claim of the lawsuit, regarding the use of copyrighted books to train AI models, remains unresolved.

This case is significant in the context of copyright law and AI technology, highlighting the complexities and evolving challenges in this area.

Stochastic Parrots

The term "Stochastic Parrots" was popularized by a research paper titled “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” authored by Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell. Presented in early 2021, the paper critiqued large language models (LLMs) for their potential to propagate biases and misinformation, likening them to parrots that mimic without understanding. The term suggests these models merely replicate patterns found in their training data without true comprehension or originality. However, this characterization is increasingly seen as a simplification, especially for foundation models like LLMs and diffusion models. These advanced AI models, including GPT and Stable Diffusion, demonstrate capabilities beyond simple replication, such as generating novel content, creative outputs, and complex problem-solving, indicating a level of processing and transformation that goes beyond the "Stochastic Parrots" label. This evolving understanding reflects the growing sophistication and potential of these models in various applications.

LLMs are not Stochastic Parrots

Kadrey v. Meta is another argument to support that LLMs don’t simply replicate or regurgitate information without understanding or originality. They are not Stochastic Parrots. Judge Chhabria's dismissal of claims based on the idea that LLaMA's output directly infringed copyrights suggests a recognition that the outputs of LLMs are not straightforward replications of input data but involve complex, transformative processes. This indicates an understanding that LLMs, while trained on existing data, generate outputs that are not mere echoes of their training material, but rather unique amalgamations and interpretations of learned information.

Image depicting a fictional interaction between a robot with the patent officer, Dr. William Thornton, in the U.S. Patent Office in 1802, submitting a patent application

References

Further read

LLM, Us and Neuroplasticity

"Are you sure?": Unveiling the Quest for Human-Like Artificial Intelligence

Designers, creativity and AI

A LLM as a reflection of our inner most desires

Should AGI be a she, he, or a it?