The New York Times has filed a comprehensive copyright infringement lawsuit against OpenAI and Microsoft in a recent legal development that has sent shockwaves through the AI community.
The lawsuit alleges that their Large Language Models (LLMs), particularly GPT-4 and related products, have a business model built on mass copyright infringement. This lawsuit has highlighted the ethical concerns surrounding sourcing training data for generative AI models.
Allegations of copyright infringement
The crux of the lawsuit revolves around the claim that OpenAI and Microsoft have used copyrighted texts and other content, including content from The New York Times, without proper authorization to train their LLMs. The lawsuit contends that these LLMs have repeatedly reproduced verbatim content from The New York Times and various other sources.
The lawsuit underscores a growing concern in the AI community and beyond regarding the ethical sourcing of training data for LLMs. It raises questions about the origins of the training data, whether it includes stolen intellectual property, and how this impacts the creators and industries that rely on original content.
Impact on journalism and content creation
The lawsuit emphasizes the potentially devastating consequences of AI copyright infringement for content creators and journalism. It argues that when AI platforms like Google and Bing incorporate ideas and expressions taken from content providers without permission, it undermines the ability of these providers to monetize their content. This, in turn, jeopardizes the financial viability of news organizations and their ability to fund quality journalism.
The lawsuit states, “The protection of The Times’s intellectual property is critical to its continued ability to fund world-class journalism in the public interest. If The Times and its peers cannot control the use of their content, their ability to monetize that content will be harmed.
With less revenue, news organizations will have fewer journalists able to dedicate time and resources to important, in-depth stories, which creates a risk that those stories will go untold. Less journalism will be produced, and the cost to society will be enormous.”
AI models’ response to copyrighted content
The lawsuit highlights that LLMs often respond inconsistently to prompts and may produce verbatim copyrighted text in some instances, while in others, they may paraphrase the content. However, it raises a fundamental question: is using copyrighted materials to train AI software an act of infringement?
The New York Times argues that the act of training LLMs itself constitutes copyright infringement, regardless of whether the models repeat phrases from the source material. This perspective has been echoed in a class-action lawsuit by authors Sarah Silverman, Christopher Golden, and Richard Kadrey, which claims that LLMs infringe on derivative works because they cannot function without the expressive information extracted from copyrighted works.
The ongoing debate on AI ethics
The legal action taken by The New York Times has ignited a broader debate on the ethical considerations surrounding AI and the responsibility of tech companies to ensure that their AI models are built on ethically sourced data.
As AI advances and plays an increasingly prominent role in various industries, questions about data usage, intellectual property rights, and content generation ethics are becoming more urgent.