by Tori Noble
Electronic Frontier Foundation
Reprinted under Creative Commons license
Part One: Copyright and AI: the Cases and the Consequences

The launch of ChatGPT and other deep learning quickly led to a flurry of lawsuits against model developers. Legal theories vary, but most are rooted in copyright: plaintiffs argue that use of their works to train the models was infringement; developers counter that their training is fair use. Meanwhile, developers are making as many licensing deals as possible to stave off future litigation, and it’s a sound bet that the existing litigation is an elaborate scramble for leverage in settlement negotiations.
These cases can end one of three ways: rightsholders win, everybody settles, or developers win. As we’ve noted before, we think the developers have the better argument. But that’s not the only reason they should win these cases: while creators have a legitimate gripe, expanding copyright won’t protect jobs from automation. A win for rightsholders or even a settlement could also lead to significant harm, especially if it undermines fair use protections for research uses or artistic protections for creators. In this article and a follow-up, we’ll explain why.
State of Play
First, we need some context, so here’s the state of play:
DMCA Claims
Multiple courts have dismissed claims under Section 1202(b) of the Digital Millennium Copyright Act, stemming from allegations that developers removed or altered attribution information during the training process. In Raw Story Media v. OpenAI, Inc., the Southern District of New York dismissed these claims because the plaintiff had not “plausibly alleged” that training ChatGPT on their works had actually harmed them, and there was no “substantial risk” that ChatGPT would output their news articles. Because ChatGPT was trained on “massive amounts of information from innumerable sources on almost any given subject…the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs’ articles seems remote.” Courts granted motions to dismiss similar DMCA claims in Andersen v. Stability AI, Ltd., The Intercept Media, Inc. v. OpenAI, Inc., Kadrey v. Meta Platforms, Inc., and Tremblay v. OpenAI.
Another such case, Doe v. GitHub, Inc. will soon be argued in the Ninth Circuit.
Copyright Infringement Claims
Rightsholders also assert ordinary copyright infringement, and the initial holdings are a mixed bag. In Kadrey v. Meta Platforms, Inc., for example, the court dismissed “nonsensical” claims that Meta’s LLaMA models are themselves infringing derivative works. In Andersen v. Stability AI Ltd., however, the court held that copyright claims based on the assumption that the plaintiff’s works were included in a training data set could go forward, where the use of plaintiffs’ names as prompts generated outputted images that were “similar to plaintiffs’ artistic works.” The court also held that the plaintiffs plausibly alleged that the model was designed to “promote infringement” for similar reasons.
It's early in the case — the court was merely deciding if the plaintiffs had alleged enough to justify further proceedings — but it’s a dangerous precedent. Crucially, copyright protection extends only to the actual expression of the author — the underlying facts and ideas in a creative work are not themselves protected. That means that, while a model cannot output an identical or near-identical copy of a training image without running afoul of copyright, it is free to generate stylistically “similar” images. Training alone is insufficient to give rise to a claim of infringement, and the court impermissibly conflated permissible “similar” outputs with the copying of protectable expression.
Fair Use
In most of the AI cases, courts have yet to consider — let alone decide — whether fair use applies. In one unusual case, however, the judge has flip-flopped, previously finding that the defendant’s use was fair and changing his mind. This case, Thomson Reuters Enterprise Centre GMBH v. Ross Intelligence, Inc., concerns legal research technology. Thomson Reuters provides search tools to locate relevant legal opinions and prepares annotations describing the opinions’ holdings. Ross Intelligence hired lawyers to look at those annotations and rewrite them in their own words. Their output was used to train Ross’s search tool, ultimately providing users with relevant legal opinions based on their queries. Originally, the court got it right, holding that if the AI developer used copyrighted works only “as a step in the process of trying to develop a ‘wholly new,’ albeit competing, product,” that’s “transformative intermediate copying,” i.e. fair use.
After reconsidering, however, the judge changed his mind in several respects, essentially disagreeing with prior case law regarding search engines. We think it’s unlikely that an appeals court would uphold this divergence from precedent. But if it did, it would present legal problems for AI developers — and anyone creating search tools.
Copyright law favors the creation of new technology to learn and locate information, even when developing the tool required copying books and web pages in order to index them. Here, the search tool is providing links to legal opinions, not presenting users with any Thomson Reuters original material. The tool is concerned with non-copyrightable legal holdings and principles, not with supplanting any creative expression embodied in the annotations prepared by Thomson Reuters.
Thomson Reuters has often pushed the limits of copyright in an attempt to profit off of the public’s need to access and refer to the law, for instance by claiming a proprietary interest in its page numbering of legal opinions. Unfortunately, the judge in this case enabled them to do so in a new way. We hope the appeals court reverses the decision.
The Side Deals
While all of this is going on, developers that can afford it — OpenAI, Google, and other tech behemoths — have inked multimillion-dollar licensing deals with Reddit, the Wall Street Journal, and myriad other corporate copyright owners. There’s suddenly a $2.5 billion licensing market for training data — even though the use of that data is almost certainly fair use.
What’s Missing
This litigation is getting plenty of attention. And it should because the stakes are high. Unfortunately, the real stakes are getting lost. These cases are not just about who will get the most financial benefits from generative AI. The outcomes will decide whether a small group of corporations that can afford big licensing fees will determine the future of AI for all of us.
Part Two: AI and Copyright: Expanding Copyright Hurts Everyone — Here’s What to Do Instead
You shouldn't need a permission slip to read a webpage – whether you do it with your own eyes, or use software to help. AI is a category of general-purpose tools with myriad beneficial uses. Requiring developers to license the materials needed to create this technology threatens the development of more innovative and inclusive AI models, as well as important uses of AI as a tool for expression and scientific research.
Threats to Socially Valuable Research and Innovation
Requiring researchers to license fair uses of AI training data could make socially valuable research based on machine learning (ML) and even text and data mining (TDM) prohibitively complicated and expensive, if not impossible. Researchers have relied on fair use to conduct TDM research for a decade, leading to important advancements in myriad fields. However, licensing the vast quantity of works that high-quality TDM research requires is frequently cost-prohibitive and practically infeasible.

Image by Mohamed Hassan from Pixabay
Fair use protects ML and TDM research for good reason. Without fair use, copyright would hinder important scientific advancements that benefit all of us. Empirical studies back this up: research using TDM methodologies are more common in countries that protect TDM research from copyright control; in countries that don’t, copyright restrictions stymie beneficial research. It’s easy to see why: it would be impossible to identify and negotiate with millions of different copyright owners to analyze, say, text from the internet.
The stakes are high, because ML is critical to helping us interpret the world around us. It's being used by researchers to understand everything from space nebulae to the proteins in our bodies. When the task requires crunching a huge amount of data, such as the data generated by the world’s telescopes, ML helps rapidly sift through the information to identify features of potential interest to researchers. For example, scientists are using AlphaFold, a deep learning tool, to understand biological processes and develop drugs that target disease-causing malfunctions in those processes. The developers released an open-source version of AlphaFold, making it available to researchers around the world. Other developers have already iterated upon AlphaFold to build transformative new tools.
Threats to Competition
Requiring AI developers to get authorization from rightsholders before training models on copyrighted works would limit competition to companies that have their own trove of training data, or the means to strike a deal with such a company. This would result in all the usual harms of limited competition — higher costs, worse service, and heightened security risks — as well as reducing the variety of expression used to train such tools and the expression allowed to users seeking to express themselves with the aid of AI. As the Federal Trade Commission recently explained, if a handful of companies control AI training data, “they may be able to leverage their control to dampen or distort competition in generative AI markets” and “wield outsized influence over a significant swath of economic activity.”
Legacy gatekeepers have already used copyright to stifle access to information and the creation of new tools for understanding it. Consider, for example, Thomson Reuters v. Ross Intelligence, widely considered to be the first lawsuit over AI training rights ever filed. Ross Intelligence sought to disrupt the legal research duopoly of Westlaw and LexisNexis by offering a new AI-based system. The startup attempted to license the right to train its model on Westlaw’s summaries of public domain judicial opinions and its method for organizing cases. Westlaw refused to grant the license and sued its tiny rival for copyright infringement. Ultimately, the lawsuit forced the startup out of business, eliminating a would-be competitor that might have helped increase access to the law.
Similarly, shortly after Getty Images — a billion-dollar stock images company that owns hundreds of millions of images — filed a copyright lawsuit asking the court to order the “destruction” of Stable Diffusion over purported copyright violations in the training process, Getty introduced its own AI image generator trained on its own library of images.
Requiring developers to license AI training materials benefits tech monopolists as well. For giant tech companies that can afford to pay, pricey licensing deals offer a way to lock in their dominant positions in the generative AI market by creating prohibitive barriers to entry. To develop a “foundation model” that can be used to build generative AI systems like ChatGPT and Stable Diffusion, developers need to “train” the model on billions or even trillions of works, often copied from the open internet without permission from copyright holders. There’s no feasible way to identify all of those rightsholders — let alone execute deals with each of them. Even if these deals were possible, licensing that much content at the prices developers are currently paying would be prohibitively expensive for most would-be competitors.
We should not assume that the same companies who built this world can fix the problems they helped create; if we want AI models that don’t replicate existing social and political biases, we need to make it possible for new players to build them.
Nor is pro-monopoly regulation through copyright likely to provide any meaningful economic support for vulnerable artists and creators. Notwithstanding the highly publicized demands of musicians, authors, actors, and other creative professionals, imposing a licensing requirement is unlikely to protect the jobs or incomes of the underpaid working artists that media and entertainment behemoths have exploited for decades. Because of the imbalance in bargaining power between creators and publishing gatekeepers, trying to help creators by giving them new rights under copyright law is, as EFF Special Advisor Cory Doctorow has written, like trying to help a bullied kid by giving them more lunch money for the bully to take.
Entertainment companies’ historical practices bear out this concern. For example, in the late-2000’s to mid-2010’s, music publishers and recording companies struck multimillion-dollar direct licensing deals with music streaming companies and video sharing platforms. Google reportedly paid more than $400 million to a single music label, and Spotify gave the major record labels a combined 18 percent ownership interest in its now-$100 billion company. Yet music labels and publishers frequently fail to share these payments with artists, and artists rarely benefit from these equity arrangements. There is no reason to believe that the same companies will treat their artists more fairly once they control AI.
Threats to Free Expression
Generative AI tools like text and image generators are powerful engines of expression. Creating content—particularly images and videos—is time intensive. It frequently requires tools and skills that many internet users lack. Generative AI significantly expedites content creation and reduces the need for artistic ability and expensive photographic or video technology. This facilitates the creation of art that simply would not have existed and allows people to express themselves in ways they couldn’t without AI.
Some art forms historically practiced within the African American community — such as hip hop and collage — have a rich tradition of remixing to create new artworks that can be more than the sum of their parts. As professor and digital artist Nettrice Gaskins has explained, generative AI is a valuable tool for creating these kinds of art. Limiting the works that may be used to train AI would limit its utility as an artistic tool, and compound the harm that copyright law has already inflicted on historically Black art forms.
Generative AI has the power to democratize speech and content creation, much like the internet has. Before the internet, a small number of large publishers controlled the channels of speech distribution, controlling which material reached audiences’ ears. The internet changed that by allowing anyone with a laptop and Wi-Fi connection to reach billions of people around the world. Generative AI magnifies those benefits by enabling ordinary internet users to tell stories and express opinions by allowing them to generate text in a matter of seconds and easily create graphics, images, animation, and videos that, just a few years ago, only the most sophisticated studios had the capability to produce. Legacy gatekeepers want to expand copyright so they can reverse this progress. Don’t let them: everyone deserves the right to use technology to express themselves, and AI is no exception.
Threats to Fair Use
In all of these situations, fair use — the ability to use copyrighted material without permission or payment in certain circumstances — often provides the best counter to restrictions imposed by rightsholders. But, as we explained in the first post in this series, fair use is under attack by the copyright creep. Publishers’ recent attempts to impose a new licensing regime for AI training rights—despite lacking any recognized legal right to control AI training—threatens to undermine the public’s fair use rights.
By undermining fair use, the AI copyright creep makes all these other dangers more acute. Fair use is often what researchers and educators rely on to make their academic assessments and to gather data. Fair use allows competitors to build on existing work to offer better alternatives. And fair use lets anyone comment on, or criticize, copyrighted material.
When gatekeepers make the argument against fair use and in favor of expansive copyright—in court, to lawmakers, and to the public—they are looking to cement their own power, and undermine ours.
A Better Way Forward
AI also threatens real harms that demand real solutions.
Many creators and white-collar professionals increasingly believe that generative AI threatens their jobs. Many people also worry that it enables serious forms of abuse, such as AI-generated non-consensual intimate imagery, including of children. Privacy concerns abound, as does consternation over misinformation and disinformation. And it’s already harming the environment.
Expanding copyright will not mitigate these harms, and we shouldn’t forfeit free speech and innovation to chase snake oil “solutions” that won’t work.
We need solutions that address the roots of these problems, like inadequate protections for labor rights and personal privacy. Targeted, issue-specific policies are far more likely to succeed in resolving the problems society faces. Take competition, for example. Proponents of copyright expansion argue that treating AI development like the fair use that it is would only enrich a handful of tech behemoths. But imposing onerous new copyright licensing requirements to train models would lock in the market advantages enjoyed by Big Tech and Big Media — the only companies that own large content libraries or can afford to license enough material to build a deep learning model — profiting entrenched incumbents at the public’s expense. What neither Big Tech nor Big Media will say is that stronger antitrust rules and enforcement would be a much better solution.
What’s more, looking beyond copyright future-proofs the protections. Stronger environmental protections, comprehensive privacy laws, worker protections, and media literacy will create an ecosystem where we will have defenses against any new technology that might cause harm in those areas, not just generative AI.
Expanding copyright, on the other hand, threatens socially beneficial uses of AI — for example, to conduct scientific research and generate new creative expression — without meaningfully addressing the harms.
|