Updated: Jan 20
Four Major Weaknesses in Open-Source Coders' Class Action Suit
We live in a world where bots can write computer code. The bot—or “language model”—writes the code by offering to autocomplete code that got started by a human programmer. It works similarly to other AI tools that autocomplete natural language. Take for example Gmail’s ability to finish sentences in an email. When I write [“Please come to my office on Monday. When ”] Gmail offers to autocomplete my sentence with [“are you available?”]
Microsoft’s programming tool Copilot works the same, but with programming language instead of “natural” language.
Language models are made by “training” them with large sets of language data, typically text scraped from the internet. The data is fed to the model, which then "learns" to predict the next words. It’s how Gmail’s language model knows that the phrase “are you available?” is statistically likely to follow the word “When” in my above example. A San Francisco based startup OpenAI created Codex, a large language model that was allegedly trained using all the open-source code stored on GitHub. Codex can do autocomplete, like the above, on code. Microsoft modified Codex and made it commercially available to GitHub users, calling it Copilot.
A bunch of programmers are now suing Microsoft and OpenAI for violations related to the Digital Millennium Copyright Act (DCMA) and other laws because of how Codex and Copilot works. The plaintiffs allege that Codex and Copilot violates the open-source programmers’ copyright licenses because Copilot will sometimes regurgitate code it was trained on, without following the requirements of the open-source coders’ licenses. Plaintiffs claim that Copilot violates the plaintiffs’ licenses because 1) Copilot fails to give attribution to the author of the open-source code when Copilot outputs that same code for a given user, 2) Copilot fails to include a copyright notice, and 3) Copilot fails to provide a copy of the license terms to the Copilot user. The DCMA prohibits the removal of "Copyright Management Information," (CMI) such as a copyright notice or license information. Because Copilot purportedly regurgitates code without the included copyright and licensing information, Plaintiffs allege that Microsoft and OpenAI violate the DCMA and various other common laws and statutes.
This suit is brought as a class action. That means a few individuals hope to represent a “class” of people who face similar issues of law or fact. The class representatives file a complaint on behalf of the whole group, and in the complaint, they define the class of people who are represented in the matter. Here, the class is defined as coders who wrote open-source code and stored it on GitHub for use by the public under more than 13 different licenses.
Class actions as a legal device only work if individual class members do not have to present evidence. If the class representatives are successful in proving their claim, then it is sufficient for the whole class. So here, the Copilot plaintiff class representatives allege that the copyright license violations they experienced because of Codex and Copilot can represent the harms experienced by other open-source coders who stored their code on GitHub. If the class representatives succeed in their litigation, all members of the class will receive a remedy to their harm, without ever having to put work into bringing the lawsuit.
But in order to bring a lawsuit as a class action, a judge will first have to determine whether a class can be “certified.” That is, the judge will have to determine whether the legal and factual issues really are common enough for the suit to proceed on a class-basis, or whether members of the purported class will have to bring lawsuits on their own and present evidence of their own individual harms. Generally, if there are more differences than there are similarities in the issues of fact and law, a class will not be certified. If a judge does certify a class, however, the plaintiffs are in a very strong position. The defendants in that scenario will almost always want to settle the case and end up paying large sums to be distributed amongst the many class members. If plaintiffs in the Copilot lawsuit were to succeed in certifying a class, Microsoft and OpenAI would likely be on the hook to pay millions of dollars in damages or provide some other costly relief.
But I think that outcome is unlikely. The reason is that there are several weaknesses in the plaintiffs’ complaint that make class certification improbable. This article is the first in a series, where I describe what I believe to be the weaknesses in the lawsuit thus far. In this article I describe weaknesses associated with class certification, particularly of the copyright-related DCMA claims. In later articles, I’ll explore other topics, such as whether any meaningful remedies are available for the complainants, and whether the fair use doctrine provides an effective defense. So, without further ado, four class certification weaknesses for the Digital Millennium Copyright Act claims in the plaintiffs’ complaint:
1. Not all the coders in the plaintiffs’ class own copyrights in any code Codex or Copilot outputs.
For any person to own copyright in any text, the text must be original. Whether any, or even a small fraction, of the countless lines of code produced by Copilot is original enough to qualify as copyrighted work is something that will take extensive investigation by a judge or jury—something that is nearly impossible to do on a class-wide basis. Take for example the above-mentioned Gmail autocomplete example. No one could reasonably accuse Google of copyright violation because the phrase “are you available?” is not original and therefore not copyrightable. The phrase has been written before in countless other texts. The same could be argued for Codex and Copilot. Much of the code that these models output may be as common as the phrase “when are you available?”
Plaintiffs argue that their inclusion in this context proves that the text was “copied” from Haverbeke’s coding textbook, who used the question marks as a placeholder to indicate that the person working through the exercise in the textbook should enter their response there. The question marks are not a typical or logical part of code.
But not all, or even some of open-source code have such idiosyncratic tells of originality. It is hard to imagine that all, or even most members of the plaintiff class would be able to recognize their own code in any given Codex or Copilot output. According to the complaint GitHub has conceded that about 1% of the time, Copilot outputs some code snippets from its training data that are longer than 150 characters. But it is not clear what percent of those snippets constitute copyrighted text.
Proving that all or even some of the code the purported class members wrote are copyrightable is not possible to do on a class-wide basis. It would require a judge or jury to opine on an instance-by-instance basis whether a given class members’ purportedly copyrighted code was original enough to qualify for a copyright. If there is no copyright, then there is no requirement to comply with licensing terms, and no DCMA protections.
2. Not all class members can allege their copyright was actually regurgitated.
To succeed in a DCMA claim, there must be actual distribution of the copyright without the required Copyright Management Information (CMI). The plaintiffs’ class is defined as anyone who stored their open-source code on GitHub—it doesn’t limit class members to those whose code was copied and output by Copilot without the required CMI. Such copied outputs are "extremely rare" according to former GitHub CEO Nat Friedman, and that as of June 29, 2021 GitHub was working on preventing it entirely. So, the class almost certainly includes people whose code (even if it was copyrighted), was never actually illegally reproduced by Copilot.
In other words, just because Copilot was trained on a person’s code, doesn’t mean that it actually has regurgitated that person’s code for another user in a way that constitutes violation of the DCMA. It is doubtful that a judge will certify a class where not each and every class member can even allege that they experienced actual DMCA violations.
3. The legal issues are not common—there are more than 13 different licenses at issue.
As I was reading the complaint, I did a double take when I discovered that the plaintiffs’ class included copyright protection under more than 13 different licenses. That means the legal rights at issue are different in at least thirteen different ways. It's unclear that the two plaintiffs named in the complaint have all 13 licenses between them. If not, the would-be class representatives of the lawsuit don't have the same legal rights as members of the class they purport to represent. Plaintiffs attempt to get around this problem by alleging that all the licenses share three qualities. They all require users of the open-source code to 1) give attribution of the code author, 2) include a copyright notice, and 3) include the text of the license terms. They allege that because Copilot fails to comply with these three requirements, it violates all licenses.
Forming a class based on three common attributes of those licenses, however, would be a very unorthodox way of certifying a class, and case law support of that requires further investigation. I am sure Microsoft and Open AI’s lawyers are burning billable hours looking for case law supporting the opposite position—that no court should certify a class where the legal rights of the class members are protected by more than 13 different legal documents. My intuition is that they will find helpful precedent. They might start with a recent Ninth Circuit ruling, where the appellate court overturned a decision by a district judge to certify a class of musicians and a class of composers who sued for copyright violations. The plaintiffs alleged that their concert recordings had been distributed on a website in violation of their copyright licenses. However, similar to the Copilot plaintiffs, the class included members with a diversity of legal rights to use the copyright. The lower court had concluded that class certification was warranted because Defendants had pointed to written agreements with "substantially identical material terms." But in overturning that decision, the Circuit court pointed out that "those agreements in fact vary as to the artist, performances, and rights they purport to cover." Class certification was inappropriate because the “individual issues of license and consent” were more numerous than the facts and legal issues in common. See Kihn v. Bill Graham Archives LLC, No. 20-17397, at *5, n. 2 (9th Cir. Jan. 3, 2022) (available on Casetext). In other words, there were too many differences in the plaintiffs' legal rights for the lawsuit to be brought as a class. Precise legal documents exist for a reason, and according to innumerable case law precedent, differences in these documents matter as much as their similarities.
4. None of the plaintiffs appear to have registered their copyrights.
An interesting fact about copyright law: no registration of the copyright is required to own a copyright. A copyright comes into existence and belongs to the author of an original work as soon as that work is put down in a tangible medium of expression, like a piece of paper or a memory card. But there is a catch. If the owner of the copyright wants to actually enforce that copyright by bringing a lawsuit, she must first have registered that copyright with the U.S. Copyright Office.
There used to be some ambiguity about what “registered” means. Some courts held that as long as a completed application was submitted to the Copyright Office, that sufficed to allow an author to bring a copyright lawsuit. Other courts held that the Copyright Office must actually process the application and issue a copyright registration or reject the application before the lawsuit may be filed. In 2019, the Supreme Court resolved that ambiguity. It held that a copyright owner may not file an infringement lawsuit until the Copyright Office has issued a copyright registration or refused to do so. In Fourth Estate Public Benefit Corporation v. Wall Street.com, LLC, 139 S. Ct. 881, 888-889 (2019) (available on the Caselaw Access Project).
In a way, the requirement makes sense. The Copyright Office does the work of evaluating whether an applicant’s work is actually copyrightable by investigating, among other things, whether the work is original. A litigant can still bring suit based on copyright violation even if the Copyright Office refuses to grant the registration (the office just has to act on the application). But a rejection by the Copyright Office might help keep frivolous infringement suits out of the courts.
Nowhere in the GitHub Copilot complaint is there any allegation that the plaintiffs seeking to represent the class ever filed a registration application, let alone that such application was acted on by the Copyright Office. After the Fourth Estate ruling it is arguable that bringing a class action for copyright violations is now nearly impossible because every single member of the class will have had to register their copyrights to maintain the lawsuit. Doing so is no small matter. Each registration can take up to 11 months and costs between $35 and $800. I suspect that Microsoft and Open AI's first order of business will be to file for a dismissal of these copyright claims based on the failure of registration.
One plausible counterargument that the plaintiffs will make is that copyright registration is not a requirement for suits filed under the DCMA. In addition to the U.S. Constitution, there are two major federal laws that establish copyright protections. The U.S. Copyright Act (enacted in 1976) and the Digital Millennium Copyright Act (enacted in 1998). Here, the plaintiffs bring claims under the DCMA. Both of the laws have been codified and exist together in Title 17 of the United States Code. But the provision (Section 411) that creates the requirement to register for a copyright was made law under the U.S. Copyright Act. Later, additional copyright protections were added to the Code by the enactment of the DCMA. One of those DCMA provisions is Section 1202, which makes removal of Copyright Management Information (CMI) unlawful. Plaintiffs brought suit under Section 1202, among others. They may argue that because this provision was one added under the DCMA, and the provision that creates a registration requirement to bring a lawsuit (Section 411) was enacted under the U.S. Copyright Act, that Section 411 does not apply to DCMA claims. Instead, only lawsuits brought under the provisions enacted under the U.S. Copyright Act require registration.
The plaintiffs in the GitHub copilot lawsuit face an uphill battle in preserving their DCMA claims. Even assuming that there is no registration requirement for copyright claims under the DCMA, the likelihood of certifying a class for the DCMA claims is low. Whether the class members actually wrote original open-source code that would qualify for a copyright is not a determination that can be done on a class-wide basis. Furthermore, whether all the class members experienced a DCMA violation is also a question of fact that cannot be resolved by the class representatives only. Finally, a judge is not likely to certify a class where the legal rights at issue are memorialized in no less than 13 different licenses.
There are, however, other claims in plaintiffs’ suit that require further investigation. For example, they allege breach of contract, based on breach of the plaintiffs’ licenses’ terms. Can they enforce those licenses if they don’t even own registered copyrights? That is perhaps a question for a future article.
Please note, this article should not be construed as legal advice! I am not your lawyer.
By the way, if you're curious about the cover art, I made it using Dall-E, with the prompt, "a cat writing programming language, digital art."
Edits as of January 11, 2023
After publishing this article, I got a number of important follow up questions concerning my take on this suit. Below are my responses. I'll reiterate here: please do not take this as legal advice. I am not your lawyer.
- Do copyright owners have a remedy for any violations of the copyright that occurred before applying and obtaining registration?
Yes, the Supreme Court said as much in the Fourth Estate ruling: "Upon registration of the copyright, however, a copyright owner can recover for infringement that occurred both before and after registration." Fourth Estate Public Benefit Corporation v. Wall Street.com, LLC, 139 S. Ct. 881, 887 (2019) (available on the Caselaw Access Project).
- Does the copyright registration requirement apply to suits brought for violation of the Digital Millennium Copyright Act?
When I first wrote this article, I believed that the Supreme Court's ruling in Fourth Estate covered claims under both the U.S. Copyright Act, and the DCMA. Certain lower courts share my opinion on this. See e.g., Sims v. Viacom, Inc., No. 2:11-cv-0675 (W.D. Pa. Jan. 31, 2012)
("As both parties acknowledge, 17 U.S.C. § 411(a) imposes a mandatory precondition that a
copyright must be registered with the Copyright Office before a copyright infringement claim is filed. The Complaint in Sims I was filed in 2009. At that time, as both parties acknowledge, Sims had not registered his "Ghetto Fabulous" treatment with the Copyright Office. Accordingly, Sims could not have raised a copyright infringement claim or alleged DMCA violations in Sims I.") (emphasis added).
It appears, however, that there is ambiguity in the law concerning this issue. Other lower courts have concluded that the registration requirement does not apply to the DCMA. See e.g., Exec. Corp. v. Oisoon, LLC, No. 3:16-cv-00898 (M.D. Tenn. Sept. 28, 2017). This is an ambiguity that I am sure the plaintiffs in the Copilot lawsuit will rely on to argue that their claims are live, despite the failure to apply for copyright registration. I have modified language above to reflect this ambiguity in the law.