Discussion about this post

User's avatar
Wil's avatar

James, it sounds like I would disagree with you on the ethics of all this, but you’re making the argument here that I’ve struggled to make in a clear way. I learned a lot and I only wish all the people in comments sections screaming “AI IS THEFT!!!” will read this.

As to this: “That doesn’t change the fact that the AI companies have created a temporary copy of all of their training data...” I’ve long said busting AI companies on this point is like busting Al Capone for tax evasion. Maybe it’s a crime, but it’s not the crime everyone is screaming about. And as you point out, the Google case offers protections. That “[c]omplete unchanged copying has repeatedly been found justified...” clause is key.

On this note: “There is nothing stopping anyone from taking the facts from a news story and retelling them: they have no legal protection whatsoever.” Thus explaining the rise of articles that seem to be no more than a retelling of a NY Times or Washington Post piece with (perhaps) a little opinion thrown in.

As to this: “If AI companies were absolutely certain about the legal basis of what they were doing, they would have no need to hide any details about it.” Couldn’t it be that revealing the details would reveal information about the algorithms used/trade secrets, etc?

There’s one key point you don’t discuss - the use of pirated ebooks. Even if I can copy a book (temporarily for transformative purposes), I still need to pay for it, no? (I’ve seen some interesting arguments that say “maybe not” but they don’t feel open and shut.) Of course, if a court forced the AI companies to “buy” all the books in the pirated LLMs, and even added in a punitive fine, I doubt that would break the bank.

Expand full comment
Jemma Hooper's avatar

I'm not an IP lawyer, but I have had some training in commercial and IP law as an engineer, musician and writer. A few points:

1. Fair use is subject to a number of tests. One of the key ones is public utility. While on the surface of it, LLM's provide some measure of public utility, the fact that these services are monetising that utility diminishes the "public" nature of that utility.

2. Fair use is an exception in copyright laws - an edge case. It is not the core principle from which such laws are generally applied. On the other hand, CONSENT is very much a core principle. There'd be a strong equity argument that consent cannot be countermanded by fair use exceptions when performed on the wholesale scale with which AI companies have so flagrantly bypassed the concept of consent. In copyright law, there is no such thing as "Asking for forgiveness because asking for permission takes longer." OpenAI, Google and their ilk are trying to convince us that forgiveness is an available option when it's not.

3. In academic publishing, there is a principle of global consent to "fair use" in creating new works that contribute to the "state of art." However, there are clear demonstrations available which show that LLM generated academic papers have significantly muddied the "state of art" in multiple fields. Many journal publishers are having to heavily rework their processes for peer review after a massive influx of LLM-generated papers which are being used to cynically manipulate citation counts and support the employability of demonstrable charlatans. I think a "reasonable person" test would find that such usages would indeed NOT pass muster as fair use.

4. Public-adjacent domain licenses (e.g. GPL, LGPL, MIT license, Creative Commons licenses) are a special case that also needs consideration. They provide for open use of a given IP subject to strict terms around attribution, non-commercial use, etc. Simple scraping of works licensed via such models does not require fair use as the license is already permissive. I would argue that fair use becomes inapplicable in such cases. However, reproduction and/or creation of derivative works from said IP without adhering to the license terms would be a clear breach of the license terms, requiring redress. If AI vendors/model trainers have not considered this, they are probably in breach of copyright. They also risk the specific case extending to the general (i.e. breach of publishing licenses) if this specific argument is upheld.

5. This is a weird edge case, but bear with me. Let's say someone writes a book, paper, song, etc... that is so distinctive that there's only one instance of it that an LLM-like ingest process can generate weights for in a given category. The chances that those weights will reproduce something very similar to that work - to the point of duplication - might leave the AI vendors having to prove that substantial modification is built into their algorithms, even for singular/small sample sizes for a given domain/category. The burden of proof would most reasonably fall on them in this circumstance.

6. There's a legal precept called "The fruit of the poisoned tree," which will be a huge risk for AI vendors - and one that if they are smart, they will settle out of court on for their larger disputes. This is an argument that the IP holders/publishers should have a field day with. Especially if the LLM owners have ingested works that were DRMed, but got around that by finding rogue PDFs, unlicensed EPK/PUB files, etc... If they got around this by purchasing Journal subscriptions, etc... but the journal's access T&Cs did not give permission to scrape more than a specific amount of any given article, they are in breach of contract. That obviates fair use.

7. "I build a custom lawnmower for my lawnmowing business. My neighbour takes my lawnmower without my consent and charges my customers and others for lawnmowing services. The income my neighbour receives using my lawnmower is effectively the proceeds of an unlawful act, and said neighbour is not entitled to it, even though he pushed the lawnmower around the respective yards. Giving my lawnmower back to me does not entitle him to keep the illegally obtained income." Jury nullification, done.

I'd be very careful saying "The AI companies are probably in the clear." These principles would all be fair game for a Court of Equity argument, even if a common law argument for fair use was accepted under Common Law. While they AI vendors might get an initial pass, appeals and potential mistrial motions under these principles could get very expensive for them; to the extent that I wouldn't be in any rush to purchase AI stocks.

Expand full comment
1 more comment...

No posts