“While using books as part of data sets is not inherently problematic, using pirated (or stolen) books does not fairly compensate authors and publishers for their work,” the plaintiffs, which include Huckabee, and Christian writers and podcasters including Tsh Oxenreider and Lysa TerKeurst, said in the lawsuit. The suit targets Meta, Microsoft and financial data provider Bloomberg L.P., all of which have trained their own “large language models” — the giant algorithms that power tools like ChatGPT — using data from the web.
The lawsuit zeroes in on an infamous collection of pirated books, known as “books3,” which the plaintiffs allege was included in “the pile” — a freely available collection of data sources compiled by nonprofit group EleutherAI to allow smaller companies access to more data to train their own AI. The lawsuit also names EleutherAI as a defendant. The lawsuit, a proposed class-action, is seeking damages and an injunction to bar the companies from continuing to use their works.
A spokesperson for Microsoft declined to comment. Spokespeople for Meta, Bloomberg and EleutherAI did not respond to requests for comment.
Large language models are generally trained on billions of sentences of text pulled from the internet, including news stories, Wikipedia and comments on social media sites. OpenAI and other AI companies such as Google and Microsoft do not say specifically which data they use, but AI critics have long suspected that it includes collections of pirated books.
The battle over whether companies can take data from the internet without payment or permission to train their potentially lucrative AI models is only heating up. Multiple lawsuits from comedians, writers and artists have targeted the tech companies. Tech executives argue that taking data from the public web falls under “free use” — a concept in copyright law that creates exemptions for works that are substantially different from the source material they may be derived from.