Training Data Infringement

Copyright or IP claims brought against an AI developer or deployer for the unlicensed use of protected works in model training datasets.

Training data infringement is the family of copyright and intellectual property claims brought against AI developers and deployers for the unlicensed use of protected works in model training. The legal theory is direct copyright infringement (the unauthorized reproduction of copyrighted works during dataset assembly and training), with derivative claims for the outputs the model produces that allegedly reproduce protected expression. Plaintiffs include book authors, news publishers, photo agencies, music labels, software code repositories, and individual creators whose works were ingested.

The landscape changed substantially with the Anthropic $1.5 billion class action settlement (Bartz v. Anthropic, 2025), the largest training data settlement to date, in which Anthropic agreed to pay approximately $3,000 per work to authors of approximately 500,000 books. Thomson Reuters v. ROSS Intelligence is the leading non-settled precedent: a 2025 ruling for Thomson Reuters established that scraping Westlaw headnotes to train a competing legal AI was not fair use, providing the clearest published holding to date that model training on copyrighted material can constitute infringement. The decision is on interlocutory appeal to the Third Circuit, where the petition was granted in 2025.

The exposure runs in two directions. The foundation model developer (OpenAI, Anthropic, Google, Meta, Stability, Midjourney) faces direct training claims; the deployer faces derivative claims when the deployed model produces output that reproduces protected expression, with potential indemnification from the foundation model vendor under contracts like Microsoft's Copilot Copyright Commitment or Google Cloud's Generative AI Indemnification. The contracts are narrowly scoped and do not extend to all deployer use cases.

Insurance coverage is fragmented and evolving. Generative AI Liability forms typically include intellectual property infringement in their insuring agreements where the deployer is sued for the model's outputs. They generally do not cover the foundation model developer's direct training claims (those are the developer's own D&O and IP exposures). Older Tech E&O, Cyber, and Media Liability policies rarely respond to AI-driven IP claims; affirmative coverage in a standalone Generative AI Liability policy is the structural answer for the deployer.

Also known as

AI Training Data Infringement, Model Training IP Liability, Training Dataset Copyright Claims

Frequently asked

What was the Anthropic $1.5 billion settlement?

Bartz v. Anthropic was a 2025 class-action settlement in which Anthropic agreed to pay approximately $1.5 billion (about $3,000 per work for roughly 500,000 books) to authors whose books were used in training without authorization. It is the largest AI training data settlement to date and a reference point for the financial scale of the exposure. The settlement does not establish legal precedent on the merits of training-data fair use, but it does set a market benchmark for what plaintiffs are accepting and defendants are paying.

What did Thomson Reuters v. ROSS Intelligence decide?

A 2025 federal court ruling held that ROSS Intelligence's use of Thomson Reuters' Westlaw headnotes to train a competing legal AI was not fair use. It is the leading published merits decision on AI training data infringement and is widely read as cutting against blanket fair-use defenses for scraping competitor content to train AI systems. The decision is on interlocutory appeal to the Third Circuit, where the petition was granted in 2025, but is already being cited in pending matters against other AI developers and is one of the legal underpinnings of the more cautious model training practices several major developers have adopted since 2025.

Related terms

generative AI liability insurance overview

General information, not legal or insurance advice.