Can I Train My AI Model on Public Content I Found Online?

Dear Will & AiME,

Can I Train My AI Model on Public Content I Found Online?

— General Counsel at a Growth-Stage SaaS Company, San Francisco

Short answer💡

Not necessarily. Publicly available content is not automatically free to use for AI training—using it may raise copyright risks unless it’s licensed or clearly defensible under fair use.

Dear General Counsel at a Growth-Stage SaaS Company,

You're not the only one asking that. We’ve seen an uptick in businesses, especially those building AI capabilities in-house, wondering what they can do with all the digital content floating around online. After all, training large language models or other generative AI systems takes massive amounts of data, and most of it is already out there. But just because it’s “out there” doesn’t mean it’s “free to use."  

Here’s what the U.S. Copyright Office just said about this:

What the U.S. Copyright Office Says About AI Training Data

In its latest Copyright and AIreport, the U.S. Copyright Office tackled whether using copyrighted content to train generative AI systems without permission constitutes infringement, and if so, whether fair use might apply. The report didn’t make new law, but it lays out a useful framework:

  1. The Office says that copying works to build AI training datasets likely does implicate copyright, particularly the reproduction right. That includes everything from downloading web pages to converting formats and storing data on training infrastructure.

  2. Training itself—feeding those works into models to adjust weights and tune behavior—can also raise copyright issues. Especially if a model “memorizes” content and can reproduce it later, even in part.

  3. The Copyright Office outlines thatcopying data “because it’s online” doesn’t make it okay. Just because something is “publicly available” doesn’t mean it’s legally licensed for AI training. That distinction could be crucial in court.

How to Reduce Risk When Training AI Models on Third-Party Content

If you’re training your own model or working with vendors, the best move is to proactively manage content rights:

  • Avoid assuming fair use: The fair use defense is fact-specific and unsettled in this area. Courts haven’t reached consensus, and Congress hasn’t intervened…yet.

  • Use licensed datasets: That might mean working with content providers, publishers, or licensing intermediaries. It could cost more, but it lowers the risk of later takedown or litigation.

  • Track your data sources: Even if your team is scraping “public” sites, document where data comes from, how it’s filtered, and whether any opt-outs were offered or honored.

This report doesn’t ban unlicensed training, but it raises a red flag for companies relying on “public availability” as a justification. If you’re building valuable IP on top of this foundation, it’s worth getting it right.

-Will & AiME

Takeaways

  • Training AI models on copyrighted content can trigger infringement, even if the content is freely available online.

  • Fair use is a defense, not a guarantee—and the legal landscape is far from settled.

  • Licensing content or using verified datasets is the safest path for companies investing in AI tools or capabilities.

Will Schultz & AiME

Will Schultz is an intellectual property and technology attorney and chair of Merchant & Gould’s Internet, Cybersecurity, and E-Commerce practice. He advises businesses on AI, online platforms, digital assets, and emerging technology law, drawing on experience as both a lawyer and entrepreneur.

https://www.merchantgould.com/people/william-d-schultz/
Previous
Previous

Will the Legal Rights Around AI Ever Change?

Next
Next

Can I Tell if Someone is Accessing My Copyrighted Content Via My Website?