I Trained My AI on Public Data—Can I Sell the Outputs?
Dear Will & AiME,
Our company has been experimenting with building our own small language model for internal use. To train it, we used publicly available online data—blogs, reviews, government reports, and more. The model is performing well, and now leadership is asking if we can commercialize it by licensing the outputs to third parties. We didn't scrape anything behind a paywall. Are we okay to move forward, or are there IP risks even with public content?
— Innovation Lead in Minneapolis
Short answer 💡
Using public data to train AI does not automatically grant the right to commercialize the outputs. Businesses must assess copyright, terms of use, and how closely outputs reflect source material before selling AI-generated content.
Dear Innovation Lead in Minneapolis,
This question gets to the heart of the data economy: publicly available does not always mean freely usable, especially when it comes to training AI models. Let's walk through the key legal and practical issues, starting with what "public data" really means.
1. Public Access Doesn’t Mean Free to Use
Just because data is publicly accessible on the internet doesn't mean you're free to ingest it, train on it, and commercialize outputs without restriction. Many websites have:
Copyrighted content, even if freely readable,
Terms of service that limit scraping or downstream use,
Robot.txt instructions or access control that signal limitations or licensing to automated tools.
Even if data was technically accessible, the site's owner may challenge commercial use, especially if they believe their data was used to train a model that now competes with them or exploits their work.
Recent industry developments, like the push toward RSL (Responsible Scaling License) and pay-per-crawl programs, signal a shift toward treating data as a licensable asset, even if it's not behind a login. This market makes fair use less likely to hold up as a defense.
2. Why the Input–Output Relationship Matters
Legal questions around AI models often hinge on two dimensions:
Input use: Did you have the right to ingest the content for training?
Output use: Does the model generate content that reflects or replicates the original inputs?
If the outputs are generic or statistically derived (e.g., summaries, topic modeling, code generation), the risk may be lower. But if the outputs:
Closely resemble the original content,
Rely heavily on specific phrasing or structure,
Can be traced to a known source without sufficient transformation,
…then the copyright concerns become more serious. This is especially true in creative or editorial domains, like reviews, long-form writing, or commentary.
3. Scraping Rules Are Evolving & Enforceable
More platforms are now enforcing automated data collection restrictions, either through:
Terms of service (clickwrap or browsewrap),
Use of bot detection or rate limiting,
Legal actions based on breach of contract, unfair competition, or anti-circumvention laws.
Courts have been split on whether terms of service are always enforceable against bots, but the trend is moving toward treating commercial-scale scraping as a regulated activity, especially when used to build products. This matters because it's not just about copyright. It's about whether your company accepted legal risk by ignoring posted terms, even if there was no password or paywall.
Training on public data is no longer a legal gray area. It's a business decision with IP, licensing, and reputational consequences. If your company is planning to commercialize AI outputs, it's smart to trace how the model was trained and what rights, if any, were granted or assumed.
— Will & AiME
Three Takeaways:
Publicly accessible data may still be subject to copyright and site-specific use restrictions.
The risk depends on both how the data was used in training and how closely the outputs reflect the source.
Commercializing outputs trained on web content without clear rights may expose your company to contract and copyright claims.