Close Menu
RoboNewsWire – Latest Insights on AI, Robotics, Crypto and Tech Innovations
  • Home
  • AI
  • Crypto
  • Cybersecurity
  • IT
  • Energy
  • Robotics
  • TechCrunch
  • Technology
What's Hot

Investors trust Google more than Meta when comes to spending on AI

April 30, 2026

Paragon is not collaborating with Italian authorities probing spyware attacks, report says

April 28, 2026

Microsoft cuts OpenAI revenue share as their AI alliance loosens

April 28, 2026
Facebook X (Twitter) Instagram
Trending
  • Investors trust Google more than Meta when comes to spending on AI
  • Paragon is not collaborating with Italian authorities probing spyware attacks, report says
  • Microsoft cuts OpenAI revenue share as their AI alliance loosens
  • Robotically assembled building blocks could make construction more efficient and sustainable | MIT News
  • AI showdown: Musk and Altman go to trial in fight over OpenAI’s beginnings
  • U.S., Iran seize ships as war evolves into standoff over Strait of Hormuz
  • Google launches training and inference TPUs in latest shot at Nvidia
  • Zoom teams up with World to verify humans in meetings
  • Home
  • About Us
  • Advertise
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
Facebook X (Twitter) Instagram
RoboNewsWire – Latest Insights on AI, Robotics, Crypto and Tech InnovationsRoboNewsWire – Latest Insights on AI, Robotics, Crypto and Tech Innovations
Saturday, May 9
  • Home
  • AI
  • Crypto
  • Cybersecurity
  • IT
  • Energy
  • Robotics
  • TechCrunch
  • Technology
RoboNewsWire – Latest Insights on AI, Robotics, Crypto and Tech Innovations
Home » Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

GTBy GTApril 1, 2025 TechCrunch No Comments4 Mins Read
Share
Facebook Twitter LinkedIn Pinterest Email


OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization makes the serious accusation that the company increasingly relied on nonpublic books it didn’t license to train more sophisticated AI models.

AI models are essentially complex prediction engines. Trained on a lot of data — books, movies, TV shows, and so on — they learn patterns and novel ways to extrapolate from a simple prompt. When a model “writes” an essay on a Greek tragedy or “draws” Ghibli-style images, it’s simply pulling from its vast knowledge to approximate. It isn’t arriving at anything new.

While a number of AI labs, including OpenAI, have begun embracing AI-generated data to train AI as they exhaust real-world sources (mainly the public web), few have eschewed real-world data entirely. That’s likely because training on purely synthetic data comes with risks, like worsening a model’s performance.

The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O’Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from O’Reilly Media. (O’Reilly is the CEO of O’Reilly Media.)

In ChatGPT, GPT-4o is the default model. O’Reilly doesn’t have a licensing agreement with OpenAI, the paper says.

“GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content … compared to OpenAI’s earlier model GPT-3.5 Turbo,” wrote the co-authors of the paper. “In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples.”

The paper used a method called DE-COP, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models’ training data. Also known as a “membership inference attack,” the method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data.

The co-authors of the paper — O’Reilly, Strauss, and AI researcher Sruly Rosenblat — say that they probed GPT-4o, GPT-3.5 Turbo, and other OpenAI models’ knowledge of O’Reilly Media books published before and after their training cutoff dates. They used 13,962 paragraph excerpts from 34 O’Reilly books to estimate the probability that a particular excerpt had been included in a model’s training dataset.

According to the results of the paper, GPT-4o “recognized” far more paywalled O’Reilly book content than OpenAI’s older models, including GPT-3.5 Turbo. That’s even after accounting for potential confounding factors, the authors said, like improvements in newer models’ ability to figure out whether text was human-authored.

“GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O’Reilly books published prior to its training cutoff date,” wrote the co-authors.

It isn’t a smoking gun, the co-authors are careful to note. They acknowledge that their experimental method isn’t foolproof and that OpenAI might’ve collected the paywalled book excerpts from users copying and pasting it into ChatGPT.

Muddying the waters further, the co-authors didn’t evaluate OpenAI’s most recent collection of models, which includes GPT-4.5 and “reasoning” models such as o3-mini and o1. It’s possible that these models weren’t trained on paywalled O’Reilly book data or were trained on a lesser amount than GPT-4o.

That being said, it’s no secret that OpenAI, which has advocated for looser restrictions around developing models using copyrighted data, has been seeking higher-quality training data for some time. The company has gone so far as to hire journalists to help fine-tune its models’ outputs. That’s a trend across the broader industry: AI companies recruiting experts in domains like science and physics to effectively have these experts feed their knowledge into AI systems.

It should be noted that OpenAI pays for at least some of its training data. The company has licensing deals in place with news publishers, social networks, stock media libraries, and others. OpenAI also offers opt-out mechanisms — albeit imperfect ones — that allow copyright owners to flag content they’d prefer the company not use for training purposes.

Still, as OpenAI battles several suits over its training data practices and treatment of copyright law in U.S. courts, the O’Reilly paper isn’t the most flattering look.

OpenAI didn’t respond to a request for comment.



Source link

GT
  • Website

Keep Reading

Paragon is not collaborating with Italian authorities probing spyware attacks, report says

Zoom teams up with World to verify humans in meetings

Hackers are abusing unpatched Windows security flaws to hack into organizations

‘Tokenmaxxing’ is making developers less productive than they think

Sources: Cursor in talks to raise $2B+ at $50B valuation as enterprise growth surges

Kevin Weil and Bill Peebles exit OpenAI as company continues to shed ‘side quests’

Add A Comment
Leave A Reply Cancel Reply

Editors Picks

Investors trust Google more than Meta when comes to spending on AI

April 30, 2026

Google launches training and inference TPUs in latest shot at Nvidia

April 27, 2026

Meta tracks employee usage on Google, LinkedIn AI training project

April 25, 2026

Meta will cut 10% of workforce as company pushes deeper into AI

April 24, 2026
Latest Posts

Malicious Chrome Extension Steal ChatGPT and DeepSeek Conversations from 900K Users

April 1, 2026

Top 10 Best Server Monitoring Tools

April 1, 2026

10 Best Cybersecurity Risk Management Tools

March 31, 2026

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to RoboNewsWire, your trusted source for cutting-edge news and insights in the world of technology. We are dedicated to providing timely and accurate information on the most important trends shaping the future across multiple sectors. Our mission is to keep you informed and ahead of the curve with deep dives, expert analysis, and the latest updates in key industries that are transforming the world.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Facebook X (Twitter) Instagram
  • Home
  • About Us
  • Advertise
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2026 Robonewswire. Designed by robonewswire.

Type above and press Enter to search. Press Esc to cancel.