Close Menu
RoboNewsWire – Latest Insights on AI, Robotics, Crypto and Tech Innovations
  • Home
  • AI
  • Crypto
  • Cybersecurity
  • IT
  • Energy
  • Robotics
  • TechCrunch
  • Technology
What's Hot

‘Bitcoin Family’ changed security after recent crypto kidnappings

June 7, 2025

Startup Battlefield 200: Only 3 days left to apply

June 7, 2025

Morgan Stanley upgrades mining stock as best pick to play rare earths

June 7, 2025
Facebook X (Twitter) Instagram
Trending
  • ‘Bitcoin Family’ changed security after recent crypto kidnappings
  • Startup Battlefield 200: Only 3 days left to apply
  • Morgan Stanley upgrades mining stock as best pick to play rare earths
  • Nikola founder Trevor Milton is fighting a subpoena from his bankrupt company’s creditors
  • Creators Caucus launches in Congress with support from Patreon, YouTube
  • UK judge warns of risk to justice after lawyers cited fake AI-generated cases in court
  • WWDC 2025: What to expect from this year’s conference
  • iOS 19: All the rumored changes Apple could be bringing to its new operating system
  • Home
  • About Us
  • Advertise
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
Facebook X (Twitter) Instagram
RoboNewsWire – Latest Insights on AI, Robotics, Crypto and Tech InnovationsRoboNewsWire – Latest Insights on AI, Robotics, Crypto and Tech Innovations
Saturday, June 7
  • Home
  • AI
  • Crypto
  • Cybersecurity
  • IT
  • Energy
  • Robotics
  • TechCrunch
  • Technology
RoboNewsWire – Latest Insights on AI, Robotics, Crypto and Tech Innovations
Home » EleutherAI releases massive AI training dataset of licensed and open domain text

EleutherAI releases massive AI training dataset of licensed and open domain text

GTBy GTJune 7, 2025 TechCrunch No Comments3 Mins Read
Share
Facebook Twitter LinkedIn Pinterest Email


EleutherAI, an AI research organization, has released what it claims is one of the largest collections of licensed and open-domain text for training AI models.

The dataset, called the Common Pile v0.1, took around two years to complete in collaboration with AI startups Poolside, Hugging Face, and others, along with several academic institutions. Weighing in at 8 terabytes in size, the Common Pile v0.1 was used to train two new AI models from EleutherAI, Comma v0.1-1T and Comma v0.1-2T, that EleutherAI claims perform on par with models developed using unlicensed, copyrighted data.

AI companies, including OpenAI, are embroiled in lawsuits over their AI training practices, which rely on scraping the web — including copyrighted material like books and research journals — to build model training datasets. While some AI companies have licensing arrangements in place with certain content providers, most maintain that the U.S. legal doctrine of fair use shields them from liability in cases where they trained on copyrighted work without permission.

EleutherAI argues that these lawsuits have “drastically decreased” transparency from AI companies, which the organization says has harmed the broader AI research field by making it more difficult to understand how models work and what their flaws might be.

“[Copyright] lawsuits have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in,” Stella Biderman, EleutherAI’s executive director, wrote in a blog post on Hugging Face early Friday. “Researchers at some companies we have spoken to have also specifically cited lawsuits as the reason why they’ve been unable to release the research they’re doing in highly data-centric areas.”

The Common Pile v0.1, which can be downloaded from Hugging Face’s AI dev platform and GitHub, was created in consultation with legal experts, and it draws on sources, including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also used Whisper, OpenAI’s open source speech-to-text model, to transcribe audio content.

EleutherAI claims Comma v0.1-1T and Comma v0.1-2T are evidence that the Common Pile v0.1 was curated carefully enough to enable developers to build models competitive with proprietary alternatives. According to EleutherAI, the models, both of which are 7 billion parameters in size and were trained on only a fraction of the Common Pile v0.1, rival models like Meta’s first Llama AI model on benchmarks for coding, image understanding, and math.

Parameters, sometimes referred to as weights, are the internal components of an AI model that guide its behavior and answers.

“In general, we think that the common idea that unlicensed text drives performance is unjustified,” Biderman wrote in her post. “As the amount of accessible openly licensed and public domain data grows, we can expect the quality of models trained on openly licensed content to improve.”

The Common Pile v0.1 appears to be in part an effort to right EleutherAI’s historical wrongs. Years ago, the company released The Pile, an open collection of training text that includes copyrighted material. AI companies have come under fire — and legal pressure — for using The Pile to train models.

EleutherAI is committing to releasing open datasets more frequently going forward in collaboration with its research and infrastructure partners.

Updated 9:48 a.m. Pacific: Biderman clarified in a post on X that EleutherAI contributed to the release of the datasets and models, but that their development involved many partners, including the University of Toronto, which helped lead the research.



Source link

GT
  • Website

Keep Reading

Startup Battlefield 200: Only 3 days left to apply

Nikola founder Trevor Milton is fighting a subpoena from his bankrupt company’s creditors

Creators Caucus launches in Congress with support from Patreon, YouTube

WWDC 2025: What to expect from this year’s conference

iOS 19: All the rumored changes Apple could be bringing to its new operating system

OpenAI’s marketing head takes leave to undergo breast cancer treatment

Add A Comment
Leave A Reply Cancel Reply

Editors Picks

DocuSign stock tanks 18% after company cuts billings outlook

June 6, 2025

Omada Health prices IPO at $19 per share, in middle of expected range

June 6, 2025

Amazon’s R&D lab forms new agentic AI group

June 4, 2025

FBI says Palm Springs bombing suspects used AI chat program

June 4, 2025
Latest Posts

Healthcare Cyber Attacks – 276 Million Patient Records were Compromised In 2024

May 15, 2025

Hackers Launching Cyber Attacks Targeting Multiple Schools & Universities in New Mexico

May 6, 2025

Over 90% of Cybersecurity Leaders Worldwide Encountered Cyberattacks Targeting Cloud Environments

May 1, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to RoboNewsWire, your trusted source for cutting-edge news and insights in the world of technology. We are dedicated to providing timely and accurate information on the most important trends shaping the future across multiple sectors. Our mission is to keep you informed and ahead of the curve with deep dives, expert analysis, and the latest updates in key industries that are transforming the world.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Facebook X (Twitter) Instagram
  • Home
  • About Us
  • Advertise
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 Robonewswire. Designed by robonewswire.

Type above and press Enter to search. Press Esc to cancel.

STEAM Education

At FutureBots, we believe the future belongs to creators, thinkers, and problem-solvers. That’s why we’ve made it our mission to provide high-quality STEM products designed to inspire curiosity, spark innovation, and empower learners of all ages to shape the world through robotics and technology.