Tag: data licensing
-
Simon Willison’s Weblog: Releasing the largest multilingual open pretraining dataset
Source URL: https://simonwillison.net/2024/Nov/14/releasing-the-largest-multilingual-open-pretraining-dataset/#atom-everything Source: Simon Willison’s Weblog Title: Releasing the largest multilingual open pretraining dataset Feedly Summary: Releasing the largest multilingual open pretraining dataset Common Corpus is a new “open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens)" released by French AI Lab PleIAs. This appears to be the largest available…
-
Wired: This Startup Wants YouTube Creators to Get Paid for AI Training Data
Source URL: https://www.wired.com/story/license-to-scrape-youtube-ai-data-license-creators/ Source: Wired Title: This Startup Wants YouTube Creators to Get Paid for AI Training Data Feedly Summary: While big platforms like Reddit have signed deals with the AI giants, YouTube leaves licensing in the hands of individual creators. The “License to Scrape” program aims to give those streaming stars proper leverage. AI…