This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
EDIT: That is all the time we have for today! Thank you everyone for the thoughtful questions. We'll hop back on tomorrow if there are any big, lingering questions still out there, and feel free to keep following our coverage of AI here: https://www.washingtonpost.com/technology/innovations/?itid=nb_technology_artificial-intelligence?utm_campaign=wp_main&utm_medium=social&utm_source=reddit.com
The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data.
To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT).
The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company.
Read more of our analysis here, and skip the paywall with email registration:
https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
proof:
Subreddit
Post Details
- Posted
- 1 year ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/IAmA/commen...