Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

162
We’re Washington Post reporters who analyzed Google’s C4 data set to see which websites AI uses to make itself sound smarter. Ask us Anything!
Post Flair (click to view more posts with a particular flair)
Author Summary
washingtonpost is in Washington
Post Body

EDIT: That is all the time we have for today! Thank you everyone for the thoughtful questions. We'll hop back on tomorrow if there are any big, lingering questions still out there, and feel free to keep following our coverage of AI here: https://www.washingtonpost.com/technology/innovations/?itid=nb_technology_artificial-intelligence?utm_campaign=wp_main&utm_medium=social&utm_source=reddit.com

The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data.

To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT).

The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company.

Read more of our analysis here, and skip the paywall with email registration:
https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

proof:

Author
Account Strength
100%
Account Age
7 years
Verified Email
Yes
Verified Flair
Yes
Total Karma
2,511,304
Link Karma
1,985,053
Comment Karma
511,995
Profile updated: 6 days ago

Subreddit

Post Details

Location
We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
1 year ago