We’re Washington Post reporters who analyzed Google’s C4 data set to see which websites AI uses to m...

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

162

We’re Washington Post reporters who analyzed Google’s C4 data set to see which websites AI uses to make itself sound smarter. Ask us Anything!

Post Flair (click to view more posts with a particular flair)

Technology

Author Summary

washingtonpost is in Washington

Post Body

EDIT: That is all the time we have for today! Thank you everyone for the thoughtful questions. We'll hop back on tomorrow if there are any big, lingering questions still out there, and feel free to keep following our coverage of AI here: https://www.washingtonpost.com/technology/innovations/?itid=nb_technology_artificial-intelligence?utm_campaign=wp_main&utm_medium=social&utm_source=reddit.com

The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data.

To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT).

The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company.

Read more of our analysis here, and skip the paywall with email registration:
https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

proof:

Author

Account Strength

100%

Account Age

7 years

Verified Email

Yes

Verified Flair

Yes

Total Karma

2,511,304

Link Karma

1,985,053

Comment Karma

511,995

Profile updated: 6 days ago

washingtonpost

Subreddit

r/IAmA

Post Details

Location

Washington

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 1 year ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/IAmA/commen...