Elon Musk Does Not Understand How Sampling Works

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

420

Post Body

For people who have not been keeping up with the news, Elon Musk recently announced his intentions to buy Twitter. This deal however, is on hold

...pending details supporting calculation that spam/fake accounts do indeed represent less than 5% of users.

So our man Elon has new concerns that Twitter may be bot infested, which would reduce how valuable the company is and reduce how much Elon should have to pay for it (why he didn't raise this concern before putting in an agreement to buy Twitter is a different story...).

To figure out whether less than 5% of users are bots Elon Musk suggested he should take a "random sample of 100 users" and count how many are bots. To get this sample he would:

skip the first 1000 replies to one of his tweets (or one of the tweets of someone with a large number of followers),
pick every 10th comment until he reached 100 users.
count the number of bots to determine the overall percentage of active twitter users who are bots (how he would decide whether an account is a bot is unclear and not the subject of this R1).

Why is this bad:

There are issues with whether 100 users is enough of a sample (it isn't) to draw any meaningful conclusions, but the biggest issue is what's called selection bias. People who respond to big accounts are neither random nor representative of twitter users at large! Compare the responses to an Elon tweet to the replies to someone like Harvard Economist Jason Furman. There's a big difference. If you surveyed from only people who responded to Jason you would likely conclude that there are close to zero bots on Twitter! Elon's twitter on the other hand gets disproportionate numbers of bots, so sampling from his tweets will overstate the proportion of bots on twitter.

To get a random sample, you have to actually sample randomly, or you have to formerly model the selection process to account for different users having a different probability of being included in your sample. In a survey, this would be weighting respondants based on the probabillity that they responded, in economics this could be something like a Heckman correction).

Author

Account Strength

100%

Account Age

4 years

Verified Email

Yes

Verified Flair

Total Karma

36,701

Link Karma

1,569

Comment Karma

34,501

Profile updated: 8 hours ago

Posts updated: 6 months ago

flavorless_beef

community meetings solve the local knowledge problem

Subreddit

r/badeconomics

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 2 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/badeconomic...