[College Statistics] Trying to determine a causal relationship and avoid p-hacking

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Author Summary

Easy-Friendship-3841 is in College Statistics

Post Body

Imagine a black box model that predicts how a particular sales person will perform in a month. Similar to how golf has a concept of par, this black box model provides a score relative to a monthly sales goal set by the company. If the model predicts the person will perform over expectations, such as 7, that means they are predicted to sell seven more products than the monthly sales goal. If the model predicts that the person will perform under expectations, such as -3, that means they are predicted to sell 3 less products than the monthly sales goal.

Overall this model is relatively predictive, but there are certain scenarios where it might be inaccurate. For people over par, inaccuracy is classified as the sales person performing worse better than their expectation. So for example if the model predicts 7, and the person sells 5 more than expectation in that month, then the model was inaccurate. For people under par, inaccuracy is classified as the sales person performing better than their expectation. So for example if the model predicts -3, and the person sells only 1 less than expectation in that month, then the model was inaccurate.

For situations where the sales person is

1) Traveling / working remotely / in changing time zones 2) Predicted to perform under expectations 3) Has a performance review within the next couple months 4) The monthly sales goal is low to begin with

The model is inaccurate, specifically correct only about 40% of the time over 700 predictions. I want to avoid the possibility of p-hacking, and I also want to make sure that the model hasn't adjusted (btw, the model is a black box statistical model, but the output it gives can also be tweaked by humans, it can adjust weights based on new data it receives, etc). A couple years ago, sales people that went into the office and were voted as likable by managers over performed model expectations early in the year, with p = .002. But it later was determined that these likability scores were highly inaccurate, possibly faked, and the trend of 'in office sales people early in the year over performing model expectations' no longer 'beats' the model.

This is what I was told to try.

1) Come up with my own rating system for each salesperson for each month. Create a feature based on this trend that I am observing. Combine that feature with the ratings, to see if the feature/trend has predictiveness. Then, see if the model includes this feature or not. This is how I would supposedly be able to determine if the trend has been 'priced in' to the model. This approach seems super tough though, bc I think it involves me having a 'fair' rating for each salesperson each month.

2) Look at the margin that the incorrect predictions are incorrect by. If over time the margin of incorrectness decreases within this trend of traveling / predicted to perform under expectations / has a performance review within the next couple months / monthly sales goal is low to begin with, then maybe the model is adjusting to correct for mispricing this trend. I think one caveat with this approach is that each month, the number of salespeople fitting within this trend could greatly differ. For example in 2020, maybe there were 50 sales people that fit as part of this trend, but in 2021, maybe only 20 sales people.

Thanks for any advice.

Author

User Suspended

Account Strength

Suspended 10 months ago

Account Age

1 year

Verified Email

Yes

Verified Flair

Total Karma

n/a

Link Karma

1,687

Comment Karma

230

Profile updated: 3 days ago

Posts updated: 10 months ago

Easy-Friendship-3841

New User

Subreddit

r/learnmath

Post Details

Location

College Statistics

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 1 year ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/learnmath/c...