This post has been de-listed (Author was flagged for spam)
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
I'm building a product for the video game, League of Legends, that will give players 3-6 distinct things to focus on in the game, that will increase their chances of winning the most.
For my technical background, I thought I wanted to be a data scientist, but transitioned to data engineering, so I have a very fundamental grasp of machine learning concepts. This is why I want input from all of you wonderfully smart people about the way I want to calculate these "important" columns.
I know that the world of explanability is still uncertain, but here is my approach:
- I am given a dataset of matches of a single player, where each row represents the stats of this player at the end of the match. There are ~100 columns (of things like kills, assists, damage dealt, etc) after dropping the columns with any NULLS in it.
- There is a binary WIN column that shows whether the player won the match or not. This is the column we are most interested in
- I train a simple tree-based model on this data, and get the list of "feature importances" using sklearn's
permutation_importance()
function.- For some reason (maybe someone can explain), there are a large number of columns that return a ZERO feature importance after computing this.
- This is where I do things differently: I RETRAIN the model using the same dataset, but without the columns that returned 0 importance on the last "run"
- I basically repeat this process until the list of feature importances doesn't contain ZERO.
- The end result is that there are usually 3-20 columns left (depending on the model).
- I take the top N (haven't decided yet) columns and "give" them to the user to focus on in their next game
Theoretically, if "feature importance" really lives up to it's name, the ending model should have only the "most important" columns when trying to achieve a win.
I've tried using SHAP/LIME, but they were more complicated that using straight feature importance.
Like I mentioned, I don't have classical training in ML or Statistics, so all of this is stuff I tried to learn on my own at one point. I appreciate any helpful advice on if this approach makes sense/is valid.
The big question is: are there any problems with this approach, and are the resulting set of columns truly the "most important?"
Subreddit
Post Details
- Posted
- 10 months ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/learnmachin...