This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
Dear data-scientists,
How do you guys prevent non-technical errors and bugs ?
I work as a data-scientist in a junior position. My typical workflow consist of the following steps :
1) the client gives us a problem
2) think about proper methodology
3) gather the data necessary to solve the problem
4) apply some statistical procedures to solve the problem (generally a model)
5) build a report to send to the client (this report must follow the company's format and standards).
One aspect where I notice I am having difficulties or improvement are in what I will call the non-technical or non-statistical aspects of the workflow above. That is suppose you gather the right data and think about the proper methodology to solve the problem, but then how can I prevent errors on the coding and reporting, for instance:
- you have the right methodology, but when you are coding the model you assign a wrong variable in the code in some step and then the results are not valid ( for instance you have x_train and x_test and you mistakenly do m = x_test / 2 instead of m = x_train / 2).
- on the reporting stage, you exported the wrong results.
These are just examples.
Then you send your report and under scrutiny from your managers or revising things to answer additional questions you find this errors. Then it looks unprofessional to say that the initial results were wrong and you will have to update it. It may not inspire much confidence in your results in the future.
It has been hard for me to find ways to improve in this aspect because these types of errors are hard to predict. When you are coding you are already doing what you think it is correct. Given the time frames we have, it is also unfeasible to double check every single line of code. Also, the problems are generally very diverse in nature, so it is not like you can just adopt an automated or semi-automated methodology that you can work upon and improve, many things you have to build from scratch every time you receive a new project.
How do you guys prevent this type of errors ?
Thanks in advance.
Subreddit
Post Details
- Posted
- 4 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/datascience...