Related Posts
PE ops companies in Singapore?
Additional Posts in Data & Analytics Consultants
Got messaged by a C3 . ai recruiter. Read that wlb is bad and that the interview process is absurdly long, but the Glassdoor reviews are 4.2 and can't find actual hours worked posted by anyone. How's the culture really? I'd be aiming for DS consulting, something more functional but with DS/ML concepts as my differentiator.
C3.ai, Inc.
Has anyone else begun to resent data science?
New to Fishbowl?
unlock all discussions on Fishbowl.





If the column has many missing values then do not use the column.
If there are enough samples to build a model then remove the rows with missing data.
If there are few samples and you can’t afford to remove rows with missing data then use k-NN or MICE.
If there are sales data for two products and the user never bought the second product, it can be NULL for product 2. You don’t want to drop row or column. In that case, the only sense is to impute zeros. Any other imputation will create bias.
There’s no such rule that one must never impute zero. It all depends on what kind of data there is and your business problem.
Get rid of the observation
Dropping rows with missing data can expose the model to non response bias. For example, suppose on a survey, high income households consistently don’t respond to questions about income level, then dropping those rows might exclude an entire subset of the population. This would then bias the results.
Basically the appropriate strategy depends on the type of missing data (MCAR, MAR, or MNAR). Wikipedia has a pretty good rundown of types of missing data and available methods to handle https://en.wikipedia.org/wiki/Missing_data
Damned if you do, damned if you don’t
If you do impute values, best practice is to partition your data set an additional time to avoid nested model bias. https://win-vector.com/2016/04/26/on-nested-models/
Decent, though the website is terrible from a design perspective. I have it in Feedly and click through occasionally.
I subscribe to a number of tech company blogs like Stitch Fix and Netflix.
In the olden days of OLS and logistic regression, we used to either (a) bin that variable and let missing be its own bin or (b) impute to a default value (eg median, mode or zero) and then include a missing value indicator paired with the underlying variable. A bit simplistic but at least both can handle MNAR. But also, before choosing a method you should seek to understand the data capture process and why the data is missing.
Yeah I still do this. Have yet to see an example where doing fancy imputation methods actually made a meaningful difference to model performance. Your time/energy are probably better spent elsewhere.
Caveat that I agree with others that if the data in a column is very sparse, it shouldn’t be used for modeling.
What level/portfolio are you that you actually get to do data modeling?
If time series, then average of observations before and after weighted by day of week or whatever makes sense
Really depends