Dealing with Missing Data in Data Science

Missing data is one of those challenges that every data scientist encounters. It can mess with your analysis, introduce bias, or even make models completely unreliable. So, how do we deal with it?

There’s no one-size-fits-all solution, but depending on why the data is missing, we can take different approaches. Here’s how I usually think about it - feel free to challenge or improve on these ideas!


Missing Completely at Random (MCAR)

What it means: The missing data has nothing to do with other variables or itself - it’s just missing randomly.

Example: Imagine you’re collecting survey data, and some people just skipped a question because they weren’t paying attention. There’s no pattern - just random gaps.

How to handle it:

  • If <5% of the data is missing: You can probably just drop those rows without much impact.
  • If 5%-20% is missing: Imputing is a good option:
    • For numeric values: Use the mean or median.
    • For categorical values: Use the mode (most frequent value).
  • If 20%-50% is missing: Consider using a predictive model to estimate the missing values.
  • If >50% is missing: The variable might not be useful - drop the column.

Missing At Random (MAR)

What it means: The missing values depend on other observed data but not on the missing values themselves.

Example: Suppose you’re analyzing employee salaries, and you notice that younger employees are less likely to report their income. The missing data isn’t random - it’s related to age but not directly to salary itself.

How to handle it:
Since we have some information to work with, we can predict the missing values using models:

  • For numeric values: Use linear regression (e.g., predict salary based on age and position).
  • For binary values: Use logistic regression (e.g., predict whether someone owns a house based on their income bracket).
  • For categorical values: K-Nearest Neighbors (KNN) works well by finding similar cases and filling in the gaps.

Missing Not At Random (MNAR)

What it means: The reason for missingness is related to the missing values themselves, making this the trickiest case.

Example: Let’s say you have a health survey where some people didn’t disclose their weight. It’s likely that people with higher weights were more hesitant to answer - so the missingness isn’t random at all.

How to handle it:

  • Treat missingness as a feature itself: Create a new variable like "weight_missing = 1" to indicate missing values.
  • Collect more data: If possible, follow up to get missing responses.
  • Use domain knowledge: Understanding why data is missing can help decide whether to impute, drop, or adjust the analysis.

Final Thoughts

Missing data is a reality in every dataset, and there’s no single correct approach - it depends on the context. What works for one dataset might not work for another.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *