What to do with Missing Values?

5 Different Techniques to Clean Dirty Data

Cassie Nutter
7 min readJun 20, 2021

When working with data, there will inevitably be a time where you come across a missing value. Don’t panic! Below you’ll find a guide to show you the different ways you can mediate this issue and scenarios when you would and would not use them.

First, let’s go over what a missing or null value looks like and why it is important to address them.

Missing values can look different in every dataset. Some may have no value or characters, some may have certain characters to show the data is missing, such as empty parenthesis. The most common value for missing data is “NaN”.

You will frequently see the value “NaN”, or “Not a Number”. NaN will show up after you load data into a DataFrame. If the cell is empty, it will be converted to “NaN” (assuming you didn’t add it any other instructions when loading the data).

First 5 rows and last 10 columns of MoMA Artwork dataset

Here, we can see there are quite a few cells that were blank when the data was read. Pandas filled the null values with NaN’s. This will be helpful later on if we chose to fill or change those values.

But as we know, data doesn’t always come clean. It may take a little investigation to find missing values hiding in the data.

First 5 rows and first 10 columns of MoMA Artwork dataset

In the DataFrame above, there are three missing values. Since the cells were not empty when they were loaded, the characters in the cells remained, though the data contained within them is missing or incorrect.

The first missing value is on the fourth row, in the column “Nationality”. While other artists have their nationalities listed, this specific artist has nothing there but some empty parenthesis.

The other missing values are under the column “EndDate”. On the second row, we can see the “BeginDate” is the year 1944. It would be very difficult for the piece to be finished in the year 0. That same issue is repeated on the fourth row.

It is important to inspect the data carefully to discover values that are inaccurate or invalid.

Quick tip! Zero is not the same as a null value. A zero can truly be a zero. It is important to look at the context of the data to decide if the value “0” fits or not.

What to do with the missing values

Once you have cleaned the data and determined if any values should be changed to null values, it is time to decide what to do with them. This is where it gets tricky. It will be helpful to know where the null values are.

Using .info(), we can quickly see where most of the data is located. Inversely, we can see where data isn’t there. For instance, the column “Seat Height (cm)” has no data in it at all.

Difference in non-null values when “(0)” was changed to “NaN” in the EndDate column

We can also visualize where the null values are using the library Missingno.

Use library Missingno to create visualization of missing values
Dark areas show were data is present, white areas show where data is missing

If the null values have been identified as “NaN”, there are some quick ways to view or alter these values with methods. Here are some useful methods:

  • isna() : returns True or False if the cell has a null value
  • notna() : the opposite of isna(); returns False or True if cell is null
  • dropna() : drops the rows or columns that contain missing values
  • fillna() : replace the “NaN” value with another specified value

Other methods that can be used interchangeably include isnull(), notnull(), and replace().

Adding .sum() will add up the null values in each column

1. Drop whole column or row

Some rows or columns may be missing a lot of data. It may be best to remove it completely than try to fill it in somehow. The column “Seat Height (cm)” has no data in it at all. We can assume that none of these art pieces are used as seating and will never have a seat height. Deleting that column would be a good move.

2. Replace with another value

There are some scenarios where all values must be filled in — especially when dealing with data that will be used in machine learning algorithms. A simple way to fill in missing data is to use a number or characters that show that value was missing. In our data, the “Depth (cm)” column is missing 124,312 data points. We could consider getting rid of that column, but maybe we want to keep the almost 14,000 data points that are filled in. We could replace the null values with a “0” to denote the value was missing. An art piece with a depth of 0 would be pretty difficult to display. A value of “0” would reasonably communicate to others that the data was not collected.

If the data in the column is not numerical, we can also replace the missing values with a letter — like a “U” for the unknown values in “Gender” — or a string — like “none_listed” in the “URL” column.

Cleaning up the “Gender” column

“Gender” has 408 unique entries. With some data cleaning, I was able to get it down to 4 (Male, Female, Non-Binary, and Unknown).

URL column had almost 51,000 missing values that were replaced with a string

3. Replace with metrics like mean, median or mode

The “Date” column, like the other columns, requires a lot of cleaning. Once the data has been stripped of excess years, zeros, months, and other various strings, we can apply the median year (or mean if you chose) to the null values with the fillna() method.

Similar to fillna(), scikit-learn has a method named SimpleImputer that can replace missing values with mean, median, mode, or any value of your choice. Check out their site here for more info.

Photo by Waranont (Joe) on Unsplash

4. Replace with nearby values

In addition to filling empty fields with a string or metrics, fillna() has the following methods to fill null values as well:

  • bfill: (or backfill) takes previous value to fill missing field
  • ffill: (or pad) takes next value to fill missing field

Interpolate() is a similar method that uses surrounding fields to calculate an estimate for the null values.

In this example, interpolate uses the fields within the column that come before and after the empty value. When using the the “linear” method, the values will be calculated as if they are evenly spaced.

For example: Column C’s second field was empty. Interpolate calculated halfway between 9 and 24 ((9+24) / 2) which is 16.5. This can be done even if the missing value in the last row, where it is replaced with the value in the row above it.

However, the null value in the first row remains empty; interpolate did not fill it. Luckily, adding a “limit_direction” can eliminate this issue.

Limit direction is defaulted to be “forward” and works the same way ffill works in fillna(). Since there is a missing value in the first row, there is nothing before it to use to estimate a value. If the limit direction were “backward” the null values in the last row would remain empty. Using a limit direction of “both” will solve that issue and fill all null values.

5. Use an algorithm

Using linear regression or machine learning algorithms like K-Nearest Neighbors can help in imputing the missing values. Scikit-learn has KNNImputer() that uses the principles of K-Nearest Neighbors and creates a model for each feature that contains missing values. It then searches for groups that resemble that missing value and averages the nearby points to fill in empty fields.

Check out scikit-learn’s documentation about KNNImputer() here

Photo by Meghan Holmes on Unsplash

At some point data will inevitably come in dirty and disorganized, but there are many different ways to clean it up! Chose a method (or all of them) and say, “See ya’ later dirty data!”

To see how some of this data was cleaned, check out my notebook on GitHub.

--

--

Cassie Nutter

Aspiring Data Scientist, dog lover and running enthusiast