Finding Duplicates with .duplicated()

Cassie Nutter
3 min readDec 6, 2020
Photo by Herbert Goetsch on Unsplash

Sometimes, having two of something is better than one. I’d take two freshly baked snickerdoodle cookies over one any day.

However, just like a mother having to repeat herself, no one likes it. And dealing with data is no different. Having the same item repeat itself more than once is not great.

And thus, the beautiful .duplicated() method was born!

When using pandas to clean and organize data, the .duplicated() method can be utilized to see rows that have repeating items. The .duplicated() method will take in a DataFrame and return a bool — False if the row is original and not a duplicate, or True if the row is a duplicate.

>>>df.duplicated()0   False
1 True
2 False
3 False

Running the first line would give you an output of bools. This code shows that the whole row in row 1 is a duplicate.

You can pass keep into the method to mark your preferred duplicate as the real deal.

>>>df.duplicated(keep= 'last')0   True
1 False
2 False
3 False

If you compare this code to the code above, you will see that row 0 is now the duplicate and row 1 is now the original. This is because we told the method that we wanted to keep the last duplicate as the winner.

What if you wanted to mark the original and it’s duplicate? Sure! Just use keep again, but this time set it as False.

>>>df.duplicated(keep= False)0   True
1 True
2 False
3 False

Setting keep to False flags the original and duplicates. This could be helpful if there are more than just two duplicates and you need to see all of them.

The .duplicated() method can also take a specific column to check as a subset.

>>>df.duplicated(subset= ['title']0   False
1 True
2 False
3 False

This code would check all the rows in the DataFrame to see if items in the title column repeat. And the result shows that the title in row 0 is the original. The title in row 1 would be the duplicate.

Now you have a taste for how to use .duplicated(). Let’s see what else it can do.

Using the method and getting back a bool may be find and dandy, but what if you have a really long DataFrame. There’s no way to see them all! Here are some tricks if you need more information.

Try adding them up! The bool False has a value of 0 and True has a value of 1. Adding them together will give you a count of how many duplicates you have!

>>>df.duplicated(subset= 'title').sum()417

Whoa! That’s 417 duplicate entries! That could really throw off an analysis. But what are they? Where are they? Let’s look.

You can create a new DataFrame that contains only the duplicate items and inspect it. Make sure you don’t reset the index so you can go back to the original DataFrame and find them.

>>>df2 = df [df.duplicated(subset= 'title')]
>>>df2

You could also add keep= False after subset and then you would be able to see the original as well as the duplicates.

So, you’ve found them and checked them. Now what?

Time to drop ’em! The fastest and easiest way is to use the .drop_duplicates() method.

>>>df.drop_duplicates(subset= 'title', inplace= True)

The inplace= True is added to replace the new cleaned up DataFrame over the older one with duplicates.

Just like with .duplicated(), you can use keep to specify if the first or last row is the original and the one you would like to keep. You can also use False if you want to eliminate the original and duplicate items.

Guess what? That’s it! You are either one step closer to having some clean data or craving freshly baked cookies. Either way, good luck!

--

--

Cassie Nutter

Aspiring Data Scientist, dog lover and running enthusiast