Handling Duplicates in Pandas

Duplicate data refers to rows with identical values across all or selected columns.

These repetitions may arise from data entry errors, data merging, or other data collection processes. Identifying and addressing duplicates is pivotal to achieve accurate and meaningful results from your data analysis. Tackling duplicates is a critical step in data preprocessing.

Detecting Duplicates

Pandas provides the .duplicated() method, which returns a boolean Series indicating duplicated rows. By default, the method marks all but the first occurrence of duplicates as True. We then index the df with this boolean array to find the duplicate row:

# Creating a sample DataFrame with duplicate rows
data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
        'Age': [25, 30, 25, 28]}
 
df = pd.DataFrame(data)
 
# Detecting duplicate rows
duplicates = df.duplicated()
 
# Printing the original DataFrame
print(df)
 
# Printing duplicate rows
print("\nDuplicate Rows:")
print(df[duplicates])

      Name  Age
0    Alice   25
1      Bob   30
2    Alice   25
3  Charlie   28

Duplicate Rows:
    Name  Age
2  Alice   25

Removing Duplicate

While identifying duplicates is essential, removing them is equally vital to maintain data quality. The .drop_duplicates() method allows you to eliminate duplicate rows while keeping the first occurrence by default.

# Removing duplicate rows from the DataFrame
cleaned_df = df.drop_duplicates()
 
# Printing the DataFrame after removing duplicates
print("DataFrame after Removing Duplicates:")
print(cleaned_df)

DataFrame after Removing Duplicates:
      Name  Age
0    Alice   25
1      Bob   30
3  Charlie   28

Considering Certain Columns

In some cases, you might want to identify duplicates based on specific columns while ignoring others. The .duplicated() method accepts the subset parameter to specify the columns to consider for detecting duplicates.

# Detecting duplicates based on the 'Name' column
name_duplicates = df.duplicated(subset=['Name'])
 
# Printing duplicate rows based on the 'Name' column
print("Duplicate Rows based on 'Name' Column:")
print(df[name_duplicates])

Duplicate Rows based on 'Name' Column:
    Name  Age
2  Alice   25

Customizing Duplicates Removal

The .drop_duplicates() method allows customization by specifying the keep parameter. You can choose to keep the 'first', 'last', or drop 'all' occurrences of duplicates.

# Keeping the last occurrence of duplicates in the DataFrame
last_occurrences = df.drop_duplicates(keep='last')
 
# Printing the DataFrame with the last occurrences of duplicates
print("DataFrame with Last Occurrences of Duplicates:")
print(last_occurrences)

Keeping Last Occurrences of Duplicates:
      Name  Age
1      Bob   30
2    Alice   25
3  Charlie   28

Combining `subset` and `keep`

Let's use a simple dataset of students' exam scores to demonstrate the .drop_duplicates() method with the subset and keep parameters.

# Creating a sample DataFrame with duplicate rows
data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
        'Subject': ['Math', 'Science', 'Math', 'History', 'Science'],
        'Score': [90, 85, 90, 75, 85]}
 
df = pd.DataFrame(data)
 
# Displaying the original DataFrame
print("Original DataFrame:")
print(df)
 
# Dropping duplicate rows based on 'Name' and 'Subject' columns,
# and keeping the first occurrence of duplicates
cleaned_df = df.drop_duplicates(subset=['Name', 'Subject'], keep='first')
 
# Displaying the cleaned DataFrame
print("\nDataFrame after Dropping Duplicates:")
print(cleaned_df)

Output:

Original DataFrame:
      Name  Subject  Score
0    Alice     Math     90
1      Bob  Science     85
2    Alice     Math     90
3  Charlie  History     75
4      Bob  Science     85

DataFrame after Dropping Duplicates:
      Name  Subject  Score
0    Alice     Math     90
1      Bob  Science     85
3  Charlie  History     75

In this example, we have a DataFrame containing students' names, subjects, and scores. We're using the .drop_duplicates() method to remove duplicate rows based on the 'Name' and 'Subject' columns. We're keeping the first occurrence of duplicates. The resulting cleaned DataFrame retains only the first occurrence of each unique combination of 'Name' and 'Subject'.

Data Type Conversion Filtering and Selection