Handling Duplicates in Pandas
Duplicate data refers to rows with identical values across all or selected columns.
These repetitions may arise from data entry errors, data merging, or other data collection processes. Identifying and addressing duplicates is pivotal to achieve accurate and meaningful results from your data analysis. Tackling duplicates is a critical step in data preprocessing.
Detecting Duplicates
Pandas provides the .duplicated()
method, which returns a boolean Series indicating duplicated rows. By default, the method marks all but the first occurrence of duplicates as True
. We then index the df with this boolean array to find the duplicate row:
# Creating a sample DataFrame with duplicate rows
data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
'Age': [25, 30, 25, 28]}
df = pd.DataFrame(data)
# Detecting duplicate rows
duplicates = df.duplicated()
# Printing the original DataFrame
print(df)
# Printing duplicate rows
print("\nDuplicate Rows:")
print(df[duplicates])
Name Age
0 Alice 25
1 Bob 30
2 Alice 25
3 Charlie 28
Duplicate Rows:
Name Age
2 Alice 25
Removing Duplicate
While identifying duplicates is essential, removing them is equally vital to maintain data quality. The .drop_duplicates()
method allows you to eliminate duplicate rows while keeping the first occurrence by default.
# Removing duplicate rows from the DataFrame
cleaned_df = df.drop_duplicates()
# Printing the DataFrame after removing duplicates
print("DataFrame after Removing Duplicates:")
print(cleaned_df)
DataFrame after Removing Duplicates:
Name Age
0 Alice 25
1 Bob 30
3 Charlie 28
Considering Certain Columns
In some cases, you might want to identify duplicates based on specific columns while ignoring others. The .duplicated()
method accepts the subset
parameter to specify the columns to consider for detecting duplicates.
# Detecting duplicates based on the 'Name' column
name_duplicates = df.duplicated(subset=['Name'])
# Printing duplicate rows based on the 'Name' column
print("Duplicate Rows based on 'Name' Column:")
print(df[name_duplicates])
Duplicate Rows based on 'Name' Column:
Name Age
2 Alice 25
Customizing Duplicates Removal
The .drop_duplicates()
method allows customization by specifying the keep
parameter. You can choose to keep the 'first'
, 'last'
, or drop 'all'
occurrences of duplicates.
# Keeping the last occurrence of duplicates in the DataFrame
last_occurrences = df.drop_duplicates(keep='last')
# Printing the DataFrame with the last occurrences of duplicates
print("DataFrame with Last Occurrences of Duplicates:")
print(last_occurrences)
Keeping Last Occurrences of Duplicates:
Name Age
1 Bob 30
2 Alice 25
3 Charlie 28
Combining subset
and keep
Let's use a simple dataset of students' exam scores to demonstrate the .drop_duplicates()
method with the subset
and keep
parameters.
# Creating a sample DataFrame with duplicate rows
data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'Subject': ['Math', 'Science', 'Math', 'History', 'Science'],
'Score': [90, 85, 90, 75, 85]}
df = pd.DataFrame(data)
# Displaying the original DataFrame
print("Original DataFrame:")
print(df)
# Dropping duplicate rows based on 'Name' and 'Subject' columns,
# and keeping the first occurrence of duplicates
cleaned_df = df.drop_duplicates(subset=['Name', 'Subject'], keep='first')
# Displaying the cleaned DataFrame
print("\nDataFrame after Dropping Duplicates:")
print(cleaned_df)
Output:
Original DataFrame:
Name Subject Score
0 Alice Math 90
1 Bob Science 85
2 Alice Math 90
3 Charlie History 75
4 Bob Science 85
DataFrame after Dropping Duplicates:
Name Subject Score
0 Alice Math 90
1 Bob Science 85
3 Charlie History 75
In this example, we have a DataFrame containing students' names, subjects, and scores. We're using the .drop_duplicates()
method to remove duplicate rows based on the 'Name' and 'Subject' columns. We're keeping the first occurrence of duplicates. The resulting cleaned DataFrame retains only the first occurrence of each unique combination of 'Name' and 'Subject'.