Data Merging in Pandas
Introduction to Data Merging
Data merging is a fundamental operation in data analysis, especially when working with multiple datasets that share common attributes or keys. It involves combining data from two or more sources based on a common column or set of columns. Data merging is essential for bringing together information from different sources to create a comprehensive and unified dataset. In pandas, the merge()
function is a powerful tool for performing data merging operations.
Overview of the merge()
Function in pandas
The merge()
function is used to combine DataFrames based on one or more columns. It offers various options for specifying how the merging should be performed, including different types of joins (e.g., inner, outer, left, right, cross), handling duplicate keys, and customizing column names.
Here's a high-level overview of the merge()
function:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, ...)
You can also use:
df_left.merge(df_right, how='inner', on=None, left_on=None, right_on=None, ...)
Parameters:
left
andright
: The DataFrames to be merged.how
: Specifies the type of join to be performed ('inner', 'outer', 'left', or 'right').on
: The column(s) on which to merge. If not specified, the common columns with the same name in both DataFrames are used.left_on
andright_on
: Columns in the left and right DataFrames to merge on, respectively, when the column names are different.left_index
andright_index
: Indicates whether to use the DataFrame's index as the merge key.suffixes
: Suffixes to add to overlapping column names in case of duplicate columns.validate
: Performs checks to validate the integrity of the merge operation.
Basic Merge Operation
Let's illustrate a basic merge operation using an example where we have two DataFrames: one containing sales data and another containing customer information. We'll merge these DataFrames based on a common 'Customer_ID' column.
import pandas as pd
# Create a sample sales DataFrame
sales_data = {'Order_ID': [101, 102, 103, 104],
'Customer_ID': [101, 102, 103, 104],
'Product': ['Laptop', 'Tablet', 'Phone', 'Monitor']}
sales_df = pd.DataFrame(sales_data)
# Create a sample customers DataFrame
customers_data = {'Customer_ID': [101, 102, 103, 105],
'Customer_Name': ['Alice', 'Bob', 'Charlie', 'Eve']}
customers_df = pd.DataFrame(customers_data)
# Perform a basic merge on 'Customer_ID'
merged_df = pd.merge(sales_df, customers_df, on='Customer_ID')
# Display the merged DataFrame
print("Merged DataFrame:")
print(merged_df)
Merged DataFrame:
Order_ID Customer_ID Product Customer_Name
0 101 101 Laptop Alice
1 102 102 Tablet Bob
2 103 103 Phone Charlie
In this example:
- We have a 'sales_df' DataFrame containing order information, including 'Order_ID' and 'Customer_ID.'
- We also have a 'customers_df' DataFrame containing customer information, including 'Customer_ID' and 'Customer_Name.'
- We perform a basic merge on the 'Customer_ID' column to combine these DataFrames.
The resulting 'merged_df' DataFrame contains both sales and customer information, where the 'Customer_ID' column acts as the common key. This merged DataFrame allows you to analyze sales data while associating each order with the corresponding customer's name.
Types of Joins
When merging DataFrames using the merge()
function in pandas, you can choose from several types of joins: inner, outer, left, and right. Each join type has distinct characteristics and use cases. Understanding these join types is essential for determining which one to use in different data merging scenarios.
1. Inner Join
An inner join, also known as an equi-join, returns only the rows that have matching values in both DataFrames based on the specified key(s). Rows with non-matching keys are excluded from the result.
result = pd.merge(left_df, right_df, on='key', how='inner')
Use an inner join when you want to retain only the common rows between two DataFrames.
2. Outer Join
An outer join returns all rows from both DataFrames and fills in missing values with NaN (or a specified fill value) for non-matching keys.
result = pd.merge(left_df, right_df, on='key', how='outer')
Use an outer join when you want to include all rows from both DataFrames and fill in missing values with NaN to maintain data completeness.
3. Left Join
A left join returns all rows from the left DataFrame and the matching rows from the right DataFrame. Non-matching rows from the left DataFrame are included, while non-matching rows from the right DataFrame are excluded.
result = pd.merge(left_df, right_df, on='key', how='left')
Use a left join when you want to keep all rows from the left DataFrame while including matching rows from the right DataFrame.
4. Right Join
A right join is the opposite of a left join. It returns all rows from the right DataFrame and the matching rows from the left DataFrame. Non-matching rows from the right DataFrame are included, while non-matching rows from the left DataFrame are excluded.
result = pd.merge(left_df, right_df, on='key', how='right')
Use a right join when you want to keep all rows from the right DataFrame while including matching rows from the left DataFrame.
The suffixes
Parameter
When merging DataFrames with columns of the same name , you may encounter naming conflicts in the resulting DataFrame. To disambiguate these columns, you can use the suffixes
parameter to add suffixes to the column names from the left and right DataFrames.
Example: Understanding Join Types and Use of suffixes
Let's illustrate the different join types using an example:
# Create sample DataFrames
left_df = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
'value_left': [1, 2, 3, 4]})
right_df = pd.DataFrame({'key': ['C', 'D', 'E', 'F'],
'value_right': [5, 6, 7, 8]})
# declaring suffixes
suf = ('_left', '_right')
# Perform different types of joins
inner_join = pd.merge(left_df, right_df, on='key', how='inner', suffixes=suf)
outer_join = pd.merge(left_df, right_df, on='key', how='outer', suffixes=suf)
left_join = pd.merge(left_df, right_df, on='key', how='left', suffixes=suf)
right_join = pd.merge(left_df, right_df, on='key', how='right', suffixes=suf)
# Display the results
print("Inner Join:")
print(inner_join)
print("\nOuter Join:")
print(outer_join)
print("\nLeft Join:")
print(left_join)
print("\nRight Join:")
print(right_join)
Inner Join:
key value_left value_right
0 C 3 5
1 D 4 6
Outer Join:
key value_left value_right
0 A 1.0 NaN
1 B 2.0 NaN
2 C 3.0 5.0
3 D 4.0 6.0
4 E NaN 7.0
5 F NaN 8.0
Left Join:
key value_left value_right
0 A 1 NaN
1 B 2 NaN
2 C 3 5.0
3 D 4 6.0
Right Join:
key value_left value_right
0 C 3.0 5
1 D 4.0 6
2 E NaN 7
3 F NaN 8
In this example:
inner_join
retains only the rows with matching 'key' values from both DataFrames.outer_join
includes all rows from both DataFrames and fills in missing values with NaN.left_join
keeps all rows from the left DataFrame while including matching rows from the right DataFrame.right_join
keeps all rows from the right DataFrame while including matching rows from the left DataFrame.
Merging on Multiple Keys
When you want to merge DataFrames on multiple keys, you can pass a list of column names as the on
parameter in the merge()
function.
result = pd.merge(left_df, right_df, on=['key1', 'key2'])
left_df
: The left DataFrame.right_df
: The right DataFrame.on
: A list of column names to use as the merging keys.
Example: Multi-Key Merge
Let's illustrate a multi-key merge with an example. Suppose we have two DataFrames, one containing sales data and another containing customer information. To perform a multi-key merge, we'll use both 'Customer_ID' and 'Product_ID' columns as merging keys.
# Create a sample sales DataFrame
sales_data = {'Order_ID': [101, 102, 103, 104],
'Customer_ID': [101, 102, 103, 104],
'Product_ID': ['P1', 'P2', 'P1', 'P3']}
sales_df = pd.DataFrame(sales_data)
# Create a sample customers DataFrame
customers_data = {'Customer_ID': [101, 102, 103, 105],
'Product_ID': ['P1', 'P2', 'P1', 'P3'],
'Customer_Name': ['Alice', 'Bob', 'Charlie', 'Eve']}
customers_df = pd.DataFrame(customers_data)
# Perform a multi-key merge on 'Customer_ID' and 'Product_ID'
merged_df = pd.merge(sales_df, customers_df, on=['Customer_ID', 'Product_ID'], how='inner')
# Display the merged DataFrame
print("Merged DataFrame with multi-key merge:")
print(merged_df)
Merged DataFrame with multi-key merge:
Order_ID Customer_ID Product_ID Customer_Name
0 101 101 P1 Alice
1 102 102 P2 Bob
2 103 103 P1 Charlie
In this example:
pd.merge()
is used to merge the 'sales_df' and 'customers_df' DataFrames based on the columns specified in theon
parameter. In this case, the merge is performed on both 'Customer_ID' and 'Product_ID.'- The
how
parameter is set to 'inner,' which means that only the rows with matching combinations of 'Customer_ID' and 'Product_ID' from both DataFrames will be included in the result. This is an inner join operation.
Handling Duplicate Keys
When merging DataFrames, you may encounter situations where the key(s) used for merging have duplicate values in one or both DataFrames. Handling duplicate keys is crucial to ensure a successful merge and avoid conflicts in the resulting DataFrame. Let's discuss strategies for handling duplicate keys:
Dropping Duplicate Rows
One approach to handling duplicate keys is to drop duplicate rows from one or both DataFrames before merging. This ensures that each key appears only once in the merged DataFrame.
# Create a DataFrame with duplicate keys
df1 = pd.DataFrame({'key': ['A', 'B', 'B'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'C'], 'value2': [4, 5, 6]})
# Drop duplicate rows based on the 'key' column
df1 = df1.drop_duplicates('key')
print(f"df1 after removing duplicates:\n{df1}")
df2 = df2.drop_duplicates('key')
print(f"df2 after removing duplicates:\n{df2}")
# Perform a merge after removing duplicates
merged_df = pd.merge(df1, df2, on='key', how='inner')
# Display the merged DataFrame
print("Merged DataFrame after dropping duplicates:")
print(merged_df)
df1 after removing duplicates:
key value1
0 A 1
1 B 2
df2 after removing duplicates:
key value2
0 B 4
1 C 5
Merged DataFrame after dropping duplicates:
key value1 value2
0 B 2 4
Aggregating Duplicate Values
In some cases, you may want to aggregate values associated with duplicate keys using aggregation functions like sum
, mean
, or max
.
# Create a DataFrame with duplicate keys and values
df1 = pd.DataFrame({'key': ['A', 'B', 'B'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'C'], 'value2': [4, 5, 6]})
# Aggregate values associated with duplicate keys
df1_agg = df1.groupby('key').agg({'value1': 'sum'}).reset_index()
print(f"df1 after summing duplicates key, values:\n{df1_agg}")
df2_agg = df2.groupby('key').agg({'value2': 'sum'}).reset_index()
print(f"df2 after summing duplicates key, values:\n{df2_agg}")
# Perform a merge after aggregation
merged_df = pd.merge(df1_agg, df2_agg, on='key', how='inner')
# Display the merged DataFrame
print("Merged DataFrame after aggregating duplicates:")
print(merged_df)
df1 after summing duplicates key, values:
key value1
0 A 1
1 B 5
df2 after summing duplicates key, values:
key value2
0 B 4
1 C 11
Merged DataFrame after aggregating duplicates:
key value1 value2
0 B 5 4
Using Indexes for Merging
In pandas, you can perform DataFrame merges based on their indexes rather than specific columns. This is particularly useful when your DataFrames have meaningful index labels that you want to use for merging.
To merge DataFrames based on their indexes, you need to specify the left_index
and right_index
parameters in the merge()
function:
left_index
: Set this parameter toTrue
to use the left DataFrame's index for merging.right_index
: Set this parameter toTrue
to use the right DataFrame's index for merging.
Example: Index-Based Merge
Let's illustrate an index-based merge using two DataFrames: 'orders_df' and 'order_details_df,' where we want to merge them based on their indexes.
# Create a sample sales DataFrame with index
sales_data = {'Customer_ID': [101, 102, 103, 104],
'Product_ID': ['P1', 'P2', 'P1', 'P3']}
sales_df = pd.DataFrame(sales_data)
sales_df.set_index('Customer_ID', inplace=True)
# Create a sample customers DataFrame with index
customers_data = {'Customer_ID': [101, 102, 103, 105],
'Customer_Name': ['Alice', 'Bob', 'Charlie', 'Eve']}
customers_df = pd.DataFrame(customers_data)
customers_df.set_index('Customer_ID', inplace=True)
# Perform an index-based merge
merged_df = pd.merge(sales_df, customers_df,
left_index=True, right_index=True, how='outer')
# Display the merged DataFrame
print("Merged DataFrame with index-based merge:")
print(merged_df)
Merged DataFrame with index-based merge:
Product_ID Customer_Name
Customer_ID
101 P1 Alice
102 P2 Bob
103 P1 Charlie
104 P3 NaN
105 NaN Eve
In this example:
- We create two DataFrames, 'sales_df' and 'customers_df,' each with an index based on the 'Customer_ID' column.
- We use
left_index=True
andright_index=True
in thepd.merge()
function to specify index-based merging. - The result is a merged DataFrame ('merged_df') that combines data from both DataFrames based on their indexes, resulting in an outer join operation.
Merging on Different Key Names
To merge DataFrames with different key names, you need to specify the key columns using the left_on
and right_on
parameters in the merge()
function:
left_on
: Specify the key column(s) from the left DataFrame.right_on
: Specify the key column(s) from the right DataFrame.
Example: Merging on Different Key Names
Let's demonstrate how to merge two DataFrames, 'employees_df' and 'departments_df,' where the key columns have different names ('Employee_Dept' in 'employees_df' and 'Dept_ID' in 'departments_df').
# Create a sample 'employees_df' DataFrame with a key column 'Employee_Dept'
employees_data = {'Employee_ID': [1, 2, 3, 4],
'Employee_Name': ['Alice', 'Bob', 'Charlie', 'Eve'],
'Employee_Dept': ['HR', 'Finance', 'IT', 'Sales']}
employees_df = pd.DataFrame(employees_data)
# Create a sample 'departments_df' DataFrame with a key column 'Dept_ID'
departments_data = {'Dept_ID': [101, 102, 103, 104],
'Dept_Name': ['HR', 'Finance', 'IT', 'Sales']}
departments_df = pd.DataFrame(departments_data)
# Perform a merge with different key names using 'left_on' and 'right_on'
merged_df = pd.merge(employees_df, departments_df, left_on='Employee_Dept', right_on='Dept_Name', how='inner')
# Display the merged DataFrame
print("Merged DataFrame with different key names:")
print(merged_df)
Merged DataFrame with different key names:
Employee_ID Employee_Name Employee_Dept Dept_ID Dept_Name
0 1 Alice HR 101 HR
1 2 Bob Finance 102 Finance
2 3 Charlie IT 103 IT
3 4 Eve Sales 104 Sales
In this example:
- We create two DataFrames, 'employees_df' and 'departments_df,' each with a key column that has a different name ('Employee_Dept' and 'Dept_Name,' respectively).
- We use
left_on='Employee_Dept'
andright_on='Dept_Name'
in thepd.merge()
function to specify the key columns for both DataFrames. - The result is a merged DataFrame ('merged_df') that combines data from both DataFrames based on the specified key columns.
This demonstrates how to merge DataFrames with different key names using the left_on
and right_on
parameters, allowing you to handle cases where key columns have different names but represent the same information.
Real-World Example
# Load AAPL stock price data
apple = pd.read_csv('AAPL.csv', parse_dates=['Date'], index_col='Date')
# Load GOOG stock price data
google = pd.read_csv('GOOG.csv', parse_dates=['Date'], index_col='Date')
# indexing required columns
apple_close = apple['Close']
google_close = google['Close']
# concat on axis=1 and set keys parameter
apple_google = pd.merge(apple_close, google_close,
left_index=True, right_index=True,
suffixes=['_apple', '_google'])
print(apple_google.head())
Close_apple Close_google
Date
2022-09-01 157.960007 110.550003
2022-09-02 155.809998 108.680000
2022-09-06 154.529999 107.480003
2022-09-07 155.960007 110.480003
2022-09-08 154.460007 109.419998
The main operation is performed using the pd.merge()
function.
pd.merge()
is used to merge the two Series,apple_close
andgoogle_close
, based on their datetime indexes. Theleft_index=True
andright_index=True
parameters indicate that the merge operation should be based on the datetime index of both Series.- The
suffixes
parameter is specified with the values['_apple', '_google']
to add suffixes to columns with the same names from the two Series. This is useful to distinguish between the 'Close' prices of Apple and Google in the merged DataFrame.