Pandas
Data Analysis
Merging Data

Data Merging in Pandas

Introduction to Data Merging

Data merging is a fundamental operation in data analysis, especially when working with multiple datasets that share common attributes or keys. It involves combining data from two or more sources based on a common column or set of columns. Data merging is essential for bringing together information from different sources to create a comprehensive and unified dataset. In pandas, the merge() function is a powerful tool for performing data merging operations.

Overview of the merge() Function in pandas

The merge() function is used to combine DataFrames based on one or more columns. It offers various options for specifying how the merging should be performed, including different types of joins (e.g., inner, outer, left, right, cross), handling duplicate keys, and customizing column names.

Here's a high-level overview of the merge() function:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, ...)

You can also use:

df_left.merge(df_right, how='inner', on=None, left_on=None, right_on=None, ...)

Parameters:

  • left and right: The DataFrames to be merged.
  • how: Specifies the type of join to be performed ('inner', 'outer', 'left', or 'right').
  • on: The column(s) on which to merge. If not specified, the common columns with the same name in both DataFrames are used.
  • left_on and right_on: Columns in the left and right DataFrames to merge on, respectively, when the column names are different.
  • left_index and right_index: Indicates whether to use the DataFrame's index as the merge key.
  • suffixes: Suffixes to add to overlapping column names in case of duplicate columns.
  • validate: Performs checks to validate the integrity of the merge operation.

Basic Merge Operation

Let's illustrate a basic merge operation using an example where we have two DataFrames: one containing sales data and another containing customer information. We'll merge these DataFrames based on a common 'Customer_ID' column.

import pandas as pd
 
# Create a sample sales DataFrame
sales_data = {'Order_ID': [101, 102, 103, 104],
              'Customer_ID': [101, 102, 103, 104],
              'Product': ['Laptop', 'Tablet', 'Phone', 'Monitor']}
sales_df = pd.DataFrame(sales_data)
 
# Create a sample customers DataFrame
customers_data = {'Customer_ID': [101, 102, 103, 105],
                  'Customer_Name': ['Alice', 'Bob', 'Charlie', 'Eve']}
customers_df = pd.DataFrame(customers_data)
 
# Perform a basic merge on 'Customer_ID'
merged_df = pd.merge(sales_df, customers_df, on='Customer_ID')
 
# Display the merged DataFrame
print("Merged DataFrame:")
print(merged_df)
Merged DataFrame:
   Order_ID  Customer_ID Product Customer_Name
0       101          101  Laptop         Alice
1       102          102  Tablet           Bob
2       103          103   Phone       Charlie

In this example:

  • We have a 'sales_df' DataFrame containing order information, including 'Order_ID' and 'Customer_ID.'
  • We also have a 'customers_df' DataFrame containing customer information, including 'Customer_ID' and 'Customer_Name.'
  • We perform a basic merge on the 'Customer_ID' column to combine these DataFrames.

The resulting 'merged_df' DataFrame contains both sales and customer information, where the 'Customer_ID' column acts as the common key. This merged DataFrame allows you to analyze sales data while associating each order with the corresponding customer's name.

Types of Joins

When merging DataFrames using the merge() function in pandas, you can choose from several types of joins: inner, outer, left, and right. Each join type has distinct characteristics and use cases. Understanding these join types is essential for determining which one to use in different data merging scenarios.

1. Inner Join

An inner join, also known as an equi-join, returns only the rows that have matching values in both DataFrames based on the specified key(s). Rows with non-matching keys are excluded from the result.

result = pd.merge(left_df, right_df, on='key', how='inner')

Use an inner join when you want to retain only the common rows between two DataFrames.

2. Outer Join

An outer join returns all rows from both DataFrames and fills in missing values with NaN (or a specified fill value) for non-matching keys.

result = pd.merge(left_df, right_df, on='key', how='outer')

Use an outer join when you want to include all rows from both DataFrames and fill in missing values with NaN to maintain data completeness.

3. Left Join

A left join returns all rows from the left DataFrame and the matching rows from the right DataFrame. Non-matching rows from the left DataFrame are included, while non-matching rows from the right DataFrame are excluded.

result = pd.merge(left_df, right_df, on='key', how='left')

Use a left join when you want to keep all rows from the left DataFrame while including matching rows from the right DataFrame.

4. Right Join

A right join is the opposite of a left join. It returns all rows from the right DataFrame and the matching rows from the left DataFrame. Non-matching rows from the right DataFrame are included, while non-matching rows from the left DataFrame are excluded.

result = pd.merge(left_df, right_df, on='key', how='right')

Use a right join when you want to keep all rows from the right DataFrame while including matching rows from the left DataFrame.

The suffixes Parameter

When merging DataFrames with columns of the same name , you may encounter naming conflicts in the resulting DataFrame. To disambiguate these columns, you can use the suffixes parameter to add suffixes to the column names from the left and right DataFrames.

Example: Understanding Join Types and Use of suffixes

Let's illustrate the different join types using an example:

# Create sample DataFrames
left_df = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                        'value_left': [1, 2, 3, 4]})
 
right_df = pd.DataFrame({'key': ['C', 'D', 'E', 'F'],
                         'value_right': [5, 6, 7, 8]})
 
# declaring suffixes
suf = ('_left', '_right')
 
# Perform different types of joins
inner_join = pd.merge(left_df, right_df, on='key', how='inner', suffixes=suf)
outer_join = pd.merge(left_df, right_df, on='key', how='outer', suffixes=suf)
left_join = pd.merge(left_df, right_df, on='key', how='left', suffixes=suf)
right_join = pd.merge(left_df, right_df, on='key', how='right', suffixes=suf)
 
# Display the results
print("Inner Join:")
print(inner_join)
 
print("\nOuter Join:")
print(outer_join)
 
print("\nLeft Join:")
print(left_join)
 
print("\nRight Join:")
print(right_join)
Inner Join:
  key  value_left  value_right
0   C           3            5
1   D           4            6

Outer Join:
  key  value_left  value_right
0   A         1.0          NaN
1   B         2.0          NaN
2   C         3.0          5.0
3   D         4.0          6.0
4   E         NaN          7.0
5   F         NaN          8.0

Left Join:
  key  value_left  value_right
0   A           1          NaN
1   B           2          NaN
2   C           3          5.0
3   D           4          6.0

Right Join:
  key  value_left  value_right
0   C         3.0            5
1   D         4.0            6
2   E         NaN            7
3   F         NaN            8

In this example:

  • inner_join retains only the rows with matching 'key' values from both DataFrames.
  • outer_join includes all rows from both DataFrames and fills in missing values with NaN.
  • left_join keeps all rows from the left DataFrame while including matching rows from the right DataFrame.
  • right_join keeps all rows from the right DataFrame while including matching rows from the left DataFrame.

Merging on Multiple Keys

When you want to merge DataFrames on multiple keys, you can pass a list of column names as the on parameter in the merge() function.

result = pd.merge(left_df, right_df, on=['key1', 'key2'])
  • left_df: The left DataFrame.
  • right_df: The right DataFrame.
  • on: A list of column names to use as the merging keys.

Example: Multi-Key Merge

Let's illustrate a multi-key merge with an example. Suppose we have two DataFrames, one containing sales data and another containing customer information. To perform a multi-key merge, we'll use both 'Customer_ID' and 'Product_ID' columns as merging keys.

# Create a sample sales DataFrame
sales_data = {'Order_ID': [101, 102, 103, 104],
              'Customer_ID': [101, 102, 103, 104],
              'Product_ID': ['P1', 'P2', 'P1', 'P3']}
sales_df = pd.DataFrame(sales_data)
 
# Create a sample customers DataFrame
customers_data = {'Customer_ID': [101, 102, 103, 105],
                  'Product_ID': ['P1', 'P2', 'P1', 'P3'],
                  'Customer_Name': ['Alice', 'Bob', 'Charlie', 'Eve']}
customers_df = pd.DataFrame(customers_data)
 
# Perform a multi-key merge on 'Customer_ID' and 'Product_ID'
merged_df = pd.merge(sales_df, customers_df, on=['Customer_ID', 'Product_ID'], how='inner')
 
# Display the merged DataFrame
print("Merged DataFrame with multi-key merge:")
print(merged_df)
Merged DataFrame with multi-key merge:
   Order_ID  Customer_ID Product_ID Customer_Name
0       101          101         P1         Alice
1       102          102         P2           Bob
2       103          103         P1       Charlie

In this example:

  • pd.merge() is used to merge the 'sales_df' and 'customers_df' DataFrames based on the columns specified in the on parameter. In this case, the merge is performed on both 'Customer_ID' and 'Product_ID.'
  • The how parameter is set to 'inner,' which means that only the rows with matching combinations of 'Customer_ID' and 'Product_ID' from both DataFrames will be included in the result. This is an inner join operation.

Handling Duplicate Keys

When merging DataFrames, you may encounter situations where the key(s) used for merging have duplicate values in one or both DataFrames. Handling duplicate keys is crucial to ensure a successful merge and avoid conflicts in the resulting DataFrame. Let's discuss strategies for handling duplicate keys:

Dropping Duplicate Rows

One approach to handling duplicate keys is to drop duplicate rows from one or both DataFrames before merging. This ensures that each key appears only once in the merged DataFrame.

# Create a DataFrame with duplicate keys
df1 = pd.DataFrame({'key': ['A', 'B', 'B'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'C'], 'value2': [4, 5, 6]})
 
# Drop duplicate rows based on the 'key' column
df1 = df1.drop_duplicates('key')
print(f"df1 after removing duplicates:\n{df1}")
df2 = df2.drop_duplicates('key')
print(f"df2 after removing duplicates:\n{df2}")
 
# Perform a merge after removing duplicates
merged_df = pd.merge(df1, df2, on='key', how='inner')
 
# Display the merged DataFrame
print("Merged DataFrame after dropping duplicates:")
print(merged_df)
df1 after removing duplicates:
  key  value1
0   A       1
1   B       2
df2 after removing duplicates:
  key  value2
0   B       4
1   C       5
Merged DataFrame after dropping duplicates:
  key  value1  value2
0   B       2       4

Aggregating Duplicate Values

In some cases, you may want to aggregate values associated with duplicate keys using aggregation functions like sum, mean, or max.

# Create a DataFrame with duplicate keys and values
df1 = pd.DataFrame({'key': ['A', 'B', 'B'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'C'], 'value2': [4, 5, 6]})
 
# Aggregate values associated with duplicate keys
df1_agg = df1.groupby('key').agg({'value1': 'sum'}).reset_index()
print(f"df1 after summing duplicates key, values:\n{df1_agg}")
df2_agg = df2.groupby('key').agg({'value2': 'sum'}).reset_index()
print(f"df2 after summing duplicates key, values:\n{df2_agg}")
 
# Perform a merge after aggregation
merged_df = pd.merge(df1_agg, df2_agg, on='key', how='inner')
 
# Display the merged DataFrame
print("Merged DataFrame after aggregating duplicates:")
print(merged_df)
df1 after summing duplicates key, values:
  key  value1
0   A       1
1   B       5
df2 after summing duplicates key, values:
  key  value2
0   B       4
1   C      11
Merged DataFrame after aggregating duplicates:
  key  value1  value2
0   B       5       4

Using Indexes for Merging

In pandas, you can perform DataFrame merges based on their indexes rather than specific columns. This is particularly useful when your DataFrames have meaningful index labels that you want to use for merging.

To merge DataFrames based on their indexes, you need to specify the left_index and right_index parameters in the merge() function:

  • left_index: Set this parameter to True to use the left DataFrame's index for merging.
  • right_index: Set this parameter to True to use the right DataFrame's index for merging.

Example: Index-Based Merge

Let's illustrate an index-based merge using two DataFrames: 'orders_df' and 'order_details_df,' where we want to merge them based on their indexes.

# Create a sample sales DataFrame with index
sales_data = {'Customer_ID': [101, 102, 103, 104],
              'Product_ID': ['P1', 'P2', 'P1', 'P3']}
sales_df = pd.DataFrame(sales_data)
sales_df.set_index('Customer_ID', inplace=True)
 
# Create a sample customers DataFrame with index 
customers_data = {'Customer_ID': [101, 102, 103, 105],
                  'Customer_Name': ['Alice', 'Bob', 'Charlie', 'Eve']}
customers_df = pd.DataFrame(customers_data)
customers_df.set_index('Customer_ID', inplace=True)
 
# Perform an index-based merge
merged_df = pd.merge(sales_df, customers_df, 
                     left_index=True, right_index=True, how='outer')
# Display the merged DataFrame
print("Merged DataFrame with index-based merge:")
print(merged_df)
Merged DataFrame with index-based merge:
            Product_ID Customer_Name
Customer_ID                         
101                 P1         Alice
102                 P2           Bob
103                 P1       Charlie
104                 P3           NaN
105                NaN           Eve

In this example:

  • We create two DataFrames, 'sales_df' and 'customers_df,' each with an index based on the 'Customer_ID' column.
  • We use left_index=True and right_index=True in the pd.merge() function to specify index-based merging.
  • The result is a merged DataFrame ('merged_df') that combines data from both DataFrames based on their indexes, resulting in an outer join operation.

Merging on Different Key Names

To merge DataFrames with different key names, you need to specify the key columns using the left_on and right_on parameters in the merge() function:

  • left_on: Specify the key column(s) from the left DataFrame.
  • right_on: Specify the key column(s) from the right DataFrame.

Example: Merging on Different Key Names

Let's demonstrate how to merge two DataFrames, 'employees_df' and 'departments_df,' where the key columns have different names ('Employee_Dept' in 'employees_df' and 'Dept_ID' in 'departments_df').

# Create a sample 'employees_df' DataFrame with a key column 'Employee_Dept'
employees_data = {'Employee_ID': [1, 2, 3, 4],
                  'Employee_Name': ['Alice', 'Bob', 'Charlie', 'Eve'],
                  'Employee_Dept': ['HR', 'Finance', 'IT', 'Sales']}
employees_df = pd.DataFrame(employees_data)
 
# Create a sample 'departments_df' DataFrame with a key column 'Dept_ID'
departments_data = {'Dept_ID': [101, 102, 103, 104],
                    'Dept_Name': ['HR', 'Finance', 'IT', 'Sales']}
departments_df = pd.DataFrame(departments_data)
 
# Perform a merge with different key names using 'left_on' and 'right_on'
merged_df = pd.merge(employees_df, departments_df, left_on='Employee_Dept', right_on='Dept_Name', how='inner')
 
# Display the merged DataFrame
print("Merged DataFrame with different key names:")
print(merged_df)
Merged DataFrame with different key names:
   Employee_ID Employee_Name Employee_Dept  Dept_ID Dept_Name
0            1         Alice            HR      101        HR
1            2           Bob       Finance      102   Finance
2            3       Charlie            IT      103        IT
3            4           Eve         Sales      104     Sales

In this example:

  • We create two DataFrames, 'employees_df' and 'departments_df,' each with a key column that has a different name ('Employee_Dept' and 'Dept_Name,' respectively).
  • We use left_on='Employee_Dept' and right_on='Dept_Name' in the pd.merge() function to specify the key columns for both DataFrames.
  • The result is a merged DataFrame ('merged_df') that combines data from both DataFrames based on the specified key columns.

This demonstrates how to merge DataFrames with different key names using the left_on and right_on parameters, allowing you to handle cases where key columns have different names but represent the same information.

Real-World Example

# Load AAPL stock price data
apple = pd.read_csv('AAPL.csv', parse_dates=['Date'], index_col='Date')
 
# Load GOOG stock price data
google = pd.read_csv('GOOG.csv', parse_dates=['Date'], index_col='Date')
 
# indexing required columns
apple_close = apple['Close']
google_close = google['Close']
 
# concat on axis=1 and set keys parameter
apple_google = pd.merge(apple_close, google_close, 
                        left_index=True, right_index=True, 
                        suffixes=['_apple', '_google'])
 
print(apple_google.head())
            Close_apple  Close_google
Date                                 
2022-09-01   157.960007    110.550003
2022-09-02   155.809998    108.680000
2022-09-06   154.529999    107.480003
2022-09-07   155.960007    110.480003
2022-09-08   154.460007    109.419998

The main operation is performed using the pd.merge() function.

  • pd.merge() is used to merge the two Series, apple_close and google_close, based on their datetime indexes. The left_index=True and right_index=True parameters indicate that the merge operation should be based on the datetime index of both Series.
  • The suffixes parameter is specified with the values ['_apple', '_google'] to add suffixes to columns with the same names from the two Series. This is useful to distinguish between the 'Close' prices of Apple and Google in the merged DataFrame.