Pandas
Data Analysis
Joining Data

Joining DataFrames in Pandas

One of the essential methods for combining DataFrames is pandas.DataFrame.join. This method allows you to merge two or more DataFrames based on common columns or indices. It is a fundamental tool in data wrangling and is often used to bring together different pieces of data for analysis.

Basic Joining Operation

Case A: Join using on

In this example, we'll create two simple DataFrames and join them based on a common column using on parameter for key

import pandas as pd
 
# Create the first DataFrame
data1 = {'ID': [1, 2, 3],
         'Name': ['Alice', 'Bob', 'Charlie']}
df1 = pd.DataFrame(data1)
 
# Create the second DataFrame
data2 = {'ID': [2, 3, 4],
         'Age': [25, 30, 22]}
df2 = pd.DataFrame(data2)
 
# Perform a join on the 'ID' column, default value of how='left'
result = df1.join(df2.set_index('ID'), on='ID')
 
print(result)

In this example, we have two DataFrames, df1 and df2, both containing information about individuals, with a common 'ID' column. We use the join method to perform an left(default) join on the 'ID' column, which combines the two DataFrames based on 'df1' 'ID' values.

Output:

   ID     Name   Age
0   1    Alice   NaN
1   2      Bob  25.0
2   3  Charlie  30.0

Case 2: Join on Common Index

We can skip the on parameter if both DataFrames have a common index:

# Perform an inner join on common index
result = df1.set_index('ID').join(df2.set_index('ID'))
 
print(result)
       Name   Age
ID               
1     Alice   NaN
2       Bob  25.0
3   Charlie  30.0

Types of Joins

When using the pandas.DataFrame.join method, you have the flexibility to choose from several types of joins to combine DataFrames. The choice of join type determines how rows are included in the resulting DataFrame based on the keys. Let's explore the four most common types of joins: inner, outer, left, and right joins.

1. Inner Join

An inner join returns only the rows that have matching keys in both DataFrames. Rows with keys that do not exist in both DataFrames are excluded from the result.

# Perform an inner join on the 'ID' column
result_inner = df1.set_index('ID').join(df2.set_index('ID'), how='inner')
 
print("Inner Join Result:")
print(result_inner)

Output:

Inner Join Result:
       Name  Age
ID              
2       Bob   25
3   Charlie   30

In this example, only the rows with 'ID' values 2 and 3, which exist in both df1 and df2, are included in the result.

2. Outer Join

An outer join returns all rows from both DataFrames and fills in missing values with NaN for columns from the DataFrame that doesn't have a matching key.

# Perform an outer join on the 'ID' column
result_outer = df1.set_index('ID').join(df2.set_index('ID'), how='outer')
 
print("\nOuter Join Result:")
print(result_outer)

Output:

Outer Join Result:
       Name   Age
ID               
1     Alice   NaN
2       Bob  25.0
3   Charlie  30.0
4       NaN  22.0

In this case, all rows from both DataFrames are included in the result, and missing values are filled with NaN.

3. Left Join

A left join returns all rows from the left DataFrame and the matching rows from the right DataFrame. Rows with keys that do not exist in the right DataFrame will have NaN values for the columns from the right DataFrame. We have already covered this in the section - Basic Joining Operation

4. Right Join

A right join is the opposite of a left join. It returns all rows from the right DataFrame and the matching rows from the left DataFrame. Rows with keys that do not exist in the left DataFrame will have NaN values for the columns from the left DataFrame.

# Perform a right join on the 'ID' column
result_right = df1.set_index('ID').join(df2.set_index('ID'), how='right')
 
print("\nRight Join Result:")
print(result_right)

Output:

Right Join Result:
       Name  Age
ID              
2       Bob   25
3   Charlie   30
4       NaN   22

In this case, all rows from df2 are included, and only the matching rows from df1 are included, with NaN values for the missing 'Name' values.

Comparing Concat, Merge and Join

concat():

Use concat() when you need to append either rows or columns from one DataFrame to another. It's a straightforward way to combine data without needing to align data by columns or index labels. This is particularly useful when you have DataFrames with similar structures and you want to combine them into a larger dataset. For example, if you have sales data for different months as separate DataFrames and you want to create a single DataFrames with sales data for the entire year, concat() would be a good choice.

merge():

Use merge() when you need to align DataFrames based on one or more common columns (keys). This is useful when you have data in different DataFrames that is related in some way and you want to bring it together in a single DataFrame. For example, if you have one DataFrame with product details and another DataFrame with supplier details, and both DataFrames have a common 'Product ID' column, you can use merge() to create a single DataFrame that includes both product and supplier details for each product.

join():

Use join() when you need to combine DataFrames based on their index. This is useful when you have dataframes where the row ordering matters and you want to combine data based on that ordering. For example, if you have a dataframe with stock prices for different dates and another dataframe with volume of stocks traded for different dates, and the dates are set as the index in both dataframes, you can use join() to create a single dataframe that includes both price and volume data for each date.

Remember, the choice between concat(), merge(), and join() depends on the structure of your data and what you're trying to achieve. Sometimes more than one function could be used to achieve the same result, and the choice between them comes down to personal preference or the specific requirements of your project.