Pandas
Data Analysis
Correlation Analysis

Correlation Analysis

Correlation analysis is a fundamental technique in data analysis that helps us understand the relationships between variables. It provides insights into how changes in one variable might be related to changes in another variable. In this post, we'll explore:

  • the process of calculating correlations using the .corr() method and interpreting the correlation values,
  • creating correlation heatmaps and visually identifying strong and weak correlations,
  • visualize correlations using scatter plots and trendlines,
  • important cautions and considerations when working with correlations,

Loading and Inspecting the Dataset

Note: We will use Seaborn's Tips dataset (opens in a new tab).

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
 
# Load the 'tips' dataset from Seaborn
tips_df = sns.load_dataset('tips')
 
# Display info 
print(tips_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB

Calculating Correlation

In Pandas, the .corr() method is used to compute the correlation coefficients between numeric variables in a DataFrame. It calculates the Pearson correlation coefficient, which measures the linear relationship between two variables. The values range from -1 to 1, where:

  • A positive correlation (close to 1) indicates that as one variable increases, the other variable tends to increase as well.
  • A negative correlation (close to -1) indicates that as one variable increases, the other variable tends to decrease.
  • A correlation close to 0 indicates a weak or no linear relationship between variables.
# Calculate pairwise correlation coefficients
correlation_matrix = tips_df[['total_bill', 'tip', 'size']].corr()
 
print("Correlation Matrix:")
print(correlation_matrix)
Correlation Matrix:
            total_bill       tip      size
total_bill    1.000000  0.675734  0.598315
tip           0.675734  1.000000  0.489299
size          0.598315  0.489299  1.000000

In the output, you can see the correlation values between 'total_bill', 'tip', and 'size' columns in the 'tips' dataset. For instance, the correlation between 'total_bill' and 'tip' is approximately 0.68, indicating a moderately positive linear relationship between the total bill amount and the tip amount. Look at other correlation numbers - does they suggest moderately positive correlation as well?

Correlation Heatmaps

A correlation heatmap is a graphical representation of the correlation matrix, where each correlation coefficient is color-coded to visualize the strength and direction of the relationship. We can create correlation heatmaps using Seaborn's heatmap() function.

# Create a heatmap of the correlation matrix
plt.figure(figsize=(8, 6))
 
sns.heatmap(correlation_matrix, annot=True,
            cmap="coolwarm", vmin=0.0, vmax=1.0)
plt.title('Correlation Heatmap')
 
plt.show()

Correlation Heatmap

In the heatmap, first thing we need to see is the colormap on the right side that indicates how correlation is linked with the color. In this case, correlation of 0 and 1 are represented by dark blue and dark red colors, respectively. Around correlation value of 0.5 the colormap diverges from light blue to light red. The darker red squares indicate stronger correlations. For instance, the square at the intersection of 'total_bill' and 'tip' is relatively dark, indicating a strong positive correlation. On the other hand, the square at the intersection of 'size' and 'tip' is lighter blue, suggesting a weaker correlation.

Scatter Plots for Correlation

Scatter plots are a valuable tool for visually understanding the relationship between two numeric variables. By plotting data points on a two-dimensional plane, we can observe patterns and trends. We have already covered how to draw a scatter plot using Pandas plot function. In this section, we will use Seaborn's regplot that allows us to create a scatter plot with a regression line. Here's how to do it:

# Create a scatter plot with a trendline
sns.regplot(data=tips_df, x='total_bill', y='tip', color='blue')
 
# labelling 
plt.title('Total Bill Amount vs Tip')
plt.xlabel('Total Bill Amount')
plt.ylabel('Tip')
 
plt.show()

Scatter plot with regression line

The sns.regplot() function automatically adds the regression line based on the provided variables, making it a convenient way to visualize the relationship and trend between two variables. The diverging band around regression line is confidence interval bands for the regression line. These bands indicate the uncertainty associated with the estimated regression line, allowing you to visualize the range within which the true regression line is likely to lie with a certain level of confidence.

In the scatter plots, each point represents a data entry with values on the x and y axes corresponding to the two variables being compared. The trendlines provide an overall indication of the direction of the correlation: upward-sloping for positive correlations and downward-sloping for negative correlations. As the scatter plot and regression line suggests, there is a moderately positive correlation between "tip" and "total_bill"

Cautions with Correlation

A fundamental principle in statistics is that correlation does not imply causation. Just because two variables are correlated doesn't mean that changes in one variable cause changes in the other. Correlation only quantifies the strength and direction of a linear relationship between variables.

For example, consider the correlation between ice cream sales and the number of drowning incidents in a city. These two variables might be positively correlated during the summer months, but that doesn't mean that buying ice cream causes more drownings. Instead, both variables are influenced by a lurking variable, such as the temperature.

Addressing Confounding Variables and Lurking Variables:

Confounding Variables: These are third variables that are related to both the independent and dependent variables, which might lead to a false perception of causation. To address this, perform controlled experiments or use statistical techniques like regression analysis to control for confounding variables.

Lurking Variables: These are variables that aren't directly included in the analysis but might influence the variables being studied. It's crucial to consider and account for lurking variables to ensure the validity of your findings.

Consider the classic example of the positive correlation between ice cream sales and the number of drowning incidents. The lurking variable in this case is the hot weather during summer, which leads to both higher ice cream sales and more people swimming, thus increasing the risk of drowning.