Data Visualization with Pandas and Matplotlib

Data visualization is a crucial part of data analysis, allowing us to gain insights from our data and communicate those insights effectively. In this post, we'll explore various types of plots using the Pandas df.plot() function, showcasing how to create different visualizations using Pandas and Matplotlib.

Loading and Inspecting the Dataset

Note: We will use Seaborn's Tips dataset (opens in a new tab). In case the 'tips' dataset lacks data for certain types of plots, we'll supplement with dummy data.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
 
# Load the 'tips' dataset from Seaborn
tips_df = sns.load_dataset('tips')
 
# Display info 
print(tips_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB

Understanding `.plot()` Function

The .plot() function in Pandas is a convenient method for creating various types of plots directly from a DataFrame. It leverages the Matplotlib library to generate the plots. The basic syntax of the .plot() function is as follows:

DataFrame.plot(kind, x, y, ...other_parameters)

Essential Parameter

Here are the main parameters of the .plot() function:

kind: Specifies the type of plot to create. It can take values like 'line', 'bar', 'barh', 'hist', 'scatter', 'box', 'area', 'pie', kde, hexbin
x: Specifies the column from the DataFrame to be used as the x-axis values.
y: Specifies the column from the DataFrame to be used as the y-axis values.

Other parameters

Additionally, the .plot() function accepts several other parameters depending on the plot type you choose. Here are a few examples:

title: Sets the title of the plot.
xlabel: Sets the label for the x-axis.
ylabel: Sets the label for the y-axis.
color: Specifies the color of the plot elements.
grid: Adds grid lines to the plot (True or False).
legend: Adds a legend to the plot (True or False).
bins: For histograms, specifies the number of bins.
... and many more.

Let's dive into the each of these types of plots!

Line Plots (`kind='line'`)

Purpose: Line plots are useful for visualizing trends and changes over time or continuous data. They are useful for showing patterns, fluctuations, and relationships between variables.

When to Use: Use line plots when you want to show trends, such as stock prices over time, temperature changes, or sales growth. They are also suitable for visualizing how multiple variables change in relation to each other.

# Group by day and calculate the total bill amounts
daily_total = tips_df.groupby('day')['total_bill'].sum()
 
# Create a line plot
daily_total.plot(kind='line', marker='o',
                 title= 'Total Bill Amounts by Day',
                 xlabel= 'Day',
                 ylabel= 'Total Bill Amount',
                 grid= True)
plt.show()

Line plot using Pandas df

Bar Plots (`kind='bar'`)

Purpose: Bar plots are great for comparing categorical data. They help compare the values of different categories.

When to Use: Bar plots are effective when you have discrete categories and want to compare their counts or values. For example, you can use them to compare sales of different products, grades of students in different subjects, or the distribution of data across different days.

# Calculate the count of meals by day
meals_by_day = tips_df['day'].value_counts()
 
# Create a bar plot
meals_by_day.plot(kind='bar',
                  title= 'Number of Meals by Day',
                  xlabel= 'Day',
                  ylabel= 'Number of Meals')
plt.show()

Bar plot using Pandas df

Histograms (`kind='hist'`)

Purpose: Histograms show the distribution of continuous data. They group data into bins and display the frequency of data points within each bin.

When to Use: Histograms are useful for understanding the data distribution and identifying patterns, modes, or outliers. Use them when you want to visualize the frequency distribution of a single variable, such as ages of participants, test scores, or income levels.

# Create a histogram of total bill amounts
tips_df['total_bill'].plot(kind='hist', bins=10, edgecolor='black',
                           title= 'Total Bill Amount Distribution',
                           xlabel= 'Total Bill Amount',
                           ylabel= 'Frequency' )
 
plt.show()

Histogram plot using Pandas df

Scatter Plots (`kind='scatter'`)

Purpose: Scatter plots visualize the relationship between two numeric variables. Each data point is plotted as a point on the graph.

When to Use: Scatter plots are ideal for exploring relationships and correlations between two variables. They can help identify patterns, clusters, outliers, and trends. Use them when you want to see how two variables interact, such as height vs. weight, age vs. income, or time spent studying vs. exam scores.

# Create a scatter plot of total bill amount vs tip
tips_df.plot(kind='scatter', x='total_bill', y='tip',
             title= 'Total Bill Amount vs Tip',
             xlabel= 'Total Bill Amount',
             ylabel= 'Tip')
 
plt.show()

Scatter plot using Pandas df

Box Plots (Using `.boxplot`)

Purpose: Box plots provide insights into the distribution, median, and outliers of a dataset. They help identify the spread and central tendency of data.

When to Use: Use box plots to compare distributions across different categories or groups. They are useful for detecting outliers and understanding the spread of data. For example, you can use them to compare the distribution of test scores by subject or the distribution of salaries by job title.

# Create a box plot of total bill amounts by day
tips_df.boxplot(column='total_bill', by='day')
 
plt.title('Box Plot of Total Bill Amounts by Day')
plt.xlabel('Day')
plt.ylabel('Total Bill Amount')
plt.suptitle("")  # Remove default title
 
plt.show()

Box plot using Pandas df

Area Plots (`kind='area'`)

Purpose: Area plots show cumulative data trends over time or continuous variables. They are useful for understanding the contribution of each category to the total.

When to Use: Use area plots to visualize how parts contribute to the whole. They are often used for tracking cumulative values, such as sales by product category over time or the distribution of expenses in a budget.

# Create a sample DataFrame for area plot
data = {'Year': [2015, 2016, 2017, 2018, 2019],
        'Sales': [100, 150, 200, 180, 250]}
df = pd.DataFrame(data)
 
# Create an area plot
ax = df.plot(kind='area', x='Year', y='Sales',
             title='Sales Over Years',
             xlabel='Year',
             ylabel='Sales',
             xticks= df['Year'])
 
plt.show()

Area plot using Pandas df

Pie Charts (`kind='pie'`)

Purpose: Pie charts display the proportion of each category in a dataset. They represent parts of a whole.

When to Use: Use pie charts when you want to show the relative contribution of different categories to a total. However, use them cautiously as they can become difficult to interpret when there are too many categories or when the differences in proportions are small.

# Calculate the count of meals by time
meals_by_time = tips_df['time'].value_counts()
 
# Create a pie chart
meals_by_time.plot(kind='pie', autopct='%1.1f%%', startangle=90,
                   title ='Distribution of Meals by Time',
                   ylabel = '')
 
plt.show()

Pie Chart using Pandas df

Hexbin Plot (`kind='hexbin'`)

Purpose: Hexbin plots visualize the distribution of a large dataset through hexagonal bins. They help with data density and overlap.

When to Use: Use hexbin plots when you have a large dataset with overlapping data points. They provide a way to show data density and patterns, especially in scatter plots where many points overlap.

# Create a hexbin plot of total bill amount vs tip
tips_df.plot(kind='hexbin', x='total_bill', y='tip', gridsize=15, cmap='YlOrRd',
             title= 'Hexbin Plot of Total Bill Amount vs Tip',
             xlabel= 'Total Bill Amount',
             ylabel= 'Tip')
 
plt.show()

Hexbin plot using Pandas df

Subplots using `subplots()` Parameter

The subplots() parameter allows you to create multiple plots in a single figure, which can be useful for comparing different visualizations side by side.

# Create a figure with two subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))
 
# Subplot 1: Bar plot of meals by day
meals_by_day.plot(kind='bar', ax=axes[0])
axes[0].set_title('Number of Meals by Day')
axes[0].set_xlabel('Day')
axes[0].set_ylabel('Number of Meals')
 
# Subplot 2: Scatter plot of total bill amount vs tip
tips_df.plot(kind='scatter', x='total_bill', y='tip', ax=axes[1])
axes[1].set_title('Total Bill Amount vs Tip')
axes[1].set_xlabel('Total Bill Amount')
axes[1].set_ylabel('Tip')
 
plt.tight_layout()
plt.show()

Subplots using Pandas df

In this example, we create a figure with two subplots arranged side by side. The subplots() function returns a figure object and an array of axes objects. We use the ax parameter in each .plot() call to specify which subplot to plot on. The plt.tight_layout() function ensures that the subplots are properly spaced and not overlapping.

Descriptive Statistics Correlation Analysis

Data Visualization with Pandas and Matplotlib

Loading and Inspecting the Dataset

Understanding .plot() Function

Essential Parameter

Other parameters

Line Plots (kind='line')

Bar Plots (kind='bar')

Histograms (kind='hist')

Scatter Plots (kind='scatter')

Box Plots (Using .boxplot)

Area Plots (kind='area')

Pie Charts (kind='pie')

Hexbin Plot (kind='hexbin')

Subplots using subplots() Parameter