Pandas
Data Analysis
Descriptive Statistics

Descriptive Statistics using Pandas

Descriptive statistics are crucial tools for summarizing and understanding the characteristics of a dataset. In this post, we'll explore various descriptive statistics using real-world data from the Titanic dataset (opens in a new tab) . We'll cover mean, median, measures of dispersion, min, max, quantiles, percentiles, mode, frequency distribution, skewness, and kurtosis.

Load and Describe the Dataset

Loading the Dataset

Our first step is to load the 'titanic' dataset using seaborn load_dataset function

import pandas as pd
import seaborn as sns
 
# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')

Using describe() Function for Summary Statistics

The describe() function provides a comprehensive summary of basic statistics for each numerical column in a DataFrame.

# Use the describe() function to get summary statistics of numerical columns
summary_stats = titanic_df.describe()
 
print("Summary Statistics:")
print(summary_stats)
Summary Statistics:

       survived  pclass     age   sibsp   parch    fare
count    891.00  891.00  714.00  891.00  891.00  891.00
mean       0.38    2.31   29.70    0.52    0.38   32.20
std        0.49    0.84   14.53    1.10    0.81   49.69
min        0.00    1.00    0.42    0.00    0.00    0.00
25%        0.00    2.00   20.12    0.00    0.00    7.91
50%        0.00    3.00   28.00    0.00    0.00   14.45
75%        1.00    3.00   38.00    1.00    0.00   31.00
max        1.00    3.00   80.00    8.00    6.00  512.33

In this example, we use the describe() function on the Titanic dataset to generate summary statistics including count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum for each numerical column. This function provides a quick overview of the distribution and spread of the data.

Functions for Descriptive Statistics

In addition to get a summary statistics of the numerical columns using describe() function, we can use the built-in functions to get the individual statistics of the desired columns.

Here's a table listing the commonly used functions in Pandas for performing descriptive statistics on DataFrame columns:

FunctionDescription
mean()Calculate the mean (average) of numeric data.
median()Find the median (middle value) of numeric data.
std()Calculate the standard deviation of numeric data.
var()Compute the variance of numeric data.
quantile()Calculate quantiles (e.g., percentiles) of numeric data.
mode()Find the mode (most frequent value) of data.
value_counts()Count the occurrences of each unique value in a column.
skew()Calculate the skewness of data distribution.
kurt()Compute the kurtosis of data distribution.
min()Find the minimum value in a column.
max()Find the maximum value in a column.
describe()Generate summary statistics for numerical columns.

Mean and Median

Mean (Average): The mean is the sum of all values divided by the number of values. It represents the "center" of the data distribution.

Median (Middle Value): The median is the middle value in a sorted dataset. It's less sensitive to outliers compared to the mean.

Let's calculate the mean and median for the columns 'age' and 'fare':

# Calculate mean and median of 'age' and 'fare' columns
mean = titanic_df[['age', 'fare']].mean()
median = titanic_df[['age', 'fare']].median()
 
# Print mean age and fare
print("Mean Age and Fare:\n", mean)
 
# Print median age and fare
print("\nMedian Age:", median['age'].round(2))
print("Median Fare:", median['fare'].round(2))
Mean Age and Fare:
age     29.699118
fare    32.204208
dtype: float64

Median Age: 28.0
Median Fare: 14.45

Minimum and Maximum Values

Minimum Value (Min): The minimum value is the smallest value in the dataset or given column(s).

Maximum Value (Max): The maximum value is the largest value in the dataset or given column(s).

Let's calculate the min and max values of 'age' and fare columns

# Find the minimum and maximum of 'age' and 'fare` 
min = titanic_df[['age', 'fare']].min()
max = titanic_df[['age', 'fare']].max()
 
# print min and max age
print("Minimum Age:", min['age'])
print("Maximum Age:", max['age'])
 
# print min and max fare
print("Minimum Fare:", min['fare'])
print("Maximum Fare:", max['fare'])
Minimum Age: 0.42
Maximum Age: 80.0
Minimum Fare: 0.0
Maximum Fare: 512.3292

Measures of Dispersion

Standard Deviation: The standard deviation measures the spread of data around the mean. A higher standard deviation indicates greater variability.

Variance: The variance is the average of the squared differences from the mean. It's a measure of how much the data points deviate from the mean.

Let's calculate the std and var of 'age and' 'fare' columns:

# Calculate standard deviation and variance of 'age'  and 'fare'
std = titanic_df[['age', 'fare']].std()
var = titanic_df[['age', 'fare']].var()
 
# Print std and var of age
print("Standard Deviation of Age:", std['age'].round(2))
print("Variance of Age:", var['age'].round(2))
 
# Print std and var of fare
print("Standard Deviation of Fare:", std['fare'].round(2))
print("Variance of Fare:", var['fare'].round(2))
Standard Deviation of Age: 14.53
Variance of Age: 211.02
Standard Deviation of Fare: 49.69
Variance of Fare: 2469.44

Quantiles and Percentiles

Quantiles: Quantiles divide the data into intervals. The median is the 50th percentile. Other common percentiles are the 25th (Q1) and 75th (Q3) percentiles.

Let's calculate the 25th, 50th and 75th percentile using quantile function of 'fare' column

# Calculate 25th, 50th (median), and 75th percentiles of 'Fare' column
percentiles = titanic_df['fare'].quantile([0.25, 0.5, 0.75])
 
print("25th Percentile:", percentiles[0.25])
print("Median (50th Percentile):", percentiles[0.5])
print("75th Percentile:", percentiles[0.75])
25th Percentile: 7.91
Median (50th Percentile): 14.45
75th Percentile: 31.0

Mode and Frequency Distribution

Mode: The mode is the value that appears most frequently in a dataset.

Frequency Distribution: A frequency distribution table shows the count of each unique value in a dataset.

Let's calculate the mode and frequency distribution table (using value_counts) of embark_town column

# Calculate the mode of 'embark_town' column
mode_embarked = titanic_df['embark_town'].mode()
 
print("Mode of Embarked:", mode_embarked[0])
 
# Create a frequency distribution table for 'embark_town' 
freq_table = titanic_df['embark_town'].value_counts()
 
print("\nFrequency Distribution:\n", freq_table)
Mode of Embarked: Southampton

Frequency Distribution:
Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64

Skewness and Kurtosis

Skewness: Skewness measures the asymmetry of the data distribution. A positive skew indicates a long tail on the right, while a negative skew indicates a long tail on the left.

Kurtosis: Kurtosis measures the "tailedness" of the data distribution. High kurtosis indicates heavy tails and more extreme values.

Let's calculate the skew and kurt of 'age' column

# Calculate skewness and kurtosis of 'Age' column
skewness_age = titanic_df['age'].skew()
kurtosis_age = titanic_df['age'].kurt()
 
print("Skewness of Age:", skewness_age)
print("Kurtosis of Age:", kurtosis_age)
Skewness of Age: 0.38910778230082704
Kurtosis of Age: 0.17827415364210353

These descriptive statistics provide valuable insights into the central tendency, spread, distribution, and shape of real-world data. Understanding these measures is essential for making informed decisions in data analysis.