Grouping and Aggregation in Pandas
In data analysis, it's often necessary to group and aggregate data to gain insights and make meaningful conclusions. Pandas provides efficient tools for grouping and aggregating data. In this post, we will explore the concepts of grouping and aggregation using the Titanic dataset (opens in a new tab) as examples.
Grouping and Aggregation Functions
Grouping involves dividing a dataset into subsets based on the values of one or more columns. Aggregation refers to computing summary statistics (like mean, sum, count, etc.) for each group. Pandas provides the .groupby()
function for grouping and .agg()
for aggregation.
.groupby()
Function
Let's first explore the syntax and commonly used parameters for the Pandas .groupby()
function:
grouped = dataframe.groupby(by, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs)
by
: The column(s) or index levels based on which the data will be grouped. This can be a column name, a list of column names, a Series, or a function. It's the primary parameter for grouping.axis
: Specifies the axis along which the grouping is performed.0
for rows (default) and1
for columns. Read more about axis parameter herelevel
: If the axis is a MultiIndex, you can use this parameter to specify the level at which to perform the grouping.as_index
: IfTrue
(default), the grouping columns will become the index of the resulting DataFrame. Set toFalse
if you want to keep them as regular columns.sort
: IfTrue
(default), the groups are sorted. Set toFalse
for potential performance improvement if the data is already sorted.group_keys
: IfTrue
(default), include the group keys in the result along with the data. Set toFalse
to exclude them.
agg()
Function
Here's the syntax and commonly used parameters for the Pandas .agg()
function when used in combination with the .groupby()
operation:
result = grouped_object.agg(func=None, axis=0, *args, **kwargs)
func
: A function, a string (function name), or a list/dictionary of functions to apply to each group. This is the primary parameter and specifies the aggregation operation(s) to be applied.*args
: Additional positional arguments that can be passed to the aggregation function(s).**kwargs
: Additional keyword arguments that can be passed to the aggregation function(s).
Here's a list of some common built-in aggregation functions that can be used with the .agg()
function in Pandas, along with their descriptions:
Aggregation Function | Description |
---|---|
'sum' | Calculate the sum of values in a group. |
'mean' | Calculate the mean (average) of values. |
'median' | Calculate the median (middle) value. |
'min' | Find the minimum value in a group. |
'max' | Find the maximum value in a group. |
'count' | Count the number of non-null values. |
'std' | Calculate the standard deviation. |
'var' | Calculate the variance. |
'size' | Count the total number of values. |
'first' | Get the first value in a group. |
'last' | Get the last value in a group. |
'nunique' | Count the number of unique values. |
'prod' | Calculate the product of values. |
'all' | Check if all values are True. |
'any' | Check if any value is True. |
'quantile' | Calculate the specified quantile value. |
You can use these aggregation functions within the .agg()
method to perform a wide range of summary calculations on your data. Remember that you can also pass custom aggregation functions to the .agg()
method if the built-in functions don't fulfill your specific needs.
Loading and Exploring the Titanic Dataset
We will use Seaborn's Titanic dataset for the code examples. First, let's load this dataset into DataFrame
import pandas as pd
import seaborn as sns
# Load Titanic dataset
titanic_df = sns.load_dataset("titanic")
# Display first few rows
print(titanic_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 891 non-null int64
1 pclass 891 non-null int64
2 sex 891 non-null object
3 age 714 non-null float64
4 sibsp 891 non-null int64
5 parch 891 non-null int64
6 fare 891 non-null float64
7 embarked 889 non-null object
8 class 891 non-null category
9 who 891 non-null object
10 adult_male 891 non-null bool
11 deck 203 non-null category
12 embark_town 889 non-null object
13 alive 891 non-null object
14 alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
None
Example 1: Number of People Survived by Sex
Let suppose, we need to find the number of people survived, grouped by sex. To answer this question, we first apply groupby()
function to the 'sex' column, and then apply size
aggregator on the 'survived' column. We divide this operation into two steps - first grouping the data and then applying the aggregator.
Grouping Data
# Group by 'sex' column
grouped_by_sex = titanic_df.groupby('sex')
# Display groups
for name, group in grouped_by_sex:
print(f"Group Name: {name}")
Group Name: female
Group Name: male
Aggregating Data
After grouping, we will apply the sum
aggregator to survived
column.
# apply 'sum' to 'survived' column
print(grouped_by_sex['survived'].agg('sum'))
sex
female 233
male 109
Name: survived, dtype: int64
We can combine grouping and aggregation in a single line
titanic_df.groupby('sex')['survived'].agg('sum')
Example 2: Average Age and Fare for Survived and Not Survived
We would like to explore that from those who survived (and not survived), what is the average age and fare paid:
titanic_df.groupby('survived')['age', 'fare'].agg('mean')
age fare
survived
0 30.626179 22.117887
1 28.343690 48.395408
Example 3: Average Age, Median Fare and Total Survived
Applying Multiple Aggregations
.agg
function allows us to find different aggregator for each column, under each group:
# Calculate multiple statistics for each passenger class
agg_stats_by_class = grouped_by_class.agg({
'age': 'mean',
'fare': 'median',
'survived': 'sum'
})
# Display aggregated statistics
print(agg_stats_by_class)
age fare survived
class
First 38.233441 60.2875 136
Second 29.877630 14.2500 87
Third 25.140620 8.0500 119
Example 4: Custom Aggregation Functions
You can define custom aggregation functions using the .agg()
method. Let suppose we are interested to find the difference between mean and median 'fare' in each class:
# Define a custom aggregation function
def my_custom_aggregation(arr):
return arr.mean() - arr.median()
# Apply custom aggregation to 'Fare' column
custom_agg_by_class = grouped_by_class['fare'].agg(my_custom_aggregation)
# Display custom aggregation results
print(custom_agg_by_class)
class
First 23.867188
Second 6.412183
Third 5.625550
Name: fare, dtype: float64
Example 5: Working with DateTime
Let first create a dummy dataset to demonstrate how to work with datetime objects in the groupby()
and agg()
functions.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Create a dummy dataset
np.random.seed(0)
num_entries = 100
date_range = pd.date_range(start='2023-01-01', periods=num_entries, freq='D')
data = {
'Date': date_range,
'Value': np.random.randint(1, 100, size=num_entries),
'Category': np.random.choice(['A', 'B', 'C'], size=num_entries)
}
dummy_df = pd.DataFrame(data)
# Display the first few rows of the dataset
print(dummy_df.head())
Date Value Category
0 2023-01-01 45 A
1 2023-01-02 48 B
2 2023-01-03 65 C
3 2023-01-04 68 A
4 2023-01-05 68 B
Grouping and Aggregating by Month
In this example, we'll group the data by month and calculate the sum of values for each month.
# Add new column of Month, extracted from Date Column
dummy_df['Month'] = dummy_df['Date'].dt.to_period('M') # Extract month
# Group by month and calculate sum of values
grouped_by_month = dummy_df.groupby('Month').agg({'Value': 'sum'})
# Display the result
print(grouped_by_month)
Value
Month
2023-01 1793
2023-02 1113
2023-03 1384
2023-04 495
Applying Multiple Aggregations with DateTime Data
In this example, we'll group the data by category and calculate both the mean and sum of values for each category.
# Group by category and calculate mean and sum of values
grouped_by_category = dummy_df.groupby('Category').agg({
'Value': ['mean', 'sum']
})
# Display the result
print(grouped_by_category)
Value
mean sum
Category
A 42.027778 1513
B 55.448276 1608
C 47.542857 1664
Grouping and aggregation are powerful techniques for exploring and summarizing datasets. Pandas provides flexible tools to group data, perform aggregations, and gain insights efficiently. By understanding these concepts, you can unlock valuable information from your data.