Pandas
Data Manipulation
Grouping and Aggregation

Grouping and Aggregation in Pandas

In data analysis, it's often necessary to group and aggregate data to gain insights and make meaningful conclusions. Pandas provides efficient tools for grouping and aggregating data. In this post, we will explore the concepts of grouping and aggregation using the Titanic dataset (opens in a new tab) as examples.

Grouping and Aggregation Functions

Grouping involves dividing a dataset into subsets based on the values of one or more columns. Aggregation refers to computing summary statistics (like mean, sum, count, etc.) for each group. Pandas provides the .groupby() function for grouping and .agg() for aggregation.

.groupby() Function

Let's first explore the syntax and commonly used parameters for the Pandas .groupby() function:

grouped = dataframe.groupby(by, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs)
  • by: The column(s) or index levels based on which the data will be grouped. This can be a column name, a list of column names, a Series, or a function. It's the primary parameter for grouping.
  • axis: Specifies the axis along which the grouping is performed. 0 for rows (default) and 1 for columns. Read more about axis parameter here
  • level: If the axis is a MultiIndex, you can use this parameter to specify the level at which to perform the grouping.
  • as_index: If True (default), the grouping columns will become the index of the resulting DataFrame. Set to False if you want to keep them as regular columns.
  • sort: If True (default), the groups are sorted. Set to False for potential performance improvement if the data is already sorted.
  • group_keys: If True (default), include the group keys in the result along with the data. Set to False to exclude them.

agg() Function

Here's the syntax and commonly used parameters for the Pandas .agg() function when used in combination with the .groupby() operation:

result = grouped_object.agg(func=None, axis=0, *args, **kwargs)
  • func: A function, a string (function name), or a list/dictionary of functions to apply to each group. This is the primary parameter and specifies the aggregation operation(s) to be applied.
  • *args: Additional positional arguments that can be passed to the aggregation function(s).
  • **kwargs: Additional keyword arguments that can be passed to the aggregation function(s).

Here's a list of some common built-in aggregation functions that can be used with the .agg() function in Pandas, along with their descriptions:

Aggregation FunctionDescription
'sum'Calculate the sum of values in a group.
'mean'Calculate the mean (average) of values.
'median'Calculate the median (middle) value.
'min'Find the minimum value in a group.
'max'Find the maximum value in a group.
'count'Count the number of non-null values.
'std'Calculate the standard deviation.
'var'Calculate the variance.
'size'Count the total number of values.
'first'Get the first value in a group.
'last'Get the last value in a group.
'nunique'Count the number of unique values.
'prod'Calculate the product of values.
'all'Check if all values are True.
'any'Check if any value is True.
'quantile'Calculate the specified quantile value.

You can use these aggregation functions within the .agg() method to perform a wide range of summary calculations on your data. Remember that you can also pass custom aggregation functions to the .agg() method if the built-in functions don't fulfill your specific needs.

Loading and Exploring the Titanic Dataset

We will use Seaborn's Titanic dataset for the code examples. First, let's load this dataset into DataFrame

import pandas as pd
import seaborn as sns
 
# Load Titanic dataset
titanic_df = sns.load_dataset("titanic")
 
# Display first few rows
print(titanic_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
None

Example 1: Number of People Survived by Sex

Let suppose, we need to find the number of people survived, grouped by sex. To answer this question, we first apply groupby() function to the 'sex' column, and then apply size aggregator on the 'survived' column. We divide this operation into two steps - first grouping the data and then applying the aggregator.

Grouping Data

# Group by 'sex' column
grouped_by_sex = titanic_df.groupby('sex')
 
# Display groups
for name, group in grouped_by_sex:
    print(f"Group Name: {name}")
Group Name: female
Group Name: male

Aggregating Data

After grouping, we will apply the sum aggregator to survived column.

# apply 'sum' to 'survived' column
print(grouped_by_sex['survived'].agg('sum'))
sex
female    233
male      109
Name: survived, dtype: int64

We can combine grouping and aggregation in a single line

titanic_df.groupby('sex')['survived'].agg('sum')

Example 2: Average Age and Fare for Survived and Not Survived

We would like to explore that from those who survived (and not survived), what is the average age and fare paid:

titanic_df.groupby('survived')['age', 'fare'].agg('mean')
                age       fare
survived                      
0         30.626179  22.117887
1         28.343690  48.395408

Example 3: Average Age, Median Fare and Total Survived

Applying Multiple Aggregations

.agg function allows us to find different aggregator for each column, under each group:

# Calculate multiple statistics for each passenger class
agg_stats_by_class = grouped_by_class.agg({
    'age': 'mean',
    'fare': 'median',
    'survived': 'sum'
})
 
# Display aggregated statistics
print(agg_stats_by_class)
              age     fare  survived
class                               
First   38.233441  60.2875       136
Second  29.877630  14.2500        87
Third   25.140620   8.0500       119

Example 4: Custom Aggregation Functions

You can define custom aggregation functions using the .agg() method. Let suppose we are interested to find the difference between mean and median 'fare' in each class:

# Define a custom aggregation function
def my_custom_aggregation(arr):
    return arr.mean() - arr.median()
 
# Apply custom aggregation to 'Fare' column
custom_agg_by_class = grouped_by_class['fare'].agg(my_custom_aggregation)
 
# Display custom aggregation results
print(custom_agg_by_class)
class
First     23.867188
Second     6.412183
Third      5.625550
Name: fare, dtype: float64

Example 5: Working with DateTime

Let first create a dummy dataset to demonstrate how to work with datetime objects in the groupby() and agg() functions.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
 
# Create a dummy dataset
np.random.seed(0)
num_entries = 100
date_range = pd.date_range(start='2023-01-01', periods=num_entries, freq='D')
data = {
    'Date': date_range,
    'Value': np.random.randint(1, 100, size=num_entries),
    'Category': np.random.choice(['A', 'B', 'C'], size=num_entries)
}
dummy_df = pd.DataFrame(data)
 
# Display the first few rows of the dataset
print(dummy_df.head())
        Date  Value Category
0 2023-01-01     45        A
1 2023-01-02     48        B
2 2023-01-03     65        C
3 2023-01-04     68        A
4 2023-01-05     68        B

Grouping and Aggregating by Month

In this example, we'll group the data by month and calculate the sum of values for each month.

# Add new column of Month, extracted from Date Column
dummy_df['Month'] = dummy_df['Date'].dt.to_period('M')  # Extract month
 
# Group by month and calculate sum of values
grouped_by_month = dummy_df.groupby('Month').agg({'Value': 'sum'})
 
# Display the result
print(grouped_by_month)
         Value
Month         
2023-01   1793
2023-02   1113
2023-03   1384
2023-04    495

Applying Multiple Aggregations with DateTime Data

In this example, we'll group the data by category and calculate both the mean and sum of values for each category.

# Group by category and calculate mean and sum of values
grouped_by_category = dummy_df.groupby('Category').agg({
    'Value': ['mean', 'sum']
})
 
# Display the result
print(grouped_by_category)
              Value      
               mean   sum
Category                 
A         42.027778  1513
B         55.448276  1608
C         47.542857  1664

Grouping and aggregation are powerful techniques for exploring and summarizing datasets. Pandas provides flexible tools to group data, perform aggregations, and gain insights efficiently. By understanding these concepts, you can unlock valuable information from your data.