Filtering and Selection in Pandas
Filtering and selection are fundamental operations when working with data in Pandas. They allow you to extract specific subsets of data that meet certain conditions. In this post, we'll explore various techniques for filtering and selecting data based on conditions, using Seaborn "tips" dataset
Importing Required Libraries
First, let's load the Tips dataset (opens in a new tab), which is available under Seaborn library. The "tips" dataset is a well-known dataset containing information about restaurant bills and tips. It includes data about various factors like total bill, tip amount, gender, and smoker status.
import pandas as pd
import seaborn as sns
# Load the Seaborn tips dataset
tips = sns.load_dataset("tips")
# Display the first few rows of the dataset
print(tips.head())
Output:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.5 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Single Condition Filtering with Comparison Operators
You can filter rows in a DataFrame based on a single condition. For example, let's filter the rows where the total bill is greater than $45.
👉 Concentrate on the syntax of filtering, df[condition]
Here, df
is the original DataFrame, condition
is a Boolean expression that is applied to the DataFrame
# Creating a condition: Total bill is greater than $45
condition = tips['total_bill'] > 45
# Applying the condition to the DataFrame 'tips'
high_bill = tips[condition]
# Printing rows with Total Bill > $45
print("Rows with Total Bill > $45:")
print(high_bill)
Output:
Rows with Total Bill > $20:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
212 48.33 9.00 Male No Sat Dinner 4
Counting the number of Rows that Matches the Filtering Condition(s)
Instead of displaying the subset of df , we can just count the total number of rows that fulfill our filtering conditions. To do this, we can use the sum()
function after applying our conditions. This works because the filtering operation returns a DataFrame of True/False values, and sum()
treats True as 1 and False as 0.
# Counting the number of rows where Total Bill > $45
count = (tips['total_bill'] > 45).sum()
# Printing the count
print(count) # Output: 5
Combining Multiple Conditions with Logical Operators
You can use logical operators like &
(and) and |
(or) to combine multiple conditions.
# Condition 1: group 'size' is 2 people or less
small_group = tips['size'] <=2
# Condition 2: paid 'tip' $5 or more
high_tip = tips['tip'] >=5
# Applying conditions
small_group_large_tip = tips[small_group & high_tip]
print(small_group_large_tip)
Output:
total_bill tip sex smoker day time size
46 22.23 5.00 Male No Sun Dinner 2
73 25.28 5.00 Female Yes Sat Dinner 2
83 32.68 5.00 Male Yes Thur Lunch 2
88 24.71 5.85 Male No Thur Lunch 2
172 7.25 5.15 Male Yes Sun Dinner 2
181 23.33 5.65 Male Yes Sun Dinner 2
Using .query()
Method
Pandas provides the .query()
method to filter rows based on a string expression. You can either set a single condition or bind multiple conditions with logical operators.
# Single condition query
tips.query('total_bill > 45')
# Multiple conditions query
tips.query('total_bill > 40 & tip > 5')
Filtering with String Conditions
You can filter rows using string conditions, such as filtering rows where the day is 'Fri'.
# Setting string condition
string_condition = tips['day'] == 'Fri'
# Applying condition to filter rows
tips[string_condition]
If you are interested in excluding rows where 'day' is 'Fri', you can use tips['day'] != 'Fri'
Filtering with isin()
Method
The isin()
method allows you to filter rows based on a list of values.
# Setting up the condition to filter for weekends
weekend_only = tips['day'].isin(['Sat', 'Sun'])
# Applying the condition to filter the rows in the DataFrame 'tips'
filtered_tips = tips[weekend_only]
Filtering with String Method
There are several string methods that can be used for filtering data in DataFrame columns based on string conditions. These methods allow you to perform operations like checking for substrings, case-insensitive comparisons, and more. Here's a list of some common string methods that you can use for filtering:
.str.contains()
: Checks if a substring is present in each element of a column.
# Filter rows where the 'sex' column contains a specific substring
contains_example = tips[tips['sex'].str.contains('Male')]
.str.startswith()
: Checks if each element of a column starts with a specific substring.
# Filter rows where the 'day' column starts with a specific substring
startswith_example = tips[tips['day'].str.startswith('Th')]
.str.endswith()
: Checks if each element of a column ends with a specific substring.
# Filter rows where the 'time' column ends with a specific substring
endswith_example = tips[tips['time'].str.endswith('Dinner')]
.str.match()
: Checks if each element of a column matches a specific regular expression pattern.
# Filter rows based on a regular expression pattern in the 'day' column
match_example = tips[tips['day'].str.match('^[S]')]
.str.len()
: Filters based on the length of each element in a column.
# Filter rows based on the length of the 'day' column
len_example = tips[tips['day'].str.len() > 3]
.str.isnumeric()
: Checks if each element of a column is numeric.
# Filter rows where the 'sex' column consists of numeric characters
isnumeric_example = tips[tips['sex'].str.isnumeric()]
.str.upper()
and.str.lower()
: Converts string elements to uppercase or lowercase for case-insensitive filtering.
# Filter rows with case-insensitive comparison using upper() or lower()
upper_example = tips[tips['sex'].str.upper() == 'MALE']
lower_example = tips[tips['sex'].str.lower() == 'female']
Detecting Null (Missing) Values in Conditions
Handle missing values using the .isna()
and .notna()
methods.
# Filtering rows with missing tip values
tips[tips['tip'].isna()]
Applying Custom Functions using apply
You can use custom functions with filters to define intricate conditions.
# Defining a custom function for filtering
def custom_filter(row):
return row['total_bill'] > 20 and row['size'] <= 4 and 'a' in row['sex']
filtered_df = tips[tips.apply(custom_filter, axis=1)]