Pandas
Data Manipulation
Filtering and Selection

Filtering and Selection in Pandas

Filtering and selection are fundamental operations when working with data in Pandas. They allow you to extract specific subsets of data that meet certain conditions. In this post, we'll explore various techniques for filtering and selecting data based on conditions, using Seaborn "tips" dataset

Importing Required Libraries

First, let's load the Tips dataset (opens in a new tab), which is available under Seaborn library. The "tips" dataset is a well-known dataset containing information about restaurant bills and tips. It includes data about various factors like total bill, tip amount, gender, and smoker status.

import pandas as pd
import seaborn as sns
 
# Load the Seaborn tips dataset
tips = sns.load_dataset("tips")
 
# Display the first few rows of the dataset
print(tips.head())

Output:

total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.5MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4

Single Condition Filtering with Comparison Operators

You can filter rows in a DataFrame based on a single condition. For example, let's filter the rows where the total bill is greater than $45.

👉 Concentrate on the syntax of filtering, df[condition] Here, df is the original DataFrame, condition is a Boolean expression that is applied to the DataFrame

# Creating a condition: Total bill is greater than $45
condition = tips['total_bill'] > 45
 
# Applying the condition to the DataFrame 'tips'
high_bill = tips[condition]
 
# Printing rows with Total Bill > $45
print("Rows with Total Bill > $45:")
print(high_bill)

Output:

Rows with Total Bill > $20:
     total_bill    tip   sex smoker  day    time  size
59        48.27   6.73  Male     No  Sat  Dinner     4
156       48.17   5.00  Male     No  Sun  Dinner     6
170       50.81  10.00  Male    Yes  Sat  Dinner     3
182       45.35   3.50  Male    Yes  Sun  Dinner     3
212       48.33   9.00  Male     No  Sat  Dinner     4

Counting the number of Rows that Matches the Filtering Condition(s)

Instead of displaying the subset of df , we can just count the total number of rows that fulfill our filtering conditions. To do this, we can use the sum() function after applying our conditions. This works because the filtering operation returns a DataFrame of True/False values, and sum() treats True as 1 and False as 0.

# Counting the number of rows where Total Bill > $45
count = (tips['total_bill'] > 45).sum()
 
# Printing the count
print(count)  # Output: 5

Combining Multiple Conditions with Logical Operators

You can use logical operators like & (and) and | (or) to combine multiple conditions.

# Condition 1: group 'size' is 2 people or less
small_group = tips['size'] <=2
 
# Condition 2: paid 'tip' $5 or more
high_tip = tips['tip'] >=5
 
# Applying conditions
small_group_large_tip = tips[small_group & high_tip]
 
print(small_group_large_tip)

Output:

     total_bill   tip     sex smoker   day    time  size
46        22.23  5.00    Male     No   Sun  Dinner     2
73        25.28  5.00  Female    Yes   Sat  Dinner     2
83        32.68  5.00    Male    Yes  Thur   Lunch     2
88        24.71  5.85    Male     No  Thur   Lunch     2
172        7.25  5.15    Male    Yes   Sun  Dinner     2
181       23.33  5.65    Male    Yes   Sun  Dinner     2

Using .query() Method

Pandas provides the .query() method to filter rows based on a string expression. You can either set a single condition or bind multiple conditions with logical operators.

# Single condition query
tips.query('total_bill > 45')
 
# Multiple conditions query
tips.query('total_bill > 40 & tip > 5')

Filtering with String Conditions

You can filter rows using string conditions, such as filtering rows where the day is 'Fri'.

# Setting string condition
string_condition = tips['day'] == 'Fri'
# Applying condition to filter rows
tips[string_condition]

If you are interested in excluding rows where 'day' is 'Fri', you can use tips['day'] != 'Fri'

Filtering with isin() Method

The isin() method allows you to filter rows based on a list of values.

# Setting up the condition to filter for weekends
weekend_only = tips['day'].isin(['Sat', 'Sun'])
 
# Applying the condition to filter the rows in the DataFrame 'tips'
filtered_tips = tips[weekend_only]

Filtering with String Method

There are several string methods that can be used for filtering data in DataFrame columns based on string conditions. These methods allow you to perform operations like checking for substrings, case-insensitive comparisons, and more. Here's a list of some common string methods that you can use for filtering:

  1. .str.contains(): Checks if a substring is present in each element of a column.
# Filter rows where the 'sex' column contains a specific substring 
contains_example = tips[tips['sex'].str.contains('Male')]
  1. .str.startswith(): Checks if each element of a column starts with a specific substring.
# Filter rows where the 'day' column starts with a specific substring 
startswith_example = tips[tips['day'].str.startswith('Th')]
  1. .str.endswith(): Checks if each element of a column ends with a specific substring.
# Filter rows where the 'time' column ends with a specific substring 
endswith_example = tips[tips['time'].str.endswith('Dinner')]
  1. .str.match(): Checks if each element of a column matches a specific regular expression pattern.
# Filter rows based on a regular expression pattern in the 'day' column
match_example = tips[tips['day'].str.match('^[S]')]
  1. .str.len(): Filters based on the length of each element in a column.
# Filter rows based on the length of the 'day' column 
len_example = tips[tips['day'].str.len() > 3]
  1. .str.isnumeric(): Checks if each element of a column is numeric.
# Filter rows where the 'sex' column consists of numeric characters 
isnumeric_example = tips[tips['sex'].str.isnumeric()]
  1. .str.upper() and .str.lower(): Converts string elements to uppercase or lowercase for case-insensitive filtering.
# Filter rows with case-insensitive comparison using upper() or lower()
upper_example = tips[tips['sex'].str.upper() == 'MALE'] 
lower_example = tips[tips['sex'].str.lower() == 'female']

Detecting Null (Missing) Values in Conditions

Handle missing values using the .isna() and .notna() methods.

# Filtering rows with missing tip values
tips[tips['tip'].isna()]

Applying Custom Functions using apply

You can use custom functions with filters to define intricate conditions.

# Defining a custom function for filtering
def custom_filter(row):
	return row['total_bill'] > 20 and row['size'] <= 4 and 'a' in    row['sex']
	
filtered_df = tips[tips.apply(custom_filter, axis=1)]