Indexing and Slicing Data in Pandas
Pandas provides a powerful and flexible set of tools for indexing and slicing data. This is a fundamental operation for data manipulation and analysis. In this post, we will use the Tips dataset (opens in a new tab) from seaborn library to illustrate these concepts.
Loading the Dataset
First, let's load the Tips dataset (opens in a new tab), which is available under Seaborn library. The "tips" dataset is a well-known dataset containing information about restaurant bills and tips. It includes data about various factors like total bill, tip amount, gender, and smoker status.
import pandas as pd
import seaborn as sns
# Load the 'tips' dataset from seaborn
tips = sns.load_dataset('tips')
# Display the first 5 rows of the dataset
print(tips.head())
Output:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.5 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Indexing Column(s)
Indexing operator []
directly on a DataFrame is used to access the columns.
Indexing a Single Column
You can select a single column by using the DataFrame's name with the indexing operator and the label for the column:
# Select a single column: Total Bill
total_bill = tips['total_bill']
print(total_bill.head())
Output:
0 16.99
1 10.34
2 21.01
3 23.68
4 24.59
Name: total_bill, dtype: float64
Indexing Multiple Columns
You can select multiple columns by passing a list of column names to the indexing operator.
# Selecting multiple columns: Total Bill, Gender, Tip
subset = tips[['total_bill', 'sex', 'tip']]
print(subset.head())
Output:
total_bill sex tip
0 16.99 Female 1.01
1 10.34 Male 1.66
2 21.01 Male 3.50
3 23.68 Male 3.31
4 24.59 Female 3.61
Indexing Column and Selective Rows
In the above example, if we are only interested in getting the first two rows, we can use:
# Selecting multiple columns and only first two rows
subset = tips[['total_bill', 'sex', 'tip']][0:2]
print(subset.head())
Output:
total_bill sex tip
0 16.99 Female 1.01
1 10.34 Male 1.66
The .loc
and .iloc
Attributes
.loc
and .iloc
are attributes of a Pandas DataFrame that allow you to access and manipulate data using label-based and integer-based indexing, respectively. They provide a versatile and powerful way to extract specific rows and columns, subsets of data, or individual elements from a DataFrame.
.loc
This attribute allows you to access a group of rows by labels or a boolean array. It takes two arguments separated by a comma: the first argument is for selecting rows, and the second (optional) argument is for selecting columns.
The syntax for using .loc
is df.loc[row_indexer, column_indexer]
, where row_indexer
and column_indexer
can be:
- A single label
- A list of labels
- A slice of labels
- A boolean array
.iloc
This attribute allows you to access a group of rows by integer location. Like .loc
, it also takes two arguments separated by a comma: the first argument is for selecting rows, and the second (optional) argument is for selecting columns.
The syntax for using .iloc
is df.iloc[row_indexer, column_indexer]
, where row_indexer
and column_indexer
can be:
- A single integer
- A list of integers
- A slice of integers
Indexing using .loc
👉 Though, you can directly index DataFrame rows using the index operator []
, but it is generally not recommended because the index operator []
on a DataFrame can be ambiguous because it is also used for selecting columns. For example, df['A']
will select column 'A', not row 'A'. This can lead to confusion when reading the code. Therefore, you must use either .iloc
or .loc
on df to index rows.
Rows only (and all Columns)
Single row indexing
# Using .loc for single row selection
# Selecting the row at index 2 from the DataFrame 'tips'
row_2 = tips.loc[2]
# Printing the data for the selected row (row at index 2)
print(row_2)
▶ As this tips df doesn't have any explicit labeled indexed, so both .loc
and .iloc
will give the same results. In other words the implicit index will be used as labelled index in this case.
Output:
total_bill 21.01
tip 3.5
sex Male
smoker No
day Sun
time Dinner
size 3
Name: 2, dtype: object
Multiple rows indexing
▶ Now, let select multiple rows with .loc
method. Beware that in the example below [2, 4]
is considered as label of rows and therefore the output will include the rows at index 2, 3 AND 4. Confused? Check example 2 below to get clarity
# Using .loc for multiple row selection
# Selecting rows from index 2 to 4 (inclusive) from the DataFrame 'tips'
rows_2_to_4 = tips.loc[2:4]
# Printing the selected rows
print(rows_2_to_4)
Output:
total_bill tip sex smoker day time size
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Rows and Selective Column(s)
Instead of getting all the columns, we explicitly choose three columns by their labels
# Selecting a subset of rows (from index 2 to 4) and
# specific columns from the DataFrame 'tips'
subset = tips.loc[2:4, ['total_bill', 'sex', 'tip']]
# Printing the selected subset of rows and columns
print(subset)
Output:
total_bill sex tip
2 21.01 Male 3.50
3 23.68 Male 3.31
4 24.59 Female 3.61
Example 2
As the "tips" dataset doesn't have explicit index labels, therefore, we will create a simple DataFrame to better explain .loc
:
# Creating a sample DataFrame
data = {'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]}
# Setting up the labels for index
df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])
# print the df
print(df)
A B C
row1 1 4 7
row2 2 5 8
row3 3 6 9
Now, let's slice the two rows having labels row1
and row2
# Selecting rows 'row1' to 'row2' from the DataFrame 'df'
row1_row2 = df.loc['row1':'row2']
# Printing the selected rows 'row1' to 'row2'
print(row1_row2)
A B C
row1 1 4 7
row2 2 5 8
Indexing using .iloc
Rows only (and all Columns)
Single row indexing
# Using .iloc for single row selection
row_2 = tips.iloc[2]
print(row_2)
▶ As this tips df doesn't have any explicit labeled indexed, so both .loc
and .iloc
will give the same results.
Multiple rows indexing
# Using .iloc for multiple row selection
rows_2_to_3 = tips.iloc[2:4]
print(rows_2_to_3)
👉 .,iloc[2:4]
will only return rows with index 2 and 3, unlike .loc[2:4
which returned the rows 2, 3 AND 4.
Output:
total_bill tip sex smoker day time size
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
Rows and Selective Column(s)
# Selecting specific rows and columns
subset = tips.iloc[2:4, [0, 2, 1]]
print(subset)
Instead of getting all the columns, we explicitly choose three columns by their integer location.
Output:
total_bill sex tip
2 21.01 Male 3.50
3 23.68 Male 3.31