Pandas
Data Manipulation
Indexing and Slicing

Indexing and Slicing Data in Pandas

Pandas provides a powerful and flexible set of tools for indexing and slicing data. This is a fundamental operation for data manipulation and analysis. In this post, we will use the Tips dataset (opens in a new tab) from seaborn library to illustrate these concepts.

Loading the Dataset

First, let's load the Tips dataset (opens in a new tab), which is available under Seaborn library. The "tips" dataset is a well-known dataset containing information about restaurant bills and tips. It includes data about various factors like total bill, tip amount, gender, and smoker status.

import pandas as pd
import seaborn as sns
 
# Load the 'tips' dataset from seaborn
tips = sns.load_dataset('tips')
 
# Display the first 5 rows of the dataset
print(tips.head())

Output:

total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.5MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4

Indexing Column(s)

Indexing operator [] directly on a DataFrame is used to access the columns.

Indexing a Single Column

You can select a single column by using the DataFrame's name with the indexing operator and the label for the column:

# Select a single column: Total Bill
total_bill = tips['total_bill']
 
print(total_bill.head())

Output:

0    16.99
1    10.34
2    21.01
3    23.68
4    24.59
Name: total_bill, dtype: float64

Indexing Multiple Columns

You can select multiple columns by passing a list of column names to the indexing operator.

# Selecting multiple columns: Total Bill, Gender, Tip
subset = tips[['total_bill', 'sex', 'tip']]
 
print(subset.head())

Output:

   total_bill     sex   tip
0       16.99  Female  1.01
1       10.34    Male  1.66
2       21.01    Male  3.50
3       23.68    Male  3.31
4       24.59  Female  3.61

Indexing Column and Selective Rows

In the above example, if we are only interested in getting the first two rows, we can use:

# Selecting multiple columns and only first two rows
subset = tips[['total_bill', 'sex', 'tip']][0:2]
 
print(subset.head())

Output:

   total_bill     sex   tip
0       16.99  Female  1.01
1       10.34    Male  1.66

The .loc and .iloc Attributes

.loc and .iloc are attributes of a Pandas DataFrame that allow you to access and manipulate data using label-based and integer-based indexing, respectively. They provide a versatile and powerful way to extract specific rows and columns, subsets of data, or individual elements from a DataFrame.

.loc

This attribute allows you to access a group of rows by labels or a boolean array. It takes two arguments separated by a comma: the first argument is for selecting rows, and the second (optional) argument is for selecting columns.

The syntax for using .loc is df.loc[row_indexer, column_indexer], where row_indexer and column_indexer can be:

  • A single label
  • A list of labels
  • A slice of labels
  • A boolean array

.iloc

This attribute allows you to access a group of rows by integer location. Like .loc, it also takes two arguments separated by a comma: the first argument is for selecting rows, and the second (optional) argument is for selecting columns.

The syntax for using .iloc is df.iloc[row_indexer, column_indexer], where row_indexer and column_indexer can be:

  • A single integer
  • A list of integers
  • A slice of integers

Indexing using .loc

👉 Though, you can directly index DataFrame rows using the index operator [], but it is generally not recommended because the index operator [] on a DataFrame can be ambiguous because it is also used for selecting columns.  For example, df['A'] will select column 'A', not row 'A'. This can lead to confusion when reading the code. Therefore, you must use either .iloc or .loc on df to index rows.

Rows only (and all Columns)

Single row indexing

# Using .loc for single row selection
# Selecting the row at index 2 from the DataFrame 'tips'
row_2 = tips.loc[2]
 
# Printing the data for the selected row (row at index 2)
print(row_2)

▶ As this tips df doesn't have any explicit labeled indexed, so both .loc and .iloc will give the same results. In other words the implicit index will be used as labelled index in this case.

Output:

total_bill     21.01
tip              3.5
sex             Male
smoker            No
day              Sun
time          Dinner
size               3
Name: 2, dtype: object

Multiple rows indexing

▶ Now, let select multiple rows with .loc method. Beware that in the example below [2, 4] is considered as label of rows and therefore the output will include the rows at index 2, 3 AND 4. Confused? Check example 2 below to get clarity

# Using .loc for multiple row selection
# Selecting rows from index 2 to 4 (inclusive) from the DataFrame 'tips'
rows_2_to_4 = tips.loc[2:4]
 
# Printing the selected rows
print(rows_2_to_4)

Output:

   total_bill   tip     sex smoker  day    time  size
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

Rows and Selective Column(s)

Instead of getting all the columns, we explicitly choose three columns by their labels

# Selecting a subset of rows (from index 2 to 4) and 
# specific columns from the DataFrame 'tips'
subset = tips.loc[2:4, ['total_bill', 'sex', 'tip']]
 
# Printing the selected subset of rows and columns
print(subset)

Output:

   total_bill     sex   tip
2       21.01    Male  3.50
3       23.68    Male  3.31
4       24.59  Female  3.61

Example 2

As the "tips" dataset doesn't have explicit index labels, therefore, we will create a simple DataFrame to better explain .loc :

# Creating a sample DataFrame
data = {'A': [1, 2, 3],
        'B': [4, 5, 6],
        'C': [7, 8, 9]}
        
# Setting up the labels for index
df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])
 
# print the df
print(df)
      A  B  C
row1  1  4  7
row2  2  5  8
row3  3  6  9

Now, let's slice the two rows having labels row1 and row2

# Selecting rows 'row1' to 'row2' from the DataFrame 'df' 
row1_row2 = df.loc['row1':'row2'] 
 
# Printing the selected rows 'row1' to 'row2' 
print(row1_row2)
      A  B  C
row1  1  4  7
row2  2  5  8

Indexing using .iloc

Rows only (and all Columns)

Single row indexing

# Using .iloc for single row selection
row_2 = tips.iloc[2]
print(row_2)

▶ As this tips df doesn't have any explicit labeled indexed, so both .loc and .iloc will give the same results.

Multiple rows indexing

# Using .iloc for multiple row selection
rows_2_to_3 = tips.iloc[2:4]
print(rows_2_to_3)

👉 .,iloc[2:4] will only return rows with index 2 and 3, unlike .loc[2:4 which returned the rows 2, 3 AND 4.

Output:

   total_bill   tip     sex smoker  day    time  size
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2

Rows and Selective Column(s)

# Selecting specific rows and columns
subset = tips.iloc[2:4, [0, 2, 1]]
print(subset)

Instead of getting all the columns, we explicitly choose three columns by their integer location.

Output:

   total_bill     sex   tip
2       21.01    Male  3.50
3       23.68    Male  3.31