Pandas
Data Manipulation
Sorting Data

Sorting Data in Pandas

Sorting data is a fundamental operation in data analysis that helps organize information and uncover patterns. Pandas provides powerful methods to sort data in various ways, whether it's sorting by values, indices, or a combination of both. In this post, we'll explore how to sort data in Pandas using real-world examples.

.sort_values() Method

The .sort_values() method is used to sort a DataFrame by its values along one or multiple columns. It has several parameters that allow you to control the sorting behavior. Here's the general syntax of the method:

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, ignore_index=False, key=None)

Parameters:

  • by: This is the most important parameter. It specifies the column or columns by which the DataFrame should be sorted. You can provide a single column name or a list of column names to sort by multiple columns.
  • axis: Specifies whether sorting should be done along rows (axis=0, default) or columns (axis=1). Read more about axis parameter here
  • ascending: Determines the sorting order. Set to True for ascending order (default) and False for descending order.
  • inplace: If True, the DataFrame is modified in place, and no new object is created. If False (default), a new sorted DataFrame is returned.
  • ignore_index: If True, the resulting DataFrame will have a new index from 0 to N-1, where N is the number of rows.
  • key: A function to determine the comparison key for custom sorting.

We will use the Seaborn's "tips" dataset for the examples:

import pandas as pd
import seaborn as sns
 
# Load the Seaborn tips dataset
tips = sns.load_dataset("tips")

Sorting by Single Column

Sorting by values is one of the most common tasks in data analysis. You can sort a DataFrame based on one or multiple columns' values. Here we will see the application of by parameter.

# Sorting the DataFrame by 'total_bill' in ascending order
sorted_by_total_bill = tips.sort_values(by='total_bill')
 
# Printing the first few rows of the sorted DataFrame
print(sorted_by_total_bill.head())

Output:

     total_bill   tip     sex smoker   day    time  size
67         3.07  1.00  Female    Yes   Sat  Dinner     1
92         5.75  1.00  Female    Yes   Fri  Dinner     2
111        7.25  1.00  Female     No   Sat  Dinner     1
172        7.25  5.15    Male    Yes   Sun  Dinner     2
149        7.51  2.00    Male     No  Thur   Lunch     2

Sorting by Multiple Columns

You can sort by multiple columns by providing a list of column names to the .sort_values() method.

Here's how the sorting process works when you sort by values of two columns:

  1. The DataFrame is sorted based on the values of the first column specified in the by parameter.
  2. If there are ties (rows with equal values) in the first column, the rows with ties are sorted based on the values of the second column specified in the by parameter.
  3. If there are still ties after sorting by the second column, the process continues with subsequent columns specified in the by parameter.
# Sorting the DataFrame by 'total_bill' in ascending order 
# and then by 'tip' in descending order
sorted_by_multiple_columns = tips.sort_values(
    by=['total_bill', 'tip'],
    ascending=[True, False])
 
# Printing the first 5 rows of the sorted DataFrame with selected columns
print(sorted_by_multiple_columns[['total_bill', 'tip']].head())
     total_bill   tip
67         3.07  1.00
92         5.75  1.00
172        7.25  5.15
111        7.25  1.00
149        7.51  2.00

In this example, we're sorting the tips DataFrame first by the 'total_bill' column in ascending order and then, for tied rows, by the 'tip' column in descending order. This means that rows with the same 'total_bill' value will be sorted based on the 'tip' value.

You can see in the output above, in third and fourth row, where total_bill values are tied, the sorting is based on the 'tip' values in descending order.

Chaining Sorting Methods

Chaining sort_values() involves applying multiple sorting operations in sequence to achieve a more complex sorting behavior. You can do this by calling sort_values() multiple times, each time specifying different columns and sorting orders. The chaining order matters because each subsequent sorting operation refines the sorting of the previous operation.

Here's an example of chaining sort_values() using the Seaborn tips dataset:

# Chaining sort_values to sort by 'total_bill' in ascending 
# and then by 'tip' in descending order
chained_sorted = tips.sort_values(
    by='total_bill').sort_values(
        by='tip', ascending=False)
        
# Printing the first few rows of the chained sorted DataFrame
print(chained_sorted.head())
     total_bill    tip
170       50.81  10.00
212       48.33   9.00
23        39.42   7.58
59        48.27   6.73
141       34.30   6.70

Are these two DataFrames identical?

If you would like to know whether two DataFrames are identical, you can run an index match like this:

# Check if the index numbers of both data frames match
index_match = (chained_sorted.index == sorted_by_multiple_columns.index).all()
 
if index_match:
    print("The index numbers match. The data frames are identical.")
else:
    print("The index numbers do not match. The data frames are not identical.")
 
# The index numbers do not match. The data frames are not identical.

Custom Sorting with key Parameter

The key parameter in the .sort_values() method allows you to provide a custom function that calculates a comparison key for each value in the specified column(s). This key is then used for sorting the DataFrame. The key function should take a value as input and return a value that will be used for sorting.

Let's illustrate the use of the key parameter with an example:

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 28]}
 
df = pd.DataFrame(data)
 
print(df)

Output:

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   28

Now, let's demonstrate how to use the key parameter to sort the DataFrame based on the length of the names in the 'Name' column:

# Sorting the DataFrame based on the length of names using the key parameter
sorted_by_name_length = df.sort_values(
    by='Name', 
    key=lambda x: x.str.len())
 
print(sorted_by_name_length)

Output:

      Name  Age
1      Bob   30
0    Alice   25
2  Charlie   28

In this case, lambda x: x.str.len() is a function that takes in one argument x and returns the length of x. The str attribute means that x is being treated as a string, and the len() function is used to get the length of a string.

When used in the context of df.sort_values(by='Name', key=lambda x: x.str.len()), x represents each element in the 'Name' column of the DataFrame df. So, for each name in the 'Name' column, this lambda function will return its length. The sort_values() function then sorts the DataFrame based on these lengths.

For example, if the 'Name' column contains 'Alice', 'Bob', and 'Charlie', the lambda function will return 5, 3, and 7 respectively. The sort_values() function will then sort the DataFrame in ascending order of these lengths, so 'Bob' will come first, then 'Alice', and finally 'Charlie'.

Sorting by Index

The .sort_index() method in Pandas is used to sort the index of a DataFrame. It allows you to rearrange the rows of the DataFrame based on the index labels. The parameter axis specifies whether sorting should be done along rows (axis=0, default) or columns (axis=1).

Ascending order along rows

Let's create a simple DataFrame with index labels and then demonstrate the use of the .sort_index() method:

# Create a sample DataFrame with index labels
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 28, 22, 24]}
 
# assigning index labels
df = pd.DataFrame(data, index=['row5', 'row2', 'row3', 'row1', 'row4'])
 
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
	   Name  Age
row5  Alice   25
row2    Bob   30
row3 Charlie   28
row1  David   22
row4    Eve   24

Now, let's demonstrate how to use the .sort_index() method to sort the DataFrame by index labels:

# Sorting the DataFrame by index labels in ascending order
sorted_by_index = df.sort_index()
 
print("Sorted by Index (Ascending Order):")
print(sorted_by_index)

Output:

Sorted by Index (Ascending Order):
       Name  Age
row1  David   22
row2    Bob   30
row3 Charlie   28
row4    Eve   24
row5  Alice   25

In this example, we first created a DataFrame with custom index labels. Then, we used the .sort_index() method without specifying any parameters. The method sorted the DataFrame's index labels in ascending order and returned a new DataFrame with the sorted order.

Descending Order Along Rows

You can modify the behavior of the .sort_index() method using its various parameters, such as specifying the sorting axis, sorting order (ascending or descending), and more, as explained earlier in this conversation.

# Sorting the DataFrame by index labels in descending order
sorted_by_index_descending = df.sort_index(ascending=False)
 
print("Sorted by Index (Descending Order):")
print(sorted_by_index_descending)
Sorted by Index (Descending Order):
         Name  Age
row5    Alice   25
row4      Eve   24
row3  Charlie   28
row2      Bob   30
row1    David   22

Ascending Order Along Columns

We can use the .sort_index() method with axis=1 to sort the columns of the DataFrame.

# Sorting the columns of the DataFrame in ascending order
sorted_columns = sorted_by_index.sort_index(axis=1)
 
print("\nDataFrame with Sorted Columns:")
print(sorted_columns)
DataFrame with Sorted Columns:
      Age     Name
row1   22    David
row2   30      Bob
row3   28  Charlie
row4   24      Eve
row5   25    Alice