Sorting Data in Pandas
Sorting data is a fundamental operation in data analysis that helps organize information and uncover patterns. Pandas provides powerful methods to sort data in various ways, whether it's sorting by values, indices, or a combination of both. In this post, we'll explore how to sort data in Pandas using real-world examples.
.sort_values()
Method
The .sort_values()
method is used to sort a DataFrame by its values along one or multiple columns. It has several parameters that allow you to control the sorting behavior. Here's the general syntax of the method:
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, ignore_index=False, key=None)
Parameters:
by
: This is the most important parameter. It specifies the column or columns by which the DataFrame should be sorted. You can provide a single column name or a list of column names to sort by multiple columns.axis
: Specifies whether sorting should be done along rows (axis=0
, default) or columns (axis=1
). Read more about axis parameter hereascending
: Determines the sorting order. Set toTrue
for ascending order (default) andFalse
for descending order.inplace
: IfTrue
, the DataFrame is modified in place, and no new object is created. IfFalse
(default), a new sorted DataFrame is returned.ignore_index
: IfTrue
, the resulting DataFrame will have a new index from 0 to N-1, where N is the number of rows.key
: A function to determine the comparison key for custom sorting.
We will use the Seaborn's "tips" dataset for the examples:
import pandas as pd
import seaborn as sns
# Load the Seaborn tips dataset
tips = sns.load_dataset("tips")
Sorting by Single Column
Sorting by values is one of the most common tasks in data analysis. You can sort a DataFrame based on one or multiple columns' values. Here we will see the application of by
parameter.
# Sorting the DataFrame by 'total_bill' in ascending order
sorted_by_total_bill = tips.sort_values(by='total_bill')
# Printing the first few rows of the sorted DataFrame
print(sorted_by_total_bill.head())
Output:
total_bill tip sex smoker day time size
67 3.07 1.00 Female Yes Sat Dinner 1
92 5.75 1.00 Female Yes Fri Dinner 2
111 7.25 1.00 Female No Sat Dinner 1
172 7.25 5.15 Male Yes Sun Dinner 2
149 7.51 2.00 Male No Thur Lunch 2
Sorting by Multiple Columns
You can sort by multiple columns by providing a list of column names to the .sort_values()
method.
Here's how the sorting process works when you sort by values of two columns:
- The DataFrame is sorted based on the values of the first column specified in the
by
parameter. - If there are ties (rows with equal values) in the first column, the rows with ties are sorted based on the values of the second column specified in the
by
parameter. - If there are still ties after sorting by the second column, the process continues with subsequent columns specified in the
by
parameter.
# Sorting the DataFrame by 'total_bill' in ascending order
# and then by 'tip' in descending order
sorted_by_multiple_columns = tips.sort_values(
by=['total_bill', 'tip'],
ascending=[True, False])
# Printing the first 5 rows of the sorted DataFrame with selected columns
print(sorted_by_multiple_columns[['total_bill', 'tip']].head())
total_bill tip
67 3.07 1.00
92 5.75 1.00
172 7.25 5.15
111 7.25 1.00
149 7.51 2.00
In this example, we're sorting the tips
DataFrame first by the 'total_bill' column in ascending order and then, for tied rows, by the 'tip' column in descending order. This means that rows with the same 'total_bill' value will be sorted based on the 'tip' value.
You can see in the output above, in third and fourth row, where total_bill values are tied, the sorting is based on the 'tip' values in descending order.
Chaining Sorting Methods
Chaining sort_values()
involves applying multiple sorting operations in sequence to achieve a more complex sorting behavior. You can do this by calling sort_values()
multiple times, each time specifying different columns and sorting orders. The chaining order matters because each subsequent sorting operation refines the sorting of the previous operation.
Here's an example of chaining sort_values()
using the Seaborn tips dataset:
# Chaining sort_values to sort by 'total_bill' in ascending
# and then by 'tip' in descending order
chained_sorted = tips.sort_values(
by='total_bill').sort_values(
by='tip', ascending=False)
# Printing the first few rows of the chained sorted DataFrame
print(chained_sorted.head())
total_bill tip
170 50.81 10.00
212 48.33 9.00
23 39.42 7.58
59 48.27 6.73
141 34.30 6.70
Are these two DataFrames identical?
If you would like to know whether two DataFrames are identical, you can run an index match like this:
# Check if the index numbers of both data frames match
index_match = (chained_sorted.index == sorted_by_multiple_columns.index).all()
if index_match:
print("The index numbers match. The data frames are identical.")
else:
print("The index numbers do not match. The data frames are not identical.")
# The index numbers do not match. The data frames are not identical.
Custom Sorting with key
Parameter
The key
parameter in the .sort_values()
method allows you to provide a custom function that calculates a comparison key for each value in the specified column(s). This key is then used for sorting the DataFrame. The key
function should take a value as input and return a value that will be used for sorting.
Let's illustrate the use of the key
parameter with an example:
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 28]}
df = pd.DataFrame(data)
print(df)
Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 28
Now, let's demonstrate how to use the key
parameter to sort the DataFrame based on the length of the names in the 'Name' column:
# Sorting the DataFrame based on the length of names using the key parameter
sorted_by_name_length = df.sort_values(
by='Name',
key=lambda x: x.str.len())
print(sorted_by_name_length)
Output:
Name Age
1 Bob 30
0 Alice 25
2 Charlie 28
In this case, lambda x: x.str.len()
is a function that takes in one argument x
and returns the length of x
. The str
attribute means that x
is being treated as a string, and the len()
function is used to get the length of a string.
When used in the context of df.sort_values(by='Name', key=lambda x: x.str.len())
, x
represents each element in the 'Name' column of the DataFrame df
. So, for each name in the 'Name' column, this lambda function will return its length. The sort_values()
function then sorts the DataFrame based on these lengths.
For example, if the 'Name' column contains 'Alice', 'Bob', and 'Charlie', the lambda function will return 5, 3, and 7 respectively. The sort_values()
function will then sort the DataFrame in ascending order of these lengths, so 'Bob' will come first, then 'Alice', and finally 'Charlie'.
Sorting by Index
The .sort_index()
method in Pandas is used to sort the index of a DataFrame. It allows you to rearrange the rows of the DataFrame based on the index labels. The parameter axis
specifies whether sorting should be done along rows (axis=0
, default) or columns (axis=1
).
Ascending order along rows
Let's create a simple DataFrame with index labels and then demonstrate the use of the .sort_index()
method:
# Create a sample DataFrame with index labels
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 28, 22, 24]}
# assigning index labels
df = pd.DataFrame(data, index=['row5', 'row2', 'row3', 'row1', 'row4'])
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
Name Age
row5 Alice 25
row2 Bob 30
row3 Charlie 28
row1 David 22
row4 Eve 24
Now, let's demonstrate how to use the .sort_index()
method to sort the DataFrame by index labels:
# Sorting the DataFrame by index labels in ascending order
sorted_by_index = df.sort_index()
print("Sorted by Index (Ascending Order):")
print(sorted_by_index)
Output:
Sorted by Index (Ascending Order):
Name Age
row1 David 22
row2 Bob 30
row3 Charlie 28
row4 Eve 24
row5 Alice 25
In this example, we first created a DataFrame with custom index labels. Then, we used the .sort_index()
method without specifying any parameters. The method sorted the DataFrame's index labels in ascending order and returned a new DataFrame with the sorted order.
Descending Order Along Rows
You can modify the behavior of the .sort_index()
method using its various parameters, such as specifying the sorting axis, sorting order (ascending or descending), and more, as explained earlier in this conversation.
# Sorting the DataFrame by index labels in descending order
sorted_by_index_descending = df.sort_index(ascending=False)
print("Sorted by Index (Descending Order):")
print(sorted_by_index_descending)
Sorted by Index (Descending Order):
Name Age
row5 Alice 25
row4 Eve 24
row3 Charlie 28
row2 Bob 30
row1 David 22
Ascending Order Along Columns
We can use the .sort_index()
method with axis=1
to sort the columns of the DataFrame.
# Sorting the columns of the DataFrame in ascending order
sorted_columns = sorted_by_index.sort_index(axis=1)
print("\nDataFrame with Sorted Columns:")
print(sorted_columns)
DataFrame with Sorted Columns:
Age Name
row1 22 David
row2 30 Bob
row3 28 Charlie
row4 24 Eve
row5 25 Alice