Pandas
Data Analysis
Time-based Indexing

Time-based Indexing

In the realm of data science and analysis, time series data is a common and valuable type of information that represents observations collected at specific time points. To effectively work with time series data, it's essential to understand time-based indexing and slicing. Time-based indexing involves using time-related information, such as dates and timestamps, to access and manipulate data in a dataset.

Importance of Time-Based Indexing

Time-based indexing is crucial for time series analysis because it allows us to harness the temporal structure of the data. This enables us to perform various tasks such as identifying trends, detecting seasonality, and making forecasts. By indexing data with respect to time, we can efficiently retrieve and process specific time intervals, which is essential for analyzing trends and patterns that unfold over time.

Advantages of Using Datetime Indexes

When dealing with time-based data, using a datetime index offers several advantages:

  1. Efficient Data Alignment: Datetime indexes facilitate data alignment across multiple datasets with varying time intervals. This alignment simplifies the process of combining and comparing data, even when the time points don't perfectly align.

  2. Easy Resampling: Resampling involves changing the frequency of data observations, such as aggregating daily data into monthly data. Datetime indexes make resampling straightforward and consistent.

  3. Intuitive Slicing: With datetime indexes, you can slice and select data using natural date-based expressions, such as selecting all data from a specific month or year.

  4. Time-Based Plots: When creating visualizations, datetime indexes enable accurate and informative time-based plots, as the index itself conveys the temporal aspect of the data.

Datetime Indexing

Loading and Inspecting the Dataset

Note: We will use Seaborn's Taxis dataset (opens in a new tab). In case the 'taxis' dataset lacks data to demonstrate certain concepts, we'll supplement with dummy data.

import pandas as pd
import seaborn as sns
 
# Load the Seaborn's taxis dataset
taxis = sns.load_dataset('taxis')
 
print(taxis_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   pickup           6433 non-null   datetime64[ns]
 1   dropoff          6433 non-null   datetime64[ns]
 2   passengers       6433 non-null   int64         
 3   distance         6433 non-null   float64       
 4   fare             6433 non-null   float64       
 5   tip              6433 non-null   float64       
 6   tolls            6433 non-null   float64       
 7   total            6433 non-null   float64       
 8   color            6433 non-null   object        
 9   payment          6389 non-null   object        
 10  pickup_zone      6407 non-null   object        
 11  dropoff_zone     6388 non-null   object        
 12  pickup_borough   6407 non-null   object        
 13  dropoff_borough  6388 non-null   object        
dtypes: datetime64[ns](2), float64(5), int64(1), object(6)
memory usage: 703.7+ KB
None

Setting Datetime as Index

In Pandas, you can easily set a datetime column as the index using the set_index() function. Let's consider the Seaborn "taxis" dataset as an example:

# Set "pickup" as the index
taxis.set_index("pickup", inplace=True)
 
# Sort the DataFrame by the index (pickup dates) in ascending order
taxis.sort_index(inplace=True)
 
# Display the modified DataFrame for selected column
print(taxis[['dropoff', 'fare']].head())
                                dropoff  fare
pickup                                       
2019-02-28 23:29:03 2019-02-28 23:32:35   5.0
2019-03-01 00:03:29 2019-03-01 00:13:32  10.0
2019-03-01 00:08:32 2019-03-01 00:29:47  22.5
2019-03-01 00:15:53 2019-03-01 00:47:58  25.5
2019-03-01 00:29:22 2019-03-01 00:32:48   4.5

Datetime Indexing

With a datetime index in place, you can directly access and manipulate data using date-based indexing. Let's index the 'taxis' df using Pandas .loc attribute

# Access data for a specific date
date_indexing = taxis.loc["2019-03-05"]
 
print(date_indexing[['dropoff', 'fare']].head())
                                dropoff  fare
pickup                                       
2019-03-05 00:26:11 2019-03-05 00:30:29   5.0
2019-03-05 00:42:20 2019-03-05 00:51:13   9.0
2019-03-05 00:54:56 2019-03-05 01:11:28  16.5
2019-03-05 01:00:08 2019-03-05 01:11:32  11.0
2019-03-05 01:27:35 2019-03-05 01:51:45  36.0

Datetime Slicing

We can use slicing technique .loc[start:end] to access the data within a range of dates(and time)

# Access data for a specific date range
date_slicing = taxis.loc["2019-03-01":"2019-03-05"]
 
# Count the number of rows in the provided range
print(date_slicing['dropoff'].count())
# Output: 1007

Partial String Indexing

Partial string indexing is a powerful feature that enables us to select data from a datetime-indexed DataFrame based on specific years, months, days, or any other part of the date. This capability proves extremely useful for extracting and analyzing data with granularity.

To demonstrate partial string indexing, let's consider the following example:

# Access data for a month of March 2019
partial_string_indexing = taxis.loc["2019-03"] # can also use [03/2019]
 
# find the average fare for that month
print(partial_string_indexing['fare'].mean())
# Output: 13.09233053482587

Time Range Selection

Time range selection involves extracting data within a specified time range. Pandas provides convenient methods for achieving this, such as between_time() and boolean indexing.

Using between_time()

The between_time() function in Pandas is used to select data from a DataFrame based on a specific time range within each day. This function is particularly useful when dealing with time series data that follows daily patterns, such as stock market data, sensor readings, or any other data that has timestamps associated with time of day.

Here's the syntax of the between_time() function:

DataFrame.between_time(start_time, end_time, inclusive='both')
  • start_time and end_time: The time range within which you want to select data. These should be strings in the format 'HH:MM', representing hours and minutes.
  • inclusive which is by default set to both, which means rows with the start and end times will be included.

Here's how you can use it:

# Select data for the morning hours of 8 to 10 using between_time
morning_data = taxis.between_time("08:00", "10:00")
 
# Display the selected data
print(morning_data[['dropoff', 'fare']].head())
                                dropoff   fare
pickup                                        
2019-03-01 08:04:26 2019-03-01 08:12:44   7.00
2019-03-01 08:12:28 2019-03-01 08:17:19   5.00
2019-03-01 08:20:56 2019-03-01 08:25:54   5.50
2019-03-01 08:23:18 2019-03-01 08:52:12  19.26
2019-03-01 08:24:40 2019-03-01 08:49:35  17.00

Boolean Indexing for Time Ranges

Boolean indexing is another effective way to filter data based on time ranges. Let's explore how to achieve the identical results as above with this technique:

# Create a boolean mask for the desired time range (8 to 10 AM)
mask = (taxis.index.time >= pd.to_datetime("08:00").time()) & (taxis.index.time <= pd.to_datetime("10:00").time())
 
# Apply the mask to filter the data
morning_data_masking = taxis[mask]
 
# Display the selected data
print(morning_data_masking[['dropoff', 'fare']].head())
                                dropoff   fare
pickup                                        
2019-03-01 08:04:26 2019-03-01 08:12:44   7.00
2019-03-01 08:12:28 2019-03-01 08:17:19   5.00
2019-03-01 08:20:56 2019-03-01 08:25:54   5.50
2019-03-01 08:23:18 2019-03-01 08:52:12  19.26
2019-03-01 08:24:40 2019-03-01 08:49:35  17.00

Time Shifting

Time shifting involves moving data points in a time series backward or forward in time. This concept is particularly useful for comparing past and future values, calculating differences, or aligning data for analysis.

Using the shift() Method

Pandas provides the shift() method to perform time shifting on a DataFrame. By specifying a positive or negative value, you can shift the data by a certain number of time periods.

Here's the syntax of the shift() method:

DataFrame.shift(periods=1, freq=None, axis=0, fill_value=None)
  • periods: The number of periods (steps) by which the data should be shifted. Positive values shift data forward (to a later time), and negative values shift data backward (to an earlier time).
  • freq: The frequency of the data. This is useful when the index has a date-like structure, and the shifting should consider the frequency for correct alignment.
  • axis: The axis along which the data should be shifted. By default, it's 0, which refers to shifting rows. You can set axis=1 to shift columns.
  • fill_value: The value to use for newly introduced missing values after shifting. By default, missing values are filled with NaN. For datetime, timedelta, or period data, etc. NaT (opens in a new tab) is used.

Example

Let's first create some dummy data instead of using 'taxis' dataset:

# Generate date range for the whole year of 2023 with Hourly frequency
date_range = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
 
# Generate random values for the dataset
np.random.seed(42)  # For reproducibility
values = np.random.randint(100, 150, size=(len(date_range)),)
 
# Create the DataFrame with "Date" and "Values" columns
df = pd.DataFrame({'Date': date_range, 'Values': values})
 
# Set "Date" as index
df.set_index('Date', inplace=True)
 
print(df.head())
            Values
Date              
2023-01-01     138
2023-01-02     128
2023-01-03     114
2023-01-04     142
2023-01-05     107

Lagging Data (Shift Backward)

Lagging data means shifting it backward in time. For example, you might want to compare current values with values from the previous day.

# Lag the "Values" column by 1 day
df['Values_lag_1'] = df['Values'].shift(1)

Leading Data (Shift Forward)

Leading data means shifting it forward in time. This can be useful when you want to predict future values based on past data.

# Lead the "Values" column by 1 day
df['Values_lead_1'] = df['Values'].shift(-1)
 
# Display the DataFrame with lagged and leading values
print(df.head())
            Values  Values_lag_1  Values_lead_1
Date                                           
2023-01-01     138           NaN          128.0
2023-01-02     128         138.0          114.0
2023-01-03     114         128.0          142.0
2023-01-04     142         114.0          107.0
2023-01-05     107         142.0          120.0

Rolling Windows

Rolling windows, also known as moving windows or rolling periods, involve analyzing data over a fixed-size window as it "rolls" through the time series. This technique is useful for calculating moving averages, identifying trends, and smoothing noisy data.

Creating Rolling Windows with .rolling()

The .rolling() method is used to calculate rolling statistics over a window of consecutive observations. The window size is specified by the window parameter. Here's the syntax of the .rolling() function

DataFrame.rolling(window, min_periods=None, ...)
  • window: The size of the rolling window.
    • If integer, it specifies the number of data points to include in each window.
    • If a timedelta, str, or offset, it specifies the time period for each window. This is only valid for datetimelike indexes
  • min_periods: The minimum number of observations required in a window to have a value.
    • If window is specified as integer, this will be default to the size of window
    • If window is specified as timedelta or offset, this will be set by default to 1

Example

Using int as window

Let's continue with our dummy data that we have created in the last section and use that to calculate the 7 day moving average for the 'Value' column

# Calculate the 7-day rolling mean
df['Rolling_Mean'] = df['Values'].rolling(window=7).mean()
 
# Display the DataFrame with rolling mean
print(df.head(10))
            Values  Rolling_Mean
Date                            
2023-01-01     138           NaN
2023-01-02     128           NaN
2023-01-03     114           NaN
2023-01-04     142           NaN
2023-01-05     107           NaN
2023-01-06     120           NaN
2023-01-07     138    126.714286
2023-01-08     118    123.857143
2023-01-09     122    123.000000
2023-01-10     110    122.428571

The .rolling() function creates a rolling window of the specified size (here, 7 days) and calculates the mean for each window.

Using timedelta and offset as window

Now, let set the windows in form of timedelta and offset methods:

# Calculate the rolling mean over a 7-day window using timedelta
timedelta_window = pd.Timedelta(days=7)
# Set min_periods to 7 to exclude the first six rows from the result
df['Timedelta_7DMean'] = df['Values'].rolling(window=timedelta_window, min_periods=7).mean()
 
# Calculate the rolling mean over a 7-day window using an offset-based window
offset_window = pd.offsets.Day(7)
df['Offset_7DMean'] = df['Values'].rolling(window=offset_window, min_periods=7).mean()
 
# Display the DataFrame with rolling mean from various approaches
print(df.head(10))
            Values  Rolling_7DMean  Timedelta_7DMean  Offset_7DMean
Date                                                               
2023-01-01     138             NaN               NaN            NaN
2023-01-02     128             NaN               NaN            NaN
2023-01-03     114             NaN               NaN            NaN
2023-01-04     142             NaN               NaN            NaN
2023-01-05     107             NaN               NaN            NaN
2023-01-06     120             NaN               NaN            NaN
2023-01-07     138      126.714286        126.714286     126.714286
2023-01-08     118      123.857143        123.857143     123.857143
2023-01-09     122      123.000000        123.000000     123.000000
2023-01-10     110      122.428571        122.428571     122.428571

Real-World Examples

Time-based indexing and slicing are essential tools in the world of finance and investment. They enable analysts and traders to extract valuable insights from time series data related to stock prices, economic indicators, and more.

Example 1: Datetime Indexing

First, we download the 5 years Apple's stock price data from Yahoo Finance (opens in a new tab) To demonstrate the indexing the datetime index, we will use the subset of data from Jan 2023 to June 30th 2023.

import pandas as pd
import matplotlib.pyplot as plt
 
# Load stock price data
apple = pd.read_csv('AAPL-5Y.csv', parse_dates=['Date'], index_col='Date')
 
# Plot stock prices over a specific period
start_date = '2023-01'
end_date = '2023-06'
selected_period = apple[start_date:end_date]
 
plt.figure(figsize=(10, 6))
plt.plot(selected_period.index, selected_period['Close'], label='Stock Price')
plt.title('Stock Price Chart')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.show()

DateTime Indexing in Pandas

Example 2: Moving Averages for Technical Analysis

Technical analysts often use moving averages to identify trends and potential buy/sell signals. As we now know how to calculate the moving averages, let's calculate 50 and 100 days moving average and plot the results:

# Calculate and plot 50-day and 100-day moving averages
moving_50 = apple['Close'].rolling(window=50).mean()
moving_100 = apple['Close'].rolling(window=100).mean()
 
plt.figure(figsize=(10, 6))
plt.plot(apple.index, apple['Close'], label='Closing Price')
plt.plot(moving_50.index, moving_50, label='50-day Moving Avg')
plt.plot(moving_100.index, moving_100, label='100-day Moving Avg')
plt.title('Moving Averages for Technical Analysis')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.show()

DateTime Indexing in Pandas

Example 3: Resampling

We have daily closing price for Apple stock for 5-years period. Let's use our knowledge of resampling, to find the average monthly closing price and plot the results:

# Resample (downsampling) to monthly, Mean of closing price
monthly_closing_average = apple['Close'].resample('M').mean()
 
plt.figure(figsize=(10, 6))
monthly_closing_average.plot(kind='line', label='Avg Monthly Close', 
                             title="AAPL: Average Monthly Closing Price",
                             xlabel="Date", ylabel="Price")
plt.legend()
plt.grid(True)
plt.show()

DateTime Indexing in Pandas