Time-based Indexing
In the realm of data science and analysis, time series data is a common and valuable type of information that represents observations collected at specific time points. To effectively work with time series data, it's essential to understand time-based indexing and slicing. Time-based indexing involves using time-related information, such as dates and timestamps, to access and manipulate data in a dataset.
Importance of Time-Based Indexing
Time-based indexing is crucial for time series analysis because it allows us to harness the temporal structure of the data. This enables us to perform various tasks such as identifying trends, detecting seasonality, and making forecasts. By indexing data with respect to time, we can efficiently retrieve and process specific time intervals, which is essential for analyzing trends and patterns that unfold over time.
Advantages of Using Datetime Indexes
When dealing with time-based data, using a datetime index offers several advantages:
-
Efficient Data Alignment: Datetime indexes facilitate data alignment across multiple datasets with varying time intervals. This alignment simplifies the process of combining and comparing data, even when the time points don't perfectly align.
-
Easy Resampling: Resampling involves changing the frequency of data observations, such as aggregating daily data into monthly data. Datetime indexes make resampling straightforward and consistent.
-
Intuitive Slicing: With datetime indexes, you can slice and select data using natural date-based expressions, such as selecting all data from a specific month or year.
-
Time-Based Plots: When creating visualizations, datetime indexes enable accurate and informative time-based plots, as the index itself conveys the temporal aspect of the data.
Datetime Indexing
Loading and Inspecting the Dataset
Note: We will use Seaborn's Taxis dataset (opens in a new tab). In case the 'taxis' dataset lacks data to demonstrate certain concepts, we'll supplement with dummy data.
import pandas as pd
import seaborn as sns
# Load the Seaborn's taxis dataset
taxis = sns.load_dataset('taxis')
print(taxis_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pickup 6433 non-null datetime64[ns]
1 dropoff 6433 non-null datetime64[ns]
2 passengers 6433 non-null int64
3 distance 6433 non-null float64
4 fare 6433 non-null float64
5 tip 6433 non-null float64
6 tolls 6433 non-null float64
7 total 6433 non-null float64
8 color 6433 non-null object
9 payment 6389 non-null object
10 pickup_zone 6407 non-null object
11 dropoff_zone 6388 non-null object
12 pickup_borough 6407 non-null object
13 dropoff_borough 6388 non-null object
dtypes: datetime64[ns](2), float64(5), int64(1), object(6)
memory usage: 703.7+ KB
None
Setting Datetime as Index
In Pandas, you can easily set a datetime column as the index using the set_index()
function. Let's consider the Seaborn "taxis" dataset as an example:
# Set "pickup" as the index
taxis.set_index("pickup", inplace=True)
# Sort the DataFrame by the index (pickup dates) in ascending order
taxis.sort_index(inplace=True)
# Display the modified DataFrame for selected column
print(taxis[['dropoff', 'fare']].head())
dropoff fare
pickup
2019-02-28 23:29:03 2019-02-28 23:32:35 5.0
2019-03-01 00:03:29 2019-03-01 00:13:32 10.0
2019-03-01 00:08:32 2019-03-01 00:29:47 22.5
2019-03-01 00:15:53 2019-03-01 00:47:58 25.5
2019-03-01 00:29:22 2019-03-01 00:32:48 4.5
Datetime Indexing
With a datetime index in place, you can directly access and manipulate data using date-based indexing. Let's index the 'taxis' df using Pandas .loc
attribute
# Access data for a specific date
date_indexing = taxis.loc["2019-03-05"]
print(date_indexing[['dropoff', 'fare']].head())
dropoff fare
pickup
2019-03-05 00:26:11 2019-03-05 00:30:29 5.0
2019-03-05 00:42:20 2019-03-05 00:51:13 9.0
2019-03-05 00:54:56 2019-03-05 01:11:28 16.5
2019-03-05 01:00:08 2019-03-05 01:11:32 11.0
2019-03-05 01:27:35 2019-03-05 01:51:45 36.0
Datetime Slicing
We can use slicing technique .loc[start:end]
to access the data within a range of dates(and time)
# Access data for a specific date range
date_slicing = taxis.loc["2019-03-01":"2019-03-05"]
# Count the number of rows in the provided range
print(date_slicing['dropoff'].count())
# Output: 1007
Partial String Indexing
Partial string indexing is a powerful feature that enables us to select data from a datetime-indexed DataFrame based on specific years, months, days, or any other part of the date. This capability proves extremely useful for extracting and analyzing data with granularity.
To demonstrate partial string indexing, let's consider the following example:
# Access data for a month of March 2019
partial_string_indexing = taxis.loc["2019-03"] # can also use [03/2019]
# find the average fare for that month
print(partial_string_indexing['fare'].mean())
# Output: 13.09233053482587
Time Range Selection
Time range selection involves extracting data within a specified time range. Pandas provides convenient methods for achieving this, such as between_time()
and boolean indexing.
Using between_time()
The between_time()
function in Pandas is used to select data from a DataFrame based on a specific time range within each day. This function is particularly useful when dealing with time series data that follows daily patterns, such as stock market data, sensor readings, or any other data that has timestamps associated with time of day.
Here's the syntax of the between_time()
function:
DataFrame.between_time(start_time, end_time, inclusive='both')
start_time
andend_time
: The time range within which you want to select data. These should be strings in the format 'HH:MM', representing hours and minutes.inclusive
which is by default set to both, which means rows with the start and end times will be included.
Here's how you can use it:
# Select data for the morning hours of 8 to 10 using between_time
morning_data = taxis.between_time("08:00", "10:00")
# Display the selected data
print(morning_data[['dropoff', 'fare']].head())
dropoff fare
pickup
2019-03-01 08:04:26 2019-03-01 08:12:44 7.00
2019-03-01 08:12:28 2019-03-01 08:17:19 5.00
2019-03-01 08:20:56 2019-03-01 08:25:54 5.50
2019-03-01 08:23:18 2019-03-01 08:52:12 19.26
2019-03-01 08:24:40 2019-03-01 08:49:35 17.00
Boolean Indexing for Time Ranges
Boolean indexing is another effective way to filter data based on time ranges. Let's explore how to achieve the identical results as above with this technique:
# Create a boolean mask for the desired time range (8 to 10 AM)
mask = (taxis.index.time >= pd.to_datetime("08:00").time()) & (taxis.index.time <= pd.to_datetime("10:00").time())
# Apply the mask to filter the data
morning_data_masking = taxis[mask]
# Display the selected data
print(morning_data_masking[['dropoff', 'fare']].head())
dropoff fare
pickup
2019-03-01 08:04:26 2019-03-01 08:12:44 7.00
2019-03-01 08:12:28 2019-03-01 08:17:19 5.00
2019-03-01 08:20:56 2019-03-01 08:25:54 5.50
2019-03-01 08:23:18 2019-03-01 08:52:12 19.26
2019-03-01 08:24:40 2019-03-01 08:49:35 17.00
Time Shifting
Time shifting involves moving data points in a time series backward or forward in time. This concept is particularly useful for comparing past and future values, calculating differences, or aligning data for analysis.
Using the shift()
Method
Pandas provides the shift()
method to perform time shifting on a DataFrame. By specifying a positive or negative value, you can shift the data by a certain number of time periods.
Here's the syntax of the shift()
method:
DataFrame.shift(periods=1, freq=None, axis=0, fill_value=None)
periods
: The number of periods (steps) by which the data should be shifted. Positive values shift data forward (to a later time), and negative values shift data backward (to an earlier time).freq
: The frequency of the data. This is useful when the index has a date-like structure, and the shifting should consider the frequency for correct alignment.axis
: The axis along which the data should be shifted. By default, it's0
, which refers to shifting rows. You can setaxis=1
to shift columns.fill_value
: The value to use for newly introduced missing values after shifting. By default, missing values are filled with NaN. For datetime, timedelta, or period data, etc.NaT
(opens in a new tab) is used.
Example
Let's first create some dummy data instead of using 'taxis' dataset:
# Generate date range for the whole year of 2023 with Hourly frequency
date_range = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
# Generate random values for the dataset
np.random.seed(42) # For reproducibility
values = np.random.randint(100, 150, size=(len(date_range)),)
# Create the DataFrame with "Date" and "Values" columns
df = pd.DataFrame({'Date': date_range, 'Values': values})
# Set "Date" as index
df.set_index('Date', inplace=True)
print(df.head())
Values
Date
2023-01-01 138
2023-01-02 128
2023-01-03 114
2023-01-04 142
2023-01-05 107
Lagging Data (Shift Backward)
Lagging data means shifting it backward in time. For example, you might want to compare current values with values from the previous day.
# Lag the "Values" column by 1 day
df['Values_lag_1'] = df['Values'].shift(1)
Leading Data (Shift Forward)
Leading data means shifting it forward in time. This can be useful when you want to predict future values based on past data.
# Lead the "Values" column by 1 day
df['Values_lead_1'] = df['Values'].shift(-1)
# Display the DataFrame with lagged and leading values
print(df.head())
Values Values_lag_1 Values_lead_1
Date
2023-01-01 138 NaN 128.0
2023-01-02 128 138.0 114.0
2023-01-03 114 128.0 142.0
2023-01-04 142 114.0 107.0
2023-01-05 107 142.0 120.0
Rolling Windows
Rolling windows, also known as moving windows or rolling periods, involve analyzing data over a fixed-size window as it "rolls" through the time series. This technique is useful for calculating moving averages, identifying trends, and smoothing noisy data.
Creating Rolling Windows with .rolling()
The .rolling()
method is used to calculate rolling statistics over a window of consecutive observations. The window size is specified by the window
parameter. Here's the syntax of the .rolling()
function
DataFrame.rolling(window, min_periods=None, ...)
window
: The size of the rolling window.- If integer, it specifies the number of data points to include in each window.
- If a timedelta, str, or offset, it specifies the time period for each window. This is only valid for datetimelike indexes
min_periods
: The minimum number of observations required in a window to have a value.- If window is specified as integer, this will be default to the size of window
- If window is specified as timedelta or offset, this will be set by default to
1
Example
Using int as window
Let's continue with our dummy data that we have created in the last section and use that to calculate the 7 day moving average for the 'Value' column
# Calculate the 7-day rolling mean
df['Rolling_Mean'] = df['Values'].rolling(window=7).mean()
# Display the DataFrame with rolling mean
print(df.head(10))
Values Rolling_Mean
Date
2023-01-01 138 NaN
2023-01-02 128 NaN
2023-01-03 114 NaN
2023-01-04 142 NaN
2023-01-05 107 NaN
2023-01-06 120 NaN
2023-01-07 138 126.714286
2023-01-08 118 123.857143
2023-01-09 122 123.000000
2023-01-10 110 122.428571
The .rolling()
function creates a rolling window of the specified size (here, 7 days) and calculates the mean for each window.
Using timedelta and offset as window
Now, let set the windows in form of timedelta and offset methods:
# Calculate the rolling mean over a 7-day window using timedelta
timedelta_window = pd.Timedelta(days=7)
# Set min_periods to 7 to exclude the first six rows from the result
df['Timedelta_7DMean'] = df['Values'].rolling(window=timedelta_window, min_periods=7).mean()
# Calculate the rolling mean over a 7-day window using an offset-based window
offset_window = pd.offsets.Day(7)
df['Offset_7DMean'] = df['Values'].rolling(window=offset_window, min_periods=7).mean()
# Display the DataFrame with rolling mean from various approaches
print(df.head(10))
Values Rolling_7DMean Timedelta_7DMean Offset_7DMean
Date
2023-01-01 138 NaN NaN NaN
2023-01-02 128 NaN NaN NaN
2023-01-03 114 NaN NaN NaN
2023-01-04 142 NaN NaN NaN
2023-01-05 107 NaN NaN NaN
2023-01-06 120 NaN NaN NaN
2023-01-07 138 126.714286 126.714286 126.714286
2023-01-08 118 123.857143 123.857143 123.857143
2023-01-09 122 123.000000 123.000000 123.000000
2023-01-10 110 122.428571 122.428571 122.428571
Real-World Examples
Time-based indexing and slicing are essential tools in the world of finance and investment. They enable analysts and traders to extract valuable insights from time series data related to stock prices, economic indicators, and more.
Example 1: Datetime Indexing
First, we download the 5 years Apple's stock price data from Yahoo Finance (opens in a new tab) To demonstrate the indexing the datetime index, we will use the subset of data from Jan 2023 to June 30th 2023.
import pandas as pd
import matplotlib.pyplot as plt
# Load stock price data
apple = pd.read_csv('AAPL-5Y.csv', parse_dates=['Date'], index_col='Date')
# Plot stock prices over a specific period
start_date = '2023-01'
end_date = '2023-06'
selected_period = apple[start_date:end_date]
plt.figure(figsize=(10, 6))
plt.plot(selected_period.index, selected_period['Close'], label='Stock Price')
plt.title('Stock Price Chart')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.show()
Example 2: Moving Averages for Technical Analysis
Technical analysts often use moving averages to identify trends and potential buy/sell signals. As we now know how to calculate the moving averages, let's calculate 50 and 100 days moving average and plot the results:
# Calculate and plot 50-day and 100-day moving averages
moving_50 = apple['Close'].rolling(window=50).mean()
moving_100 = apple['Close'].rolling(window=100).mean()
plt.figure(figsize=(10, 6))
plt.plot(apple.index, apple['Close'], label='Closing Price')
plt.plot(moving_50.index, moving_50, label='50-day Moving Avg')
plt.plot(moving_100.index, moving_100, label='100-day Moving Avg')
plt.title('Moving Averages for Technical Analysis')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.show()
Example 3: Resampling
We have daily closing price for Apple stock for 5-years period. Let's use our knowledge of resampling, to find the average monthly closing price and plot the results:
# Resample (downsampling) to monthly, Mean of closing price
monthly_closing_average = apple['Close'].resample('M').mean()
plt.figure(figsize=(10, 6))
monthly_closing_average.plot(kind='line', label='Avg Monthly Close',
title="AAPL: Average Monthly Closing Price",
xlabel="Date", ylabel="Price")
plt.legend()
plt.grid(True)
plt.show()