Working with Date and Time Data in Pandas

Working with date and time data is a fundamental aspect of data analysis, especially when dealing with time series or temporal datasets. Pandas offers robust tools to handle and manipulate date and time data effectively. In this post, we'll delve into the intricacies of working with date and time data in Pandas.

Loading and Inspecting the Dataset

Note: We will use Seaborn's Taxis dataset (opens in a new tab). In case the 'taxis' dataset lacks data to demonstrate certain concepts, we'll supplement with dummy data.

import pandas as pd
import seaborn as sns
 
# Load the Seaborn's taxis dataset
taxis_df = sns.load_dataset('taxis')
 
print(taxis_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   pickup           6433 non-null   datetime64[ns]
 1   dropoff          6433 non-null   datetime64[ns]
 2   passengers       6433 non-null   int64         
 3   distance         6433 non-null   float64       
 4   fare             6433 non-null   float64       
 5   tip              6433 non-null   float64       
 6   tolls            6433 non-null   float64       
 7   total            6433 non-null   float64       
 8   color            6433 non-null   object        
 9   payment          6389 non-null   object        
 10  pickup_zone      6407 non-null   object        
 11  dropoff_zone     6388 non-null   object        
 12  pickup_borough   6407 non-null   object        
 13  dropoff_borough  6388 non-null   object        
dtypes: datetime64[ns](2), float64(5), int64(1), object(6)
memory usage: 703.7+ KB
None

Date and Time Parsing

Using `pd.to_datetime()`

One of the fundamental steps in working with date and time data is converting date-time strings into datetime objects. The pd.to_datetime() function takes a string as input and returns a datetime object. The string can be in any format, but it must be a valid date and time format.

The 'taxis' dataset pickup and dropoff columns are already in the datetime64[ns] form. Therefore, to demonstrate the use of pd.to_datetime() function to convert date strings into datetime objects, let's create some dummy data.

# Sample weather dataset with date-time columns
data = {'Date': ['2023-08-01', '2023-08-02', '2023-08-03'],
        'Temperature_C': [25.5, 26.2, 24.8],
        'Humidity': [65, 70, 60]}
 
weather_df = pd.DataFrame(data)
 
# print the dtype of 'Date' column
print(weather_df['Date'].dtypes # Output: object
 
# Convert 'Date' column to datetime
weather_df['Date'] = pd.to_datetime(weather_df['Date']) 
 
# print the dtype of 'Date' column
print(weather_df['Date'].dtypes # Output: datetime64[ns]

Handling Different Date and Time Formats

Pandas has the remarkable ability to accommodate various date and time formats. It can automatically detect and convert different formats. The pd.to_datetime() function can handle a variety of date and time formats using the format parameter. Here are some examples:

"YYYY-MM-DD": This is the standard format for dates.
"YYYY-MM-DD HH:MM:SS": This is the standard format for timestamps.
"MM/DD/YYYY": This is the format used in the United States.
"DD/MM/YYYY": This is the format used in many European countries.

Datetime Index

Setting Datetime Column as Index

One of the most powerful features of Pandas is its ability to utilize datetime values as the index of a DataFrame. This approach, known as a datetime index, facilitates efficient time-based data manipulation and analysis.

We can set the datetime column as the index using the set_index() function. The set_index() function takes the name of the column to be set as the index as input.

# Set 'pickup' column as the datetime index
taxis_df.set_index('pickup', inplace=True)
 
# print the newly set index
print(taxis_df.index[:5])

DatetimeIndex(['2019-03-23 20:21:09', '2019-03-04 16:11:55',
               '2019-03-27 17:53:01', '2019-03-10 01:23:59',
               '2019-03-30 13:27:42'],
              dtype='datetime64[ns]', name='pickup', freq=None)

Advantages of Using Datetime Index

Utilizing a datetime index offers several advantages in time-based analysis:

Efficient Time Series Operations: Datetime indices enable efficient slicing, resampling, and aggregation of time series data. It's particularly useful when working with large datasets.
Time-based Filtering: You can easily filter data for specific time periods, helping you perform focused analyses.
Seamless Visualization: Datetime indices seamlessly integrate with time-based plotting libraries like Matplotlib and Seaborn.
Resampling: Resampling time series data to different frequencies (e.g., daily to monthly) becomes straightforward.

Date and Time Components

Extracting Date and Time Components

Once we have datetime data, we often need to break it down into its constituent parts, such as year, month, day, hour, minute, weekday, dayofyear, weekofyear, quarter, is_leap_year and second. Pandas' .dt accessor provides a convenient way to extract these components from the columns.

Example 1: Continuing with our "taxis" dataset, let's extract various time components from the datetime index.

# Extract date and time components
taxis_df['P-Hour'] = taxis_df.index.hour
taxis_df['P-Minute'] = taxis_df.index.minute
taxis_df['P-Second'] = taxis_df.index.second
 
# Printing the extractd information
print(taxis_df[['P-Hour','P-Minute', 'P-Second']].head())

                     P-Hour  P-Minute  P-Second
pickup                                         
2019-03-23 20:21:09      20        21         9
2019-03-04 16:11:55      16        11        55
2019-03-27 17:53:01      17        53         1
2019-03-10 01:23:59       1        23        59
2019-03-30 13:27:42      13        27        42

Example 2: In the above example, we extracted the information from the index object. Now, let's extract the information from the datetime object in the column 'dropoff':

# Extract date and time components
taxis_df['D-Hour'] = taxis_df['dropoff'].dt.hour
taxis_df['D-Minute'] = taxis_df['dropoff'].dt.minute
taxis_df['D-Second'] = taxis_df['dropoff'].dt.second
 
# Printing the extractd information
print(taxis_df[['D-Hour','D-Minute', 'D-Second']].head())

                     D-Hour  D-Minute  D-Second
pickup                                         
2019-03-23 20:21:09      20        27        24
2019-03-04 16:11:55      16        19         0
2019-03-27 17:53:01      18         0        25
2019-03-10 01:23:59       1        49        51
2019-03-30 13:27:42      13        37        14

Significance of Date and Time Components

Year, Month, and Day: Extracting these components helps in analyzing trends and seasonality in your data. For instance, you can observe if certain months or days of the week have higher activity.
Hour and Minute: These components are crucial for understanding the daily patterns of your data. You can identify peak hours and times of low activity.
Second: While less commonly used, it can be relevant when analyzing events with high frequency.

Time Zones

Assigning and Converting Time Zones

When working with global data, it's essential to manage time zones correctly. Pandas provides methods to assign and convert datetime values between different time zones using the tz_localize() and tz_convert() functions.

Assign Using `.tz_localize()`

The .tz_localize() function is used to assign a time zone to a datetime object that is currently timezone-naive. This means that if you have a datetime without any timezone information, you can use this function to specify its time zone.

Continuing with the "taxis" dataset, let's assign the time zone to datetime objects in the dataset - index and 'dropoff' column

# Assinging New York time zone to dropoff
taxis_df['dropoff'] = taxis_df['dropoff'].dt.tz_localize('America/New_York')
 
print(taxis_df['dropoff'].head())

pickup
2019-03-23 20:21:09   2019-03-23 20:27:24-04:00
2019-03-04 16:11:55   2019-03-04 16:19:00-05:00
2019-03-27 17:53:01   2019-03-27 18:00:25-04:00
2019-03-10 01:23:59   2019-03-10 01:49:51-05:00
2019-03-30 13:27:42   2019-03-30 13:37:14-04:00
Name: dropoff, dtype: datetime64[ns, America/New_York]

If we would like to assign the timezone to the index as well, we can use:

taxis_df.index = taxis_df.index.tz_localize('America/New_York')

Convert Using `.tz_convert()`

The .tz_convert() function is used to convert datetime objects from one time zone to another. This is useful when you want to change the time zone of datetime values.

Now, let suppose, we need to convert the timezone of 'dropoff' column to 'Asia/Singpaore':

# Converting timezone to Asia/Singapore, saving as new column 'dropoff-sg'
taxis_df['dropoff-sg'] = taxis_df['dropoff'].dt.tz_convert('Asia/Singapore')
 
taxis_df[['dropoff', 'dropoff-sg']].head()

Handling Daylight Saving Time (DST) Changes

Daylight Saving Time (DST) changes can complicate time zone conversions. The .tz_localize() and .tz_convert() methods automatically handle DST changes, ensuring that the correct time is maintained during these transitions.

Correlation Analysis Resampling