Pandas
Data Manipulation
Data Type Conversion

Data Type Conversion in Pandas

Data type conversion is a pivotal step in data preprocessing and analysis. Pandas provides a range of methods to convert data types, allowing you to tailor your data to better suit your analytical needs. Before we delve into data conversion, first explore the data types supported in Pandas

Supported Data Types in Pandas

Data types define how the data is stored and interpreted by a computer. In Pandas, columns in DataFrames can have various data types such as integers, floating-point numbers, strings, and more. It's essential to choose the right data type to optimize memory usage and enable accurate analyses.

Here's a table that lists the commonly used data types in Pandas along with their corresponding strings that can be used with .astype() and other conversion methods:

Data TypePandas StringDescription
int'int'Signed integer (default 64 bits)
int8'int8'Signed integer (8 bits)
int16'int16'Signed integer (16 bits)
int32'int32'Signed integer (32 bits)
int64'int64'Signed integer (64 bits)
uint8'uint8'Unsigned integer (8 bits)
uint16'uint16'Unsigned integer (16 bits)
uint32'uint32'Unsigned integer (32 bits)
uint64'uint64'Unsigned integer (64 bits)
float'float'Floating-point number (default 64 bits)
float16'float16'Floating-point number (16 bits)
float32'float32'Floating-point number (32 bits)
float64'float64'Floating-point number (64 bits)
bool'bool'Boolean (True or False)
object'object'Python object (usually strings)
string'string'Fixed-size ASCII string (introduced in Pandas 1.0.0)
datetime64'datetime64'Date and time (64-bit)
timedelta64'timedelta64'Time difference (64-bit)
category'category'Categorical data type
nullable integer'Int8', 'Int16', 'Int32', 'Int64'Nullable integers (introduced in Pandas 1.0.0)

Please note that these data types have various applications depending on your data and analysis needs. Additionally, the 'string' and 'Int8', 'Int16', 'Int32', and 'Int64' (nullable integer) data types were introduced in later versions of Pandas, so make sure you're using an appropriate version if you plan to use them.

Difference between Signed and Unsigned Integer

In Pandas and in general programming terminology, int and uint refer to different types of integer data types:

int (Signed Integer):

  • int stands for "integer," and it represents signed integers, which means they can be both positive and negative.
  • Signed integers use a portion of their range to represent negative numbers and the remaining portion for positive numbers.
  • In Pandas, the default integer data type (int64) is signed, meaning it can hold both positive and negative integer values.
  • Example: -1, 0, 42

uint (Unsigned Integer):

  • uint stands for "unsigned integer," and it represents only non-negative integers (greater than or equal to zero).
  • Unsigned integers do not allocate any bits to represent negative numbers, allowing the entire range to represent positive values.
  • In Pandas, you can use data types like uint8, uint16, uint32, and uint64 to represent unsigned integers of various sizes.
  • Example: 0, 1, 255

Converting Data Types

Pandas offers the .astype() method to convert data types of columns in a DataFrame. This method allows you to specify the desired data type and convert the column accordingly.

To Integer

# Importing the Pandas library
import pandas as pd
 
# Creating a sample DataFrame
data = {'A': [1, 2, 3],
        'B': [4, 5, 6],
        'C': ['7', '8', '9']}
 
df = pd.DataFrame(data)
 
# Printing the data type of column 'C' before conversion
print(f"Column C data type is {df['C'].dtypes}")
 
# Converting column 'C' to integer data type
df['C'] = df['C'].astype('int')
 
# Printing the data type of column 'C' after conversion
print(f"Column C data type is {df['C'].dtypes}")
Column C data type is object
Column C data type is int64

In this example, we have a DataFrame with three columns: 'A', 'B', and 'C'. We use the .astype(int) method to convert the data type of column 'C' to integers. The result is that the data type of column 'C' is changed from the object to the integer data type.

To Float

# Printing the data type of column 'B' before conversion
print(f"Column B data type is {df['B'].dtypes}")
 
# Converting column 'B' to float data type
df['B'] = df['B'].astype('float')
 
# Printing the data type of column 'B' after conversion
print(f"Column B data type is {df['B'].dtypes}")
 
Column B data type is int64 
Column B data type is float64

In this example, we continue with the same DataFrame. Now, we use the .astype(float) method to convert the data type of column 'B' to floating-point numbers (float). This allows the 'B' column to store decimal values.

To String

# Printing the data type of column 'A' before conversion
print(f"Column A data type is {df['A'].dtypes}")
 
# Converting column 'A' to string data type
df['A'] = df['A'].astype('string')
 
# Printing the data type of column 'A' after conversion
print(f"Column A data type is {df['A'].dtypes}")
Column A data type is int64 
Column A data type is string

To Dates

# Creating a sample DataFrame with date strings
data = {'date': ['2023-01-01', '2023-02-01', '2023-03-01']}
df = pd.DataFrame(data)
 
# Printing the data type of 'date' column before conversion
print(f"Data type of 'date' was {df['date'].dtypes}")
 
# Converting 'date' column to datetime data type
df['date'] = df['date'].astype('datetime64')
 
# Printing the data type of 'date' column after conversion
print(f"Data type of 'date' is now {df['date'].dtypes}")
Data type of 'date' was object 
Data type of 'date' is now datetime64[ns]

To Categorical Data

# Creating a sample DataFrame with categorical data
data = {'Category': ['A', 'B', 'A', 'C']}
df = pd.DataFrame(data)
 
# Printing the data type of 'Category' column before conversion
print(f"Data type of 'Category' was {df['Category'].dtypes}")
 
# Converting 'Category' column to categorical data type
df['Category'] = df['Category'].astype('category')
 
# Printing the data type of 'Category' column after conversion
print(f"Data type of 'Category' is now {df['Category'].dtypes}")
Data type of 'Cateogry' was object 
Data type of 'Cateogry' is now category

To Boolean Data

# Creating a sample DataFrame with 0 and 1 values
data = {'Flag': [0, 1, 1, 0]}
df = pd.DataFrame(data)
 
# Printing the data type of 'Flag' column before conversion
print(f"Data type of 'Flag' was {df['Flag'].dtypes}")
 
# Converting 'Flag' column to boolean data type
df['Flag'] = df['Flag'].astype(bool)
 
# Printing the data type of 'Flag' column after conversion
print(f"Data type of 'Flag' is now {df['Flag'].dtypes}")
Data type of 'Flag' was int64 
Data type of 'Flag' is now bool

To Nullable Integer

In standard integer data types (int8, int16, int32, int64), missing values are represented as NaN, which can be problematic because integers themselves cannot store missing values.

The Nullable Integer data type was introduced to solve this issue by allowing integer columns to have missing values in a memory-efficient way. It's available as Int8, Int16, Int32, and Int64 data types. The capitalization of the "I" in these data types (Int8, Int16, etc.) indicates that they are Nullable Integer data types, whereas lowercase (int8, int16, etc.) indicates the standard integer data types.

# Creating a sample DataFrame with an integer column and missing values
data = {'Integer_Column': [1, 2, 3, None]}
df = pd.DataFrame(data)
 
# Printing the data type of 'Integer_Column' before conversion
print(df['Integer_Column'].dtypes)
 
# Converting 'Integer_Column' to nullable integer data type (Int64)
df['Integer_Column'] = df['Integer_Column'].astype('Int64')
 
# Printing the data type of 'Integer_Column' after conversion
print(df['Integer_Column'].dtypes)
float64 
Int64

These examples showcase how the .astype() method can be used to convert columns to different data types in a Pandas DataFrame, enabling you to manipulate and analyze your data more effectively.