Pandas Data Structures
1. Series
A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). It is essentially a column in an excel spreadsheet. The axis labels are collectively referred to as the index.
Creating a Series
You can convert a list, NumPy array, or dictionary to a Series.
From a List
import pandas as pd
# Create a pandas series from list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
Output:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
In this example, we did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 5.
From a NumPy Array
import numpy as np
# Create a numpy array
numpy_array = np.array([10, 20, 30, 40, 50])
# Create a pandas series from numpy array
series = pd.Series(numpy_array)
print(series)
Output:
0 10
1 20
2 30
3 40
4 50
dtype: int32
Creating a Series with Index
You can also specify an index explicitly:
# explicitly setting the index using index argument
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=['A', 'B', 'C', 'D', 'E', 'F'])
print(s)
Output:
A 1.0
B 3.0
C 5.0
D NaN
E 6.0
F 8.0
dtype: float64
Creating a Series from a Dictionary
When the data is a dict, and an index is not passed, the Series index will be ordered by the dictβs key:
# key-value pairs
data = {'a' : 0., 'b' : 1., 'c' : 2.}
# key will serve as index and values as series
s = pd.Series(data)
print(s)
Output:
a 0.0
b 1.0
c 2.0
dtype: float64
Accessing Data from Series
Data in the series can be accessed similar to that in an ndarray:
s = pd.Series([1,3,5,np.nan,6,8])
print(s[0])
print(s[:3])
print(s[-3:])
Output:
1.0
0 1.0
1 3.0
2 5.0
dtype: float64
3 NaN
4 6.0
5 8.0
dtype: float64
In conclusion, a Series is a one-dimensional data structure in pandas that can be used to store any data type. It is one of the simplest, yet most useful data structures in pandas.
2. DataFrame
Understanding and working with Pandas DataFrame is essential for all form of data analysis and manipulation. It's a multi-dimensional labeled data structure that mirrors a table in a relational database or an Excel spreadsheet.
Creating a DataFrame
You can create a DataFrame from a variety of sources such as Lists, Dicts, Series, and even another DataFrame.
From a List
Here is a simple example of creating a DataFrame from a list:
# dataframe from list
data = [1, 2, 3, 4, 5]
df = pd.DataFrame(data)
print(df)
Output:
0
0 1
1 2
2 3
3 4
4 5
From Dict
The keys will become the label of columns.
# Creating a DataFrame from a dictionary of lists
data = {'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 'Age': [28, 34, 29, 42]}
df = pd.DataFrame(data)
# Printing the created DataFrame
print(df)
Output:
Name Age
0 Tom 28
1 Jack 34
2 Steve 29
3 Ricky 42
Accessing Data from DataFrame
Column Selection
We can select any column in the DataFrame using its label:
# Selecting and printing the 'Name' column from the DataFrame
print(df['Name'])
Output:
0 Tom
1 Jack
2 Steve
3 Ricky
Name: Name, dtype: object
Row Selection
We can select any row in the DataFrame using its label:
# select and printing the first row with index label
print(df.loc[0])
Output:
Name Tom
Age 28
Name: 0, dtype: object
Indexing techniques will be discussed later in details.
3. Multi-index DataFrame
A Multi-Index DataFrame is a data structure provided that allows you to have multiple labels on rows or columns. It is a great tool to use when you need to work with complex data with many dimensions.
Creating a Multi-Index DataFrame
Creating a multi-index DataFrame is quite simple. Here is an example:
# MultiIndex DataFrame
index = pd.MultiIndex.from_tuples([(i,j) for i in ['A','B','C'] for j in ['x', 'y', 'z']])
df = pd.DataFrame({'Data': range(9)}, index=index)
print(df)
Output:
Data
A x 0
y 1
z 2
B x 3
y 4
z 5
C x 6
y 7
z 8
In this example, we have two levels of index - the first level is 'A', 'B', 'C', and the second level is 'x', 'y', 'z'.
Accessing Data in Multi-Index DataFrame
You can access data in a multi-index DataFrame in a similar way to a normal DataFrame:
# Accessing data
print(df.loc['A'])
print(df.loc['B', 'y'])
Output:
Data
x 0
y 1
z 2
4
- In the first
print
statement, we are accessing all data under index 'A'. - In the second
print
statement, we are accessing the data under index 'B' and sub-index 'y'.
4. Index Object
Pandas Index is an immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects is Index.
Indexes are used for three main purposes:
- Identifying data (i.e., providing metadata) using known indicators, important for analysis, visualization, and interactive console display.
- Enabling automatic and explicit data alignment.
- Allowing intuitive getting and setting of subsets of the data set.
Creating an Index
To create an Index, you can use the pd.Index()
function and pass in a list of index labels:
# creating index object
index = pd.Index(['a', 'b', 'c', 'd', 'e'])
print(index)
Output:
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Index as Immutable Array
Index objects are immutable and thus can't be modified by the user:
# index values are immutable
index = pd.Index(['a', 'b', 'c', 'd', 'e'])
index[1] = 'z' # This will raise a TypeError
Output:
TypeError: Index does not support mutable operations
Index as Ordered Set
Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic.
# creating two index objects
index1 = pd.Index(['a', 'b', 'c', 'd', 'e'])
index2 = pd.Index(['c', 'd', 'e', 'f', 'g'])
print(index1.intersection(index2)) # Intersection
print(index1.union(index2)) # Union
print(index1.symmetric_difference(index2)) # Symmetric difference
Output:
Index(['c', 'd', 'e'], dtype='object')
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')
Index(['a', 'b', 'f', 'g'], dtype='object')
Understanding axis
parameter
In pandas, many functions or methods include an axis
parameter. The axis
parameter is used to specify the direction of the operation:
-
axis=0
: This means the operation should move vertically, across rows. In other words, each function is to be applied column-wise. -
axis=1
: This means the operation should move horizontally, across columns. In other words, each function is to be applied row-wise.
The best way to remember this is that axis
parameter refers to the dimension that will be collapsed. For axis=0
, it will collapse the rows (i.e., for each column) and for axis=1
, it will collapse the columns (i.e., for each row).
Here are some examples:
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})
print(df)
This will output:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Now, if you want to calculate the sum of each column, you can use axis=0
:
print(df.sum(axis=0))
This will output:
A 6
B 15
C 24
dtype: int64
If you want to calculate the sum of each row, you can use axis=1
:
print(df.sum(axis=1))
This will output:
0 12
1 15
2 18
dtype: int64
So, the axis
parameter is specifying the axis along which the methods or functions are applied.