Pandas
Getting Started
Data Structures

Pandas Data Structures

1. Series

A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). It is essentially a column in an excel spreadsheet. The axis labels are collectively referred to as the index.

Creating a Series

You can convert a list, NumPy array, or dictionary to a Series.

From a List

import pandas as pd
 
# Create a pandas series from list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

Output:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In this example, we did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 5.

From a NumPy Array

import numpy as np
 
# Create a numpy array
numpy_array = np.array([10, 20, 30, 40, 50])
 
# Create a pandas series from numpy array
series = pd.Series(numpy_array)
 
print(series)

Output:

0    10
1    20
2    30
3    40
4    50
dtype: int32

Creating a Series with Index

You can also specify an index explicitly:

# explicitly setting the index using index argument
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=['A', 'B', 'C', 'D', 'E', 'F'])
 
print(s)

Output:

A    1.0
B    3.0
C    5.0
D    NaN
E    6.0
F    8.0
dtype: float64

Creating a Series from a Dictionary

When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s key:

# key-value pairs
data = {'a' : 0., 'b' : 1., 'c' : 2.}
 
# key will serve as index and values as series
s = pd.Series(data)
 
print(s)

Output:

a    0.0
b    1.0
c    2.0
dtype: float64

Accessing Data from Series

Data in the series can be accessed similar to that in an ndarray:

s = pd.Series([1,3,5,np.nan,6,8])
print(s[0])
print(s[:3])
print(s[-3:])

Output:

1.0
0    1.0
1    3.0
2    5.0
dtype: float64
3    NaN
4    6.0
5    8.0
dtype: float64

In conclusion, a Series is a one-dimensional data structure in pandas that can be used to store any data type. It is one of the simplest, yet most useful data structures in pandas.

2. DataFrame

Understanding and working with Pandas DataFrame is essential for all form of data analysis and manipulation. It's a multi-dimensional labeled data structure that mirrors a table in a relational database or an Excel spreadsheet.

Creating a DataFrame

You can create a DataFrame from a variety of sources such as Lists, Dicts, Series, and even another DataFrame.

From a List

Here is a simple example of creating a DataFrame from a list:

# dataframe from list
data = [1, 2, 3, 4, 5]
df = pd.DataFrame(data)
 
print(df)

Output:

   0
0  1
1  2
2  3
3  4
4  5

From Dict

The keys will become the label of columns.

# Creating a DataFrame from a dictionary of lists
data = {'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 'Age': [28, 34, 29, 42]}
df = pd.DataFrame(data)
 
# Printing the created DataFrame
print(df)

Output:

    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42

Accessing Data from DataFrame

Column Selection

We can select any column in the DataFrame using its label:

# Selecting and printing the 'Name' column from the DataFrame
print(df['Name'])

Output:

0      Tom
1     Jack
2    Steve
3    Ricky
Name: Name, dtype: object

Row Selection

We can select any row in the DataFrame using its label:

# select and printing the first row with index label
print(df.loc[0])

Output:

Name    Tom
Age      28
Name: 0, dtype: object

Indexing techniques will be discussed later in details.

3. Multi-index DataFrame

A Multi-Index DataFrame is a data structure provided that allows you to have multiple labels on rows or columns. It is a great tool to use when you need to work with complex data with many dimensions.

Creating a Multi-Index DataFrame

Creating a multi-index DataFrame is quite simple. Here is an example:

# MultiIndex DataFrame
index = pd.MultiIndex.from_tuples([(i,j) for i in ['A','B','C'] for j in ['x', 'y', 'z']])
 
df = pd.DataFrame({'Data': range(9)}, index=index)
 
print(df)

Output:

     Data
A x     0
  y     1
  z     2
B x     3
  y     4
  z     5
C x     6
  y     7
  z     8

In this example, we have two levels of index - the first level is 'A', 'B', 'C', and the second level is 'x', 'y', 'z'.

Accessing Data in Multi-Index DataFrame

You can access data in a multi-index DataFrame in a similar way to a normal DataFrame:

# Accessing data
print(df.loc['A'])
print(df.loc['B', 'y'])

Output:

   Data
x     0
y     1
z     2

4
  • In the first print statement, we are accessing all data under index 'A'.
  • In the second print statement, we are accessing the data under index 'B' and sub-index 'y'.

4. Index Object

Pandas Index is an immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects is Index.

Indexes are used for three main purposes:

  1. Identifying data (i.e., providing metadata) using known indicators, important for analysis, visualization, and interactive console display.
  2. Enabling automatic and explicit data alignment.
  3. Allowing intuitive getting and setting of subsets of the data set.

Creating an Index

To create an Index, you can use the pd.Index() function and pass in a list of index labels:

# creating index object
index = pd.Index(['a', 'b', 'c', 'd', 'e'])
print(index)

Output:

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

Index as Immutable Array

Index objects are immutable and thus can't be modified by the user:

# index values are immutable
index = pd.Index(['a', 'b', 'c', 'd', 'e'])
index[1] = 'z'  # This will raise a TypeError

Output:

TypeError: Index does not support mutable operations

Index as Ordered Set

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic.

# creating two index objects
index1 = pd.Index(['a', 'b', 'c', 'd', 'e'])
index2 = pd.Index(['c', 'd', 'e', 'f', 'g'])
 
print(index1.intersection(index2))  # Intersection
print(index1.union(index2))  # Union
print(index1.symmetric_difference(index2))  # Symmetric difference

Output:

Index(['c', 'd', 'e'], dtype='object')
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')
Index(['a', 'b', 'f', 'g'], dtype='object')

Understanding axis parameter

In pandas, many functions or methods include an axis parameter. The axis parameter is used to specify the direction of the operation:

  • axis=0: This means the operation should move vertically, across rows. In other words, each function is to be applied column-wise.

  • axis=1: This means the operation should move horizontally, across columns. In other words, each function is to be applied row-wise.

The best way to remember this is that axis parameter refers to the dimension that will be collapsed. For axis=0, it will collapse the rows (i.e., for each column) and for axis=1, it will collapse the columns (i.e., for each row).

Here are some examples:

import pandas as pd
 
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})
 
print(df)

This will output:

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

Now, if you want to calculate the sum of each column, you can use axis=0:

print(df.sum(axis=0))

This will output:

A     6
B    15
C    24
dtype: int64

If you want to calculate the sum of each row, you can use axis=1:

print(df.sum(axis=1))

This will output:

0    12
1    15
2    18
dtype: int64

So, the axis parameter is specifying the axis along which the methods or functions are applied.