Pandas
Getting Started
Data Loading and Inspection

Data Loading and Inspection in Pandas

Data Loading from External Sources

Pandas provides a variety of functions to read data from different types of files or databases. In this post, we will discuss how to read data from CSV files, Excel files, and SQL databases.

Reading Data from CSV

CSV (Comma Separated Values) is the most common file format that is widely supported by many platforms and applications. You can use the read_csv() function to read the data from a CSV file.

Assume we have a CSV file named data.csv with the following content:

Name,Age,Salary
Alice,25,50000
Bob,30,60000
Charlie,22,45000

Let's read this CSV file using Pandas:

import pandas as pd
 
data_csv = pd.read_csv('file.csv')
print(data_csv)

Output:

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   22   45000

Reading Data from Excel

Excel files can be read using the read_excel() function. You can specify the sheet name or number using the sheet_name parameter.

# Reading data from Excel spreadsheet
data_excel = pd.read_excel('file.xlsx', sheet_name='Sheet1')
print(data_excel)

This code will read the data from the Sheet1 of the file.xlsx Excel file. The output will be exactly the same as in the section "Reading data from CSV"

Reading Data from SQL

To read data from a SQL database, you can use the read_sql() function. You need to establish a connection to the database using the create_engine() function from the sqlalchemy library.

from sqlalchemy import create_engine
 
# Establishing a connection to the database
engine = create_engine('sqlite:///database.db')
 
# Reading data from SQL database
data = pd.read_sql('SELECT * FROM table_name', engine)
 
print(data)

This code will create a connection to the database.db SQLite database, execute the SQL query SELECT * FROM table_name, and print the data.

Here is another example, If you have a SQLite database named data.db with a table named employees. Let's read data from this table using Pandas:

import sqlite3
 
# Establishing a connection to the database
conn = sqlite3.connect('data.db')
 
# Reading data from SQL database
query = 'SELECT * FROM employees'
df_sql = pd.read_sql(query, conn)
conn.close()
 
print(df_sql)

In conclusion, pandas provides a powerful and flexible set of functions to load data from various sources into a DataFrame. Whether it's a CSV file, an Excel file, or a SQL database, you can easily import your data for further processing and analysis.

Getting Basic DataFrame Information

When working with a large DataFrame, it's useful to get a quick overview of its content. Pandas provides several methods that help you understand the basic information about your DataFrame structure.

The info() method

The info() method provides a summary of the DataFrame including the number of non-null entries in each column:

# Creating a Simple DataFrame
data = {
    'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
    'age': [42, 52, 36, 24, 73], 
    'preTestScore': [4, 24, 31, 2, 3],
    'postTestScore': [25, 94, 57, 62, 70]
}
df = pd.DataFrame(data)
 
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           5 non-null      object
 1   age            5 non-null      int64 
 2   preTestScore   5 non-null      int64 
 3   postTestScore  5 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 288.0+ bytes

The head() and tail() methods

head() returns the first n rows (default is 5). This function is great when you want to test if your DataFrame has the right type of data in it.

df.head()

tail() returns the last n rows (default is 5). It is useful to check the end of your DataFrame.

df.tail()

The describe() method

The describe() method in Pandas is used to generate summary statistics of the numeric columns in a DataFrame. It provides information like mean, standard deviation, minimum, maximum, quartiles, and more.

df.describe()

The output will look like this:

             age  preTestScore  postTestScore
count   5.000000      5.000000       5.000000
mean   45.400000     12.800000      61.600000
std    18.460769     13.663821      24.905823
min    24.000000      2.000000      25.000000
25%    36.000000      3.000000      57.000000
50%    42.000000      4.000000      62.000000
75%    52.000000     24.000000      70.000000
max    73.000000     31.000000      94.000000

shape and dtypes attributes

  • The .shape attribute provides the dimensions of the DataFrame (rows, columns).
# Getting the shape of the DataFrame
print(df.shape)
# Output: (5, 4)

The .dtypes attribute returns the data types of each column.

# Getting data types of columns
print(df.dtypes) 

Output:

name          object
age            int64
preTestScore   int64
postTestScore  int64
dtype: object