Data Loading and Inspection in Pandas
Data Loading from External Sources
Pandas provides a variety of functions to read data from different types of files or databases. In this post, we will discuss how to read data from CSV files, Excel files, and SQL databases.
Reading Data from CSV
CSV (Comma Separated Values) is the most common file format that is widely supported by many platforms and applications. You can use the read_csv()
function to read the data from a CSV file.
Assume we have a CSV file named data.csv
with the following content:
Name,Age,Salary
Alice,25,50000
Bob,30,60000
Charlie,22,45000
Let's read this CSV file using Pandas:
import pandas as pd
data_csv = pd.read_csv('file.csv')
print(data_csv)
Output:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 22 45000
Reading Data from Excel
Excel files can be read using the read_excel()
function. You can specify the sheet name or number using the sheet_name
parameter.
# Reading data from Excel spreadsheet
data_excel = pd.read_excel('file.xlsx', sheet_name='Sheet1')
print(data_excel)
This code will read the data from the Sheet1
of the file.xlsx
Excel file. The output will be exactly the same as in the section "Reading data from CSV"
Reading Data from SQL
To read data from a SQL database, you can use the read_sql()
function. You need to establish a connection to the database using the create_engine()
function from the sqlalchemy
library.
from sqlalchemy import create_engine
# Establishing a connection to the database
engine = create_engine('sqlite:///database.db')
# Reading data from SQL database
data = pd.read_sql('SELECT * FROM table_name', engine)
print(data)
This code will create a connection to the database.db
SQLite database, execute the SQL query SELECT * FROM table_name
, and print the data.
Here is another example, If you have a SQLite database named data.db
with a table named employees
. Let's read data from this table using Pandas:
import sqlite3
# Establishing a connection to the database
conn = sqlite3.connect('data.db')
# Reading data from SQL database
query = 'SELECT * FROM employees'
df_sql = pd.read_sql(query, conn)
conn.close()
print(df_sql)
In conclusion, pandas provides a powerful and flexible set of functions to load data from various sources into a DataFrame. Whether it's a CSV file, an Excel file, or a SQL database, you can easily import your data for further processing and analysis.
Getting Basic DataFrame Information
When working with a large DataFrame, it's useful to get a quick overview of its content. Pandas provides several methods that help you understand the basic information about your DataFrame structure.
The info()
method
The info()
method provides a summary of the DataFrame including the number of non-null entries in each column:
# Creating a Simple DataFrame
data = {
'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]
}
df = pd.DataFrame(data)
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 5 non-null object
1 age 5 non-null int64
2 preTestScore 5 non-null int64
3 postTestScore 5 non-null int64
dtypes: int64(3), object(1)
memory usage: 288.0+ bytes
The head()
and tail()
methods
head()
returns the first n rows (default is 5). This function is great when you want to test if your DataFrame has the right type of data in it.
df.head()
tail()
returns the last n rows (default is 5). It is useful to check the end of your DataFrame.
df.tail()
The describe()
method
The describe()
method in Pandas is used to generate summary statistics of the numeric columns in a DataFrame. It provides information like mean, standard deviation, minimum, maximum, quartiles, and more.
df.describe()
The output will look like this:
age preTestScore postTestScore
count 5.000000 5.000000 5.000000
mean 45.400000 12.800000 61.600000
std 18.460769 13.663821 24.905823
min 24.000000 2.000000 25.000000
25% 36.000000 3.000000 57.000000
50% 42.000000 4.000000 62.000000
75% 52.000000 24.000000 70.000000
max 73.000000 31.000000 94.000000
shape
and dtypes
attributes
- The
.shape
attribute provides the dimensions of the DataFrame (rows, columns).
# Getting the shape of the DataFrame
print(df.shape)
# Output: (5, 4)
The .dtypes
attribute returns the data types of each column.
# Getting data types of columns
print(df.dtypes)
Output:
name object
age int64
preTestScore int64
postTestScore int64
dtype: object