Introduction
Pandas ⤴ is an open-source library for the Python programming language that is used for data manipulation and analysis. It provides tools for:
- data analysis,
- data cleaning,
- data organization,
making it a popular choice for data scientists, analysts, and developers who work with large datasets and require efficient and scalable solutions for their data-related tasks.
Pandas is built on top of the NumPy ⤴ library and provides two primary data structures:
Series | DataFrame |
---|---|
A Series is a one-dimensional array-like object that can hold any data type. | A DataFrame is a two-dimensional tabular data structure that consists of rows and columns. |
# Creating a Pandas Series with some data s = pd.Series([1, 3, 5]) |
# Creating a Pandas DataFrame with some data data = {'name': ['Alice', 'Bob'], 'age': [25, 32]} |
# Printing the Series 0 1 |
# Printing the DataFrame name age |
Some of the key features of the Pandas library include:
Input-Output (IO) operations | The library can read and write data from a variety of sources, including CSV, Excel, and SQL databases. |
---|---|
Data cleaning | The library makes it easy to handle missing or duplicate data, and provides tools for data normalization and standardization. |
Data manipulation | Pandas provides a wide range of functions for manipulating and transforming data, such as merging, grouping, and pivoting. |
Data visualization | Pandas includes built-in tools for creating basic visualizations, such as bar charts and scatter plots. |
High performance | Pandas is optimized for performance, with functions that are designed to handle large datasets efficiently. |
For numerical and scientific computing, numpy
and scipy
are often the go-to options, while for data manipulation and analysis, pandas
and other packages such as dask
and pyspark
may be more appropriate.
- NumPy ⤴, provides support for multidimensional arrays and matrices, and includes a wide range of mathematical functions
- SciPy ⤴, includes modules for optimization, signal processing, and linear algebra, among others
2) data manipulation and analysis:
- Pandas ⤴, provides high-level tools for working with tabular and heterogeneous data
- Scikit-learn ⤴, includes modules for classification, regression, clustering, and dimensionality reduction
- PySpark ⤴, allows users to process large datasets in parallel across multiple nodes
- Dask ⤴, provides support for distributed computing and enables users to work with larger-than-memory datasets
Getting started with Pandas
Pandas is NOT a built-in Python module, meaning it is not included with the standard Python distribution. To use Pandas in your Python environment, you will need to install it separately.
Install pandas
library
To install Pandas, you can use pip
, which is the standard package installer for Python. You can run the following command in your terminal or command prompt to install Pandas:
pip install pandas
An alternative way to install Pandas is using Conda. This way you can install different variants of pandas
library in separate virtual environments, depending on the requirements of your project. You can create and activate a new conda environment, and then install pandas
libarary:
conda install pandas
This command will install Pandas and any necessary dependencies in your current Conda environment. conda
installed yet, you can follow the guide provided in the tutorial .
conda
provides additional benefits over pip
, such as the ability to create and manage multiple environments for different projects with different dependencies, and the ability to install packages from both the Conda and PyPI (Python Package Index) repositories.
Import pandas
library
After installation, you need to import Pandas in your Python code using the following command:
import pandas as pd
This will import the Pandas library and give it an alias of pd
, which is a commonly used abbreviation for Pandas.
You can then use the Pandas functions and classes in your code by prefixing them with pd, such as pd.DataFrame
or pd.read_csv
.
Data manipulation
Pandas is a powerful Python library for data manipulation and analysis that provides easy-to-use and efficient data structures for handling tabular, structured, and time-series data. With its rich functionality and flexible syntax, Pandas is a powerful tool for data wrangling and advanced analysis in Python.
Here is an index of data manipulation tasks with Pandas:
- Creating Pandas objects
- Loading data from a file
- Exploring data
- Cleaning data
- Filtering data
- Aggregating data
Create Pandas objects
In Pandas, there are two main types of objects: Series
and DataFrame
.
Series
object
In Pandas, a Series is a one-dimensional labeled data structure that can hold any data type, such as integers, floats, strings, or even complex objects. It is similar to a column in a DataFrame or a one-dimensional array in NumPy or a column in a spreadsheet.
A Series consists of an array of values and a set of labels, called the index. The index
labels the values in the Series
and allows you to access them by label instead of by position. If you don’t specify an index, Pandas will automatically assign a numerical index starting from 0.
Here’s an example of a simple Series:
0 John
1 Jane
2 Jack
3 Jill
Name: Name, dtype: object
In this example, we have a Series
with four values, which are strings representing names. The index labels each value with a number from 0 to 3. The label “Name” at the end of the Series specifies the name of the Series.
DataFrame
object
In Pandas, a DataFrame
is a 2-dimensional labeled data structure with columns of potentially different data types. It is similar to a table in a relational database or a spreadsheet. It can hold multiple Series objects of different data types and allows you to perform operations on the entire data set, such as merging, grouping, filtering, and sorting.
A DataFrame consists of rows and columns, where each column can have a different data type (e.g., numeric, string, boolean). Each row represents a record, and each column represents a different variable or feature.
Here’s an example of a simple DataFrame:
Name Age City
0 John 25 London
1 Jane 30 New York
2 Jack 35 Paris
3 Jill 40 Tokyo
In this example, we have a DataFrame
with four rows and three columns: “Name”, “Age”, and “City”. The first column contains strings (the names), the second column contains integers (the ages), and the third column contains strings (the cities).
The rows and columns in a DataFrame are labeled, which means you can access the data using the labels.
- The row labels are called the
index
. - The column labels are called the
column names
.
You can think of a DataFrame
as a dictionary of Series
objects, where each column is a Series.
Index
objects
Pandas also provides Index objects such as Index
, Numeric Index
, CategoricalIndex
, IntervalIndex
, MultiIndex
, DatetimeIndex
, TimeDeltaIndex
, PeriodIndex
. They are used to label and index the data in Series and DataFrame objects.
Create a Series
Here’s an example of creating a Pandas Series:
import pandas as pd
# create a Pandas Series with four values
s = pd.Series([25, 30, 35, 40], index=['John', 'Jane', 'Jack', 'Jill'])
# print the Series
print(s)
In this example, we use the pd.Series()
constructor function to create a Series. We pass a list of four values and a list of four index labels.
The resulting Series has the values and index labels specified:
John 25 Jane 30 Jack 35 Jill 40 dtype: int64
Create an empty DataFrame
To create an empty DataFrame
, you can call the pd.DataFrame()
constructor function with no arguments, like this:
import pandas as pd
# create an empty DataFrame
df = pd.DataFrame()
# print the DataFrame
print(df)
This will create an empty DataFrame
with no rows or columns:
Empty DataFrame Columns: [] Index: []
You can create an empty DataFrame
in Pandas and add data to it later using the df.append()
method.
While using df
as a variable name for a Pandas DataFrame
object is a common convention, it’s generally a better practice to use a more descriptive and unique name that corresponds to the content of the DataFrame.
This is particularly important if you're working with multiple DataFrames in your code, as using the same variable name for multiple objects can lead to confusion and errors. By choosing a more descriptive name, you can help make your code more readable and understandable to others (and your future self!).
Create an empty DataFrame with structure
You can create an empty DataFrame
of a specific structure by specifying the column names and data types when you create the DataFrame.
To create an empty DataFrame with specific column names and data types, you can pass a dictionary where the keys are the column names, and the values are the data types you want for each column.
import pandas as pd
# create an empty DataFrame with specific column names and data types
df = pd.DataFrame(columns=['name', 'age', 'city'], dtype=str)
# print the DataFrame
print(df)
In this example, we use the pd.DataFrame()
constructor function to create an empty DataFrame. We pass a dictionary containing the column names as keys and the data types as values. We set the data type to str for all columns.
The resulting DataFrame has the specified column names and data types, but no rows:
Empty DataFrame Columns: [name, age, city] Index: []
You can then add data to this DataFrame using the df.append()
method.
Create DataFrame from Python object
One of the powerful features of Pandas is that it allows you to create DataFrames from a variety of different data sources, including built-in Python objects like lists and dictionaries.
In general, Pandas makes it easy to convert data from a variety of different sources to DataFrames
, allowing you to work with and manipulate the data using the powerful tools and functionality provided by the library.
DataFrame from a list
To create a DataFrame
from a list, you can simply pass the list to the pd.DataFrame()
constructor function. By default, Pandas will create a one-dimensional DataFrame with a single column containing the list values.
import pandas as pd
# create a list of values
values = ['apple', 'banana', 'cherry', 'date']
# create a DataFrame from the list
df = pd.DataFrame(values)
# print the DataFrame
print(df)
This code creates a DataFrame
with a single column containing the values ‘apple’, ‘banana’, ‘cherry’, and ‘date’
0 0 apple 1 banana 2 cherry 3 date
DataFrame from a dictionary
To create a DataFrame
from a dictionary, you can pass the dictionary to the pd.DataFrame()
constructor function. By default, Pandas will create a DataFrame with one column for each key in the dictionary, with the column name corresponding to the key name.
import pandas as pd
# create a dictionary with key-value pairs
data = {'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['New York', 'Los Angeles', 'Chicago']}
# create a DataFrame from the dictionary
df = pd.DataFrame(data)
# print the DataFrame
print(df)
This code creates a DataFrame
with three columns corresponding to the keys in the dictionary: ‘name’, ‘age’, and ‘city’.
The values in each column correspond to the values in the corresponding dictionary values:
name age city 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago
Add data
to empty DataFrame
To add data to an empty DataFrame, you can use the df.append()
method. This method appends rows of data to the end of the DataFrame.
The df.append()
method takes a dictionary or Series
object containing the values to add as a new row to the DataFrame
, and appends it to the end of the DataFrame
.
Let’s assume we have an empty DataFrame:
import pandas as pd
# create an empty DataFrame
df = pd.DataFrame()
We can add some data to the DataFrame using df.append()
method:
# add some data to the DataFrame
df = df.append({'name': 'Alice', 'age': 25, 'city': 'New York'}, ignore_index=True)
df = df.append({'name': 'Bob', 'age': 30, 'city': 'Los Angeles'}, ignore_index=True)
df = df.append({'name': 'Charlie', 'age': 35, 'city': 'Chicago'}, ignore_index=True)
# print the DataFrame
print(df)
The ignore_index=True
argument tells Pandas to ignore any existing index labels on the DataFrame
, and instead generate new integer index labels starting from 0.
The print()
function displays the DataFrame with the added data:
name age city 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago
new row to existing DataFrame
add a new row with data
To add new data as a new row to an existing DataFrame in Pandas, you can use the append()
method. (Let’s use the df
DataFrame object from previous example.)
# add new data to the DataFrame
df = df.append({'name': 'David', 'age': 28, 'city': 'Miami'}, ignore_index=True)
df = df.append({'name': 'Emily', 'age': 27, 'city': 'San Francisco'}, ignore_index=True)
print(df)
We just added two more rows of data to the DataFrame using append()
method. The ignore_index=True
parameter ensures that the new rows are added with consecutive row numbers.
The output of this code will be a DataFrame with five rows of data, including the new rows added at the end:
name age city 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 28 Miami 4 Emily 27 San Francisco
add a new row with data
In Pandas, you can add a new empty row to a DataFrame by appending a new empty dictionary to the DataFrame using the append method.
# add a new empty row to the DataFrame
df = df.append({}, ignore_index=True)
print(df)
In this example, we added a new empty row to the DataFrame
by appending an empty dictionary to it using the append method and setting the ignore_index
parameter to True
to ensure that the index of the new row is automatically assigned.
The output of this code will be a DataFrame with six rows and three columns, including the new empty row:
name age city 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 28 Miami 4 Emily 27 San Francisco 5 NaN NaN NaN
Note that the values in the new empty row are represented as NaN
(Not a Number), which is the default value for missing or undefined data in Pandas.
new column to existing DataFrame
add a column with values
To add a new column to an existing DataFrame in Pandas, you can simply assign a new column with a name and values to the DataFrame.
Here’s an example of adding a new column called ‘gender’ to the DataFrame from the previous example:
# add a new column to the DataFrame
df['gender'] = ['F', 'M', 'M', 'M', 'F']
print(df)
In this example, we added a new column called ‘gender’ to the DataFrame
by assigning a list of values to it using the square bracket notation.
* Note that the length of the list must match the number of rows in the DataFrame.
The output of this code will be a DataFrame with four columns, including the new ‘gender’ column:
name age city gender 0 Alice 25 New York F 1 Bob 30 Los Angeles M 2 Charlie 35 Chicago M 3 David 28 Miami M 4 Emily 27 San Francisco F 5 NaN NaN NaN NaN
add an empty column
To add an empty column to an existing DataFrame in Pandas, you can simply assign an empty Series with a name to the DataFrame.
Here’s an example of adding an empty column called ‘income’ to the DataFrame from the previous example:
# add an empty column to the DataFrame
df['income'] = pd.Series(dtype=float)
print(df)
In this example, we added an empty column called ‘income’ to the DataFrame
by assigning an empty Pandas Series
with a data type (dtype) of float to it using the square bracket notation.
The output of this code will be a DataFrame with four columns, including the new empty ‘income’ column:
name age city gender income 0 Alice 25 New York F NaN 1 Bob 30 Los Angeles M NaN 2 Charlie 35 Chicago M NaN 3 David 28 Miami M NaN 4 Emily 27 San Francisco F NaN 5 NaN NaN NaN NaN NaN
Note that the values in the new empty column are represented as NaN
(Not a Number), which is the default value for missing or undefined data in Pandas.
Access data
in the Series
Let’s assume we have a Pandas Series:
import pandas as pd
# create a Pandas Series with four values
s = pd.Series([25, 30, 35, 40], index=['John', 'Jane', 'Jack', 'Jill'])
We can access the values and index labels of the Series using indexing:
# access the value for 'Jane'
print(s['Jane'])
30
# access the values for 'Jane' and 'Jack'
print(s[['Jane', 'Jack']])
Jane 30 Jack 35 dtype: int64
# access the values greater than 30
print(s[s > 30])
Jack 35 Jill 40 dtype: int64
in the DataFrame
Let’s assume we have a Pandas DataFrame:
import pandas as pd
# create the DataFrame
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'city': ['New York', 'Los Angeles', 'Chicago']})
The print()
function displays the DataFrame:
print(df)
name age city 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago
In Pandas, you can access data in a DataFrame using indexing and slicing.
Here are some examples of how to access data in the DataFrame df
:
A. Access a single column by name
series_col = df['name']
This returns a Pandas Series object containing the values in the ‘name’ column.
print(series_col) 0 Alice 1 Bob 2 Charlie
B. Access multiple columns by name
new_df_col = df[['name', 'age']]
This returns a new DataFrame containing only the ‘name’ and ‘age’ columns.
new_df_col name age 0 Alice 25 1 Bob 30 2 Charlie 35
C. Access a single row by index
series_row = df.loc[0]
This returns a Series object containing the values in the first row of the DataFrame (index 0).
series_row name Alice age 25 city New York
D. Access multiple rows by index
new_df_row = df.loc[0:1]
This returns a new DataFrame containing the first two rows of the DataFrame (indexes 0 and 1).
new_df_row name age city 0 Alice 25 New York 1 Bob 30 Los Angeles
E. Access a single value by row and column index
this_value = df.loc[0, 'name']
This returns the value in the first row and ‘name’ column of the DataFrame (which is ‘Alice’ in this example).
this_value 'Alice'
F. Access a subset of rows and columns using boolean indexing
new_df_subset = df[(df['age'] > 25) & (df['city'] == 'Chicago')][['name', 'age']]
This returns a new DataFrame containing only the ‘name’ and ‘age’ columns for the rows where the age is greater than 25 and the city is ‘Chicago’.
new_df_subset name age 2 Charlie 35
Note that there are many other ways to access data in a DataFrame
in Pandas, such as:
- using integer indexing with the
iloc
method, - using advanced indexing techniques like
.loc[]
and.iloc[]
.
The best method to use depends on the specific task at hand.
Learn more about Pandas indexing and selecting data from the official documentation:
Load data from a file
Pandas provides functions for reading data from a variety of sources, including:
function | file type | tutorial |
---|---|---|
pd.read_csv() |
CSV files | example |
pd.read_excel() |
Excel spreadsheets | example |
pd.read_sql() |
SQL databases | example |
pd.read_json() |
JSON files | example |
pd.read_html() |
HTML code | example |
…and more. from Pandas : Input/Output
CSV: read_csv()
The pd.read_csv()
function is a convenient way to read in data from a CSV or TXT file and convert it to a DataFrame object. While importing data from a file, you can also handle various data types and missing values, facilitating seamless data manipulation and analysis in Python.
You can use the pd.read_csv()
function to read data from a CSV file, for example:
import pandas as pd
# Load data from a CSV file
df = pd.read_csv('data.csv')
In this example, we’re using Pandas to load data from a CSV file called “data.csv” and store it in a DataFrame
called df.
EXCEL/XLSX: read_excel()
To load data from an Excel file, you can use the read_excel()
function in Pandas.
import pandas as pd
# Load data from an Excel file
df = pd.read_excel('data.xlsx')
In this example, we are loading data from an Excel file named “data.xlsx”. The resulting data is stored in a Pandas DataFrame
named df.
SQL: read_sql()
To load data from a SQL database, you can use the read_sql()
function in Pandas.
import pandas as pd
import sqlite3
# Connect to a SQLite database
conn = sqlite3.connect('data.db')
# Load data from a SQL query
df = pd.read_sql('SELECT * FROM table_name', conn)
# Close the database connection
conn.close()
In this example, we are connecting to a SQLite database named “data.db”. We then use the read_sql()
function to load data from a SQL query. The resulting data is stored in a Pandas DataFrame named df. Finally, we close the database connection.
JSON: read_json()
To load data from a JSON file, you can use the read_json()
function in Pandas.
import pandas as pd
# Load data from a JSON file
df = pd.read_json('data.json')
In this example, we are loading data from a JSON file named “data.json”. The resulting data is stored in a Pandas DataFrame named df.
HTML: read_html()
To load data from an HTML file or webpage, you can use the read_html()
function in Pandas.
import pandas as pd
# Load data from an HTML table
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)')
# Select the table you want from the list of tables
df = tables[0]
In this example, we are loading data from an HTML table on the Wikipedia page for a list of countries by population. The read_html()
function returns a list of all tables found on the page. We then select the table we want by indexing the list. The resulting data is stored in a Pandas DataFrame named df.
OTHER FORMATS
I/O CheetSheet
The complete list of file formats supported by Pandas can be found in the official documentation under the IO Tools section: IO tools (text, CSV, HDF5, …) ⤴
This page provides detailed information on how to read and write data from various file formats, including CSV, Excel, SQL, JSON, HTML, HDF5, Parquet, Feather, Msgpack, and many others. Each format has its own section with examples and explanations.
File Format | Reader | Writer |
---|---|---|
CSV | read_csv() | to_csv() |
Fixed-Width Text File | read_fwf() | |
JSON | read_json() | to_json() |
HTML | read_html() | to_html |
LaTeX | Styler.to_latex() | |
XML | read_xml() | to_xml() |
Local clipboard | read_clipboard() | to_clipboard() |
MS Excel (.xlsx) | read_excel() | to_excel |
HDF5 Format | read_hdf() | to_hdf() |
Feather Format | read_feather() | to_feather() |
Parquet Format | read_parquet() | to_parquet() |
ORC Format | read_orc() | to_orc() |
Stata | read_stata() | to_stata() |
SAS | read_sas() | |
SPSS | read_spss() | |
Python Pickle Format | read_pickle() | to_pickle() |
SQL | read_sql() | to_sql() |
Google BigQuery | read_gbq() | to_gbq() |
Explore data
Once you’ve loaded your data into a Pandas DataFrame, you can use various functions to explore the data and get a sense of its structure and characteristics. For example, you can use the following functions:
df.head()
to display the first few rows of a DataFrame,df.describe()
to get summary statistics for numerical columns,df.info()
to display column names and data types
# Display the first few rows of the DataFrame
print(df.head())
# Display summary statistics for numerical columns
print(df.describe())
# Display column names and data types
print(df.info())
Assume the following data:
name age city
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 28 Miami
4 Emily 27 San Francisco
- Now, you have two options:
- Option 1: Copy this data to a file named “data.txt” or format it to CSV. Then, read a file with Pandas.
- Option 2: Create the DataFrame directly in your code.
- Once you have the DataFrame ready, explore the data using the following functions:
- Use
print(df.head())
to display the first few rows. - Utilize
print(df.describe())
for summary statistics. - Apply
print(df.info())
to view information about the DataFrame.
- Use
Refer to the previous section for detailed instructions:
See example solution for option 1
import pandas as pd
# Read data from the file
df = pd.read_csv("data.txt", delim_whitespace=True)
Explore the data
print("First few rows:")
print(df.head())
name age city 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 28 Miami 4 Emily 27 San Francisco
print("\nSummary statistics:")
print(df.describe())
age count 5.000000 mean 29.000000 std 3.807887 min 25.000000 25% 27.000000 50% 28.000000 75% 30.000000 max 35.000000
print("\nDataFrame information:")
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 5 non-null object 1 age 5 non-null int64 2 city 5 non-null object dtypes: int64(1), object(2) memory usage: 248.0+ bytes None
Clean data
Pandas provides functions for cleaning and transforming data, such as:
df.drop_duplicates()
to remove duplicate rows from a DataFrame,df.fillna()
to handle missing values (e.g., to fill in missing values with a specified value or method),- converting data types.
# Remove duplicate rows
df = df.drop_duplicates()
# Fill in missing values with the mean value of the column
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
# Convert a string column to a datetime column
df['date_column'] = pd.to_datetime(df['date_column'], format='%Y-%m-%d')
Filter data
You can use Boolean indexing and other functions to filter and subset data based on certain criteria.
For example, you can use the df[df['column'] > value]
syntax to filter rows based on a condition.
# Filter rows where a column value is greater than a certain value
filtered_df = df[df['column_name'] > value]
# Filter rows where a column value is in a list of values
filtered_df = df[df['column_name'].isin(list_of_values)]
Aggregate data
Pandas provides functions for grouping and aggregating data, such as calculating the mean, sum, or count of values in a column or group.
For example, you can use the df.groupby()
function to group data by one or more columns, and then use a aggregation function such as mean()
or sum()
to calculate summary statistics for each group.
# Group data by a column and calculate the mean of another column
grouped_df = df.groupby('group_column')['value_column'].mean()
# Group data by multiple columns and calculate the sum of another column
grouped_df = df.groupby(['group_column_1', 'group_column_2'])['value_column'].sum()