DataScience Workbook / 05: Introduction to Programming / 3. Introduction to Python programming / 3.4. Pandas library - data structure manipulation tool

Introduction

Pandas ⤴ is an open-source library for the Python programming language that is used for data manipulation and analysis. It provides tools for:

  • data analysis,
  • data cleaning,
  • data organization,

making it a popular choice for data scientists, analysts, and developers who work with large datasets and require efficient and scalable solutions for their data-related tasks.

Pandas is built on top of the NumPy ⤴ library and provides two primary data structures:

Series DataFrame
A Series is a one-dimensional array-like object that can hold any data type. A DataFrame is a two-dimensional tabular data structure that consists of rows and columns.
# Creating a Pandas Series with some data s = pd.Series([1, 3, 5])
# Creating a Pandas DataFrame with some data data = {'name': ['Alice', 'Bob'], 'age': [25, 32]}
df = pd.DataFrame(data)
# Printing the Series
0    1
1 3
2 5
dtype: int64
# Printing the DataFrame
       name    age
0 Alice 25
1 Bob 32

Some of the key features of the Pandas library include:

Input-Output (IO) operations The library can read and write data from a variety of sources, including CSV, Excel, and SQL databases.
Data cleaning The library makes it easy to handle missing or duplicate data, and provides tools for data normalization and standardization.
Data manipulation Pandas provides a wide range of functions for manipulating and transforming data, such as merging, grouping, and pivoting.
Data visualization Pandas includes built-in tools for creating basic visualizations, such as bar charts and scatter plots.
High performance Pandas is optimized for performance, with functions that are designed to handle large datasets efficiently.

For numerical and scientific computing, numpy and scipy are often the go-to options, while for data manipulation and analysis, pandas and other packages such as dask and pyspark may be more appropriate. 1) numerical and scientific computing:

  • NumPy ⤴, provides support for multidimensional arrays and matrices, and includes a wide range of mathematical functions
  • SciPy ⤴, includes modules for optimization, signal processing, and linear algebra, among others

2) data manipulation and analysis:

  • Pandas ⤴, provides high-level tools for working with tabular and heterogeneous data
  • Scikit-learn ⤴, includes modules for classification, regression, clustering, and dimensionality reduction
  • PySpark ⤴, allows users to process large datasets in parallel across multiple nodes
  • Dask ⤴, provides support for distributed computing and enables users to work with larger-than-memory datasets

Getting started with Pandas

Pandas is NOT a built-in Python module, meaning it is not included with the standard Python distribution. To use Pandas in your Python environment, you will need to install it separately.

Install pandas library

To install Pandas, you can use pip, which is the standard package installer for Python. You can run the following command in your terminal or command prompt to install Pandas:

pip install pandas

An alternative way to install Pandas is using Conda. This way you can install different variants of pandas library in separate virtual environments, depending on the requirements of your project. You can create and activate a new conda environment, and then install pandas libarary:

conda install pandas

This command will install Pandas and any necessary dependencies in your current Conda environment. If you don’t have conda installed yet, you can follow the guide provided in the tutorial .

conda provides additional benefits over pip, such as the ability to create and manage multiple environments for different projects with different dependencies, and the ability to install packages from both the Conda and PyPI (Python Package Index) repositories.

Import pandas library

After installation, you need to import Pandas in your Python code using the following command:

import pandas as pd

This will import the Pandas library and give it an alias of pd, which is a commonly used abbreviation for Pandas.

You can then use the Pandas functions and classes in your code by prefixing them with pd, such as pd.DataFrame or pd.read_csv.

Data manipulation

Pandas is a powerful Python library for data manipulation and analysis that provides easy-to-use and efficient data structures for handling tabular, structured, and time-series data. With its rich functionality and flexible syntax, Pandas is a powerful tool for data wrangling and advanced analysis in Python.

Here is an index of data manipulation tasks with Pandas:

  1. Creating Pandas objects
  2. Loading data from a file
  3. Exploring data
  4. Cleaning data
  5. Filtering data
  6. Aggregating data

Create Pandas objects

In Pandas, there are two main types of objects: Series and DataFrame.

Series object

In Pandas, a Series is a one-dimensional labeled data structure that can hold any data type, such as integers, floats, strings, or even complex objects. It is similar to a column in a DataFrame or a one-dimensional array in NumPy or a column in a spreadsheet.

A Series consists of an array of values and a set of labels, called the index. The index labels the values in the Series and allows you to access them by label instead of by position. If you don’t specify an index, Pandas will automatically assign a numerical index starting from 0.

Here’s an example of a simple Series:

0      John
1      Jane
2      Jack
3      Jill
Name: Name, dtype: object

In this example, we have a Series with four values, which are strings representing names. The index labels each value with a number from 0 to 3. The label “Name” at the end of the Series specifies the name of the Series.

DataFrame object

In Pandas, a DataFrame is a 2-dimensional labeled data structure with columns of potentially different data types. It is similar to a table in a relational database or a spreadsheet. It can hold multiple Series objects of different data types and allows you to perform operations on the entire data set, such as merging, grouping, filtering, and sorting.

A DataFrame consists of rows and columns, where each column can have a different data type (e.g., numeric, string, boolean). Each row represents a record, and each column represents a different variable or feature.

Here’s an example of a simple DataFrame:

   Name   Age      City
0  John    25    London
1  Jane    30  New York
2  Jack    35     Paris
3  Jill    40     Tokyo

In this example, we have a DataFrame with four rows and three columns: “Name”, “Age”, and “City”. The first column contains strings (the names), the second column contains integers (the ages), and the third column contains strings (the cities).

The rows and columns in a DataFrame are labeled, which means you can access the data using the labels.

  • The row labels are called the index.
  • The column labels are called the column names.

You can think of a DataFrame as a dictionary of Series objects, where each column is a Series.

Index objects

Pandas also provides Index objects such as Index, Numeric Index, CategoricalIndex, IntervalIndex, MultiIndex, DatetimeIndex, TimeDeltaIndex, PeriodIndex. They are used to label and index the data in Series and DataFrame objects. …from Pandas documentation ⤴

Create a Series

Here’s an example of creating a Pandas Series:

import pandas as pd

# create a Pandas Series with four values
s = pd.Series([25, 30, 35, 40], index=['John', 'Jane', 'Jack', 'Jill'])

# print the Series
print(s)

In this example, we use the pd.Series() constructor function to create a Series. We pass a list of four values and a list of four index labels.

The resulting Series has the values and index labels specified:

John    25
Jane    30
Jack    35
Jill    40
dtype: int64

Create an empty DataFrame

To create an empty DataFrame, you can call the pd.DataFrame() constructor function with no arguments, like this:

import pandas as pd

# create an empty DataFrame
df = pd.DataFrame()

# print the DataFrame
print(df)

This will create an empty DataFrame with no rows or columns:

Empty DataFrame
Columns: []
Index: []

You can create an empty DataFrame in Pandas and add data to it later using the df.append() method.

While using df as a variable name for a Pandas DataFrame object is a common convention, it’s generally a better practice to use a more descriptive and unique name that corresponds to the content of the DataFrame.
This is particularly important if you're working with multiple DataFrames in your code, as using the same variable name for multiple objects can lead to confusion and errors. By choosing a more descriptive name, you can help make your code more readable and understandable to others (and your future self!).

Create an empty DataFrame with structure

You can create an empty DataFrame of a specific structure by specifying the column names and data types when you create the DataFrame.

To create an empty DataFrame with specific column names and data types, you can pass a dictionary where the keys are the column names, and the values are the data types you want for each column.

import pandas as pd

# create an empty DataFrame with specific column names and data types
df = pd.DataFrame(columns=['name', 'age', 'city'], dtype=str)

# print the DataFrame
print(df)

In this example, we use the pd.DataFrame() constructor function to create an empty DataFrame. We pass a dictionary containing the column names as keys and the data types as values. We set the data type to str for all columns.

The resulting DataFrame has the specified column names and data types, but no rows:

Empty DataFrame
Columns: [name, age, city]
Index: []

You can then add data to this DataFrame using the df.append() method.

Create DataFrame from Python object

One of the powerful features of Pandas is that it allows you to create DataFrames from a variety of different data sources, including built-in Python objects like lists and dictionaries.

In general, Pandas makes it easy to convert data from a variety of different sources to DataFrames, allowing you to work with and manipulate the data using the powerful tools and functionality provided by the library.

DataFrame from a list
To create a DataFrame from a list, you can simply pass the list to the pd.DataFrame() constructor function. By default, Pandas will create a one-dimensional DataFrame with a single column containing the list values.

import pandas as pd

# create a list of values
values = ['apple', 'banana', 'cherry', 'date']

# create a DataFrame from the list
df = pd.DataFrame(values)

# print the DataFrame
print(df)

This code creates a DataFrame with a single column containing the values ‘apple’, ‘banana’, ‘cherry’, and ‘date’

        0
0   apple
1  banana
2  cherry
3    date

DataFrame from a dictionary
To create a DataFrame from a dictionary, you can pass the dictionary to the pd.DataFrame() constructor function. By default, Pandas will create a DataFrame with one column for each key in the dictionary, with the column name corresponding to the key name.

import pandas as pd

# create a dictionary with key-value pairs
data = {'name': ['Alice', 'Bob', 'Charlie'],
        'age': [25, 30, 35],
        'city': ['New York', 'Los Angeles', 'Chicago']}

# create a DataFrame from the dictionary
df = pd.DataFrame(data)

# print the DataFrame
print(df)

This code creates a DataFrame with three columns corresponding to the keys in the dictionary: ‘name’, ‘age’, and ‘city’.

The values in each column correspond to the values in the corresponding dictionary values:

      name   age  city
0    Alice   25   New York
1      Bob   30   Los Angeles
2  Charlie   35   Chicago

Add data

to empty DataFrame

To add data to an empty DataFrame, you can use the df.append() method. This method appends rows of data to the end of the DataFrame.

The df.append() method takes a dictionary or Series object containing the values to add as a new row to the DataFrame, and appends it to the end of the DataFrame.

Let’s assume we have an empty DataFrame:

import pandas as pd

# create an empty DataFrame
df = pd.DataFrame()

We can add some data to the DataFrame using df.append() method:

# add some data to the DataFrame
df = df.append({'name': 'Alice', 'age': 25, 'city': 'New York'}, ignore_index=True)
df = df.append({'name': 'Bob', 'age': 30, 'city': 'Los Angeles'}, ignore_index=True)
df = df.append({'name': 'Charlie', 'age': 35, 'city': 'Chicago'}, ignore_index=True)

# print the DataFrame
print(df)

The ignore_index=True argument tells Pandas to ignore any existing index labels on the DataFrame, and instead generate new integer index labels starting from 0.

The print() function displays the DataFrame with the added data:

      name   age  city
0    Alice   25   New York
1      Bob   30   Los Angeles
2  Charlie   35   Chicago

new row to existing DataFrame

add a new row with data
To add new data as a new row to an existing DataFrame in Pandas, you can use the append() method. (Let’s use the df DataFrame object from previous example.)

# add new data to the DataFrame
df = df.append({'name': 'David', 'age': 28, 'city': 'Miami'}, ignore_index=True)
df = df.append({'name': 'Emily', 'age': 27, 'city': 'San Francisco'}, ignore_index=True)

print(df)

We just added two more rows of data to the DataFrame using append() method. The ignore_index=True parameter ensures that the new rows are added with consecutive row numbers.

The output of this code will be a DataFrame with five rows of data, including the new rows added at the end:

      name   age  city
0    Alice    25  New York
1      Bob    30  Los Angeles
2  Charlie    35  Chicago
3    David    28  Miami
4    Emily    27  San Francisco

add a new row with data
In Pandas, you can add a new empty row to a DataFrame by appending a new empty dictionary to the DataFrame using the append method.

# add a new empty row to the DataFrame
df = df.append({}, ignore_index=True)

print(df)

In this example, we added a new empty row to the DataFrame by appending an empty dictionary to it using the append method and setting the ignore_index parameter to True to ensure that the index of the new row is automatically assigned.

The output of this code will be a DataFrame with six rows and three columns, including the new empty row:

      name   age  city
0    Alice    25  New York
1      Bob    30  Los Angeles
2  Charlie    35  Chicago
3    David    28  Miami
4    Emily    27  San Francisco
5      NaN   NaN  NaN

Note that the values in the new empty row are represented as NaN (Not a Number), which is the default value for missing or undefined data in Pandas.

new column to existing DataFrame

add a column with values
To add a new column to an existing DataFrame in Pandas, you can simply assign a new column with a name and values to the DataFrame.

Here’s an example of adding a new column called ‘gender’ to the DataFrame from the previous example:

# add a new column to the DataFrame
df['gender'] = ['F', 'M', 'M', 'M', 'F']

print(df)

In this example, we added a new column called ‘gender’ to the DataFrame by assigning a list of values to it using the square bracket notation.

* Note that the length of the list must match the number of rows in the DataFrame.

The output of this code will be a DataFrame with four columns, including the new ‘gender’ column:

      name   age  city           gender
0    Alice    25  New York            F
1      Bob    30  Los Angeles         M
2  Charlie    35  Chicago             M
3    David    28  Miami               M
4    Emily    27  San Francisco       F
5      NaN   NaN  NaN               NaN

add an empty column
To add an empty column to an existing DataFrame in Pandas, you can simply assign an empty Series with a name to the DataFrame.

Here’s an example of adding an empty column called ‘income’ to the DataFrame from the previous example:

# add an empty column to the DataFrame
df['income'] = pd.Series(dtype=float)

print(df)

In this example, we added an empty column called ‘income’ to the DataFrame by assigning an empty Pandas Series with a data type (dtype) of float to it using the square bracket notation.

The output of this code will be a DataFrame with four columns, including the new empty ‘income’ column:

      name   age  city           gender  income
0    Alice    25  New York            F     NaN
1      Bob    30  Los Angeles         M     NaN
2  Charlie    35  Chicago             M     NaN
3    David    28  Miami               M     NaN
4    Emily    27  San Francisco       F     NaN
5      NaN   NaN  NaN               NaN     NaN

Note that the values in the new empty column are represented as NaN (Not a Number), which is the default value for missing or undefined data in Pandas.

Access data

in the Series

Let’s assume we have a Pandas Series:

import pandas as pd

# create a Pandas Series with four values
s = pd.Series([25, 30, 35, 40], index=['John', 'Jane', 'Jack', 'Jill'])

We can access the values and index labels of the Series using indexing:

# access the value for 'Jane'
print(s['Jane'])
30
# access the values for 'Jane' and 'Jack'
print(s[['Jane', 'Jack']])
Jane    30
Jack    35
dtype: int64
# access the values greater than 30
print(s[s > 30])
Jack    35
Jill    40
dtype: int64

in the DataFrame

Let’s assume we have a Pandas DataFrame:

import pandas as pd

# create the DataFrame
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'city': ['New York', 'Los Angeles', 'Chicago']})

The print() function displays the DataFrame:

print(df)
      name  age         city
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

In Pandas, you can access data in a DataFrame using indexing and slicing.

Here are some examples of how to access data in the DataFrame df:

A. Access a single column by name
series_col = df['name']

This returns a Pandas Series object containing the values in the ‘name’ column.

print(series_col)
0      Alice
1        Bob
2    Charlie
B. Access multiple columns by name
new_df_col = df[['name', 'age']]

This returns a new DataFrame containing only the ‘name’ and ‘age’ columns.

new_df_col
      name  age
0    Alice   25
1      Bob   30
2  Charlie   35
C. Access a single row by index
series_row = df.loc[0]

This returns a Series object containing the values in the first row of the DataFrame (index 0).

series_row
name       Alice
age           25
city    New York
D. Access multiple rows by index
new_df_row = df.loc[0:1]

This returns a new DataFrame containing the first two rows of the DataFrame (indexes 0 and 1).

new_df_row
    name  age         city
0  Alice   25     New York
1    Bob   30  Los Angeles
E. Access a single value by row and column index
this_value = df.loc[0, 'name']

This returns the value in the first row and ‘name’ column of the DataFrame (which is ‘Alice’ in this example).

this_value
'Alice'
F. Access a subset of rows and columns using boolean indexing
new_df_subset = df[(df['age'] > 25) & (df['city'] == 'Chicago')][['name', 'age']]

This returns a new DataFrame containing only the ‘name’ and ‘age’ columns for the rows where the age is greater than 25 and the city is ‘Chicago’.

new_df_subset
      name  age
2  Charlie   35

Note that there are many other ways to access data in a DataFrame in Pandas, such as:

  • using integer indexing with the iloc method,
  • using advanced indexing techniques like .loc[] and .iloc[].

The best method to use depends on the specific task at hand.

Learn more about Pandas indexing and selecting data from the official documentation:

Load data from a file

Pandas provides functions for reading data from a variety of sources, including:

function file type tutorial
pd.read_csv() CSV files example
pd.read_excel() Excel spreadsheets example
pd.read_sql() SQL databases example
pd.read_json() JSON files example
pd.read_html() HTML code example

…and more. from Pandas : Input/Output

CSV: read_csv()

The pd.read_csv() function is a convenient way to read in data from a CSV or TXT file and convert it to a DataFrame object. While importing data from a file, you can also handle various data types and missing values, facilitating seamless data manipulation and analysis in Python.

You can use the pd.read_csv() function to read data from a CSV file, for example:

import pandas as pd

# Load data from a CSV file
df = pd.read_csv('data.csv')

In this example, we’re using Pandas to load data from a CSV file called “data.csv” and store it in a DataFrame called df.

EXCEL/XLSX: read_excel()

To load data from an Excel file, you can use the read_excel() function in Pandas.

import pandas as pd

# Load data from an Excel file
df = pd.read_excel('data.xlsx')

In this example, we are loading data from an Excel file named “data.xlsx”. The resulting data is stored in a Pandas DataFrame named df.

SQL: read_sql()

To load data from a SQL database, you can use the read_sql() function in Pandas.

import pandas as pd
import sqlite3

# Connect to a SQLite database
conn = sqlite3.connect('data.db')

# Load data from a SQL query
df = pd.read_sql('SELECT * FROM table_name', conn)

# Close the database connection
conn.close()

In this example, we are connecting to a SQLite database named “data.db”. We then use the read_sql() function to load data from a SQL query. The resulting data is stored in a Pandas DataFrame named df. Finally, we close the database connection.

JSON: read_json()

To load data from a JSON file, you can use the read_json() function in Pandas.

import pandas as pd

# Load data from a JSON file
df = pd.read_json('data.json')

In this example, we are loading data from a JSON file named “data.json”. The resulting data is stored in a Pandas DataFrame named df.

HTML: read_html()

To load data from an HTML file or webpage, you can use the read_html() function in Pandas.

import pandas as pd

# Load data from an HTML table
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)')

# Select the table you want from the list of tables
df = tables[0]

In this example, we are loading data from an HTML table on the Wikipedia page for a list of countries by population. The read_html() function returns a list of all tables found on the page. We then select the table we want by indexing the list. The resulting data is stored in a Pandas DataFrame named df.

OTHER FORMATS
I/O CheetSheet

The complete list of file formats supported by Pandas can be found in the official documentation under the IO Tools section: IO tools (text, CSV, HDF5, …) ⤴
This page provides detailed information on how to read and write data from various file formats, including CSV, Excel, SQL, JSON, HTML, HDF5, Parquet, Feather, Msgpack, and many others. Each format has its own section with examples and explanations.

File Format Reader Writer
CSV read_csv() to_csv()
Fixed-Width Text File read_fwf()  
JSON read_json() to_json()
HTML read_html() to_html
LaTeX   Styler.to_latex()
XML read_xml() to_xml()
Local clipboard read_clipboard() to_clipboard()
MS Excel (.xlsx) read_excel() to_excel
HDF5 Format read_hdf() to_hdf()
Feather Format read_feather() to_feather()
Parquet Format read_parquet() to_parquet()
ORC Format read_orc() to_orc()
Stata read_stata() to_stata()
SAS read_sas()  
SPSS read_spss()  
Python Pickle Format read_pickle() to_pickle()
SQL read_sql() to_sql()
Google BigQuery read_gbq() to_gbq()

Explore data

Once you’ve loaded your data into a Pandas DataFrame, you can use various functions to explore the data and get a sense of its structure and characteristics. For example, you can use the following functions:

  • df.head() to display the first few rows of a DataFrame,
  • df.describe() to get summary statistics for numerical columns,
  • df.info() to display column names and data types
# Display the first few rows of the DataFrame
print(df.head())

# Display summary statistics for numerical columns
print(df.describe())

# Display column names and data types
print(df.info())

Assume the following data: name age city 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago 3 David 28 Miami 4 Emily 27 San Francisco

  1. Now, you have two options:
    • Option 1: Copy this data to a file named “data.txt” or format it to CSV. Then, read a file with Pandas.
    • Option 2: Create the DataFrame directly in your code.
  2. Once you have the DataFrame ready, explore the data using the following functions:
    • Use print(df.head()) to display the first few rows.
    • Utilize print(df.describe()) for summary statistics.
    • Apply print(df.info()) to view information about the DataFrame.
See example solution for option 1
import pandas as pd

# Read data from the file
df = pd.read_csv("data.txt", delim_whitespace=True)

Explore the data

print("First few rows:")
print(df.head())
      name  age           city
0    Alice   25       New York
1      Bob   30    Los Angeles
2  Charlie   35        Chicago
3    David   28          Miami
4    Emily   27  San Francisco
print("\nSummary statistics:")
print(df.describe())
             age
count   5.000000
mean   29.000000
std     3.807887
min    25.000000
25%    27.000000
50%    28.000000
75%    30.000000
max    35.000000
print("\nDataFrame information:")
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   name    5 non-null      object
 1   age     5 non-null      int64
 2   city    5 non-null      object
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes
None

Clean data

Pandas provides functions for cleaning and transforming data, such as:

  • df.drop_duplicates() to remove duplicate rows from a DataFrame,
  • df.fillna() to handle missing values (e.g., to fill in missing values with a specified value or method),
  • converting data types.
# Remove duplicate rows
df = df.drop_duplicates()

# Fill in missing values with the mean value of the column
df['column_name'].fillna(df['column_name'].mean(), inplace=True)

# Convert a string column to a datetime column
df['date_column'] = pd.to_datetime(df['date_column'], format='%Y-%m-%d')

Filter data

You can use Boolean indexing and other functions to filter and subset data based on certain criteria.

For example, you can use the df[df['column'] > value] syntax to filter rows based on a condition.

# Filter rows where a column value is greater than a certain value
filtered_df = df[df['column_name'] > value]

# Filter rows where a column value is in a list of values
filtered_df = df[df['column_name'].isin(list_of_values)]

Aggregate data

Pandas provides functions for grouping and aggregating data, such as calculating the mean, sum, or count of values in a column or group.

For example, you can use the df.groupby() function to group data by one or more columns, and then use a aggregation function such as mean() or sum() to calculate summary statistics for each group.

# Group data by a column and calculate the mean of another column
grouped_df = df.groupby('group_column')['value_column'].mean()

# Group data by multiple columns and calculate the sum of another column
grouped_df = df.groupby(['group_column_1', 'group_column_2'])['value_column'].sum()