DataScience Workbook / 05. Introduction to Programming / 3. Introduction to Python Programming / 3.5 Pandas Library - Data Structure Manipulation Tool


Introduction

Pandas ⤴ is an open-source library for the Python programming language that is used for data manipulation and analysis. It provides tools for:

  • data analysis,
  • data cleaning,
  • data organization,

making it a popular choice for data scientists, analysts, and developers who work with large datasets.

The library is built on top of the NumPy ⤴ library and provides two primary data structures:

Series DataFrame
A Series is a one-dimensional array-like object that can hold any data type. A DataFrame is a two-dimensional tabular data structure that consists of rows and columns.
# Creating a Pandas Series with some data

s = pd.Series([1, 3, 5, 7])
# Creating a Pandas DataFrame with some data

data = {'name': ['Alice', 'Bob'], 'age': [25, 32]}
df = pd.DataFrame(data)
# Printing the Series
0   1
1   3
2   5
3   7
dtype: int64
# Printing the DataFrame
   name  age
0   Alice   25
1    Bob    32

Some of the key features of the Pandas library include:

  • Input-Output (IO) operations
    The library can read and write data from a variety of sources, including CSV, Excel, and SQL databases.

  • Data cleaning
    The library makes it easy to handle missing or duplicate data, and provides tools for data normalization and standardization.

  • Data manipulation
    Pandas provides a wide range of functions for manipulating and transforming data, such as merging, grouping, and pivoting.

  • Data visualization
    Pandas includes built-in tools for creating basic visualizations, such as bar charts and scatter plots.

  • High performance
    Pandas is optimized for performance, with functions that are designed to handle large datasets efficiently.

PRO TIP:
For numerical and scientific computing, NumPy and SciPy are often the go-to options, while for data manipulation and analysis, Pandas and other packages such as Dask and PySpark may be more appropriate.
  • numerical and scientific computing:
  • NumPy ⤴, provides support for multidimensional arrays and matrices, and includes a wide range of mathematical functions
    SciPy ⤴, includes modules for optimization, signal processing, and linear algebra, among others
  • data manipulation and analysis:
  • Pandas ⤴, provides high-level tools for working with tabular and heterogeneous data
    Scikit-learn ⤴, includes modules for classification, regression, clustering, and dimensionality reduction
    PySpark ⤴, allows users to process large datasets in parallel across multiple nodes
    Dask ⤴, provides support for distributed computing and enables users to work with larger-than-memory datasets


    Getting started with Pandas

    Pandas is NOT a built-in Python module, meaning it is not included with the standard Python distribution. To use Pandas in your Python environment, you will need to install it separately.

    Install pandas library

    To install Pandas, you can use pip, which is the standard package installer for Python. You can run the following command in your terminal or command prompt to install Pandas:

    pip install pandas
    

    An alternative way to install Pandas is using Conda. This way you can install different variants of pandas library in separate virtual environments, depending on the requirements of your project. You can create and activate a new conda environment, and then install pandas libarary:

    conda install pandas
    
    WARNING:
    This command will install Pandas and any necessary dependencies in your current Conda environment.

    If you don't have Conda installed yet, you can follow the guide provided in the tutorial Local Python setup on your computing machine ⤴.
    PRO TIP:
    Conda provides additional benefits over pip, such as the ability to create and manage multiple environments for different projects with different dependencies, and the ability to install packages from both the Conda and PyPI (Python Package Index) repositories.

    Import pandas library

    After installation, you need to import Pandas in your Python code using the following command:

    import pandas as pd
    

    This will import the Pandas library and give it an alias of pd, which is a commonly used abbreviation for Pandas.

    You can then use the Pandas functions and classes in your code by prefixing them with pd, such as pd.DataFrame or pd.read_csv.


    Data manipulation with pandas

    Pandas is a powerful Python library for data manipulation and analysis that provides easy-to-use and efficient data structures for handling tabular, structured, and time-series data. With its rich functionality and flexible syntax, Pandas is a powerful tool for data analysis and manipulation in Python.

    Here is an index of data manipulation tasks with Pandas:

    1. Creating Pandas objects
    2. Loading data from a file
    3. Exploring data
    4. Cleaning data
    5. Filtering data
    6. Aggregating data

    Create Pandas objects

    In Pandas, there are two main types of objects: Series and DataFrame.

    Series object

    In Pandas, a Series is a one-dimensional labeled data structure that can hold any data type, such as integers, floats, strings, or even complex objects. It is similar to a column in a DataFrame or a one-dimensional array in NumPy or a column in a spreadsheet.

    NOTE:
    A Series consists of an array of values and a set of labels, called the index. The index labels the values in the Series and allows you to access them by label instead of by position. If you don't specify an index, Pandas will automatically assign a numerical index starting from 0.


    Here’s an example of a simple Series:

    0      John
    1      Jane
    2      Jack
    3      Jill
    Name: Name, dtype: object
    

    In this example, we have a Series with four values, which are strings representing names. The index labels each value with a number from 0 to 3. The label “Name” at the end of the Series specifies the name of the Series.

    DataFrame object

    In Pandas, a DataFrame is a 2-dimensional labeled data structure with columns of potentially different data types. It is similar to a table in a relational database or a spreadsheet. It can hold multiple Series objects of different data types and allows you to perform operations on the entire data set, such as merging, grouping, filtering, and sorting.

    A DataFrame consists of rows and columns, where each column can have a different data type (e.g., numeric, string, boolean). Each row represents a record, and each column represents a different variable or feature.

    Here’s an example of a simple DataFrame:

       Name   Age      City
    0  John    25    London
    1  Jane    30  New York
    2  Jack    35     Paris
    3  Jill    40     Tokyo
    

    In this example, we have a DataFrame with four rows and three columns: “Name”, “Age”, and “City”. The first column contains strings (the names), the second column contains integers (the ages), and the third column contains strings (the cities).

    The rows and columns in a DataFrame are labeled, which means you can access the data using the labels.

    • The row labels are called the index.
    • The column labels are called the column names.

    You can think of a DataFrame as a dictionary of Series objects, where each column is a Series.

    Other Pandas objects

    Pandas also provides Index objects such as Index, Numeric Index, CategoricalIndex, IntervalIndex, MultiIndex, DatetimeIndex, TimeDeltaIndex, PeriodIndex (see Pandas documentation ⤴ for details). They are used to label and index the data in Series and DataFrame objects.

    Create a Series

    Here’s an example of creating a Pandas Series:

    import pandas as pd
    
    # create a Pandas Series with four values
    s = pd.Series([25, 30, 35, 40], index=['John', 'Jane', 'Jack', 'Jill'])
    
    # print the Series
    print(s)
    

    In this example, we use the pd.Series() constructor function to create a Series. We pass a list of four values and a list of four index labels. The resulting Series has the values and index labels specified:


    John 25
    Jane 30
    Jack 35
    Jill 40
    dtype: int64

    Create an empty DataFrame

    To create an empty DataFrame, you can call the pd.DataFrame() constructor function with no arguments, like this:

    import pandas as pd
    
    # create an empty DataFrame
    df = pd.DataFrame()
    
    # print the DataFrame
    print(df)
    

    This will create an empty DataFrame with no rows or columns:


    Empty DataFrame
    Columns: []
    Index: []

    PRO TIP:
    You can create an empty DataFrame in Pandas and add data to it later using the df.append() method.
    PRO TIP:
    While using df as a variable name for a Pandas DataFrame object is a common convention, it's generally a better practice to use a more descriptive and unique name that corresponds to the content of the DataFrame.

    This is particularly important if you're working with multiple DataFrames in your code, as using the same variable name for multiple objects can lead to confusion and errors. By choosing a more descriptive name, you can help make your code more readable and understandable to others (and your future self!).


    Create an empty DataFrame with structure

    You can create an empty DataFrame of a specific structure by specifying the column names and data types when you create the DataFrame.

    To create an empty DataFrame with specific column names and data types, you can pass a dictionary where the keys are the column names, and the values are the data types you want for each column.

    import pandas as pd
    
    # create an empty DataFrame with specific column names and data types
    df = pd.DataFrame(columns=['name', 'age', 'city'], dtype=str)
    
    # print the DataFrame
    print(df)
    

    In this example, we use the pd.DataFrame() constructor function to create an empty DataFrame. We pass a dictionary containing the column names as keys and the data types as values. We set the data type to str for all columns.

    The resulting DataFrame has the specified column names and data types, but no rows:

    Empty DataFrame
    Columns: [name, age, city]
    Index: []

    You can then add data to this DataFrame using the df.append() method.

    Create DataFrame from Python object

    One of the powerful features of Pandas is that it allows you to create DataFrames from a variety of different data sources, including built-in Python objects like lists and dictionaries.

    PRO TIP:
    In general, Pandas makes it easy to convert data from a variety of different sources to DataFrames, allowing you to work with and manipulate the data using the powerful tools and functionality provided by the library.


    DataFrame from a list
    To create a DataFrame from a list, you can simply pass the list to the pd.DataFrame() constructor function. By default, Pandas will create a one-dimensional DataFrame with a single column containing the list values.

    import pandas as pd
    
    # create a list of values
    values = ['apple', 'banana', 'cherry', 'date']
    
    # create a DataFrame from the list
    df = pd.DataFrame(values)
    
    # print the DataFrame
    print(df)
    

    This code creates a DataFrame with a single column containing the values ‘apple’, ‘banana’, ‘cherry’, and ‘date’

        0
    0 apple
    1 banana
    2 cherry
    3 date

    DataFrame from a dictionary
    To create a DataFrame from a dictionary, you can pass the dictionary to the pd.DataFrame() constructor function. By default, Pandas will create a DataFrame with one column for each key in the dictionary, with the column name corresponding to the key name.

    import pandas as pd
    
    # create a dictionary with key-value pairs
    data = {'name': ['Alice', 'Bob', 'Charlie'],
            'age': [25, 30, 35],
            'city': ['New York', 'Los Angeles', 'Chicago']}
    
    # create a DataFrame from the dictionary
    df = pd.DataFrame(data)
    
    # print the DataFrame
    print(df)
    

    This code creates a DataFrame with three columns corresponding to the keys in the dictionary: ‘name’, ‘age’, and ‘city’.

    The values in each column correspond to the values in the corresponding dictionary values:

         name   age   city
    0   Alice    25   New York
    1     Bob    30   Los Angeles
    2 Charlie    35   Chicago

    Add data

    to empty DataFrame

    To add data to an empty DataFrame, you can use the df.append() method. This method appends rows of data to the end of the DataFrame.

    NOTE:
    The df.append() method takes a dictionary or Series object containing the values to add as a new row to the DataFrame, and appends it to the end of the DataFrame.


    Let’s assume we have an empty DataFrame: import pandas as pd

    ♯ create an empty DataFrame
    df = pd.DataFrame()

    We can add some data to the DataFrame using df.append() method:

    # add some data to the DataFrame
    df = df.append({'name': 'Alice', 'age': 25, 'city': 'New York'}, ignore_index=True)
    df = df.append({'name': 'Bob', 'age': 30, 'city': 'Los Angeles'}, ignore_index=True)
    df = df.append({'name': 'Charlie', 'age': 35, 'city': 'Chicago'}, ignore_index=True)
    
    # print the DataFrame
    print(df)
    
    WARNING:
    The ignore_index=True argument tells Pandas to ignore any existing index labels on the DataFrame, and instead generate new integer index labels starting from 0.

    The print() function displays the DataFrame with the added data:

         name   age   city
    0   Alice    25   New York
    1     Bob    30   Los Angeles
    2 Charlie    35   Chicago

    new row to existing DataFrame

    add a new row with data
    To add new data as a new row to an existing DataFrame in Pandas, you can use the append() method.
    (Let’s use the df DataFrame object from previous example.)

    # add new data to the DataFrame
    df = df.append({'name': 'David', 'age': 28, 'city': 'Miami'}, ignore_index=True)
    df = df.append({'name': 'Emily', 'age': 27, 'city': 'San Francisco'}, ignore_index=True)
    
    print(df)
    

    We just added two more rows of data to the DataFrame using append() method. The ignore_index=True parameter ensures that the new rows are added with consecutive row numbers.

    The output of this code will be a DataFrame with five rows of data, including the new rows added at the end:

         name   age   city
    0   Alice    25   New York
    1     Bob    30   Los Angeles
    2 Charlie    35   Chicago
    3   David    28   Miami
    4   Emily    27   San Francisco

    add a new row with data
    In Pandas, you can add a new empty row to a DataFrame by appending a new empty dictionary to the DataFrame using the append method.

    # add a new empty row to the DataFrame
    df = df.append({}, ignore_index=True)
    
    print(df)
    

    In this example, we added a new empty row to the DataFrame by appending an empty dictionary to it using the append method and setting the ignore_index parameter to True to ensure that the index of the new row is automatically assigned.

    The output of this code will be a DataFrame with six rows and three columns, including the new empty row:

         name   age   city
    0   Alice    25   New York
    1     Bob    30   Los Angeles
    2 Charlie    35   Chicago
    3   David    28   Miami
    4   Emily    27   San Francisco
    5     NaN   NaN   NaN

    WARNING:
    Note that the values in the new empty row are represented as NaN (Not a Number), which is the default value for missing or undefined data in Pandas.

    new column to existing DataFrame

    add a column with values
    To add a new column to an existing DataFrame in Pandas, you can simply assign a new column with a name and values to the DataFrame.

    Here’s an example of adding a new column called ‘gender’ to the DataFrame from the previous example:

    # add a new column to the DataFrame
    df['gender'] = ['F', 'M', 'M', 'M', 'F']
    
    print(df)
    

    In this example, we added a new column called ‘gender’ to the DataFrame by assigning a list of values to it using the square bracket notation. Note that the length of the list must match the number of rows in the DataFrame.

    The output of this code will be a DataFrame with four columns, including the new ‘gender’ column:

         name   age   city           gender
    0   Alice    25   New York            F
    1     Bob    30   Los Angeles         M
    2 Charlie    35   Chicago             M
    3   David    28   Miami               M
    4   Emily    27   San Francisco       F
    5     NaN   NaN   NaN               NaN

    add an empty column
    To add an empty column to an existing DataFrame in Pandas, you can simply assign an empty Series with a name to the DataFrame.

    Here’s an example of adding an empty column called ‘income’ to the DataFrame from the previous example:

    # add an empty column to the DataFrame
    df['income'] = pd.Series(dtype=float)
    
    print(df)
    

    In this example, we added an empty column called ‘income’ to the DataFrame by assigning an empty Pandas Series with a data type (dtype) of float to it using the square bracket notation.

    The output of this code will be a DataFrame with four columns, including the new empty ‘income’ column:

         name   age   city           gender   income
    0   Alice    25   New York            F      NaN
    1     Bob    30   Los Angeles         M      NaN
    2 Charlie    35   Chicago             M      NaN
    3   David    28   Miami               M      NaN
    4   Emily    27   San Francisco       F      NaN
    5     NaN   NaN   NaN               NaN      NaN

    WARNING:
    Note that the values in the new empty column are represented as NaN (Not a Number), which is the default value for missing or undefined data in Pandas.

    Access data

    in the Series

    Let’s assume we have a Pandas Series:

    import pandas as pd

    ♯ create a Pandas Series with four values
    s = pd.Series([25, 30, 35, 40], index=['John', 'Jane', 'Jack', 'Jill'])

    We can access the values and index labels of the Series using indexing:

    # access the value for 'Jane'
    print(s['Jane'])
    

    OUTPUT: 30

    # access the values for 'Jane' and 'Jack'
    print(s[['Jane', 'Jack']])
    

    OUTPUT: Jane 30
    Jack 35
    dtype: int64

    # access the values greater than 30
    print(s[s > 30])
    

    OUTPUT: Jack 35
    Jill 40
    dtype: int64

    in the DataFrame

    Let’s assume we have a Pandas DataFrame:

    import pandas as pd

    ♯ create the DataFrame
    df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'city': ['New York', 'Los Angeles', 'Chicago']})

    The print() function displays the DataFrame:

         name   age   city
    0   Alice    25   New York
    1     Bob    30   Los Angeles
    2 Charlie    35   Chicago

    In Pandas, you can access data in a DataFrame using indexing and slicing. Here are some examples of how to access data in the DataFrame df:

    A. Access a single column by name:

    series_col = df['name']
    

    This returns a Pandas Series object containing the values in the ‘name’ column.

    B. Access multiple columns by name:

    new_df_col = df[['name', 'age']]
    

    This returns a new DataFrame containing only the ‘name’ and ‘age’ columns.

    C. Access a single row by index:

    series_row = df.loc[0]
    

    This returns a Series object containing the values in the first row of the DataFrame (index 0).

    D. Access multiple rows by index:

    new_df_row = df.loc[0:1]
    

    This returns a new DataFrame containing the first two rows of the DataFrame (indexes 0 and 1).

    E. Access a single value by row and column index:

    this_value = df.loc[0, 'name']
    

    This returns the value in the first row and ‘name’ column of the DataFrame (which is ‘Alice’ in this example).

    F. Access a subset of rows and columns using boolean indexing:

    new_df_subset = df[(df['age'] > 25) & (df['city'] == 'New York')][['name', 'age']]
    

    This returns a new DataFrame containing only the ‘name’ and ‘age’ columns for the rows where the age is greater than 25 and the city is ‘New York’.

    PRO TIP:
    Note that there are many other ways to access data in a DataFrame in Pandas, such as:
  • using integer indexing with the iloc method,
  • using advanced indexing techniques like .loc[] and .iloc[].

  • The best method to use depends on the specific task at hand.

    Learn more about Pandas indexing and selecting data from the official documentation:

    Load data from a file

    Pandas provides functions for reading data from a variety of sources, including:

    • CSV files,             pd.read_csv()
    • Excel spreadsheets,    
    • SQL databases,      
    • and more.

    CSV: read_csv()

    You can use the pd.read_csv() function to read data from a CSV file, for example:

    import pandas as pd
    
    # Load data from a CSV file
    df = pd.read_csv('data.csv')
    

    In this example, we’re using Pandas to load data from a CSV file called “data.csv” and store it in a DataFrame called df. The pd.read_csv() function is a convenient way to read in data from a CSV file and convert it to a DataFrame object.

    EXCEL/XLSX: read_excel()

    To load data from an Excel file, you can use the read_excel() function in Pandas.

    import pandas as pd
    
    # Load data from an Excel file
    df = pd.read_excel('data.xlsx')
    

    In this example, we are loading data from an Excel file named “data.xlsx”. The resulting data is stored in a Pandas DataFrame named df.

    SQL: read_sql()

    To load data from a SQL database, you can use the read_sql() function in Pandas.

    import pandas as pd
    import sqlite3
    
    # Connect to a SQLite database
    conn = sqlite3.connect('data.db')
    
    # Load data from a SQL query
    df = pd.read_sql('SELECT * FROM table_name', conn)
    
    # Close the database connection
    conn.close()
    

    In this example, we are connecting to a SQLite database named “data.db”. We then use the read_sql() function to load data from a SQL query. The resulting data is stored in a Pandas DataFrame named df. Finally, we close the database connection.

    JSON: read_json()

    To load data from a JSON file, you can use the read_json() function in Pandas.

    import pandas as pd
    
    # Load data from a JSON file
    df = pd.read_json('data.json')
    

    In this example, we are loading data from a JSON file named “data.json”. The resulting data is stored in a Pandas DataFrame named df.

    HTML: read_html()

    To load data from an HTML file or webpage, you can use the read_html() function in Pandas.

    import pandas as pd
    
    # Load data from an HTML table
    tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)')
    
    # Select the table you want from the list of tables
    df = tables[0]
    

    In this example, we are loading data from an HTML table on the Wikipedia page for a list of countries by population. The read_html() function returns a list of all tables found on the page. We then select the table we want by indexing the list. The resulting data is stored in a Pandas DataFrame named df.

    OTHER FORMATS I/O CheetSheet

    The complete list of file formats supported by Pandas can be found in the official documentation under the IO Tools section: IO tools (text, CSV, HDF5, …) ⤴ This page provides detailed information on how to read and write data from various file formats, including CSV, Excel, SQL, JSON, HTML, HDF5, Parquet, Feather, Msgpack, and many others. Each format has its own section with examples and explanations.

    File Format Reader Writer
    CSV read_csv() to_csv()
    Fixed-Width Text File read_fwf()  
    JSON read_json() to_json()
    HTML read_html() to_html
    LaTeX   Styler.to_latex()
    XML read_xml() to_xml()
    Local clipboard read_clipboard() to_clipboard()
    MS Excel (.xlsx) read_excel() to_excel
    HDF5 Format read_hdf() to_hdf()
    Feather Format read_feather() to_feather()
    Parquet Format read_parquet() to_parquet()
    ORC Format read_orc() to_orc()
    Stata read_stata() to_stata()
    SAS read_sas()  
    SPSS read_spss()  
    Python Pickle Format read_pickle() to_pickle()
    SQL read_sql() to_sql()
    Google BigQuery read_gbq() to_gbq()

    Explore data

    Once you’ve loaded your data into a Pandas DataFrame, you can use various functions to explore the data and get a sense of its structure and characteristics. For example, you can use the following functions:

    • df.head() to display the first few rows of a DataFrame,
    • df.describe() to get summary statistics for numerical columns,
    • df.info() to display column names and data types
    # Display the first few rows of the DataFrame
    print(df.head())
    
    # Display summary statistics for numerical columns
    print(df.describe())
    
    # Display column names and data types
    print(df.info())
    

    Clean data

    Pandas provides functions for cleaning and transforming data, such as:

    • df.drop_duplicates() to remove duplicate rows from a DataFrame,
    • df.fillna() to handle missing values (e.g., to fill in missing values with a specified value or method),
    • converting data types.
    # Remove duplicate rows
    df = df.drop_duplicates()
    
    # Fill in missing values with the mean value of the column
    df['column_name'].fillna(df['column_name'].mean(), inplace=True)
    
    # Convert a string column to a datetime column
    df['date_column'] = pd.to_datetime(df['date_column'], format='%Y-%m-%d')
    

    Filter data

    You can use Boolean indexing and other functions to filter and subset data based on certain criteria.

    For example, you can use the df[df['column'] > value] syntax to filter rows based on a condition.

    # Filter rows where a column value is greater than a certain value
    filtered_df = df[df['column_name'] > value]
    
    # Filter rows where a column value is in a list of values
    filtered_df = df[df['column_name'].isin(list_of_values)]
    

    Aggregate data

    Pandas provides functions for grouping and aggregating data, such as calculating the mean, sum, or count of values in a column or group.

    For example, you can use the df.groupby() function to group data by one or more columns, and then use a aggregation function such as mean() or sum() to calculate summary statistics for each group.

    # Group data by a column and calculate the mean of another column
    grouped_df = df.groupby('group_column')['value_column'].mean()
    
    # Group data by multiple columns and calculate the sum of another column
    grouped_df = df.groupby(['group_column_1', 'group_column_2'])['value_column'].sum()
    

    Further Reading


    Homepage Section Index Previous Next top of page