DataScience Workbook / 07. Data Acquisition and Wrangling / 3. Data Wrangling: ready-made apps / 3.1 Merge files by common column (python)


Introduction

The merge_data.py ⤴ application is written in Python3 and employs an efficient pandas library for operating on a column-like data structure. The application enables the merging of two files by matching columns with filling in the missing data by customized error values.

Merging files by a common column facilitates:

  • creating a robust dataset from different source files
  • complementing the features for observables
  • finding the common part of two data sets
  • detecting missing data in results

Algorithm

Schema Notes
Merge data app
The figure shows the algorithm of merging two files by common column.
  • both files should be a column-like text file (including Excel .xlsx format and CSV separated with different delimiters)
  • both files should include a matching column (with the same values), but the header may be different in each file
  • data (all columns) from the second file (i.e., merge_file) is added to the first file (i.e., input_file) and automatically saved in the output file
  • the user can customize the name of the output file
  • the user has to provide the indexes of matching columns (numbering starts from 0) or unique headers of columns
  • if some values are missing in the merge_data file, the corresponding fields are filled with pre-set missing_value (-9999.99 by default, user can customize it)
  • Requirements

    Requirements: python3, pandas, openpyxl

    Install Python3 on various operating systems (if you don't have it yet)
  • Python3 - Ubuntu
    sudo apt-get update
    sudo apt-get install python3

  • Python3 - macOS
    if not yet, first install Homebrew:
    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

    brew install python3

  • Python3 - Windows
    Please follow the instructions provided at phoenixnap.com .

  • Install app requirements

    pip3 install pandas
    pip3 install openpyxl
    

    Options

    help & info arguments:

      -h,         --help                       # show full help message and exit
    

    required arguments:

    -i file1,     --data-file-1 file1          # [string] input multi-col file
    -m file2,     --data-file-2 file2          # [string] merge multi-col file
    -c cols,      --matching-columns cols      # list of the same column of two files, e.g., 0,5 or label1,label2
    

    optional arguments:

    -e missing, --error-value missing          # [any] provide custom value for missing data
    -o outfile, --output-datafile outfile      # [string] provide custom name for the output data_file
    -f format,  --output-format format         # [int] select format for output file: 0 - original, 1 - csv, 2 - xlsx
    

    defaults for optional arguments:

    -e -9999.99            # means: all missing data will be replaced with -9999.99
    -o 'data_output'       # means: the output will be saved as 'output_data' file
    -f 0                   # means: the output will be saved in the original format
    

    Usage (generic)

    syntax:
    ^ arguments provided in square brackets [] are optional

    python3 merge_data.py -i file1 -m file2 -c cols
                         [-e missing] [-o outfile] [-f format]
                         [-h]
    

    example usage with minimal required options:

    python3 merge_data.py -i input_file -m merge_file -c 1,5
    

    The example parses two column-like files (input_file and merge_file), where the common column has index = 1 in the first file, and index = 5 in the second file. The missing data will be replaced with the default error value = -9999.99. The merged results will be saved into the default ‘data_output-$date’ file.

    Hands-on tutorial

    Environment setup

    Inputs

    Usage variations


    Further Reading


    Homepage Section Index Previous Next top of page