DataScience Workbook / 07. Data Acquisition and Wrangling / 3. Data Wrangling: ready-made apps / 3.1 Merge files by common column (python)
Introduction
The merge_data.py ⤴ application is written in Python3 and employs an efficient pandas library for operating on a column-like data structure. The application enables the merging of two files by matching columns with filling in the missing data by customized error values.
Merging files by a common column facilitates:
- creating a robust dataset from different source files
- complementing the features for observables
- finding the common part of two data sets
- detecting missing data in results
Algorithm
Schema | Notes |
---|---|
![]() The figure shows the algorithm of merging two files by common column. |
|
Requirements
Requirements: python3, pandas, openpyxl
Install Python3 on various operating systems (if you don't have it yet)
sudo apt-get update
sudo apt-get install python3
if not yet, first install Homebrew:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install python3
Please follow the instructions provided at phoenixnap.com .
Install app requirements
pip3 install pandas
pip3 install openpyxl
Options
help & info arguments:
-h, --help # show full help message and exit
required arguments:
-i file1, --data-file-1 file1 # [string] input multi-col file
-m file2, --data-file-2 file2 # [string] merge multi-col file
-c cols, --matching-columns cols # list of the same column of two files, e.g., 0,5 or label1,label2
optional arguments:
-e missing, --error-value missing # [any] provide custom value for missing data
-o outfile, --output-datafile outfile # [string] provide custom name for the output data_file
-f format, --output-format format # [int] select format for output file: 0 - original, 1 - csv, 2 - xlsx
defaults for optional arguments:
-e -9999.99 # means: all missing data will be replaced with -9999.99
-o 'data_output' # means: the output will be saved as 'output_data' file
-f 0 # means: the output will be saved in the original format
Usage (generic)
syntax:
^ arguments provided in square brackets [] are optional
python3 merge_data.py -i file1 -m file2 -c cols
[-e missing] [-o outfile] [-f format]
[-h]
example usage with minimal required options:
python3 merge_data.py -i input_file -m merge_file -c 1,5
The example parses two column-like files (input_file and merge_file), where the common column has index = 1 in the first file, and index = 5 in the second file. The missing data will be replaced with the default error value = -9999.99. The merged results will be saved into the default ‘data_output-$date’ file.