DataScience Workbook / 07: Data Acquisition and Wrangling / 3. Data wrangling: use ready-made apps / 3.1. Merge files by common column

Introduction

The merge_data.py ⤴ application is written in Python3 and employs an efficient pandas library for operating on a column-like data structure. The application enables the merging of two files by matching columns with filling in the missing data by customized error values.

Merging files by a common column facilitates:
creating a robust dataset from different source files
complementing the features for observables
finding the common part of two data sets
detecting missing data in results

App Features

merging files of the same or different format, i.e., with different column headers or different column order
merging files separated by different delimiters (including .xlsx files)
merging multiple files all at once
keeping only selected columns during the merge (the same or different columns from files)

Algorithm

Schema Notes

Schema	Notes
The figure shows the algorithm of merging two files by common column.	both files (`input` & `merge`) should be a column-like text file (including Excel .xlsx and CSV separated with unified delimiter); various delimiters are accepted both files should include a matching column (with the same values), but the header may be different in each file data (all columns) from the second file (i.e., `merge`) is added to the first file (i.e., `input`) and automatically saved in the `output` file the user can customize the name of the `output` file the user has to provide the indexes of matching columns (numbering starts from 0) or unique headers of columns if some values are missing in the `merge` file, the corresponding fields are filled with the `missing_value` (-9999.99 by default, user can customize it)

The figure shows the algorithm of merging two files by common column.

both files (input & merge) should be a column-like text file (including Excel .xlsx and CSV separated with unified delimiter); various delimiters are accepted

both files should include a matching column (with the same values), but the header may be different in each file

data (all columns) from the second file (i.e., merge) is added to the first file (i.e., input) and automatically saved in the output file

the user can customize the name of the output file

the user has to provide the indexes of matching columns (numbering starts from 0) or unique headers of columns

if some values are missing in the merge file, the corresponding fields are filled with the missing_value (-9999.99 by default, user can customize it)

python3
pandas
openpyxl

Options

help & info arguments:

  -h,         --help                       # show full help message and exit

required arguments:

-i file1,     --data-file-1 file1          # [string] input multi-col file
-m file2,     --data-file-2 file2          # [string] merge multi-col file
-c cols,      --matching-columns cols      # list of the same column of two files, e.g., 0,5 or label1,label2

optional arguments:

-e missing, --error-value missing          # [any] provide custom value for missing data
-o outfile, --output-datafile outfile      # [string] provide custom name for the output data_file
-f format,  --output-format format         # [int] select format for output file: 0 - original, 1 - csv, 2 - xlsx

defaults for optional arguments:

-e -9999.99            # means: all missing data will be replaced with -9999.99
-o 'data_output'       # means: the output will be saved as 'output_data' file
-f 0                   # means: the output will be saved in the original format

Usage (generic)

syntax:

python3 merge_data.py -i file1 -m file2 -c cols
                     [-e missing] [-o outfile] [-f format]
                     [-h]

arguments provided in square brackets [] are optional

example usage with minimal required options:

python3 merge_data.py -i input_file -m merge_file -c 1,5

The example parses two column-like files (input_file and merge_file), where the common column has index = 1 in the first file, and index = 5 in the second file. The missing data will be replaced with the default error value = -9999.99. The merged results will be saved into the default “data_output-$date” file.

Explore more usage examples in the documentation for this app in the data_wranglig repo: merge_data: Example usage ⤴.

Hands-on tutorial

Environment setup

The application is developed in Python3 programming language and requires importing several useful libraries. Thus, to manage dependencies, first you have to set up the Conda environment on your (local or remote) machine. Follow the instructions provided in section Environment setup in the introduction to to get Python installed and create data_wrangling virtual environment using install Conda ⤴.

Once you have Conda and the data_wrangling environment follow further steps below.

Activate existing Conda environment

You do NOT need to create the new environment each time you want to use it. Once created, the env is added to the list of all virtual instances managed by Conda. You can display them with the command:

conda info -e

Conda envs

The selected environment can be activated with the conda activate command, followed by the name of the env:

conda activate data_wrangling

Conda activate

Once the environment is active, you can see its name preceding the prompt.

Install new dependencies within environment

Once environment of your choice is activated, you can install new dependencies required by the merge_data ⤴ application:

pandas ⤴	to create and manage data structure
openpyxl ⤴	to read/write Excel 2010 xlsx/xlsm files

If you use Conda environment management system and followed the instructions provided above in section Environment setup then you are ready to get started with Data Wrangling Apps (because the requirements were installed as basic dependencies in this virtual environment).

Alternatively, if you want to use this app in system-wide mode (without creating a virtual environment), then install the neccessary requirements using the commands:

pip3 install pandas
pip3 install openpyxl

Inputs

Before using the application, make sure your inputs has been properly prepared. First of all, the data in the input file must be organized into columns. The number of columns and rows is arbitrary, including Big Data support (text file size reaching GBs).

data structure in the example CSV input: input0.txt

fruit,size(in),weight(oz)
apple,4,6.8
blueberry,5/8,0.02
orange,4,4.6
lemon,3,3.5
strawberry,3/8,0.4
...

data structure in the example Excel input: data_CN.xlsx

File format

The format of the input file does NOT matter as long as it is a columns-like text file. Text (.txt) files with columns separated by a uniform white character (single space, tabulator), comma-delimited CSV files, or Excel files are preferred, though.

Data delimiter

The data delimiter used does NOT matter, as it will be automatically detected by application. However, it is essential that the column separator is consistent, for example, that it is always a fixed number of spaces ` ` or always a tab, \t.

If the separator is a comma, make sure not to use commas within any data cell. For example, if you have a list of items within a cell, such as [0,1,1] the commas will be interpreted as column separators and will disrupt the data structure. Instead, consider using a different separator, such as a dash [0-1-1] or pipe [0|1|1] or encapsulate the entire list within quotes "[0,1,1]".

Column names

The header is usually the first line of the file and contains the column labels. Naming the columns brings great informational value to the analyzed data. However, the application does NOT require the input file to have a header. If it is in the file it will be detected automatically. Otherwise, you should use the column indexes to indicate the matching columns and columns to be kept.

Remeber that in Python, indexing starts from 0.

         label  position  val-2  val-3  val-4  val-5  val-6  val-7  val-8  val-9    
HiC_scaffold_1       982      0      0      0      0      0      0      0      0
HiC_scaffold_1       983      0      0      0      0      0      0      0      0
HiC_scaffold_1       984      0      0      0      0      0      0      0      0

File content

Running the application requires that you specify the names of input files and indexes (or names) of matching (shared) columns:

-i list of inputs
-c list of columns matching between inputs

Input files

You can specify inputs as a comma-separated list of filenames or a column-like file containing paths & filenames of all inputs.

For example:

-i input0.txt,input1.txt

-i inputs_list

where the input_list is a one-column file with contents like this:

input0.txt
input1.txt
input2.txt

Shared column

The mereg_data app provides -c mcols option which is the required argument for the program. The user should specify the list of matching columns in the input files. In these columns, the values should be identical and unique, while the order can differ to correctly merge data from multiple files.

For example:

if merging files have the same data structure, i.e., the matching column has the same index, you can specify it as follows:

-c 0                    # it means that in all inputs, the matching column has index 0 (i.e., it's the first column in the file)

if merging files have different data structure, i.e., the matching column has different index, you can specify it as follows:

-c 0,1                  # specify indexes of matching columns in the inputs

-c USDA#,ARS_Label      # specify column labels (headers) of matching columns

Usage variations

Besides the required options (-i, -c) for simple merging of multiple files by a shared column, the app provides extensive customization through optional arguments. This allows you to:

select columns to be kept (-k) individually from each input file
provide custom value for missing data (-e)
provide custom name for the merged output (-o)
select format for output file (-f)
select app verbosity level (-v)

The Example usage ⤴ section in the documentation of the app in the data_wrangling ⤴ repo provides practical examples of using these options in various configurations. Please explore the ready-made examples to learn more about specific options.

Case study

In this section of the tutorial, we use the data_merge app to address a real-life problem faced by collaborators who needed to merge their complementary datasets that were labeled differently.

Matching datasets manually can be a time-consuming and error-prone process. It is not a reliable approach due to the risk of human error and the possibility of missing data or overlooking important information. Mistakes can lead to data inconsistencies, inaccuracies, and negatively impact research outcomes.

Software solutions like Python-based data_merge app can be a better approach for matching datasets because they can automate the process, reducing the likelihood of human error and increasing efficiency. It can also ensure consistency and accuracy, identify discrepancies, and be customized to meet specific needs and requirements.

In this Case Study we use the inputs in the Excel format (.xlsx).
Usage examples for different input formats (CSV, TXT) are available in the Example usage ⤴ section in the documentation of the app in the data_wrangling ⤴ repo.

Background story

Two labs, Lab A and Lab B, were collaborating on a research project regarding quality of grains. As part of the project, they used the same experimental samples and conducted different measurements to obtain some numerical data. However, each lab labeled the samples differently, resulting in confusion when it came time to analyze the data. Luckily, the file containing the corresponding pairs of labels to each sample is available.

The general task

ISSUE: Each lab had collected two different kinds of experimental measurements in relation to the different set of labels.

TASK: Match the data and measurements from each lab based on the corresponding labels and create a robust merged dataset.

METHOD: Use a Python-based data_merge app as an automated and flexible approach that saves users time and reduced the risk of human error.

Inputs

Below is the preview of the required data and labels input files.

The data_CN.xlsx and labels.xlsx share the same column with ARS labels, even though in both files this column has a different name (i.e., USDA# and ARS_Label, marked with the green frames on the screenshot below).
The labels file contains an additional column with Sample Num paired with the ARS labels and together they provide a set of matching pairs of labels for the experimental samples.
The other data file: data_NN.xlsx contains measurements collected by the second lab and labeled with the Sample Num identifiers (marked with the blue frames).

As you can see, the number of rows in the labels file is significantly higher than in the data files (marked with the red frames).

Approach

There are two tasks to perform:

Adding matching Sample Num from the labels.xlsx file into the data_CN.xlsx file using matching ARS_Label-USDA# columns.
Transfering data from the data_NN.xlsx file into the data_CN.xlsx file using matching Sample Num column (now available in both files).

Both tasks can be realized using the same data_merge app but in two separate steps.

Solution

STEP 1
Adding matching Sample Num from the labels.xlsx file into the data_CN.xlsx file using matching ARS_Label-USDA# columns.

python3 merge_data.py -i data_CN.xlsx,labels.xlsx -c 0,1 -o output_step1 -f 2 -v 1

OUTPUT: output_step1.xlsx

STEP 2
Transfering data from the data_NN.xlsx file into the data_CN.xlsx file using matching Sample Num column (now available in both files).

python3 merge_data.py -i output_step1.xlsx,data_NN.xlsx -c 11,0 -o output_step2 -f 2 -v 1

OUTPUT: output_step2.xlsx

Merge files by common column

Aleksandra Badaczewska

Introduction

App Features

Algorithm

Options

Usage (generic)

Hands-on tutorial

Environment setup

Inputs

File format

Data delimiter

Column names

File content

Input files

Shared column

Usage variations

Case study

Background story

The general task

Inputs

Approach

Solution

Further Reading