Introduction
Certain analyses can be time-consuming when they run on a single processor, especially when dealing with large datasets. A task is deemed trivially parallelizable if the data can be divided into chunks and processed independently.
Examples of data that can be easily parallelized
- in general scenario:
- when each line of a file can be processed individually
- individual records in a database file
- each section/module/subset of a large dataset
- domain-specific examples:
- each chromosome of a genome can be processed individually
- each scaffold of an assembly can be processed individually
- each image in a collection
- individual transactions in financial data
Examples of problems that are trivially parallelizable
- zipping or unzipping 10s to 100s of files
- counting the number of lines in a large file
- aligning raw sequencing data files of many samples to a genome
- resizing a large batch of images
- converting multiple video files to different formats
- running simulations with different parameters on separate cores or nodes
Examples of problems that are not trivially parallelizable
- Genome assembly is not trivially parallelizable because the first step requires alignment of each read to each other read in order to find which ones are similar and should be joined (assembled). Take a subset of the reads would result in a bunch of small poor assemblies.
- Training a machine learning model on a single large dataset is not trivially parallelizable because the process typically involves iterating over the entire dataset multiple times to update model parameters, and breaking up the dataset could result in suboptimal model performance due to lack of global context.
- Processing a large database query involving complex joins and aggregations is not trivially parallelizable because the relationships between different tables need to be preserved, and breaking up the query could lead to incorrect results or incomplete analysis.
- Running a complex simulation with interdependent variables is not trivially parallelizable because each step of the simulation relies on the results of previous steps, and parallelizing it could introduce race conditions or synchronization issues.
GNU parallel
for trivially parallelizable problems
The program that we use to parallelize a bioinformatics problem is GNU parallel. It is “a shell tool for executing jobs in parallel using one or more compute nodes”.
GNU parallel helps you run jobs that you would have otherwise run sequentially one by one or in a loop.
You can check the GNU parallel website ⤴ to determine how to install parallel on your cluster and/or learn how to use it.
On SCINet/Ceres, you can load parallel in version 20230222 (as of April 2024).
On SCINet/Atlas, you can load parallel in version 20200722 (as of April 2024).
Here, we load the module and look at the version
module load parallel
parallel --version
GNU parallel 20230222 Copyright (C) 2007-2023 Ole Tange, http://ole.tange.dk and Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. GNU parallel comes with no warranty. Web site: https://www.gnu.org/software/parallel When using programs that use GNU Parallel to process data for publication please cite as described in 'parallel --citation'.
getting-started parallel
Before proceeding with the hands-on example of using GNU Parallel, it’s essential to follow these steps to set up your environment on the HPC cluster.
- To log in to the HPC cluster in a terminal, use
ssh user.name@hostname
with your user credentials and specific hostname at your HPC infrastructure. Alternatively, use the ODM access. - Start an interactive session on a computing node using the command:
salloc -N1 -n10 -t 1:00:00 -A <account>
This command will give 1 Node with 10 cpu for a time of 1 hour.
NOTE: If you see your granted allocation (e.g., salloc: Nodes Atlas-0025 are ready for job) but still on the login node, try ssh Atlas-0025
replacing “Atlas-0025” with your granted node. - Navigate to your working directory with sufficient storage quota using
cd
command. - Finally, check the availability of the parallel module with
module avail parallel
and load it with:module load parallel
Example dataset
We will be using COVID-19 data collated by New York Times github repository ⤴.
mkdir GNU-parallel
cd GNU-parallel
wget https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
--2024-04-26 16:23:24-- https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 104795654 (100M) [text/plain] Saving to: ‘us-counties.csv’ 100%[=======================================================================================>] 104,795,654 54.7MB/s in 1.8s
This (us-counties.csv
) is a comma separated file so let’s convert it to a tab delimitated format:
more us-counties.csv | tr ',' '\t' > us-counties.tab
ls us-counties.csv us-counties.tab
As you can see, this data contains the county and state information about the pandemic over time.
head us-counties.tab date county state fips cases deaths 2020-01-21 Snohomish Washington 53061 1 0 2020-01-22 Snohomish Washington 53061 1 0 2020-01-23 Snohomish Washington 53061 1 0 2020-01-24 Cook Illinois 17031 1 0 2020-01-24 Snohomish Washington 53061 1 0 2020-01-25 Orange California 06059 1 0 2020-01-25 Cook Illinois 17031 1 0 2020-01-25 Snohomish Washington 53061 1 0 2020-01-26 Maricopa Arizona 04013 1 0
Instead of one large file let’s split this data by county-state.
Using sort
and awk
we can first sort the file by county/state and then use awk
to print each line ($0
) to a file named county-state.tab
.
sort -k 2,3 us-counties.tab | awk '{print $0 > $2"-"$3".tab"}'
What the script does?
This command pipeline performs two operations.
First, it sorts the lines of the file “us-counties.tab” based on fields 2 (country) and 3 (state) using the sort
command.
Then, it uses awk
to print each line into a file named after the concatenation of the values in fields 2 and 3, with a .tab
extension.
head Adair-Iowa.tab 2020-03-17 Adair Iowa 19001 1 0 2020-03-18 Adair Iowa 19001 1 0 2020-03-19 Adair Iowa 19001 1 0 2020-03-20 Adair Iowa 19001 1 0 2020-03-21 Adair Iowa 19001 1 0 2020-03-22 Adair Iowa 19001 1 0 2020-03-23 Adair Iowa 19001 1 0 2020-03-24 Adair Iowa 19001 1 0 2020-03-25 Adair Iowa 19001 1 0 2020-03-26 Adair Iowa 19001 1 0
This will generate 3241 files + the original 2 files we downloaded:
ls | wc 3243 3243 62968
GNU examples
Gzipping N text files using gzip
Learning objectives
- How to use parallel with gzip to zip many files simultaneously?
- The anatomy of a
parallel
command.
Let’s make a copy of the data and compare how long it takes to run gzip using a for loop vs using parallel
mkdir -p gzip/parallel
mkdir -p gzip/forloop
cp *.tab gzip/parallel
cp *.tab gzip/forloop
Gzip using a for
loop
We can do this using a for
loop as follows:
your expected location on the HPC: GNU-parallel/gzip/forloop
cd gzip/forloop
time for f in *.tab; do gzip $f; done
real 0m15.801s user 0m1.414s sys 0m5.045s
Gzip using parallel
However, we can make better use of all the available CPUs by using GNU parallel. The syntax of the function is as follows:
parallel | command keyword |
-j10 | number of jobs or cpus to use for processing. Here we are using 10 cpus. |
"command" | in this case gzip {} where {} is a place holder for substituting a list of files defined after the delimiter |
::: | the delimiter |
.tab | is a wildcard (* ) pattern that matches all filenames in the current directory with a ".tab" extension |
cd ../parallel
time parallel -j10 "gzip {}" ::: *.tab
real 0m6.405s user 0m6.872s sys 0m11.623s
As you can see this sped up the gziping command by a factor of 2.3. This may vary depending on the number of cpus you have and their speed.
Here is how you can unzip all the files with the parallel
speedup. It takes about the same amount of time.
parallel -j10 "gunzip {}" ::: *.tab.gz
real 0m5.519s user 0m0.376s sys 0m1.367s
If you want to unzip them you can run this command
time parallel -j10 "gunzip {}" ::: *.tab.gz
listing all the files using ls
Learning objectives
parallel
has a time overhead when it starts and as it runs
The ls
is a simple command and shouldn’t take very long to run in parallel.
cd ../..
time parallel -j10 "ls {}" ::: *.tab > /dev/null
Interestingly, this is about the same amount of time it takes to run a simple ls
command on each file using parallel
.
real 0m5.955s user 0m6.713s sys 0m10.994s
Above, I am redirecting the output to the null device, so it doesn’t print the output to standard output (i.e., terminal console).
However, if we add the -X
parameter it finishes in a fraction of the time.
time parallel -X -j10 "ls {}" ::: *.tab > /dev/null
real 0m0.333s user 0m0.210s sys 0m0.214s
This is because there is some overhead from the parallel command starting a new shell and opening files for buffering. The -X
will tell parallel not to do this as it is all on the same node.
As you can compare below, the parallel
is actually slower than just typing ls *.gz
.
time ls *.gz > /dev/null
real 0m0.277s user 0m0.017s sys 0m0.079s
So, parallel
is great when you have lots of data and lots of files in the data or if a task is taking a really long time using a single processor but overhead of the command will make it slower if the dataset is too small.
Unfortunately, repeating the gzipping example from above using this parameter does not give as significant of an improvement as the ls
example.
time parallel -X -j10 "gzip {}" ::: *.tab
real 0m5.438s user 0m0.492s sys 0m1.398s
Some of the difference may be due to IO speed writing to a JBOD on your HPC. You may get improved results if you are running this example on local scratch space on a node or are testing this on your local laptop.
feeding more input files with parallel
Learning objectives
- How to use parallel with programs that take more than one input?
:::+
A good example of this in bioinformatics is aligning paired-end reads to a genome. Every sample has a forward (Left) and reverse (right) read pair.
For this example, we will be aligning several small sample files to a genome using bowtie2
.
Let’s download a toy example from Arabidopsis.
I grabbed the first 250 sequences from four Arabidposis samples taken from NCBI’s SRA database.
Download the example raw data
mkdir bowtie
cd bowtie
wget https://datascience.101workbook.org/_collection-base/06-hpc/assets/data/fastqfiles.tar.gz
tar -zxvf fastqfiles.tar.gz
The reads are in a directory named fastqfiles
:
Left Reads:
fastqfiles/SRR4420295_1.fastq.gz fastqfiles/SRR4420293_1.fastq.gz fastqfiles/SRR4420294_1.fastq.gz
Right reads
fastqfiles/SRR4420294_2.fastq.gz fastqfiles/SRR4420293_2.fastq.gz fastqfiles/SRR4420295_2.fastq.gz
Download Arabidopsis genome:
wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas.gz
gunzip TAIR10_chr_all.fas.gz
Create the bowtie2 alignment database for the Arabidopsis genome
module load bowtie2
# this next command will build a bowtie2 database named tair
bowtie2-build TAIR10_chr_all.fas tair tair
mkdir bwt_index
mv tair.*bt2 bwt_index
The genome index files *.bt2
are located in a directory called bwt_index
.
We decide to write the output SAM and log files to sam_out
:
mkdir sam_out
We can design a bash script with the bowtie2
command for each pair separately. Obviously if dealing with hundreds of files this could become cumbersome!
module load bowtie2
bowtie2 --threads 4 -x bwt_index/tair -k1 -q -1 fastqfiles/SRR4420293_1.fastq.gz -2 fastqfiles/SRR4420293_2.fastq.gz -S sam_out/third_R1.sam >& sam_out/third.log
bowtie2 --threads 4 -x bwt_index/tair -k1 -q -1 fastqfiles/SRR4420294_1.fastq.gz -2 fastqfiles/SRR4420294_2.fastq.gz -S sam_out/fourth_R1.sam >& sam_out/fourth.log
bowtie2 --threads 4 -x bwt_index/tair -k1 -q -1 fastqfiles/SRR4420295_1.fastq.gz -2 fastqfiles/SRR4420295_2.fastq.gz -S sam_out/fifth_R1.sam >& sam_out/fifth.log
GNU parallel let’s us automate this task by using a combination of substitution
and separators
notably :::
and :::+
. We can also make optimum use of the available threads.
module load bowtie2
module load parallel
mkdir sam_parallel
time parallel -j2 "bowtie2 --threads 4 -x bwt_index/tair -k1 -q -1 {1} -2 {2} -S sam_parallel/{1/.}.sam >& sam_parallel/{1/.}.log" ::: fastqfiles/*_1.fastq.gz :::+ fastqfiles/*_2.fastq.gz
You may have also noticed some new syntax where we are using {1}
, {2}
, {1/.}
and {2/.}
.
- The
{1}
will give us the first file and{2}
the second from the list taking two at a time. - The
{1/.}
and{2/.}
will take the prefix of the file name before the “.”.
splitting a large task to utilize all CPUs
Let’s consider a large file named test.fa
. Our objective is to count the number of lines in this FASTA file.
You can download the example Trinity file using the following commands:
mkdir trinity
cd trinity
wget https://datascience.101workbook.org/_collection-base/06-hpc/assets/data/test.fa
The file should look something like this:
head test.fa >TRINITY_DN22368_c0_g1_i1 len=236 path=[427:0-235] [-1, 427, -2] ATTGGTTTTCTACGGAGAGAGAGAAAATGGAGACGGCGAGTGTCTAAAGCTAGAGCTTGT GTTGGAGAAGGAAACGGAGATTTGCGTAGTAGTGGAAGCTTTAGGTATTTGTTGTGGTTA CTCACGGCGGCGATATTTGACGGCGGGAGGAGGAAAAGAGAGAGGAAAGAACAGAGGAAG AAGATGAGAGGAAACATTGAGAGAGAGTGAGAAGGGTTTTGTGATTTTTGTGTCTG >TRINITY_DN22345_c0_g1_i1 len=615 path=[593:0-614] [-1, 593, -2] GCCGGATTCAGATACGCAAGGAGAATCTGAGCAGGTCGAATGTTGATGGTATGCTTTCAT CGGCACTTCCAGGTGGTCAGGAGAAGATCCCCATACGACTGCACTCTCTTTGCTATATGA TGAAGCAGGAACTGTCACAAGAGGCAGAGAAGTACTGGACTCTGCCATTTGCTCATTTGT AGCATGATTTCCTTCCCCATTCTCAGTTCCGGGAGTGCAGTGAAAGCAACAATCATTATT
We have reserved a node and are making use of 10 cpus on Ceres and we decide to run the wc -l
command to find the number of lines. We can also use the unix command time
to see how much time it takes to run the command.
time wc -l test.fa
1082567 test.fa real 0m1.237s user 0m0.025s sys 0m0.057s
Now using parallel
we can take advantage of the 10 cpus and spread this job over all the cpus:
parallel -a test.fa --pipepart --block -1 time wc -l
Note:
-a
option means input is read from a file--pipepart
: pipe parts of a file**
see the expected output
111748 real 0m0.021s user 0m0.005s sys 0m0.002s 104450 real 0m0.023s user 0m0.002s sys 0m0.005s 111050 real 0m0.026s user 0m0.002s sys 0m0.005s 108797 real 0m0.023s user 0m0.005s sys 0m0.002s 108114 real 0m0.021s user 0m0.005s sys 0m0.002s 109001 real 0m0.021s user 0m0.003s sys 0m0.003s 106528 real 0m0.022s user 0m0.002s sys 0m0.004s 109069 real 0m0.019s user 0m0.002s sys 0m0.005s 104735 real 0m0.020s user 0m0.003s sys 0m0.003s 109075 real 0m0.021s user 0m0.003s sys 0m0.003s
Notice that less time is required when each block is counted by each CPU. In this case, the longest time is 0.026 seconds
, compared to 1.237 seconds
when parallel processing is not utilized.
other useful parameters
The --joblog
option of the parallel
command creates a log file that records information about the jobs executed by parallel
. This log file typically includes details such as the command executed, the start and end times of each job, and any errors encountered during execution. It provides a record of the parallel processing tasks for later analysis or reference.
executing Blast in parallel
See Running NCBI-BLAST jobs in parallel.
Further Reading
Introduction to containersApptainer: the container system for secure HPC (singularity rebranding)
Singularity (archival)
Creating containers using Singularity
Modifying existing containers
Singularity on your Mac via Vagrant
Docker - accelerated container application development
MODULE 07: Data Acquisition and Wrangling