DataScience Workbook / 07. Data Acquisition and Wrangling / 2. Data Manipulation / 2.1 Manipulating Excel Data Sheets / 2.1.4 Merge Two Spreadsheets using a Common Column


Introduction

Excel is most popular among researchers because of its ease of use and tons of useful features. In most cases scripting is the most efficient way to do these simple operations, but practicality of Excel for researchers and the cryptic scripting commands will always make excel a better choice. Most common case of merging 2 spreadsheets is when users have a list of gene ids and another list of geneids with function. To merge these 2 sheets using the gene-ids, we can use the VLOOKUP function.

Data

Typically, users will have something like this:

Sheet1 : list of gene ids with differential gene expression results

Sheet1

Sheet2: list of gene ids with its annotation information

Sheet2

Now, to add GeneID and GeneName information for the Sheet1 using the information from Sheet2 using Ensembl Gene ID as the common field/column, we can use the VLOOKUP function.

Formula

Type: VLOOKUP on the Sheet1 (E1) cell, you should see typical usage for the command as shown below. It needs 4 pieces of inforamtion:

  • lookup_value : which cell to use for looking up the value? since we need to look up information for Ensembl Gene ID it should be A2 here
  • table_array : where to look up? the entire table where the annotation is stored. This shoudl be the full table in Sheet2 (in this case: Sheet2!$A$2:$C$49672)
  • col_index_num : what cell value to print for matching ids? Since we need GeneID and it is the 2nd column of Sheet2, we should use the value of 2 here
  • range_lookup : do you need an exact match or approximate match? Since each gene id is unique, we need exact match, so we fill FALSE here

The compelte formula looks like this:

=VLOOKUP(A2,Sheet2!$A$2:$C$49672,2,FALSE)

formula for geneid

Note the $ for both rows and columns of the table array, this prevetns the excel from incrementing when the formula is dragged to other cells. This is because we want to keep out table_array fixed, regardless where we use the formula in the Sheet1

Next, we need to drag this formula down using the + sign that appears on the lower right of the cell containing formula. You can also double click on it to automatically fill the formula for you. It should correctly fill in the GeneID column for you.

click and drag

For filling out the GeneName follow the same steps, but instead of using col_index_num value of 2, we will use 3 which is for Description in the Sheet2

=VLOOKUP(A2,Sheet2!$A$2:$C$49672,3,FALSE)

formula for function

Again, click and drag or double click the + sign that appears on the lower right of the cell to fill all other cells in that column.

merged spreadsheet

You should now have the complete table with both functions and gene names, now!


Further Reading


Homepage Section Index Previous Next top of page