DataScience Workbook / 07: Data Acquisition and Wrangling / 1. Remote data access / 1.3. Remote data preview / 1.3.1. Viewing text files using UNIX commands

Introduction

In research, results are often presented in various formats such as simple text files (tab or comma-separated), PDFs, or occasionally image files like PNGs or JPEGs. This tutorial will demonstrate how to view these files directly on your local or remote system without the need to download them.

Why is it important to view files directly in the UNIX shell?

  • Viewing files directly on UNIX systems allows for immediate access to data without the need for additional software, streamlining the workflow and saving valuable time.

  • Direct file access on UNIX provides the ability to quickly preview & search and manage large datasets, crucial for data-intensive fields like research and development.

  • Utilizing UNIX commands for file viewing can enhance security by reducing the need to transfer data to different devices or systems, minimizing the risk of data breaches.

CLI setup

For accessing the command-line interface (CLI), you will need a system running a Unix-like terminal or command prompt (on Windows). If you’re unfamiliar with how to access or use the CLI on your system, please refer to the practical tutorial on for setup and instructions on how to Open Terminal Window.

Viewing text files

In this section, we’ll explore various UNIX commands that allow you to view and manage text files directly in the shell. These commands are essential for quickly accessing and reading the contents of files without opening them in a text editor. There are many commands for this purpose such as:

Command Purpose Example Use cases
less view file pagewise with more options less filename use -S to avoid line wrapping, you can use arrow keys to scroll
more view file pagewise more filename use less instead
cat catalogs the contents of a file cat filename you can send the contenst of a file to clipboard or to another file using this command
tac reverse of cat, reverses the order of lines tac filename pipe this to less commad to scroll through the file in reverse
head view first few lines of a file head filename use -n option to change the number of lines displayed. -n 20 displays 20 lines
tail view last few lines of a file tail filename use -n option to change the number of lines displayed. -n 20 displays 20 lines
od octal dump of a file od filename use -c options and pipe it through less. You can see non printable characters with this option (like tab, whitespace, newline etc)

Text files in UNIX often come in formats like plain text or delimited text such as CSV (comma-separated values) and TSV (tab-separated values). These files are widely used for storing large datasets in a format that’s easily readable both by humans and software. Unstructured text files, such as plain text documents, don’t follow a specific format and are often used for notes, logs or written content without any strict data schema. Thwy often use the TXT or DOC file extension.

Each of text file formats serves different purposes but can be easily managed and viewed using UNIX commands, allowing for flexible data handling across various applications.

  • CSV (Comma-Separated Values): data.csv
    Name,Age,Department John Doe,29,Marketing Jane Smith,34,IT Tom Johnson,28,Operations

  • TSV (Tab-Separated Values): data.tsv
    Name Age Department John Doe 29 Marketing Jane Smith 34 IT Tom Johnson 28 Operations

  • TXT (Plain Text): data.txt
    My name is John Doe. I'm 29, and work in Marketing Department. Her name is Jane Smith. She is 34 and works in the IT Department. His name is Tom Johnson, age 28. He works in the Operations Department.

CAT or TAC

Purpose: Display the entire content of a file on the screen.

  • cat - from the beginning (first line of the file at the top)
  • tac - from the end (showing the last line first)

Syntax: cat filename.txt or tac filename.txt

cat data.csv tac data.csv
Name,Age,Department
John Doe,29,Marketing
Jane Smith,34,IT
Tom Johnson,28,Operations
Tom Johnson,28,Operations
Jane Smith,34,IT
John Doe,29,Marketing
Name,Age,Department

Both commands print the content permanently on the terminal until the session is cleared or another command is executed.

MORE or LESS

Purpose: View the paginated content of a file one screen at a time.

  • more - primarily allows forward movement through the file
  • less - allows backward and forward movement

Syntax: more filename.txt or less filename.txt

more data.csv less data.csv
Name,Age,Department
John Doe,29,Marketing
Jane Smith,34,IT
Tom Johnson,28,Operations
Name,Age,Department
John Doe,29,Marketing
Jane Smith,34,IT
Tom Johnson,28,Operations

The displayed file content remains visible until you exit the command or navigate to other terminal activities. To close the preview just hit the q on your keyboard. Enhance File Viewing with less:

  • Use G on your keyboard to jump to the end of the file.
  • Use g on your keyboard to go back to the start.
  • Use /pattern to search for a “pattern” downwards.
  • Use ?pattern to search for a pattern upwards.

HEAD or TAIL

Purpose: Display the beginning or end of a file on the terminal screen.

  • head - shows the first few lines of a file
  • tail - shows the last few lines

Syntax: head -n number filename.txt or tail -n number filename.txt

head -2 data.csv tail -2 data.csv
Name,Age,Department
John Doe,29,Marketing
Jane Smith,34,IT
Tom Johnson,28,Operations

The output remains visible until you issue another command or clear the terminal session.

Searching in text files

Searching within text files is a powerful way to quickly locate specific data or patterns, enhancing your ability to analyze and manage large volumes of text efficiently. It is possible to perform these searches using the grep command, which allows you to filter and display lines in text files that match a specified pattern.

Purpose: Search for specific patterns within a file and display all the matching lines.
Syntax: grep 'pattern' filename.txt

To find all occurrences of the word “name” in data.txt, enter:

grep 'name' data.txt
My name is John Doe. I'm 29, and work in Marketing Department.
Her name is Jane Smith. She is 34 and works in the IT Department.
His name is Tom Johnson, age 28. He works in the Operations Department.

Advanced Tips and Tricks

Combining Commands - Pipes | allow you to use the output of one command as the input to another, enhancing the efficiency and functionality of file viewing operations.

Here’s how to apply this technique:

Chaining grep - search for specific information within a file and then refine the search:

grep 'name' data.txt | grep 'John'
My name is John Doe. I'm 29, and work in Marketing Department.
His name is Tom Johnson, age 28. He works in the Operations Department.

This command first extracts lines containing “name” and then narrows down to those that also include “John”.

Skipping the header line - for skipping headers in data files or any other time you need to ignore the initial line of a file:

tail -n +2 data.tsv
John Doe    29  Marketing
Jane Smith  34  IT
Tom Johnson   28  Operations

This command starts displaying from the second line of the file, effectively skipping the first line.

Combining sort and/or uniq - to view unique entries in a sorted manner from a text file:

tail -n +2 data.tsv | sort -nk3
Tom Johnson   28  Operations
John Doe    29  Marketing
Jane Smith  34  IT

This displays the file content sorted numerically by age located in 3-rd column.

The sort command recognizes both spaces and tabs as delimiters by default when arranging lines of text. It treats contiguous whitespace (spaces or tabs) as a single delimiter, which means it can effectively handle data that is separated by either or both when sorting. In our example TSV file above, the sort command regonizes name and surname as two separate columns (separated by a single space) while other columns are separated by a tab. Thus, to sort by an age, we used the 3-rd column.