DataScience Workbook / 03: Introduction to Command Line / 3. Useful text manipulation programs

Introduction

The command line interface (CLI) is a powerful environment for text manipulation tasks.
There are a variety of text manipulation programs available in the terminal that allow users to quickly and efficiently process large amounts of text data. These programs offer a wide range of capabilities, including searching for patterns, transforming text, sorting lines, removing duplicates, and counting characters, words, and lines.

In this section, we will discuss some of the most popular and useful command line text manipulation programs, such as:

TOOL	TDESCRIPTION	NOTES
grep	searches for a specific pattern in text files and outputs matching lines	Tutorial: GREP
sed	stream editor for filtering and transforming text	Tutorial: SED
awk	a programming language for processing text data, often used for text manipulation tasks	Tutorial: AWK
cut	cuts out specific columns or fields from a file
sort	sorts lines of text alphabetically or numerically
uniq	removes duplicate lines from a file
tr	translates or deletes characters from a file
wc	counts the number of lines, words, and characters in a file
tail	output the first or last part of a file, respectively

^ Click on the tool name (in the first column) to jump to the cheat sheet.

These tools are often used together in pipelines to perform more complex text manipulations. Understanding text manipulation programs can greatly improve a user’s productivity and efficiency when working with text files, without the need for graphical user interfaces.

Why manipulate text files from the command line?

Manipulating text files in the command line is a powerful and efficient method for processing large amounts of text data. It is useful because of:

SPEED
Command line tools are typically faster than GUI-based text editors when processing large amounts of data.
AUTOMATION
Command line tools can be automated using shell scripts, allowing you to perform repetitive tasks quickly and efficiently.
FLEXIBILITY
The command line tools offer a wide range of functionality, making it possible to perform complex text manipulations.
INTEGRATION
Command line tools can be easily integrated into other programs, allowing you to process text data in a variety of different contexts.
ACCESSABILITY
Command line tools allow users to manipulate text files stored on the remote machine without downloading them.

When to manipulate text files in the command line?

DATA PROCESSING
Command line tools are particularly useful when working with large text data or multiple files, as they can process the data much faster than a graphical user interface (GUI) based text editor.
- TEXT MANIPULATION
  The command line tools provide a powerful way to change the order or structure in the text file.
- TEXT ANALYSIS
  You can use command line tools to extract meaningful information from large amounts of text data.
SCRIPTING
The command line tools can be used in shell scripts to automate complex text processing tasks.

CheatSheet

Below, you can find a cheat sheet for some of the most popular command line text manipulation tools.

GREP - search pattern

SYNTAX: text_stream | grep OPTIONS PATTERN or grep OPTIONS PATTERN FILE

COMMAND SYNTAX	EXAMPLE	TASK
`grep <PATTERN> <FILE>`	`grep 'version' file.txt`	Search for a pattern in a file.
`grep <PATTERN> <FILE1> <FILE2>`	`grep 'version' file1.txt file2.txt`	Search for a pattern in multiple files.
`grep -r <PATTERN> <DIR>`	`grep -r 'version' THIS_FOLDER`	Search recursively in all files in a directory.
`grep -n <PATTERN> <FILE>`	`grep -n 'version' file.txt`	Show line numbers for matches.
`grep -o <PATTERN> <FILE>`	`grep -o 'version' file.txt`	Show only the matching portion of the line.

SED - replace pattern

SYNTAX: text_stream | sed OPTIONS /PATTERN/REPLACEMENT/ or sed OPTIONS /PATTERN/REPLACEMENT/ FILE

COMMAND SYNTAX	EXAMPLE	TASK
`sed 's/<PATTERN>/<REPLACEMENT>/g' FILE`	`sed 's/version/V/g'`	Replace all occurrences of a pattern in a file.
`sed 's/<PATTERN>//g' FILE`	`sed 's/version//g'`	Delete all occurrences of a pattern in a file.
`sed 's/<PATTERN>/<REPLACEMENT>/N' FILE`	`sed 's/version/V/2'`	Replace the nth occurrence of a pattern in a line.

AWK - manage order

SYNTAX: text_stream | awk OPTIONS '{}' or awk OPTIONS '{}' FILE

COMMAND SYNTAX	EXAMPLE	TASK
`awk '{print $1, $3}' <FILE>`	`awk '{print $1, $3}' file.txt'`	Print the first and third column of a file.
`awk 'NF > 3' <FILE>`	`awk 'NF > 3' file.txt`	Print only the lines with more than 3 fields (columns).
`awk '{sum+=$2} END {print sum}' <FILE>`	`awk '{sum+=$2} END {print sum}' file.txt`	Print the sum of all numbers in the second column.
`awk '{printf "%-10s %s\n", $1, $2}' <FILE>`	`awk '{printf "%-10s %s\n", $1, $2}' file.txt`	Format the output.

CUT - cut characters

SYNTAX: text_stream | cut OPTIONS or cut OPTIONS FILE

COMMAND SYNTAX	EXAMPLE	TASK
`cut -f 1 <FILE>`	`cut -f 1,3-5 file.txt`	Cut out the first and 3rd to 5th columns from a file.
`cut -c 1-3 <FILE>`	`cut -c 1-3 file.txt`	Cut out the first three characters from each line.

SORT - sort lines

SYNTAX: text_stream | sort OPTIONS or sort OPTIONS FILE

COMMAND SYNTAX	EXAMPLE	TASK
`sort <FILE>`	`sort file.txt`	Sort the lines of a file.
`sort -r <FILE>`	`sort -r file.txt`	Sort the lines of a file in reverse order.
`sort -k 2 <FILE>`	`sort -k 2 file.txt`	Sort the lines of a file based on the second field (column).
`sort -n <FILE>`	`sort -n file.txt`	Sort the lines of a file numerically.

…explore the Unix Getting Started tutorial in the section: SORT a file by lines

UNIQ - unique lines

SYNTAX: text_stream | uniq OPTIONS or uniq OPTIONS FILE

COMMAND SYNTAX	EXAMPLE	TASK
`uniq <FILE>`	`uniq file.txt`	Remove duplicated lines from a file.
`uniq -d <FILE>`	uniq -d file.txt	Show only the duplicates in a file.
`uniq -u <FILE>`	`uniq -u file.txt`	Show only the unique lines in a file.

…explore the Unix Getting Started, section: UNIQ - command to remove duplicates

TR - swap characters

SYNTAX: text_stream | tr OPTIONS or tr OPTIONS < FILE

COMMAND SYNTAX	EXAMPLE	TASK
`tr '[:upper:]' '[:lower:]' < <FILE>`	`tr '[:upper:]' '[:lower:]' < file.txt`	Translate all uppercase letters to lowercase.
`tr ' ' '\t' < <FILE>`	`tr ' ' '\t' < file.txt`	Translate all spaces to tabs.
`tr -d '[AEIOUaeiou]' < <FILE>`	`tr -d '[AEIOUaeiou]' < file.txt`	Delete all vowels from a file.

…explore the Unix Getting Started tutorial in the section: TR - translate

WC - count lines, words

SYNTAX: text_stream | wc OPTIONS or wc OPTIONS FILE

COMMAND SYNTAX	EXAMPLE	TASK
`wc <FILE>`	`wc file.txt`	Count the number of lines, words, and characters in a file.
`wc -l <FILE>`	`wc -l file.txt`	Count the number of lines in a file.
`wc -w <FILE>`	`wc -w file.txt`	Count the number of words in a file.
`wc -m <FILE>`	`wc -m file.txt`	Count the number of characters in a file.

…explore the Unix Getting Started tutorial in the section: WC - word count

HEAD and TAIL

SYNTAX: text_stream | head <OPTIONS or head OPTIONS FILE

These tools are very useful for quickly inspecting the contents of a file and can be used to get an overview of the data before processing it with more complex text manipulation tools.

HEAD

COMMAND SYNTAX	EXAMPLE	TASK
`head <FILE>`	`head file.txt`	Print the first 10 lines of a file.
`head -n N <FILE>`	head -n 5 file.txt	Print the first N lines of a file.
`head -c N <FILE>`	head -c 10 file.txt	Print the first N bytes of a file.

…explore the Unix Getting Started tutorial in the section: HEAD of the file

TAIL

COMMAND SYNTAX	EXAMPLE	TASK
`tail <FILE>`	`tail file.txt`	Print the last 10 lines of a file.
`tail -n N <FILE>`	`tail -n 5 file.txt`	Print the last N lines of a file.
`tail -c N <FILE>`	`tail -c N file.txt`	Print the last N bytes of a file.
`tail -f <FILE>`	`tail -f file.txt`	Continuously monitor the end of a file.

…explore the Unix Getting Started tutorial in the section: TAIL of the file

jump to solution

Removing duplicate lines from a file

sort FILE | uniq

Counting the number of a ‘WORD’ in a file

grep -o WORD FILE | wc -w

Extracting columns of data from a file

cut -d DELIMITER -f COLUMN FILE

awk -F DELIMITER '{print COLUMN}' FILE

Creating columns by translating a char to a delimiter

text_stream | grep WORD | tr '-' ' ' | awk '{print $2,$4,$6}' | sort -nk1 | uniq