DataScience Workbook / 03: Introduction to Command Line / 3. Useful text manipulation programs

Introduction

The command line interface (CLI) is a powerful environment for text manipulation tasks.
There are a variety of text manipulation programs available in the terminal that allow users to quickly and efficiently process large amounts of text data. These programs offer a wide range of capabilities, including searching for patterns, transforming text, sorting lines, removing duplicates, and counting characters, words, and lines.

In this section, we will discuss some of the most popular and useful command line text manipulation programs, such as:

TOOLTDESCRIPTIONNOTES
grep searches for a specific pattern in text files and outputs matching lines Tutorial: GREP
sed stream editor for filtering and transforming text Tutorial: SED
awk a programming language for processing text data, often used for text manipulation tasks Tutorial: AWK
cut cuts out specific columns or fields from a file
sort sorts lines of text alphabetically or numerically
uniq removes duplicate lines from a file
tr translates or deletes characters from a file
wc counts the number of lines, words, and characters in a file
tail output the first or last part of a file, respectively

^ Click on the tool name (in the first column) to jump to the cheat sheet.

These tools are often used together in pipelines to perform more complex text manipulations. Understanding text manipulation programs can greatly improve a user’s productivity and efficiency when working with text files, without the need for graphical user interfaces.

Why manipulate text files from the command line?

Manipulating text files in the command line is a powerful and efficient method for processing large amounts of text data. It is useful because of:

  • SPEED
    Command line tools are typically faster than GUI-based text editors when processing large amounts of data.

  • AUTOMATION
    Command line tools can be automated using shell scripts, allowing you to perform repetitive tasks quickly and efficiently.

  • FLEXIBILITY
    The command line tools offer a wide range of functionality, making it possible to perform complex text manipulations.

  • INTEGRATION
    Command line tools can be easily integrated into other programs, allowing you to process text data in a variety of different contexts.

  • ACCESSABILITY
    Command line tools allow users to manipulate text files stored on the remote machine without downloading them.

When to manipulate text files in the command line?

  • DATA PROCESSING
    Command line tools are particularly useful when working with large text data or multiple files, as they can process the data much faster than a graphical user interface (GUI) based text editor.

    • TEXT MANIPULATION
      The command line tools provide a powerful way to change the order or structure in the text file.

    • TEXT ANALYSIS
      You can use command line tools to extract meaningful information from large amounts of text data.

  • SCRIPTING
    The command line tools can be used in shell scripts to automate complex text processing tasks.

CheatSheet

Below, you can find a cheat sheet for some of the most popular command line text manipulation tools.

GREP - search pattern

SYNTAX: text_stream | grep OPTIONS PATTERN or grep OPTIONS PATTERN FILE
COMMAND SYNTAX EXAMPLE TASK
grep <PATTERN> <FILE> grep 'version' file.txt Search for a pattern in a file.
grep <PATTERN> <FILE1> <FILE2> grep 'version' file1.txt file2.txt Search for a pattern in multiple files.
grep -r <PATTERN> <DIR> grep -r 'version' THIS_FOLDER Search recursively in all files in a directory.
grep -n <PATTERN> <FILE> grep -n 'version' file.txt Show line numbers for matches.
grep -o <PATTERN> <FILE> grep -o 'version' file.txt Show only the matching portion of the line.

SED - replace pattern

SYNTAX: text_stream | sed OPTIONS /PATTERN/REPLACEMENT/ or sed OPTIONS /PATTERN/REPLACEMENT/ FILE
COMMAND SYNTAX EXAMPLE TASK
sed 's/<PATTERN>/<REPLACEMENT>/g' FILE sed 's/version/V/g' Replace all occurrences of a pattern in a file.
sed 's/<PATTERN>//g' FILE sed 's/version//g' Delete all occurrences of a pattern in a file.
sed 's/<PATTERN>/<REPLACEMENT>/N' FILE sed 's/version/V/2' Replace the nth occurrence of a pattern in a line.

AWK - manage order

SYNTAX: text_stream | awk OPTIONS '{}' or awk OPTIONS '{}' FILE
COMMAND SYNTAX EXAMPLE TASK
awk '{print $1, $3}' <FILE> awk '{print $1, $3}' file.txt' Print the first and third column of a file.
awk 'NF > 3' <FILE> awk 'NF > 3' file.txt Print only the lines with more than 3 fields (columns).
awk '{sum+=$2} END {print sum}' <FILE> awk '{sum+=$2} END {print sum}' file.txt Print the sum of all numbers in the second column.
awk '{printf "%-10s %s\n", $1, $2}' <FILE> awk '{printf "%-10s %s\n", $1, $2}' file.txt Format the output.

CUT - cut characters

SYNTAX: text_stream | cut OPTIONS or cut OPTIONS FILE
COMMAND SYNTAX EXAMPLE TASK
cut -f 1 <FILE> cut -f 1,3-5 file.txt Cut out the first and 3rd to 5th columns from a file.
cut -c 1-3 <FILE> cut -c 1-3 file.txt Cut out the first three characters from each line.

SORT - sort lines

SYNTAX: text_stream | sort OPTIONS or sort OPTIONS FILE
COMMAND SYNTAX EXAMPLE TASK
sort <FILE> sort file.txt Sort the lines of a file.
sort -r <FILE> sort -r file.txt Sort the lines of a file in reverse order.
sort -k 2 <FILE> sort -k 2 file.txt Sort the lines of a file based on the second field (column).
sort -n <FILE> sort -n file.txt Sort the lines of a file numerically.

…explore the Unix Getting Started tutorial in the section: SORT a file by lines

UNIQ - unique lines

SYNTAX: text_stream | uniq OPTIONS or uniq OPTIONS FILE
COMMAND SYNTAX EXAMPLE TASK
uniq <FILE> uniq file.txt Remove duplicated lines from a file.
uniq -d <FILE> uniq -d file.txt Show only the duplicates in a file.
uniq -u <FILE> uniq -u file.txt Show only the unique lines in a file.

…explore the Unix Getting Started, section: UNIQ - command to remove duplicates

TR - swap characters

SYNTAX: text_stream | tr OPTIONS or tr OPTIONS < FILE
COMMAND SYNTAX EXAMPLE TASK
tr '[:upper:]' '[:lower:]' < <FILE> tr '[:upper:]' '[:lower:]' < file.txt Translate all uppercase letters to lowercase.
tr ' ' '\t' < <FILE> tr ' ' '\t' < file.txt Translate all spaces to tabs.
tr -d '[AEIOUaeiou]' < <FILE> tr -d '[AEIOUaeiou]' < file.txt Delete all vowels from a file.

…explore the Unix Getting Started tutorial in the section: TR - translate

WC - count lines, words

SYNTAX: text_stream | wc OPTIONS or wc OPTIONS FILE
COMMAND SYNTAX EXAMPLE TASK
wc <FILE> wc file.txt Count the number of lines, words, and characters in a file.
wc -l <FILE> wc -l file.txt Count the number of lines in a file.
wc -w <FILE> wc -w file.txt Count the number of words in a file.
wc -m <FILE> wc -m file.txt Count the number of characters in a file.

…explore the Unix Getting Started tutorial in the section: WC - word count

HEAD and TAIL

SYNTAX: text_stream | head <OPTIONS or head OPTIONS FILE

These tools are very useful for quickly inspecting the contents of a file and can be used to get an overview of the data before processing it with more complex text manipulation tools.

HEAD

COMMAND SYNTAX EXAMPLE TASK
head <FILE> head file.txt Print the first 10 lines of a file.
head -n N <FILE> head -n 5 file.txt Print the first N lines of a file.
head -c N <FILE> head -c 10 file.txt Print the first N bytes of a file.

…explore the Unix Getting Started tutorial in the section: HEAD of the file

TAIL

COMMAND SYNTAX EXAMPLE TASK
tail <FILE> tail file.txt Print the last 10 lines of a file.
tail -n N <FILE> tail -n 5 file.txt Print the last N lines of a file.
tail -c N <FILE> tail -c N file.txt Print the last N bytes of a file.
tail -f <FILE> tail -f file.txt Continuously monitor the end of a file.

…explore the Unix Getting Started tutorial in the section: TAIL of the file

jump to solution

Removing duplicate lines from a file

sort FILE | uniq

Counting the number of a ‘WORD’ in a file

grep -o WORD FILE | wc -w

Extracting columns of data from a file

cut -d DELIMITER -f COLUMN FILE

or

awk -F DELIMITER '{print COLUMN}' FILE

Creating columns by translating a char to a delimiter

text_stream | grep WORD | tr '-' ' ' | awk '{print $2,$4,$6}' | sort -nk1 | uniq