DataScience Workbook / 07. Data Acquisition and Wrangling / 1. Remote Data Access / 1.1 Remote Data Transfer / 1.1.2 Copying Data via SSH using Command Line: scp, rsync
Introduction
Copying data using SSH (Secure Shell connection) provides a secure way to transfer data between two computers. The data is encrypted while it is being transmitted, providing protection against eavesdropping and tampering. By establishing an encrypted connection and verifying the identity of the user, SSH protocol ensures that the data is transmitted securely.
The data can be copied or synchronized between two computers using a command line tools such as:
scp
(secure copy), recommended for transferring individual files [go to the section]
or
rsync
(secure synchronization), recommended to update the differences between the corresponding directories [go to the section]
What you need to start?
All you need is a terminal window providing the command line interface and your access credentials to the remote machine. Typically, these include:
hostname
of the remote machine- your
username
- your access
password
- multifactor
authentication code
A hostname is a label that is assigned to a computer on a network, and it is used to identify the computer and its location on the network. The specific format of a hostname can vary, but it must be unique on the network in order to function correctly.
Here are some examples of hostnames:
Command SYNTAX
The command syntax for both command line tools, scp
and rsync
, are very similar and use similar components:
scp <source> <destination>
or rsync <source> <destination>
e.g.,
scp /local/directory/file.txt username@remote-hostname:/remote/directory/
where:
file.txt
- is a data file you want to transfer/local/directory/
- is a relative or absolute path on your local machine to data locationusername
- is the name of your user account on the remote machine@
- is a linker in the username@hostname syntaxremote-hostname
- is the a label that is assigned to a remote computer/remote/directory/
- is the relative or absolute path on a remote machine
A file path is used to specify the location of a file or directory on the computer's file system. There are two types of file paths: absolute paths and relative paths.
For example,
/home/user/documents/file.txt
is an absolute path to a file in a directory on the file system.For example,
./documents/file.txt
is a relative path to a file in a directory that is located in the current working directory. - current directory: ./
- one directory above: ../
- two directories above: ../../
...and so on
SCP (secure copy)
scp
(secure copy) is a command line tool for copying files between computers using SSH (Secure Shell) protocol for data transfer. It works by establishing an encrypted ssh
connection between two computers and copying the data over this connection.
SCP is usually available on Linux and Mac in terminal, and on Windows 10 in Windows PowerShell.
Getting started:
Open terminal window on your local machine and copy-paste the command example (provided below), while adjusting paths and credentials to your needs (according to directions from the Command SYNTAX section).
Copy file: local to remote
scp /local/directory/file.txt username@remote-hostname:/remote/directory/
Copy file: remote to local
scp username@remote-hostname:/remote/directory/file.txt /local/directory/
Copy a directory
If you want to copy the entire directory, use the scp -r
command, where the -r
flag tells copy the directory and its contents recursively.
- from local to remote
scp -r /local/directory/file.txt username@remote-hostname:/remote/directory/
- from remote to local
scp -r username@remote-hostname:/remote/directory/file.txt /local/directory/
Admins of some HPC systems, e.g. SCINet infrastructure ⤴ recommend to use
csp
to transfer a single file only.So please be aware of this note:
"It is not advised to use “scp -r” command to transfer directories to Ceres, since the setgid bit on directories at destination is not inherited. This is not a problem if directories are copied to /home/$USER but is a problem when copying to /project area and usually results in quota exceeded errors."
If you decide to use scp to transfer directories to Ceres cluster follow the instructions provided at SCINet website: Small Data Transfer Using scp ⤴.
Example Options
To learn more about scp command and all available options type “man scp” in the command line.
Here are some options most commonly used with the scp
command:
-r
- Recursively copy the entire contents of a directory, including subdirectories and files.-v
- Verbose output. Display the progress of the transfer and any error messages.-P 8080
- Specify the port to use for the connection, 8080 is just an example.-C
- Compression during transfer.-q
- Quiet mode. Suppress output, including error messages.
Example 1: Recursively copy a directory and its contents
scp -r ~/data user@example-hostname:~/backup
Example 2: Display verbose output during the transfer
scp -v ~/data user@example-hostname:~/backup
Example 3: Specify the port to use for the connection
scp -p 8080 ~/data user@example-hostname:~/backup
Example 4: Enable data compression during transfer
scp -C ~/data user@example-hostname:~/backup
Example 5: Suppress output, including error messages
scp -q ~/data user@example-hostname:~/backup
RSYNC (secure synchro)
rsync
(secure synchronization) is a command line tool for efficiently transferring and synchronizing files between computers using SSH (Secure Shell) protocol for data transfer. It works by establishing an encrypted ssh
connection between two computers and copying the data over this connection. This tool is commonly used for backup, data replication, and file distribution.
rsync works by comparing the source and destination files and only transferring the differences, making it much more efficient than other file transfer tools, such as cp or scp, when the source and destination files are similar. This makes rsync particularly useful for transferring large files or large collections of files that change only slightly over time, as it can significantly reduce the amount of data that needs to be transferred.
In addition to its efficiency, rsync also provides a number of features that make it a versatile tool for file transfer and synchronization, such as:
RSYNC is usually available on Linux and Mac in terminal, and on Windows 10 in Windows PowerShell.
Getting started:
Open terminal window on your local machine and copy-paste the command example (provided below), while adjusting paths and credentials to your needs (according to directions from the Command SYNTAX section).
The general syntax for synchronization requires to provide the source and destination locations. You can synchronize locations on a single machine or between different computers.
rsync <source> <destination>
It can be practical to use the rsync
command with -avz
flags:
-a
- preserves file attributes such as permissions and ownership-v
- provides verbose output-z
- compresses the data during transfer
On the first transfer with rsync
all data will be copied, while on future uses only the differences will be updated.
Synchronize local to remote
rsync -avz /local/directory username@remote-hostname:/remote/directory
Synchronize remote to local
rsync -avz username@remote-hostname:/remote/directory /local/directory
Synchronize File or Dir
Example 1: if you wanted to synchronize the file file.txt
stored in your home directory (~/
) from your local computer to a remote computer with the hostname example-hostname
and place it in the directory ~/backup
, you could run the following command:
rsync ~/file.txt user@example-hostname:~/backup
Example 2: if you wanted to synchronize the directory ~/data
from your local computer to a remote computer with the hostname example-hostname
and place it in the directory ~/backup
, you could run the following command:
rsync -avz ~/data user@example-hostname:~/backup
Using -avz flags will also 1) preserve file attributes, 2) provide verbose output, and 3) compress the data during transfer.
Example Options
To learn more about scp command and all available options type “man rsync” in the command line.
Here are some options most commonly used with the rsync
command:
-a
- Archive mode. A shorthand for a set of options that preserve file attributes such as permissions, ownership, timestamps, and symbolic links.-v
- Verbose output. Display the progress of the transfer and a list of the files being transferred.-z
- Compress the data during transfer.-r
- Recursively copy the entire contents of a directory, including subdirectories and files.-n
- Dry run. Perform a test run without actually transferring any files.-u
- Update only. Transfer only files that are newer on the source than on the destination.--exclude='*.log'
- Exclude files or directories from the transfer based on a pattern, ‘.log’* is an example value for the option.
Example 1: Transfer files in archive mode
rsync -a ~/data user@example-hostname:~/backup
Example 2: Display verbose output during the transfer
rsync -v ~/data user@example-hostname:~/backup
Example 3: Compress the data during transfer
rsync -z ~/data user@example-hostname:~/backup
Example 4: Recursively copy a directory and its contents
rsync -r ~/data user@example-hostname:~/backup
Example 5: Perform a dry run without transferring any files
rsync -n ~/data user@example-hostname:~/backup
Example 6: Update only files that are newer on the source than on the destination
rsync -u ~/data user@example-hostname:~/backup
Example 6: Exclude files or directories based on a pattern
rsync --exclude='*.log' ~/data user@example-hostname:~/backup