DataScience Workbook / 07. Data Acquisition and Wrangling / 1. Remote Data Access / 1.1 Remote Data Transfer


Introduction

Remote data transfer refers to the transfer of data from one location to another. It requires a network, such as the Internet, that provides the means of transmitting the data between the source and destination devices. It allows individuals and organizations to share and exchange information, regardless of their physical location.

NOTE:
With remote data transfer, data can be sent from a source device, such as a computer, to a destination device, such as another computer or a server, using a variety of protocols, such as File Transfer Protocol (FTP), Secure File Transfer Protocol (SFTP), and cloud storage services.
Remote data transfer is not possible without a network. Without a network, data can only be transferred directly between two devices using a physical connection, such as a USB drive or an external hard drive.


Remote data transfer can be used for a wide range of purposes, such as:

  • sharing files and documents,
  • backing up data,
  • transferring large amounts of data between systems,
  • distributing software and updates.

Remote data transfer enables greater collaboration and productivity by making it possible to access and share information from anywhere, at any time.

Remote data transfer options

Secure Data Transfer in Science

The need for secure data transfer in research is paramount due to the sensitive nature of the data being transmitted. Research often involves the collection and analysis of sensitive and confidential information, such as personal information, intellectual property, trade secrets, and medical records. If this information were to fall into the wrong hands, it could have serious consequences, such as identity theft, unauthorized access to sensitive information, or loss of confidential information.

Here are some of the key reasons why secure data transfer is important in research:

  • CONFIDENTIALITY
    Research data often contains sensitive information that must be kept confidential to protect the privacy of individuals and the confidentiality of the research itself. Secure data transfer ensures that this information is transmitted over a secure connection and is protected from unauthorized access.

  • INTEGRITY
    The accuracy and completeness of research data is crucial for the validity of the results. Secure data transfer helps to ensure the integrity of the data by providing mechanisms to detect and prevent any unauthorized changes or tampering with the data during transmission.

  • COMPLIANCE
    Many research projects are subject to regulatory requirements, such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA). Secure data transfer helps to ensure compliance with these regulations by providing secure and encrypted methods of transmitting data.

  • INTELLECTUAL PROPERTY
    Research often involves the creation of new ideas and concepts that may have commercial value. Secure data transfer helps to protect the intellectual property of the researchers by ensuring that confidential information is transmitted securely and is protected from unauthorized access.

  • EFFICIENCY
    Secure data transfer allows researchers to collaborate and share data in real-time, regardless of their location. This can greatly increase the speed and efficiency of research projects and can facilitate collaboration between researchers across multiple institutions.

Good Practices

When transferring any research data, there are several practices and techniques that should be avoided in order to ensure the security and confidentiality of the data:

  • Avoid unencrypted connections:
    Unencrypted connections, such as FTP or HTTP are NOT secure and can leave the data vulnerable to interception and unauthorized access. Instead, use encrypted connections, such as SFTP, HTTPS, or VPN, to ensure that the data is transmitted securely.

  • Avoid using public Wi-Fi networks:
    Public Wi-Fi networks are NOT secure and can be easily intercepted by unauthorized third parties. Instead, use a secure and encrypted connection, such as a VPN, to ensure that the data is transmitted securely.

  • Avoid using weak passwords:
    Weak passwords can be easily cracked by attackers, putting the data at risk. Instead, use strong and unique passwords and implement multi-factor authentication to add an extra layer of security.

  • Avoid transmitting sensitive information without permission:
    Before transmitting sensitive data, it is important to obtain permission from the owner of the data and to ensure that the data is properly secured.

  • Avoid using personal email:
    Personal email accounts are not designed for secure data transfer and can be vulnerable to interception, unauthorized access, and data loss. Instead, use secure file transfer services or encrypted email services that are specifically designed for secure data transfer.

  • Avoid using removable media:
    Removable media, such as USB drives, can be easily lost or stolen, putting the data at risk. Instead, use secure file transfer services (e.g., Globus) or encrypted cloud storage services (e.g., your organization Box) that are specifically designed for secure data transfer.

1. Online File Sharing Services

There are several online file sharing services, which allow users to upload, store, and share files with others. These services offer the convenience of being accessible from anywhere with an Internet connection, and they often include advanced collaboration features such as the ability to share files with others and work on them together in real-time.

  • Dropbox ⤴
    A file hosting service that offers cloud storage, file synchronization, personal cloud, and client software. Free 2GB per user with up to 3 linked devices.

  • Google Drive ⤴
    A file storage and synchronization service developed by Google. Allows to create and share documents, spreadsheets, presentations, surveys, and more. Free 15GB per user with Google account.

  • OneDrive ⤴
    A file hosting service operated by Microsoft that enables registered users to share and synchronize their files. Includes Microsoft Office tools. No free plan available.

  • Apple iCloud ⤴
    A cloud service developed by Apple that enables users to store and sync data across devices, including Apple Mail, Apple Calendar, Apple Photos, Apple Notes, contacts, settings, backups, and files, to collaborate with other users. Free 5GB per user with Apple account.

  • Box ⤴ (recommended)
    A cloud-based file hosting service, focused on business users. Provides Secure collaboration with anyone, anywhere, on any device. No free plan available.
    It is highly likely that your organization uses Box and you as an employee have an account automatically set up.

2. Cloud Storage Services

Cloud storage services provide scalable and highly available data storage services that can be accessed over the Internet. These services are often used for backup, disaster recovery, and archiving. They are dedicated for business and large-scale needs. Typically, there are no free pricing plans, check your options with your organization.

  • Google Cloud Storage ⤴
    Cloud Storage is a managed service for storing unstructured data. Store any amount of data and retrieve it as often as you like.
  • Microsoft Azure Blob Storage ⤴
    Azure Blob Storage provides long-term storage to build powerful cloud-native and mobile apps. It can flexibly scale up for high-performance computing and machine learning workloads.
  • Amazon S3 ⤴
    Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance.

3. Web-based File Transfer

Some web-based file transfer services allow you to send large files over the Internet without the need to install any software. These services typically use a web browser as the client, and the recipient simply needs to follow a link to download the file. While web-based free file transfer services can be convenient, they may not provide the level of security and reliability required for sensitive or confidential data.

WARNING:
The safety and security of web-based free file transfer services can vary greatly. While some services may offer a certain level of security, it is important to consider the potential risks associated with using these services.
Remember that nothing comes for free. Here are some of the risks associated with using web-based free file transfer services:
  • lack of encryption, leaving the data vulnerable to interception and unauthorized access
  • limited security features, such as weak password protection, which can leave the data at risk of unauthorized access
  • data privacy, services may collect and use personal data for advertising or other purposes
  • unreliable servers, which can result in data loss or corruption during transmission
  • no technical support, which can make it difficult to resolve issues or recover lost data
  • 4. Data transfer to and from HPC

    High-Performance Computing (HPC) environments typically have a number of secure options for remote data transfer, including:

    Globus ⤴ is a web-based service for transferring large amounts of data between HPC systems, cloud storage systems, and other data repositories. Globus provides a secure and reliable means of transferring data, and it can be integrated with other tools and systems used in HPC environments.

    Tutorial: Follow hands-on tutorial Copying Data using Graphical Interface: Globus ⤴ in this workbook, to acquire the practical skill of transferring data to and from the HPC system.

    Globus is a recommended tool for transferring data on the SCINet infrastructure.
    Learn more about Data transfer on Atlas using Globus ⤴ from the tutorial SCINet: Atlas Computing Cluster ⤴.

    • GridFTP

    GridFTP ⤴ is a command line service for parallel movement of data. It is a high-performance, parallel-data transfer protocol designed for large-scale data transfers, especially for HPC and scientific computing. GridFTP uses multiple parallel streams to transfer data, which can significantly speed up data transfer times for large files.

    • FTP client

    FTP (File Transfer Protocol) is a standard network protocol used for transferring files between computers over the internet. The protocol was designed to be simple and efficient, allowing for the easy transfer of large files between hosts. Many websites and online resources offer FTP access, which allows users to download files directly to their local machine using an FTP client.

    NOTE:
    FTP works by establishing a connection between a client and a server. The client is typically an FTP software application, also known as an FTP client, while the server is a remote computer that stores the files to be transferred. The FTP client uses the protocol to send commands to the server to download or upload files.

    FTP is not a secure protocol, as data is transferred in plain text and can be intercepted by third parties. As such, it is recommended to use a secure file transfer protocol, such as SFTP (Secure File Transfer Protocol) or FTPS (FTP over SSL), for transferring sensitive data.


    QUICK GUIDE

    1. To download data from an online resource using FTP, you first need to have an FTP client installed on your local machine.
    There are several popular FTP clients available, including:

    • FileZilla ⤴, a cross platform open-source FTP solution supporting FTP, FTPS, SFTP, and some cloud-based file transfer protocols
    • Cyberduck ⤴, a libre server and cloud storage browser for Mac and Windows with support for FTP, SFTP, and some cloud-based file transfer protocols
    • WinSCP ⤴, a popular FTP client for Microsoft Windows supporting FTP, FTPS, SFTP, and some cloud-based file transfer protocols

    01-download_ftp_client.png

    2. Once you have an FTP client installed, you can use it to connect to the server hosting the files you want to download.
    To establish a connection, you will need to enter the hostname or IP address of the server, as well as your login credentials, such as your username and password.

    3. Once you have established a connection, you can browse the files and directories stored on the server, and select the files you want to download.
    You can then use the FTP client to initiate the file transfer and monitor its progress.

    Some FTP clients provide both a graphical user interface (GUI) and a command-line interface (CLI) for transferring files. Note that the exact syntax may differ depending on your operating system and the version of the tool you are using.

    FileZilla In FileZilla, you can use the command-line interface by launching the application with the `--cli` command-line option.
    Once in the CLI, you can use commands like:
  • `open` to connect to an FTP server,
  • `put` and get to transfer files,
  • `exit` to close the connection.
  • To display a list of available commands use the `--help` option: filezilla --help To display help for a specific command, use the `-h` or `--help` option followed by the command name: filezilla -h open
    Cyberduck In Cyberduck, you can use the `duck` command-line tool to transfer files via FTP.
    The duck command-line tool provides a variety of options to configure the FTP connection and transfer settings. To display a list of available commands use the `--help` option: duck help To display help for a specific command, use the `duck help` command followed by the command name: duck help put
    WinSCP In WinSCP, you can use the `winscp.com` command-line tool to automate file transfer tasks.
    The winscp.com tool provides a variety of options to configure the FTP connection and transfer settings. To display a list of available commands, use the `/?` or `/h` option: winscp.com /? To display help for a specific command, use the `/h` or `/help` option followed by the command name: winscp.com /help put

    • rsync (command)

    rsync is a command line tool for fast and efficient file transfer. It is often used in HPC environments. rsync transfers only the differences between two sets of files, making it well-suited for transferring large amounts of data, especially when only small changes have been made to the data.

    Tutorial: Follow hands-on tutorial Copying Data via SSH using Command Line: scp, rsync ⤴ in this workbook, to acquire the practical skill of transferring data to and from the HPC system.

    • scp (command)

    scp (secure copy) is a command-line tool for securely copying files between computers. scp uses the same authentication and security as the SSH (secure shell) protocol, which is widely used for secure remote login and other secure network services.

    Tutorial: Follow hands-on tutorial Copying Data via SSH using Command Line: scp, rsync ⤴ in this workbook, to acquire the practical skill of transferring data to and from the HPC system.

    • Data Movers

    Some HPC systems include data movers, which are specialized hardware or software components that are designed to handle the high-speed transfer of large amounts of data. Data movers can be integrated with other HPC tools and systems to provide a seamless means of transferring data between HPC resources.


    Further Reading


    Homepage Section Index Previous Next top of page