07: Data Acquisition and Wrangling

Information is the foundation of the learning process. Data acquisition and wrangling are crucial parts of Data Science that lead to extracting knowledge from the information. With large, difficult to transfer data, remote access is the rule, almost exclusively via a command-line interface. Luckily for you, knowing a few tricks make it easy to access and visualize data in a friendly way on a remote machine. As you explore this section, you will also learn how to manage Excel spreadsheets and efficiently manipulate massive data with Python.

1. Remote data access
1.1 Remote data transfer
1.1.1 Copying Data using Graphical Interface: Globus
1.1.2 Copying Data via SSH using Command Line: scp, rsync
1.1.3 File transfer using irods
1.2 Remote data download
1.2.1 Downloading Online Data using WGET
1.2.2 Downloading online data using API
1.2.3 Downloading online data using Python-based web scraping
1.2.4 Downloading online repos using GIT: [GitHub, Bitbucket, SourceForge]
1.2.5 Downloading a single folder or file from GitHub
1.3 Remote data preview (without downloading)
1.3.1 Viewing text files using UNIX commands
1.3.2 Viewing PDF and PNG files using X11 SSH connection
1.3.3 Viewing graphics in a terminal as the text-based ASCII art
1.3.4 Mounting remote folder on a local machine
2. Data manipulation
2.1 Manipulating Excel data sheets
2.1.1 Create worksheet from multiple text files
2.1.2 Export multiple worksheets as separate text files
2.1.3 Create index for all worksheets
2.1.4 Merge two spreadsheets using a common column
2.2 Manipulating text files with Python
2.2.1 Read, write, split, select data
3. Data wrangling: use ready-made apps
3.1 Merge files by common column (python)
3.2 Aggregate data over slicing variations (python)
3.3 Split data or create data chunks (python)

07: Data Acquisition and Wrangling

Aleksandra Badaczewska

Table of Contents