Data Science Workbook
Being a modern scientist requires familiarity with digital data processing regardless of the field. The Data Science Workbook summarizes the basics of background knowledge & good practices tips, proposes state-of-the-art tools and methods based on benchmarks & reviews, and provides the user with hands-on tutorials to learn through examples of real-world applications.
The workbook’s organization favors beginners by introducing them to sections related to Computer Science (Command-Line, Computer Setup, Development Environment, Programming, High-Performance Computing), Data Processing (Data Acquisition, Wrangling, and Visualization), and Project Management. Returning users can easily find the content of interest by browsing the Index and navigating directly to the desired section. The Glossary tab at the top of the page contains short definitions and related keywords to help users integrate the knowledge.
This workbook offers a solid introduction to Data Science by providing principles universal across disciplines. To learn more about advanced techniques specific to a field (Bioinformatics, Geospatial, Artificial Intelligence) we refer you to our other Workbooks.
Preface
The rapid advance of computer technology in the early 21st century has lifted humanity from the analog to the digital world. That significantly pushed the boundaries of what was possible and feasible in terms of collecting, parsing, transferring, and storing data. Since then, digital data has become dominant, so much so that a 2013 study � found that over 90% of the world’s data appeared over prior two years (in contrast to the earlier thousands of years of civilization). In 2020, the volume of created data reached 64 zettabytes (1021 bytes), and the trend remains exponential �. Such enormous amounts of information, structured � and especially unstructured � data, create a need for efficient computational tools to find patterns, filter by a key, and categorize to finally extract knowledge and insights to retain. In short, that’s what Data Science does.
What is Data Science?
Data Science is a modern conception of efficient computational processing of large sets of digital information for data mining and knowledge discovery. Regardless of the source (healthcare, social media, bioinformatics, logistics, finance, cybersecurity), data has become an entity itself due to its size (Big Data) and nature (mainly Unstructured Data). That brings various technical challenges at all stages of the Data Lifecycle � : Capturing, Maintaining, Processing, Analyzing, and Communicating. Thus, Data Science focuses also on solving these problems and developing innovative techniques unique to digital data (e.g., Machine Learning). All that makes it a highly interdisciplinary field using the latest developments in Computer & Information Science, also strongly supported by Mathematics and Statistics. This is complemented by specific Domain Knowledge that leads to asking the relevant research hypothesis and imposing some boundaries for the analysis. Lastly, when reaching the Knowledge Extraction from data, the most significant aim of Data Science is to present the results attractive and meaningful way. Here come additional high-tech components of Data Integration, interactive Visualization, Graphic Design, and modern communication methods. Project Management becomes significantly more important part of Data Science as the size of the project and the amount of data increases. It is not just about the organization and timing but also how to improve reproducibility, standardization, and Knowledge Retention.
Acknowledgements
Some of the funding to produce this Workbook was provided by a joint collaboration between Iowa State University and resources provided by the SCINet project of the USDA Agricultural Research Service [ARS project number: 0500-00093-001-00-D].
Aleksandra Badaczewska � created graphics for the Workbook’s layout using free and open-source software Inkscape (vector) and Gimp (raster). The original assets used for the graphic design come primarily from Adobe Stock granted to Iowa State University under an Educational License 2022. For a detailed list of images and source information, see the sources file.