is a set of rules in problem-solving operations. Creating algorithms is a foundation of programming, where a developer defines a finite sequence of well-defined instructions to perform computations and process data. Among the typical elements of an algorithm, regardless of programming language, are conditionals and loops that enable repetitive actions and logical decisions.
#bash scripting | #programming | #problem-solving |
Artificial Intelligence - is a branch of Computer Science dealing with cognitive technology and simulation of intelligent behavior, including planning, learning, reasoning, problem-solving, knowledge representation, perception, motion, and manipulation.
#data science | #machine learning | #big data |
is a command language in the Unix shell that allows users to execute various processes by writing text commands in the terminal window.
#unix-shell | #terminal | #command-line |
Big Data - focuses on the large size of data, its variety, and the velocity of generating and processing. These parameters continually expand and become a bottleneck on existing computational approaches. It also integrates modern (i) analytical techniques (e.g., machine learning), (ii) technologies (e.g., distributed computing), and (iii) visualization solutions (e.g., interactive graphing and infographics), applied during the life cycle of Big Data.
#digital data | #structured data | #unstructured data | #data science |
Command Line -
is a text interface for the computer that passes the predefined commands to the operating system. Commands trigger the execution of various processes.
#unix-shell | #terminal | #bash |
Data Science -
is a modern conception of efficient computational processing of large sets of digital information for data mining and knowledge discovery. Data Science focuses on solving various technical challenges related to Big Data and developing innovative techniques unique to digital data (e.g., Machine Learning). It is a highly interdisciplinary field using the latest developments in Computer & Information Science, also strongly supported by Mathematics and Statistics, and complemented by specific Domain Knowledge.
#big data | #machine learning | #knowledge |
Digital Data - is a collection of observables registered in a computer-readable representation. A single item of data is called a datum. Each datum has assigned the binary value of “false, 0” or “true, 1”, resulting in a bit of information, i.e., one binary digit.
#structured data | #unstructured data | #big data |
Distributed Computing - is a system of multiple computer machines connected over a network to create a computing cluster that appears as a single computer to the end-user. It provides a unified environment with access to shared resources (software, data, storage) and coordinated computing power. Distributed computing is a technique typically used in the High-Performance Computing (HPC) infrastructures.
#queue | #job scheduler | #high-performance computing |
development environment, is a workspace for developers where they create and modify the source code of their application (software, web, etc.). Nowadays, professional developers usually use an Integrated Development Environment (IDE) that is a software suite partially with a graphical interface to make various things easier for the programmer (general code management, tracking and pushing changes, file system browsing, file preview & editing, kernel loading, autocomplete, formatting, etc.);
programming environment, is a layer of settings specific for a given programming language or type of developed application. It can be isolated from the general operating system and provides a kind of virtual environment with adjusted software configuration or modules loaded in a selected release. Virtual environments are commonly used when programming in Python and can be easily created using Conda ( environment management system).
#programming | #Python | #high-performance computing |
File System -
is the organization of data retained in the digital storage on the computing machine. The content consists of hierarchically ordered folders and files. Folders are user-categorized containers for files. Files contain data and consume digital storage space. Some files belong to the operating system and include configurations, source code, and executables of various programs. Each file and folder is assigned an absolute path that defines its location in the file system. Knowing this path is very useful for navigating the file system from the command line.
#operating system | #command-line | #bash scripting |
is a free program that can generate two- and three-dimensional plots of functions, data, and data fits. It works as both command-line and GUI interface. You can export graphic files in a variety of formats (SVG, PNG, etc.) or analyze the results interactively, customizing the graph as you wish.
#data analysis | #bash scripting | #interactive graphing |
High-Performance Computing, is to perform computations requiring high computational power not available from a single computer. HPC operates on a dedicated infrastructure in the framework of distributed computing that aggregates computer power in networks, such as computer clusters, supercomputers, and cloud-based services.
#distributed computing | #queue | #big data |
is a meaningful and organized product of data processing. It maintains data compression, encapsulates densification of value and veracity, and provides context for querying in the analysis.
#digital data | #big data | #raw data | #knowledge |
Interactive Graphing - is a method for data visualization that enables users to interact with the data on-the-fly, see the details such as numerical values, and freely customize the final plot. That is a modern approach that gives greater insight into the dataset and allows for collaborative work on data analysis.
#visualization | #big data | #knowledge |
is an integrated development environment (IDE) with an interactive web-based computing interface that supports programming in multiple languages, including Python, Java, R, Julia, Matlab, Octave, etc. The Jupyter interface has a form of a notebook, where you can do it all at once, (i) develop and execute code cells, (ii) write comments and documentation in markdown, and (iii) visualize and analyze results with an interactive graphing.
#programming | #environment (IDE) | #python | #interactive graphing |
is an extracted non-trivial insight from the data classification and analysis of information. Knowledge, while applied, leads to problem-solving, improvements, and steady development.
#digital data | #big data | #raw data | #information |
Local Machine -
is the computer that the user is using with direct access to it. Usually, it is your personal computer.
#distributed computing | #HPC | #remote machine |
Machine Learning -
is a field of study focused on developing advanced computer algorithms that search for deeply coupled patterns in massive, disparate data and enable knowledge extraction. Machine learning methods are trained with large sets of data, and they learn from examples to make intelligent decisions without being explicitly programmed.
#big data | #unstructured data | #data science | #artificial intelligence |
Operating System -
OS, is the core software on the computer that manages computing resources, performes installations, and executes available programs. Command-line interface (CLI) and Graphical User Interface (GUI) enable the user directly interact with the operating system to set up, configure, and troubleshoot it. Among the popular operating systems are Windows, Mac OS, and Linux.
#command-line | #bash shell | #programming |
means creating a set of instructions for a computer on how to execute operations in order, following the assumptions and logical conditions of the algorithm. Many programming languages facilitate communication between the code developer and the computing machine. Bash enables a shell scripting using a command-line interpreter for automating repetitive tasks by executing pre-defined commands according to requested conditionals and loops. More advanced analytical operations, including mathematical and statistical functions, modifying complex data structures, or processing non-text data, usually require a higher-level programming language such as Python or C++.
#algorithm | #command-line | #bash | #python | #R | #C++ |
in distributed computing, it is an organized list (sequence) of tasks submitted to the job scheduler that manages the computational resources and decides to start or stop the task. The queue is ordered by wait time, user priority, and availability of requested resources. When the combination of these factors is advantageous, the submitted task begins executing, and so its status changes from waiting to running. The queuing system is typical for distributed computing, such as a network of computer clusters shared by more users. Some of the most popular workload managers are SLURM and PBS.
#HPC | #distributed computing | #SLURM | #PBS |
Raw Data -
is the data captured from the source and has not been processed before the use. It usually has a large volume and serves as a primary unfiltered input in Data Science.
#digital data | #big data | #data science |
Remote Machine - is any other computer or computing network that the user can access by logging into the external network. Performing actions on a remote machine requires a secure login and often requires the user to have an account created by the network administrator. In scientific projects, we use remote computing machines as part of the HPC infrastructure to access high-performance computing and collaboratively share big data.
#distributed computing | #HPC | #local machine |
Structured Data -
is highly organized in terms of easy digital deciphering. That includes a standardized format, enduring order, and categorization in a well-determined arrangement that facilitates managing and querying datasets in various combinations. A typical example of an organized data structure is a spreadsheet or relational database.
#digital data | #unstructured data | #big data |
is a program for accessing files on your computer using the command line.
#command-line | #unix-shell | #bash |
Unix Shell -
is a command-line interpreter that provides a command-line user interface for a computer’s operating system (OS). The OS uses shell scripts/commands to control the execution of the programs and procedures.
#command-line | #terminal | #bash |
Unstructured Data - has no organized structure that can be easily detected, processed, and categorized by computer algorithms. This type of data is usually massive and descriptive in nature. A good example is the streams of highly varied text (e.g., emails, social media posts, online blogs, newspapers, books, and scientific publications), audio and video recording, images and photos, data from various sensors (weather, traffic), and medical records.
#digital data | #structured data | #big data |
is a highly visually influential and semantically meaningful form of modern communication methods. In Data Science, interactive graphing and creating concise infographics support both the ease of extracting insights and the opportunity for deeper analysis for those interested. That contributes to better knowledge retention.
#data analysis | #knowledge | #interactive graphing |
is a working directory for a project or computational task. It often appears as a variable or instruction that can be assigned a path to a selected location in the file system. That path is then used for all future commands that require a location, such as writing to a file. It is a common variable for workload managers on distributed computing infrastructures. The pathname of the current working directory can be accessed with the ‘pwd’ command or using the ‘$PWD’ environmental variable.
#HPC | #command-line |