is a set of rules in problem-solving operations. Creating algorithms is a foundation of programming, where a developer defines a finite sequence of well-defined instructions to perform computations and process data. Among the typical elements of an algorithm, regardless of programming language, are conditionals and loops that enable repetitive actions and logical decisions.
is a shortcut or a shorthand notation used in command-line interfaces to represent a longer command or series of commands. By defining an alias, users can save time and reduce the effort required to execute frequently used commands. Aliases are typically defined in shell configuration files, such as .bashrc or .zshrc, allowing them to be automatically available in every new terminal session. For example, an alias can be set to replace a lengthy directory path with a simple command or to bundle multiple commands into one. This feature enhances productivity and efficiency in command-line operations.
(formerly known as Singularity) is a container platform designed specifically for HPC environments. It allows users to create and run containers that package applications and their dependencies in a portable and reproducible way. Apptainer ensures that software runs consistently across different computing environments, from local workstations to HPC clusters. It supports seamless integration with HPC resource managers and adheres to security practices suitable for multi-user environments. Apptainer is particularly useful for scientific research and complex computational tasks, enabling researchers to easily share and deploy their applications.
is a branch of Computer Science dealing with cognitive technology and simulation of intelligent behavior, including planning, learning, reasoning, problem-solving, knowledge representation, perception, motion, and manipulation. AI systems can range from simple rule-based systems to complex machine learning models that improve over time by learning from data.
is the process of creating copies of data to protect against loss, corruption or accidental deletion. Backups ensure that important information can be restored in case of hardware failure, software issues or other disasters. There are various types of backup methods, including full, incremental and differential backups, each with different strategies for copying and storing data. Backups can be stored on local devices, external drives or cloud-based services. Regular and automated backups are crucial for data integrity and recovery, especially in environments where data is critical, such as businesses, research institutions and personal computing.
is a command language in the Unix shell that allows users to execute various processes by writing text commands in the terminal window. While the Unix shell is a general interface for command execution, BASH (Bourne Again SHell) is a specific implementation that offers enhanced features like scripting capabilities, improved command-line editing and customizable user environments.
.bashrc is a script file executed whenever a new terminal non-login session is started in an interactive shell environment for the Bash shell; it is typically used on local machines. .bash_profile is executed for login shells to set up the environment that will be inherited by all subsequent shells; it is commonly used in HPC environments and is executed when users initially log into an HPC system. It is common to see .bash_profile source .bashrc within its configuration so that configurations needed for interactive use are also available in login shells. These hidden files are used to configure the user’s shell environment, including setting environment variables, defining aliases, customizing the command prompt and running startup commands. Both are typically located in the user’s home directory and allow users to personalize their command-line interface, automate tasks and enhance their productivity by tailoring the shell’s behavior to their preferences. In HPC environments: - add environment variables, PATH modifications and module loads to .bash_profile(to set up at the start of your login session) - put interactive shell configurations in .bashrc(including: aliases, functions and any interactive shell customizations)
is the process of writing scripts using the BASH language to automate tasks in the Unix shell. It allows users to combine multiple commands into a single script file and execute repetitive tasks efficiently. Bash scripting allows users to simplify or automate operations performed on text or multiple files, including numerical calculations, saving time and reducing the potential for errors by eliminating the need to do it manually in the GUI or enter each command every time you need it.
focuses on the large size of data, its variety, and the velocity of generating and processing. These parameters continually expand and become a bottleneck on existing computational approaches. It also integrates modern (i) analytical techniques (e.g., machine learning), (ii) technologies (e.g., distributed computing), and (iii) visualization solutions (e.g., interactive graphing and infographics), applied during the life cycle of Big Data.
is a numbering system where data is expressed in base-2, using only two symbols, typically 0 and 1, to represent information. It is the fundamental language of computers, where each binary digit (bit) corresponds to a power of 2. Binary is used to encode all types of data, including numbers, text and instructions, enabling digital devices to perform calculations, store information and execute commands. This system underpins all computer operations and is essential for data processing, memory storage and communication in digital electronics.
in programming, is a set of instructions written in programming languages that a computer can execute. Source code forms the foundation of software applications, scripts and systems, enabling them to perform specific tasks and processes. It can range from simple scripts that automate routine tasks to complex systems that power applications and services. Writing source code involves using programming languages like Python, Java, C++ and many others, each suited to different types of development. Source code is essential for creating, maintaining and updating software, and it must be compiled or interpreted to run on a computer.
is the process of writing, testing, and maintaining the source code of both software applications and smaller code pieces that facilitate researchers’ tasks. It involves designing algorithms, implementing functionality, and debugging to ensure the software operates correctly. Key aspects include following coding best practices such as writing clean and readable code, using meaningful variable names, commenting and documenting code, maintaining version control, and performing regular code reviews. Adhering to these practices improves code quality, facilitates collaboration, enhances maintainability, and ensures that the software is reliable and scalable.
are software applications designed to facilitate teamwork and communication among individuals or groups, especially in a remote or distributed environment. These tools include platforms for real-time messaging and chat (e.g., Slack, Microsoft Teams), video conferencing (e.g., Zoom, Google Meet), project management (e.g., Trello, Asana) and document sharing and co-editing (e.g., Google Drive, Dropbox). Collaboration tools enhance productivity by enabling seamless communication, file sharing and task coordination, ensuring that team members can work together efficiently, regardless of their physical locations. They often feature integration with other software and services, further streamlining workflows and improving project management.
is a text interface for the computer that passes the predefined commands to the operating system. Commands trigger the execution of various processes. These commands trigger the execution of various processes, allowing users to perform tasks such as file manipulation, program execution and system management directly through text input in the terminal window.
is a method in command-line interfaces that allows multiple commands to be executed in sequence, using operators to control the flow. Common operators include && (and), || (or) and ; (semicolon). For example, command1 && command2 will run command2 only if command1 succeeds. Use command chaining to streamline workflows by executing multiple commands based on conditions.
is a technique used in command-line interfaces to pass the output of one command as input to another command, using the pipe operator |. This allows for powerful combinations of commands to process data in stages. For example, command1 | command2 sends the output of command1 directly into command2. Use command piping to efficiently process and transform data through a series of commands.
in computer programming and scripting are annotations within the code that are not executed by the compiler or interpreter. They are used to explain the purpose, logic or functionality of specific code segments, making it easier for others (and the original author) to understand and maintain the code. Well-commented code enhances readability and maintainability, especially in collaborative projects. Use comments to document complex code sections, provide context and clarify the intent behind specific implementations.
is an operation in version control systems that saves changes to the repository. It captures the current state of the project’s files, along with a message describing the changes, creating a snapshot in the project’s history. This allows developers to track progress, revert to previous states and collaborate effectively. Write clear and descriptive commit messages to provide context for the changes, making it easier to understand the project’s evolution and facilitating collaboration among team members.
(in this workbook) refer to the resources and technologies that support computer-powered computations in research. This includes High-Performance Computing (HPC), cloud computing, specialized computer software, high-speed networking, and programming tools. These tools facilitate efficient and effective data processing, analysis and computation, enabling researchers to handle complex and large-scale tasks with enhanced speed and precision.
is a script commonly found in the source code of software packages that prepares the software to be built and installed on your system. It checks the system’s environment for necessary tools and libraries, sets up configuration files and creates a Makefile that guides the compilation process. The .configure script customizes the build process based on the specific characteristics of the target system, ensuring compatibility and optimizing performance. This script is typically used in Unix-like operating systems as part of the build process for open-source software.
are executable packages that bundle a specific application code along with its dependencies, ensuring that the software can run consistently across different environments. Containers include all necessary elements, such as libraries, system tools and runtime, needed to execute the application. Popular containerization platforms include Docker, commonly used for local machines, and Apptainer (formerly Singularity), designed for High-Performance Computing (HPC) environments. These tools enable the creation of portable and reproducible software environments, facilitating development, testing and deployment processes by isolating applications from underlying system variations.
is a command-line tool and library for transferring data with URLs, supporting a wide range of protocols, including HTTP, HTTPS, FTP and many more. It allows users to send and receive data from web servers, making it highly versatile for tasks such as downloading files, making API requests and web scraping. Curl is capable of handling various forms of authentication, SSL certificates and can execute requests with complex parameters. It is particularly useful for developers and system administrators for testing endpoints, automating web interactions and debugging network issues. Curl’s flexibility and extensive options make it a powerful tool for interacting with web services and retrieving online data.
is the process of collecting and measuring information from various sources to be used for analysis and research. This involves gathering data through sensors, instruments, surveys and other methods, and often includes digitizing and storing the data for further processing. In scientific research and engineering, data acquisition systems (DAQ) capture physical phenomena like temperature, pressure and sound, converting them into digital signals for analysis. Effective data acquisition ensures the accuracy and reliability of the data, forming the foundation for insightful analysis and informed decision-making.
is the process of handling, storing, and organizing data throughout its lifecycle to ensure its accuracy, accessibility and reliability. It includes tasks such as data collection, storage, processing, maintenance and archiving. Effective data management involves implementing policies and practices for data governance, quality control and security to protect data integrity and privacy. In research and business, good data management practices facilitate efficient data retrieval and analysis, support decision-making and enhance the overall value and usability of data.
is the process of adjusting, organizing, and transforming data to make it suitable for analysis. In the context of data processing, data preparation, or data wrangling, especially in research, it involves tasks like cleaning data, merging datasets, and converting data formats to ensure accuracy and consistency for subsequent analysis. It should not be confused with intentional data manipulation in statistical analyses, which is a serious issue that undermines the transparency and honesty of research.
is the process of viewing data without downloading it, particularly useful for data stored remotely on HPC systems. This can be achieved through various methods such as using command-line tools like less, more and head to quickly view text files, mounting folders with sshfs to access remote files as if they were local and utilizing X11-forwarding for running graphical applications remotely with local display. Additionally, Open OnDemand (OOD) provides a web-based interface for browsing and previewing files, while tools like Jupyter notebooks facilitate remote data visualization directly in the browser. These approaches enhance efficiency by allowing users to inspect data quickly and conveniently without transferring large files.
is a modern conception of efficient computational processing of large sets of digital information for data mining and knowledge discovery. Data Science focuses on solving various technical challenges related to Big Data and developing innovative techniques unique to digital data (e.g., Machine Learning). It is a highly interdisciplinary field using the latest developments in Computer & Information Science, also strongly supported by Mathematics and Statistics, and complemented by specific Domain Knowledge.
is the method of saving digital information in a secure and organized manner for future access and use. It encompasses various technologies and solutions, including physical devices like hard drives and SSDs, cloud-based services such as Amazon S3 and Google Cloud Storage and specialized HPC storage solutions. In HPC environments, long-term data storage is essential for backup and archiving, ensuring that valuable research data is preserved over time. Additionally, online databases and data repositories (e.g., SQL databases, data warehouses and repositories like Zenodo) facilitate the efficient storage, retrieval, and sharing of data. Effective data storage practices include regular backups, data encryption, and implementing redundancy to prevent data loss, which are critical for preserving data integrity and supporting data analysis.
is the process of moving data from one location to another, which can be between different devices, systems or network locations. In the context of HPC and research, data transfer involves using secure and efficient methods to handle large datasets. Tools such as SCP (Secure Copy Protocol), SFTP (Secure File Transfer Protocol) and Globus facilitate data transfer, ensuring integrity and speed while minimizing the risk of data loss or corruption. Efficient data transfer is crucial for collaborative research, backup and data analysis.
refer to the various forms in which data can exist and be utilized across different contexts. In classical data types, we have structured data (organized in predefined formats like databases and spreadsheets) and unstructured data (lacking specific structure, such as text documents and images). In programming, data types include primitive types (integers, floats, booleans), composite types (arrays, lists, tuples), objects, data frames (e.g., pandas DataFrame in Python), arrays (e.g., NumPy arrays) and matrices. Additionally, data types encompass various file formats like text files (TXT, CSV), binary files (EXE, BIN), markup languages (XML, HTML) and specialized formats (HDF5, JSON) for specific data storage and exchange needs.
is the process of cleaning, transforming and organizing raw data into a usable format that meets the specific requirements of a project. This involves various steps such as removing inconsistencies, handling missing values, normalizing data structures, and converting data types. Data wrangling is essential for preparing data for analysis, ensuring accuracy and enhancing the quality of insights derived from the data. It is a critical step in data science, enabling researchers and analysts to work with reliable and well-structured data.
are software applications and utilities that assist programmers in creating, debugging, maintaining and optimizing code. These tools include integrated development environments (IDEs) like Visual Studio Code and PyCharm, notebook interfaces like Jupyter and version control systems like Git. They support various programming languages and libraries, providing features like code completion, syntax highlighting, and real-time error detection. Developer tools also encompass build automation tools like Maven and Gradle, debugging tools, performance profilers and repository hosting platforms like GitHub and GitLab. These tools enhance productivity, facilitate collaboration, streamline workflows and ensure that the development process is efficient and effective.
is a workspace for developers where they create and modify the source code of their application (software, web, etc.). Nowadays, professional developers usually use an Integrated Development Environment (IDE) that is a software suite partially with a graphical interface to make various things easier for the programmer (general code management, tracking and pushing changes, file system browsing, file preview & editing, kernel loading, autocomplete, formatting, etc.); programming environment, is a layer of settings specific for a given programming language or type of developed application. It can be isolated from the general operating system and provides a kind of virtual environment with adjusted software configuration or modules loaded in a selected release. Virtual environments are commonly used when programming in Python and can be easily created using Conda ( environment management system).
is a collection of information represented in a computer-readable format. Each piece of data or datum, is stored as a binary value, where it can be either ‘0’ (false) or ‘1’ (true), representing one bit of information. Digital data encompasses various forms of information, including text, images, audio and video, all encoded in binary. This format allows data to be easily processed, stored and transmitted by digital devices. The use of digital data is fundamental to computing and underpins modern technologies and applications across numerous fields.
is a system of multiple computer machines connected over a network to create a computing cluster that appears as a single computer to the end-user. It provides a unified environment with access to shared resources (software, data, storage) and coordinated computing power. Distributed computing is a technique typically used in the High-Performance Computing (HPC) infrastructures.
is a disk image file format used on macOS to distribute software and other files. It acts like a virtual disk drive, containing the application or files packaged within it. When opened, a .dmg file mounts as a new volume on the desktop, allowing users to drag and drop the application into the Applications folder for installation. This format is commonly used for macOS software distribution due to its ease of use and ability to include licensing information, readme files and other documentation along with the software.
is the process of creating and maintaining written records that explain how a project, system, or piece of software works. In research and software development, documentation includes user guides, technical manuals, code comments, and project reports that provide essential information for understanding, using, and maintaining the work. Good documentation ensures knowledge retention, making it easier for others to learn, reproduce, and build upon the work, and it facilitates collaboration and consistency across teams.
is the process of anticipating, detecting and responding to errors or exceptions that occur during the execution of a program or software. It involves writing code to manage unexpected conditions, such as invalid input, resource limitations or hardware failures, ensuring that the program can handle these situations gracefully without crashing. Techniques for error handling include using try-catch blocks, validating input, logging errors and providing user-friendly error messages. Effective error handling improves the robustness and reliability of software, making it more resilient to bugs and operational issues.
is a file extension for an executable file format used primarily on Windows operating systems. These files contain a program or a piece of software that can be run directly by the computer. When a user opens an .exe file, the operating system reads the instructions encoded within it to launch the application or perform specific tasks. Executable files can contain compiled code, resources, and other necessary components required to run the software. Due to their nature, .exe files are a common format for installing and running applications on Windows platforms, but they can also be a vector for malware, so it’s important to only execute .exe files from trusted sources.
is the organization of data retained in the digital storage on the computing machine. The content consists of hierarchically ordered folders and files. Folders are user-categorized containers for files. Files contain data and consume digital storage space. Some files belong to the operating system and include configurations, source code, and executables of various programs. Each file and folder is assigned an absolute path that defines its location in the file system. Knowing this path is very useful for navigating the file system from the command line.
is a free program that can generate two- and three-dimensional plots of functions, data, and data fits. It works as both command-line and GUI interface. You can export graphic files in a variety of formats (SVG, PNG, etc.) or analyze the results interactively, customizing the graph as you wish.
refer to visual representations of data, images and designs created and manipulated using computer software. In the context of computing and research, graphics involve generating plots, charts and visualizations to represent complex data sets clearly and intuitively. Tools and libraries such as Matplotlib, Plotly, and ggplot2 in programming languages like Python and R are commonly used for creating these visualizations. Graphics also encompass design and editing software like Adobe Photoshop and Illustrator, which are used for creating and modifying images and illustrations. Effective use of graphics enhances data interpretation, communication and overall presentation quality in various fields.
short for Graphical User Interface, is an interface through which users interact with computers and other electronic devices using icons, buttons and other visual elements. Unlike command-line interfaces, GUIs provide a more intuitive and user-friendly experience by allowing users to navigate and control software applications through graphical representations rather than text commands. GUIs are prevalent in operating systems (like Windows, macOS and Linux), software applications, and mobile devices, making technology accessible to a broader audience by simplifying complex operations through visual cues and interactive elements.
is the practice of performing computations that require more power than a single computer can provide. HPC utilizes dedicated infrastructure within the framework of distributed computing, combining the power of multiple computers through networks such as computer clusters, supercomputers and cloud-based services. This aggregation allows for the efficient processing of complex tasks that demand significant computational resources.
or a computer cluster is a group of computing machines or servers that work together as a single system. Clusters provide high performance computing and parallel processing by distribution of tasks across multiple machines.
is an application for software development that includes a code editor, debugging tools, and version control system. IDE softwares are designed to make the process of writing, testing, and debugging code easier and more efficient.
is an application used to access the internet and browse websites and web pages. It interprets and displays HTML, CSS and JavaScript code, allowing users to interact with online content and services. Modern web browsers, such as Chrome, Firefox and Safari, support a wide range of in-browser services, including web-based applications like Open OnDemand (OOD) and JupyterLab. These services enable users to access HPC resources, run computational notebooks, manage files and perform complex tasks directly from the browser, providing a flexible and powerful interface for both everyday browsing and specialized computing needs.
is a meaningful and organized product of data processing. It maintains data compression, encapsulates densification of value and veracity, and provides context for querying in the analysis.
is a method for data visualization that enables users to interact with the data on-the-fly, see the details such as numerical values, and freely customize the final plot. That is a modern approach that gives greater insight into the dataset and allows for collaborative work on data analysis.
is the process of managing and allocating computational tasks across shared computing resources such as HPC infrastructure. Since multiple users need to access these resources, job scheduling ensures an equal or prioritized distribution by gathering jobs into a queue for batch processing. These jobs are then sent to different nodes for computation, optimizing memory allocation and computer power usage. Job scheduling systems like SLURM and PBS automate this process, balancing the computational demand with the available processing power to maximize the efficiency and throughput of the HPC infrastructure, minimizing waiting times and ensuring fair access for all users.
is an integrated development environment (IDE) with an interactive web-based computing interface that supports programming in multiple languages, including Python, Java, R, Julia, Matlab, Octave, etc. The Jupyter interface has a form of a notebook, where you can do it all at once, (i) develop and execute code cells, (ii) write comments and documentation in markdown, and (iii) visualize and analyze results with an interactive graphing.
in computing is the core component of an operating system, responsible for managing memory, tasks and processes, facilitating communication between hardware and software. In the context of Jupyter, a kernel refers to the computational engine that executes the code contained in Jupyter notebooks. For instance, Python and Julia kernels in Jupyter allow users to write and run code in these respective programming languages, handling the execution of commands and returning the results within the notebook interface. This setup enables interactive computing and supports data analysis, visualization and other tasks within a flexible and user-friendly environment.
is an extracted non-trivial insight from the data classification and analysis of information. Knowledge, while applied, leads to problem-solving, improvements, and steady development.
in programming is a collection of prewritten code designed to be used for common tasks. Programming libraries provide reusable functions, classes, and routines that developers can integrate into their own programs. For example, Numpy in Python is used for working with large arrays of data, offering a wide range of mathematical and statistical functions. Libraries help developers save time and effort by leveraging existing solutions, ensuring consistency, and improving code reliability.
is a collection of pre-written source code, dependencies, and modules available for use in programming. In the context of HPC clusters, these modules are pre-installed and can be easily loaded to provide essential functions and tools, streamlining development and ensuring compatibility with various software requirements. They help developers save time by reusing code and avoiding the need to build software from scratch.
is a field of study focused on developing advanced computer algorithms that search for deeply coupled patterns in massive, disparate data and enable knowledge extraction. Machine learning methods are trained with large sets of data, and they learn from examples to make intelligent decisions without being explicitly programmed.
is a small piece of a larger program. Modular programming is a way to design software such that each module is independent and can be used to execute a part of the function.
is the process of making a directory on a remote server accessible on a local machine as if it were part of the local filesystem. This is especially useful in HPC environments where large datasets are stored on remote servers. By mounting a remote HPC directory, you can preview and work with data directly without downloading it. NOTE: Tools like SSHFS allow you to securely mount remote directories over SSH, enabling seamless access and manipulation of remote files from your local machine.
are productivity software applications designed to assist with common tasks in a professional or academic setting. These tools include word processors (e.g., Microsoft Word, Google Docs), spreadsheets (e.g., Microsoft Excel, Google Sheets), presentation software (e.g., Microsoft PowerPoint, Google Slides), email clients (e.g., Microsoft Outlook, Gmail), password managers and schedule managers among other applications. Office tools facilitate document creation, data analysis, notetaking, presentations and communication, enhancing efficiency and collaboration. They often come with features like templates, real-time collaboration, cloud storage and integration with other software, making them essential for productivity and effective project management.
is a web-based interface that provides easy and flexible access to High-Performance Computing (HPC) resources. It allows users to manage computational tasks, transfer files and access software applications from any location using a web browser. OOD simplifies the process of interacting with HPC systems by offering an intuitive interface, eliminating the need for complex command-line operations, and enabling users to monitor and control their jobs, visualize data and collaborate more effectively.
OS, is the core software on the computer that manages computing resources, performes installations, and executes available programs. Command-line interface (CLI) and Graphical User Interface (GUI) enable the user directly interact with the operating system to set up, configure, and troubleshoot it. Among the popular operating systems are Windows, Mac OS, and Linux.
is a type of computation in which many calculations or processes are carried out simultaneously, leveraging multiple processors or cores to solve complex problems faster. This approach divides a large problem into smaller, independent tasks that can be processed concurrently, significantly reducing computation time. In HPC environments, parallel computing can also involve using modules such as the parallel module, which simplifies the execution of parallel tasks for users. This allows even simple tasks to benefit from parallel processing, enhancing performance and efficiency. Parallel computing is essential in fields such as scientific research, engineering and big data analysis, and it is implemented through various models and architectures, including multi-core processors, clusters and distributed systems. It enables the handling of large-scale computations that would be impractical with sequential processing.
is a package file format used on macOS to distribute and install software. It contains all the files and metadata needed for an application, including installation scripts, resources and configuration files. When a user opens a .pkg file, the macOS Installer application processes it to install the software on the system. This format is commonly used for commercial and open-source macOS software distribution, providing a streamlined and user-friendly installation process. In HPC environments, .pkg files can also be used to distribute software across macOS-based clusters, ensuring consistent and reliable installations.
is the process of creating visual representations of data to facilitate understanding and analysis. In the context of research and data science, plotting involves using tools and libraries, such as Matplotlib, Plotly and ggplot2, to generate graphs, charts and plots. These visualizations help in identifying patterns, trends and outliers in the data, making it easier to interpret results and communicate findings effectively.
means creating a set of instructions for a computer on how to execute operations in order, following the assumptions and logical conditions of the algorithm. Many programming languages facilitate communication between the code developer and the computing machine. Bash enables a shell scripting using a command-line interpreter for automating repetitive tasks by executing pre-defined commands according to requested conditionals and loops. More advanced analytical operations, including mathematical and statistical functions, modifying complex data structures, or processing non-text data, usually require a higher-level programming language such as Python or C++.
is the practice of planning, organizing, and overseeing the execution of a project to achieve specific goals within defined constraints such as time, budget, and resources. It involves coordinating tasks, managing team members, and ensuring that project milestones are met. Key components include defining project objectives, creating detailed plans, allocating resources, monitoring progress, and adjusting strategies as needed to ensure successful project completion and publication of reserch outcomes. Effective project management ensures that projects are delivered on time, within scope, and to the desired quality standards.
in distributed computing, it is an organized list (sequence) of tasks submitted to the job scheduler that manages the computational resources and decides to start or stop the task. The queue is ordered by wait time, user priority, and availability of requested resources. When the combination of these factors is advantageous, the submitted task begins executing, and so its status changes from waiting to running. The queuing system is typical for distributed computing, such as a network of computer clusters shared by more users. Some of the most popular workload managers are SLURM and PBS.
refers to a type of digital image composed of a grid of individual pixels, each with its own color value. Raster images, also known as bitmap images, are resolution-dependent, meaning their quality decreases when scaled up. Common file formats for raster images include JPEG, PNG, GIF and BMP. Raster graphics are widely used for detailed and complex images like photographs and digital paintings. They are created and edited using software such as Adobe Photoshop and GIMP. Unlike vector graphics, raster images cannot be resized without losing quality, making them less suitable for applications requiring scalability.
is the unprocessed data captured directly from its source, retaining its original form without any filtering, cleaning or transformation. It typically has a large volume and includes all the details and potential noise inherent in the data collection process. In data science, raw data serves as the primary input, providing the foundational information required for analysis. Processing raw data involves steps such as cleaning, normalization and transformation to prepare it for meaningful analysis and interpretation. This initial, unaltered state is crucial for ensuring the accuracy and integrity of the subsequent data processing and analysis stages.
is a text file typically included in the root directory of a project or software package. It is a part of the documentation and provides essential information about the project, such as its purpose, installation instructions, usage guidelines and how to contribute. The README file helps users and developers understand the project’s scope and how to get started with it. A well-written README enhances the usability and accessibility of a project, making it easier for others to use, contribute to and maintain the software.
is the ability to connect to and use a computer or network from a different location through the internet or another network. This allows users to access files, applications, and system resources as if they were physically present at the remote location. Tools like SSH (Secure Shell) and VPN (Virtual Private Network) facilitate secure remote access, enabling efficient and flexible work, troubleshooting and system management from anywhere. Open OnDemand (OOD) provides a web-based interface for accessing HPC resources, making it easier for users to manage their computational tasks remotely.
is any other computer or computing network that the user can access by logging into the external network. Performing actions on a remote machine requires a secure login and often requires the user to have an account created by the network administrator. In scientific projects, we use remote computing machines as part of the HPC infrastructure to access high-performance computing and collaboratively share big data.
is a storage location for software code (also data or documentation) and its history, managed by a version control system (VCS). It tracks changes to the file content, enabling multiple users to collaborate, merge contributions and revert to previous versions when needed. Repositories can be hosted locally or on remote servers. Using repositories in version control systems like Git, Subversion or Mercurial helps maintain an organized, collaborative workflow, ensuring that all changes are documented and can be traced back to their origin.
in the context of research, refer to the various tools, materials, and supports needed to conduct a study or project effectively. This includes physical resources like laboratory equipment, computing resources such as HPC clusters, software and long term storage space, data resources like datasets and databases, and human resources such as skilled personnel and collaborators. Proper management and allocation of these resources are crucial for the success of research activities, ensuring that all necessary components are available and efficiently utilized throughout the research process.
is the process of distributing available resources efficiently to achieve specific objectives. In general, it involves assigning tasks, funding and materials to various activities and projects to maximize productivity and outcomes. In Research Project Management: Resource allocation entails planning and distributing resources such as personnel, budget, equipment and time to different phases and tasks of a research project. Effective allocation ensures that all aspects of the project are adequately supported and that milestones are met within the planned schedule. In computing, especially on HPC:Resource allocation (computer) involves distributing computational resources like CPU cores, memory, storage and network bandwidth to various tasks and users. In HPC environments, job scheduling systems like SLURM manage resource allocation to optimize the use of the computing cluster, ensuring fair access, maximizing efficiency and minimizing waiting times for queued jobs.
stands for Red Hat Package Manager, a package management system used primarily on Linux distributions like Red Hat, Fedora and CentOS. An .rpm file is a package file that contains the compiled software, along with metadata such as dependencies, installation scripts and version information. RPM packages facilitate the installation, upgrade and removal of software, ensuring that all necessary components are correctly deployed and configured. This system is typical for HPC environments and helps maintain the consistency across numerous nodes and stability of the operating system by handling complex dependencies and conflicts automatically.
is a type of graph used to display the relationship between two variables. Each point on the plot represents an individual data point with its position determined by the values of the two variables. Scatter plots are useful for identifying patterns, correlations and outliers within the data. Use scatter plots to visually analyze the relationship between two variables, such as comparing height and weight or gene expression levels and protein abundance. They are particularly helpful in exploratory data analysis and regression analysis.
is a text file containing a sequence of commands that are executed by a scripting engine or interpreter. Script files automate repetitive tasks, configure systems, and run complex sequences of actions without manual intervention. Common scripting languages include Bash (shell script), AWK, Python, Perl and JavaScript. Script files enhance productivity by automating routine tasks and can be scheduled to run at specific times or events, making them valuable tools for system administration, data processing and software development.
shell variables are variables defined within a shell session that store data and can be used in commands and scripts. They are local to the shell in which they were created and are not accessible to child processes. Users can create any shell variable with a custom name and value using syntax like myVar='Hello'.environment variables are a type of shell variable that are exported to and accessible by child processes of the shell. They are used to pass configuration settings and other data to applications and scripts running in the shell environment. While some environment variables have predefined names (e.g., PATH, HOME), users can modify their values and create new ones using the export command, such as export PATH='/usr/local/bin:$PATH'.
(Simple Linux Utility for Resource Management) is a powerful and widely-used cluster management and job scheduling system designed for high-performance computing (HPC) environments. It manages the allocation of resources such as CPU, memory and storage across multiple nodes in a computing cluster. SLURM enables users to submit, schedule, and monitor jobs efficiently, ensuring optimal resource utilization and job prioritization. Its capabilities include workload management, job dependency handling and resource reservation, making it an essential tool for researchers and engineers working in HPC settings. SLURM’s flexibility and scalability allow it to support a wide range of computational workloads and complex workflows.
is a type of file in Unix and Linux systems that points to another file or directory. It acts as a shortcut, allowing users to access the target file or directory using a different path without duplicating the actual content. Softlinks can span across different file systems and can point to directories, unlike hard links. They save storage space by avoiding multiple copies of the same file and provide the convenience of having symlinked files accessible on the selected path for easier access and execution.NOTE: Use softlinks to create convenient access points for frequently used files or directories, simplifying navigation and organization. This example creates a symlink to the target:ln -s /path/to/target /path/to/softlink
is a digital document that organizes data in rows and columns, allowing for efficient data entry, manipulation and analysis. Each cell in a spreadsheet can contain text, numbers, or formulas that perform calculations on other cells. Spreadsheets are widely used for financial planning, data analysis, and project management. Popular spreadsheet software includes Microsoft Excel (Windows, macOS), Google Sheets (in-browser), and LibreOffice Calc (Linux), which offer features like pivot tables, chart creation and collaborative editing to enhance data management and analysis.
(Secure Shell File System) is a network file system that allows you to mount and interact with remote directories (e.g., those located on HPC system) on your local machine using SSH (Secure Shell). This means you can access and manipulate files on a remote server as if they were on your local computer, providing a secure and convenient way to manage remote files. SSHFS is particularly useful for accessing remote files securely and efficiently, making it easier to work with files stored on remote servers without having to transfer them back and forth.
STANDARD ERRORalt. Standard streams: Standard error (stderr)
is a standard stream used in command line and computer programming to output error messages and diagnostics. It allows errors to be separated from standard output (e.g. automatically saved to a different file), enabling better debugging and error handling.
STANDARD INPUTalt. Standard streams: Standard input (stdin)
is a standard channel that programs use to receive input. When a program needs data from the user or another program, it reads from stdin. This is commonly used for interactive data entry in the command line or to process data from files or other programs.
STANDARD OUTPUTalt. Standard streams: Standard output (stdout)
is a channel that programs use to send regular output. The results or data produced by a program are sent to stdout, which can be displayed on the screen, redirected to a file or piped to another program for further processing.
are established norms or requirements in various fields to ensure consistency, quality, and interoperability. In the context of research, standards are crucial for transparency, good practices, state-of-the-art approaches and ethics. They guide researchers in conducting and reporting studies reliably, facilitating reproducibility and maintaining ethical considerations in data handling and experimentation. Standards help ensure that research findings are credible, verifiable and can be built upon by the broader scientific community. Examples of standards include metrology standards for precise measurements, standard operating procedures (SOPs) for consistent processes, standard algorithms for efficient computations, learning standards in education, BioCompute objects for NGS data standardization and software standards for development and interoperability. These standards help ensure credibility, verifiability and the ability to build upon findings across various disciplines.
is the science of collecting, analyzing, interpreting, and presenting data. It involves designing experiments and surveys, summarizing data with descriptive statistics and making inferences using inferential statistics. Statistics plays a crucial role in modern data science, providing the foundation for developing efficient algorithms and handling big data to make informed decisions and uncover patterns and relationships within large datasets.
is a limit set by administrators on the amount of disk space that a user or group can use on a computer system or network. This ensures fair distribution of storage resources and prevents any single user from consuming excessive disk space, which can impact the performance and availability of storage for others. HPC administrators can set and manage storage quotas to optimize resource allocation and maintain system performance, using tools and commands specific to the operating system, such as quota in Unix environments.
is highly organized in terms of easy digital deciphering. That includes a standardized format, enduring order, and categorization in a well-determined arrangement that facilitates managing and querying datasets in various combinations. A typical example of an organized data structure is a spreadsheet or relational database.
is an open-source version control system that helps manage changes to source code and documents. It uses a centralized repository model, where all versioned files and their histories are stored on a single server, facilitating collaboration by tracking revisions, maintaining a history of changes and enabling the recovery of previous versions. Unlike Git, which is a distributed version control system allowing each user to have a complete local copy of the repository, Subversion relies on a central server for all operations. This centralization can simplify management but may limit offline work and flexibility compared to Git.
(short for ‘superuser do’) is a command in Unix and Linux systems that allows a permitted user to execute a command as the superuser or another user, as specified by the security policy. It is commonly used to perform administrative tasks without needing to log in as the root user. Use sudo to run commands with elevated privileges, for example, sudo apt-get update to update package lists. This enhances security by limiting the time and scope of superuser access.
is a highly advanced computing machine designed to perform extremely complex calculations at incredibly high speeds. It consists of thousands of interconnected processors working in parallel to achieve performance far beyond that of typical computers. Supercomputers are used for tasks requiring massive computational power, such as climate modeling, simulations, scientific research and data-intensive tasks like genome sequencing.
is a user account with elevated privileges that allows for complete control over a computer system. In Unix and Linux systems, the superuser is typically referred to as root. This account can perform all administrative tasks, such as installing software, configuring system settings and managing other user accounts. Use superuser privileges with caution, as actions taken with this account can significantly impact system stability and security. Commands like sudo can temporarily grant superuser rights for specific tasks.
is the process of ensuring files in two or more locations are consistently updated. It involves copying, updating and deleting files to maintain identical sets across devices or storage locations. Syncing is a shorthand for file synchronization, often automated to keep files consistent across multiple devices or platforms, commonly used in cloud storage services like Google Drive and Dropbox. RSYNC is a command-line utility for Unix-like systems that efficiently synchronizes files and directories by transferring only the differences between source and destination, ideal for backups and mirroring. Rsync is widely used for backups, mirroring and ensuring data consistency across storage locations. Using rsync, you can synchronize files between local (personal) and remote (e.g., HPC) systems with a command like: rsync -avz source/ destination/ where -a preserves attributes, -v enables verbose output and -z compresses data during transfer for efficiency.
refers to the set of rules and structures that define the combinations of symbols that are considered valid statements or expressions in a language. It specifies how to write commands, functions, loops, conditionals and other elements of code, ensuring that the compiler or interpreter can correctly understand and execute the instructions. Each programming language has its own unique syntax, which developers must follow to write functional and error-free code. NOTE: Understanding and adhering to the syntax of a programming language is crucial for writing effective and efficient code, as it ensures proper communication with the computer and helps in debugging and maintaining the software.
is a feature in text editors and Integrated Development Environments (IDEs) that visually differentiates code elements based on their syntactic meaning. It uses different colors and font styles to highlight keywords, variables, operators, strings, comments and other language constructs. This helps programmers quickly identify the structure and flow of their code, making it easier to read, write, and debug.
refers to the process of installing and/or configuring an operating system (OS) on a personal (local) machine, as well as setting up a user account on an HPC cluster (remote machine), including HOME directory management. This involves steps to ensure the command-line interface (CLI) is functional and the installation of useful tools such as office suites, development environments, and graphic software to enhance productivity and reproducibility. Proper system setup maximizes the efficiency and capabilities of the computing tool, allowing users to effectively utilize both local machines and remote HPC resources.
is a program that provides a command-line interface (CLI) for interacting with your computer’s operating system. It allows users to access and manage files, execute commands and run scripts by typing text-based commands. The terminal is a powerful tool for performing tasks such as navigating the file system, managing processes, configuring system settings and automating repetitive tasks. It is widely used by developers, system administrators and power users for its efficiency and flexibility compared to graphical user interfaces (GUIs). The terminal is the typical interface for interacting with High-Performance Computing (HPC) systems.
are software tools used for creating and modifying plain text files. They come in two main types: GUI (Graphical User Interface) and CLI (Command-Line Interface) tools. 1. GUI Text Editors: These editors provide a graphical interface with menus, buttons and other visual elements. Examples include TextEdit (macOS), Notepad (Windows) and Gedit (Linux). They are user-friendly and suitable for basic text editing tasks. 2. CLI Text Editors: These editors operate within the command-line interface, offering a more lightweight and resource-efficient way to edit text files. Examples include Vim, Nano, and Emacs. CLI editors are especially useful in environments where graphical interfaces are unavailable, such as remote servers and HPC systems. 3. IDE (Integrated Development Environment): These are comprehensive software suites that combine text editing with other tools like debuggers, compilers and version control. Examples include Visual Studio Code (VSC), PyCharm, Eclipse and IntelliJ IDEA. IDEs provide a robust environment for software development with features that enhance productivity and code quality.
is a file that contains plain text without any formatting, typically encoded in formats such as ASCII or UTF-8. Text files can be created and edited using simple text editors like Notepad with GUI or Vim and Nano in the terminal CLI, and are commonly used for storing data, configuration settings and programming code. Text files are versatile and can be used for a variety of purposes, from writing scripts and storing logs to documenting information, making them essential in many computing tasks.
is the process of modifying and organizing text to achieve a desired format or structure. In the context of GNU Core Utilities, it involves using command-line tools like cat, cut, tr or sort to perform operations such as reading, replacing, extracting and sorting text data. These tools enable efficient and automated handling of text files, simplifying complex text processing tasks.
in the context of a research project, refers to a schedule that outlines the key milestones and deadlines throughout the project’s duration. It includes specific tasks, their start and end dates and major deliverables. A well-defined timeline helps manage progress, allocate resources efficiently and ensure that the project stays on track. Creating a detailed timeline with clear milestones allows for better tracking of progress and early identification of potential delays, ensuring timely completion of the project.
is a construct in Python used for error handling, allowing developers to manage exceptions and execute code that may cause errors. The try block contains code that might raise an exception, while the except block contains code to handle the error if it occurs. This mechanism helps ensure that programs can handle unexpected issues gracefully without crashing. Tip: Use try-except blocks to catch and handle specific exceptions, improving the robustness and reliability of your code.
is a command-line interpreter that provides a user interface for interacting with a computer’s operating system (OS). It allows users to execute commands, run scripts and control the execution of programs and procedures through text-based input. Shells such as Bash, Zsh, and Csh interpret and execute user commands, facilitating tasks like file manipulation, process management and system configuration. Shell scripts, which are sequences of commands stored in a file, automate complex tasks and workflows, enhancing productivity and efficiency.
has no organized structure that can be easily detected, processed, and categorized by computer algorithms. This type of data is usually massive and descriptive in nature. A good example is the streams of highly varied text (e.g., emails, social media posts, online blogs, newspapers, books, and scientific publications), audio and video recording, images and photos, data from various sensors (weather, traffic), and medical records.
refers to the practice of leveraging multiple CPU cores to perform parallel computing tasks, thereby speeding up computations by distributing workloads across several processors. To use multiple cores in parallel computing on HPC, you can load the GNU Parallel module. This tool helps to execute jobs in parallel by distributing tasks across available CPU cores. For example, you can load the module and use it as follows: module load parallel parallel ::: command1 command2 command3 This command runs command1, command2 and command3 in parallel, utilizing multiple cores for faster execution.
is a Python module for working with URLs, providing functions and classes to open, read, parse and interact with web resources. It includes submodules like urllib.request for accessing URLs, urllib.parse for parsing URLs and urllib.error for handling errors. Urllib is commonly used in web scraping to fetch and process data from web pages.
is a fundamental concept in computing and programming, representing a storage location identified by a name that holds data which can be changed during program execution. Variables serve different purposes across various contexts: 1. Shell Variables: These are variables that are defined within a shell session and are available only to the shell in which they were created. They can store temporary data and are often used in shell scripts. For example: myVar='Hello, Shell'2. Environment Variables: These are dynamic values (a specific type of shell variable) that affect the behavior of processes running on an operating system. They are used to pass configuration information to applications and scripts. Environment variables are set using the export command in Unix-like systems. For example: export PATH='/usr/local/bin:$PATH' echo $PATH NOTE: All environment variables are shell variables, but not all shell variables are environment variables. Environment variables have a broader scope and can influence the behavior of the entire session and its child processes.3. Bash Variables: In shell scripting, Bash variables store data and can be used to pass information between commands or scripts. They are defined without a type and can hold strings or numbers. For example: string_var='Hello, World!' number_var=54. Programming Variables: In programming languages, variables are used to store data values, such as numbers, strings and objects, which can be manipulated through code. For example, in Python: x = 10 name = 'Alice' list_obj = ['Alice', 'Bob', 'Amy', 'Tom']
refers to two distinct concepts in computing and programming: 1. Data Types in R Programming: In R, a vector is a basic data structure that holds elements of the same type. Vectors can be numeric, character, logical or other types, and are used for storing data sequences. Operations on vectors are performed element-wise, making them fundamental for data manipulation and analysis in R. 2. Vector Graphics: In computer graphics, vector graphics represent images using mathematical formulas to define points, lines, curves and shapes. Unlike raster graphics, which are composed of pixels, vector graphics are scalable without loss of quality. Common formats include SVG, EPS and PDF. Vector graphics are widely used in design and illustration software, such as Adobe Illustrator and Inkscape, for creating logos, icons and other scalable graphics.
is a tool in Python that creates isolated virtual environments for projects, ensuring the specific dependencies of a project are met without affecting other projects or the global Python installation. Each virtual environment has its own Python interpreter and can have its own set of installed packages, independent of other environments. This isolation helps manage dependencies and avoid conflicts between different projects. venv is included in the standard library from Python 3.3 onwards. It is commonly used in development to ensure consistency across development, testing, and production environments.
is a system that records changes to files or sets of files over time, allowing users to track revisions, revert to previous versions, and collaborate effectively. In research and software development, version control systems like Git help manage modifications to source code, documents, and datasets, ensuring that every change is documented and can be traced. This enhances collaboration by enabling multiple users to work on the same project simultaneously without conflict and helps maintain a complete history of the project’s evolution, aiding in knowledge retention and continuity.
is a digital recording of moving visual images, typically accompanied by audio. It can be stored in various formats (e.g., MP4, AVI, MOV) and viewed on multiple devices such as computers, smartphones and televisions. Videos are used for entertainment, education, communication and information sharing, making them a versatile medium for conveying complex ideas and stories.
is the technology that enables real-time visual and audio communication between users in different locations. It utilizes internet-connected devices equipped with cameras and microphones to facilitate face-to-face meetings without physical presence. Commonly used platforms include Zoom, Microsoft Teams and Google Meet. In the context of research and collaboration, video conferencing allows teams to discuss projects, share screens, present findings and collaborate effectively regardless of geographic distance. This technology enhances communication, reduces travel costs and improves productivity by making it easier to conduct meetings, interviews, and collaborative sessions remotely.
is a software application used to manipulate and arrange video footage. It allows users to cut, trim and merge video clips, add effects, transitions and titles, and adjust audio levels. Popular video editors include Adobe Premiere Pro, Final Cut Pro and DaVinci Resolve. These tools are essential for creating professional-quality videos for various purposes, such as marketing, education, research & collaboration, and professional projects.
is a highly configurable and powerful text editor, an enhanced version of the vi editor found on most Unix systems. Known for features like syntax highlighting, text completion and support for multiple programming languages, Vim is popular among developers and system administrators. It operates in various modes, primarily Normal, Insert and Command-line, allowing efficient text manipulation with simple keystrokes. Vim is especially popular on HPC systems for in-terminal access where no GUI is available to modify text files, making it indispensable for coding, scripting and general text editing tasks.
is the exchange of information between individuals or groups through digital platforms, without physical presence. In the context of conducting research and collaboration, virtual communication includes tools like email, instant messaging, video conferencing (e.g., Zoom, Microsoft Teams) and collaborative software (e.g., Slack, Trello). These tools facilitate real-time and asynchronous communication, enabling researchers to share data, discuss findings and coordinate projects efficiently, regardless of geographic location. Virtual communication is essential for maintaining productivity and fostering collaboration in remote or distributed research teams.
is an isolated workspace created to manage and run specific projects without affecting the system’s global settings. It allows developers to install and use different versions of software packages and dependencies tailored to a project. Example virtual environment software includes virtualenv, venv and conda for managing environments and Python packages. These tools help ensure project consistency and prevent conflicts between dependencies in different projects.
is a software emulation of a physical computer, providing the functionality of a real computer system. It runs on a host system and uses virtualized hardware resources to operate an independent operating system (OS) and applications. Virtual machines are commonly used for testing software, running multiple OS environments on a single physical machine, and enhancing security through isolation. Tools like VMware, VirtualBox and Microsoft Hyper-V are popular for creating and managing virtual machines. In High-Performance Computing (HPC) and cloud environments, virtual machines enable efficient resource utilization, scalability and flexibility by allowing multiple virtual instances to run on shared physical infrastructure.
VISUALIZATIONalt. Data and information visualization
is a highly visually influential and semantically meaningful form of modern communication methods. In Data Science, interactive graphing and creating concise infographics support both the ease of extracting insights and the opportunity for deeper analysis for those interested. That contributes to better knowledge retention.
is a scatter plot that displays statistical significance of the change (y-axis, typically -log10 p-value) versus magnitude of change between two conditions (x-axis, typically log2 fold-change) for large-scale datasets. It is commonly used in bioinformatics and molecular biology to visualize differential expression of genes or proteins between two conditions. For example, in RNA-seq experiments, a volcano plot helps identify genes that are significantly upregulated or downregulated, aiding in the discovery of potential biomarkers or therapeutic targets.
is a system that delivers web content to users over the internet. It processes incoming requests from clients (typically web browsers), retrieves the requested resources such as HTML pages, images, and scripts and sends them back to the clients. Web servers can host websites, handle web applications and manage data exchange. Data can be retrieved from online web servers using various methods, such as HTTPS requests, command-line tools like wget and curl, APIs for structured data exchange and web scraping techniques for extracting information from web pages.
is a command-line utility used for downloading files from the internet. It supports protocols like HTTP, HTTPS and FTP, enabling users to retrieve content from web servers. Wget is especially useful for batch downloads, recursive downloading (to fetch entire websites) and resuming interrupted downloads. Its versatility and robustness make it a popular tool for automating file retrieval tasks, web scraping, and mirroring websites. Wget operates non-interactively, meaning it can run in the background and handle complex download tasks without user intervention, making it ideal for scripts and scheduled tasks.
are interactive elements in graphical user interfaces (GUI) that allow users to interact with applications. In the context of web development, widgets can include buttons, sliders, text boxes and other controls that facilitate user interaction. They are essential for creating user-friendly and dynamic web applications. Libraries such as Dash in Python enable developers to build complex, interactive web applications with custom widgets. Dash, in particular, is designed for data visualization and analytics applications, providing a range of pre-built widgets for creating responsive and interactive user interfaces. Widgets enhance the usability and functionality of web applications by providing intuitive controls for users.
is a widely-used operating system developed by Microsoft, known for its graphical user interface (GUI) that allows users to interact with their computers using visual elements like windows, icons, and menus. It supports a broad range of applications and is popular in both personal and business computing environments. Windows includes features such as a taskbar, Start menu, file explorer and system settings, making it user-friendly and accessible. It also supports multitasking, networking, security features and extensive hardware compatibility.
is a working directory for a project or computational task. It often appears as a variable or instruction that can be assigned a path to a selected location in the file system. That path is then used for all future commands that require a location, such as writing to a file. It is a common variable for workload managers on distributed computing infrastructures. The pathname of the current working directory can be accessed with the pwd command or using the $PWD environmental variable.
is a system that efficiently allocates and manages computing resources for running various tasks and jobs on HPC clusters. It schedules, monitors and optimizes the execution of workloads, ensuring that resources are used effectively and jobs are completed in a timely manner. Workload managers handle job queuing, prioritization and resource allocation, balancing the load across available nodes to maximize throughput and minimize wait times. Examples of workload managers include SLURM, PBS (Portable Batch System) and LSF (Load Sharing Facility). These tools are crucial for maintaining the performance and efficiency of HPC environments, enabling researchers and engineers to run complex simulations and analyses smoothly.
(Yellowdog Updater, Modified) is a package management utility for RPM-based Linux distributions such as Red Hat Enterprise Linux (RHEL), CentOS and Fedora, common on HPC systems. YUM simplifies the process of installing, updating and removing software packages, resolving dependencies automatically to ensure that all required packages are installed. It retrieves packages from configured repositories and can manage package groups for easier software deployment. The command yum list all provides a comprehensive list of all available packages in the configured repositories, including those that are installed and those that are available for installation. This command is useful for users who need to see the full range of software options and versions available on their system.
is the process of compressing one or more files into a single archive file, typically with a .zip extension. This reduces the total file size, making it easier to store and transfer. Zipping files preserves the directory structure and can also include additional metadata. Common tools for zipping files include software like WinZip, 7-Zip, and the built-in zip utilities in operating systems like Windows, macOS and Linux. In addition to saving space, zipping files can also facilitate the packaging of multiple files into a single, manageable unit for easier distribution and organization.