What we will be learning?
- Introduction to Linux and High Performance Computing (HPC)
- Basic navigation in Linux
- Managing files/folders
- File manipulation and editing
- Wildcards and permission
- Filtering and searching
- Piping & Redirection & Process Management
- Submitting task/jobs using SLURM
- Installing software
What is Linux?
Linux is an open-source operating system similar to Windows and MacOS. Here are some key points about Linux:
- Open Source Nature:
- Linux is developed collaboratively by a global community of contributors. Its source code is freely available, allowing anyone to view, modify, and distribute it.
- Kernel vs. Distributions:
- The Linux kernel is the core component responsible for managing hardware resources.
- Distributions (e.g., Ubuntu, CentOS, Debian) package the kernel with additional software, creating complete operating systems.
- Application of Linux in daily life:
- Linux powers servers, supercomputers, embedded devices, and even Android smartphones. Many web servers, cloud services, and scientific research clusters run on Linux.
- General User Interface (GUI) vs Command line (CLI)
- Linux comes with two interfaces – a command line interface (CLI) and a general user interface (GUI)
- Example of CLI
- Why are we learning CLI?
- Performance and Efficiency
- Command-line operations are often faster because it executes commands directly without the overhead of graphical interfaces.
- CLI tools consume fewer system resources (e.g. CPU, memory) compared to resource-intensive GUI applications.
- CLI allows control and customization, making it ideal for developers and system administrators
- Stability and Consistency
- CLI commands remain consistent across different systems and distributions (e.g., Ubuntu, Fedora, Arch Linux).
- Software Development and Programming
- Linux provides native support for popular languages like Python, C/C++, Java, Perl, Ruby, and more
- Developers find a rich ecosystem of libraries and tools for programming purposes
- The Linux terminal (Bash) is powerful and versatile.
- Windows’ command line has a different syntax but macOS also uses Bash as its default shell.
- Cost and Licensing
- Linux distributions are free and open source.
- Windows and macOS often require paid licenses.
- Linux supports a wide range of open-source software.
- Software/Hardware Compatibility
- Runs well on older hardware
- Performance and Efficiency
- How does learning Linux CLI help me in bioinformatics
- Learning the Linux Command Line Interface (CLI) is immensely beneficial for anyone working in the field of bioinformatics
- Linux CLI provides powerful tools for managing and analyzing biological data files.
- You can efficiently manipulate text files, perform data extraction, and process large datasets using commands like grep, sed, and awk.
- Bioinformatics often involves handling diverse data files (FASTA, SAM/BAM, VCF, etc.). Navigating directories, creating folders, and organizing data become second nature with CLI skills.
- Writing scripts allows you to automate repetitive tasks. Bioinformatics workflows benefit from scripted processes, ensuring consistency and reproducibility.
- Many bioinformatics tools are command-line based (e.g., minimap2, freebayes, samtools, bwa etc…). Learning the CLI enables you to use these tools effectively.
- When things go wrong (and they will!), CLI expertise helps diagnose issues.
What is High Performance Computing (HPC)?
- HPC refers to the practice of combining computing power to deliver far greater performance than a typical desktop or workstation.
- It involves using clusters of powerful processors that work in parallel to process massive multi-dimensional data sets (often referred to as “big data”) and solve complex problems at extremely high speeds.
- There are many HPCs in Qatar, there is one in Sidra Medicine (physical), in QCRI (physical), in HBKU and HMC (physical)
- At Qatar University, since Summer 2022, we have been given access to use the Microsoft Azure HPC, a comprehensive cloud computing platform. Cloud computing means the servers/machines are not located on-site.
- Why can’t I just use my laptop or workstation at my desk?
While it’s possible to perform some bioinformatics analyses on a local machine, there are limitations.
- Laptops/workstations have limited computational resources (CPU, memory, storage). Complex analyses involving large datasets may be slow or impractical. Some bioinformatics tasks (e.g., genome assembly, large-scale alignment) require substantial computational resources. For example, insufficient RAM can hinder performance.
- Many bioinformatics tools are designed for parallel or multi-threading processing. HPC clusters or cloud platforms like Azure provide better parallelization capabilities than individual machines.
- Large-scale bioinformatics projects generate massive amounts of data. Cloud platforms like Azure offer scalable storage solutions, whereas local machines may have limited disk space
- Collaborative research often involves sharing data and analyses. Cloud-based solutions allow seamless collaboration and scalability across teams