Pre-Lab 2 - Bioinformatics Environment, Workflow, and Reproducibility¶

This pre-lab has two parts:

Understanding the bioinformatics computing environment (why Linux/macOS are common, and what alternatives exist).
Building good habits for workflows: command-line basics, directory organization, and reproducibility.

Part A - Bioinformatics environments¶

Why Linux/macOS are favored in bioinformatics¶

Most bioinformatics and structural biology tools are developed and deployed primarily on Linux (and often also work well on macOS). Common reasons include:

Stability and long-running job support (servers/HPC)
Strong package managers and dependency handling
Powerful shell scripting for automation
Easier remote work (SSH, clusters)

A few practical differences:

Feature	Windows	Linux/macOS
Case sensitivity	No (by default)	Yes
Shell scripting	Limited	Native
Package managers	Weaker	Strong

GUI vs CLI (Graphical vs Command Line)¶

Bioinformatics software is often CLI-first because:

CLI tools are easier to automate and combine into pipelines
Servers and clusters typically have no graphical desktop
CLI tools often expose more options and reproducible configuration

GUI vs CLI

A good mental model:

GUI is convenient for exploration.
CLI is better for automation, logging parameters, and reproducibility.

Software environment options¶

You may run bioinformatics tools using:

Native installation (Linux/macOS): best compatibility
WSL (Windows Subsystem for Linux): good balance for Windows users
Virtual machine (VM): full Linux environment, heavier
Remote servers (SSH): common for heavy computations

Key idea: tools have dependencies (libraries, runtimes, databases). Choosing an environment is mostly about reducing dependency friction and improving reproducibility.

Software environments

Part B - Workflow and reproducibility¶

Why workflows matter¶

A workflow is a repeatable sequence of steps (often across multiple tools) to go from input data to results.

A reproducible workflow usually depends on:

Consistent directory structure
Clear file naming
Logging parameters and software versions
Version control when possible (e.g., Git)
A short but informative README.md

Example directory structure¶

A recommended structure for your work in this course (especially the term project):

GENE433/
├── term-project/
│   ├── data/
│   ├── sequences/
│   ├── structures/
│   ├── scripts/
│   ├── results/
│   └── README.md
├── labs/
│   ├── lab1/
│   ├── lab2/
│   └── ...
├── lectures/
└── README.md

Basic shell commands (showcase)¶

These commands are not a formal prerequisite, but you are expected to be comfortable using them over time.

pwd: print current directory
ls: list files
cd: change directory
mkdir: create directory
cp: copy files
mv: move/rename files
rm: remove files
cat: print file contents
less: view file contents page-by-page
grep: search text patterns
wc: count lines/words/bytes

Note

rm is permanent. Always double-check what you are deleting.

Demonstration: building a small, reproducible workspace¶

Warning

This is just a demonstration. You are not responsible for learning anything given in this section.

In Linux/macOS (bash)¶

Open a terminal.
Navigate to the directory where you want to create your course workspace.
Option A: use cd from the terminal.
Option B: in the file manager, navigate to the destination folder, right-click, and choose “Open in Terminal” (if available).
Create directories.

Option 1 (step-by-step):

mkdir GENE433
cd GENE433
mkdir labs
cd labs
mkdir pre-lab2
cd pre-lab2

Option 2 (one command):

mkdir -p GENE433/labs/pre-lab2
cd GENE433/labs/pre-lab2

Download a structure file (example):

wget https://files.rcsb.org/download/5IKR.pdb

Count how many ATOM records the file contains:

grep -c '^ATOM' 5IKR.pdb

Alternative (counts matching lines):

grep '^ATOM' 5IKR.pdb | wc -l

Extract FASTA headers (example). Suppose you have sequences.fasta:

grep '^>' sequences.fasta

In Windows (PowerShell)¶

Open PowerShell.
Go to the directory where you want to create your workspace.

Option 1 (step-by-step):

mkdir GENE433
cd GENE433
mkdir labs
cd labs
mkdir pre-lab2
cd pre-lab2

Option 2 (one line):

mkdir GENE433\labs\pre-lab2
cd GENE433\labs\pre-lab2

Download a structure file:

Invoke-WebRequest -Uri https://files.rcsb.org/download/5IKR.pdb -OutFile 5IKR.pdb

Count ATOM lines:

Select-String -Path 5IKR.pdb -Pattern '^ATOM' | Measure-Object | Select-Object -ExpandProperty Count

Extract FASTA headers (example):

Select-String -Path sequences.fasta -Pattern '^>'

Transition to core laboratory modules¶

After completing this preliminary module, you should be prepared to:

Navigate biological databases
Interpret molecular structure files
Use molecular visualization tools effectively
Understand the computational basis of structure prediction and docking