Pre-Lab 2 - Bioinformatics Environment, Workflow, and Reproducibility¶
This pre-lab has two parts:
- Understanding the bioinformatics computing environment (why Linux/macOS are common, and what alternatives exist).
- Building good habits for workflows: command-line basics, directory organization, and reproducibility.
Part A - Bioinformatics environments¶
Why Linux/macOS are favored in bioinformatics¶
Most bioinformatics and structural biology tools are developed and deployed primarily on Linux (and often also work well on macOS). Common reasons include:
- Stability and long-running job support (servers/HPC)
- Strong package managers and dependency handling
- Powerful shell scripting for automation
- Easier remote work (SSH, clusters)
A few practical differences:
| Feature | Windows | Linux/macOS |
|---|---|---|
| Case sensitivity | No (by default) | Yes |
| Shell scripting | Limited | Native |
| Package managers | Weaker | Strong |
GUI vs CLI (Graphical vs Command Line)¶
Bioinformatics software is often CLI-first because:
- CLI tools are easier to automate and combine into pipelines
- Servers and clusters typically have no graphical desktop
- CLI tools often expose more options and reproducible configuration

A good mental model:
- GUI is convenient for exploration.
- CLI is better for automation, logging parameters, and reproducibility.
Software environment options¶
You may run bioinformatics tools using:
- Native installation (Linux/macOS): best compatibility
- WSL (Windows Subsystem for Linux): good balance for Windows users
- Virtual machine (VM): full Linux environment, heavier
- Remote servers (SSH): common for heavy computations
Key idea: tools have dependencies (libraries, runtimes, databases). Choosing an environment is mostly about reducing dependency friction and improving reproducibility.

Part B - Workflow and reproducibility¶
Why workflows matter¶
A workflow is a repeatable sequence of steps (often across multiple tools) to go from input data to results.
A reproducible workflow usually depends on:
- Consistent directory structure
- Clear file naming
- Logging parameters and software versions
- Version control when possible (e.g., Git)
- A short but informative
README.md
Example directory structure¶
A recommended structure for your work in this course (especially the term project):
GENE433/
├── term-project/
│ ├── data/
│ ├── sequences/
│ ├── structures/
│ ├── scripts/
│ ├── results/
│ └── README.md
├── labs/
│ ├── lab1/
│ ├── lab2/
│ └── ...
├── lectures/
└── README.md
Basic shell commands (showcase)¶
These commands are not a formal prerequisite, but you are expected to be comfortable using them over time.
pwd: print current directoryls: list filescd: change directorymkdir: create directorycp: copy filesmv: move/rename filesrm: remove filescat: print file contentsless: view file contents page-by-pagegrep: search text patternswc: count lines/words/bytes
Note
rm is permanent. Always double-check what you are deleting.
Demonstration: building a small, reproducible workspace¶
Warning
This is just a demonstration. You are not responsible for learning anything given in this section.
In Linux/macOS (bash)¶
- Open a terminal.
-
Navigate to the directory where you want to create your course workspace.
-
Option A: use
cdfrom the terminal. -
Option B: in the file manager, navigate to the destination folder, right-click, and choose “Open in Terminal” (if available).
-
Create directories.
Option 1 (step-by-step):
mkdir GENE433
cd GENE433
mkdir labs
cd labs
mkdir pre-lab2
cd pre-lab2
Option 2 (one command):
mkdir -p GENE433/labs/pre-lab2
cd GENE433/labs/pre-lab2
- Download a structure file (example):
wget https://files.rcsb.org/download/5IKR.pdb
- Count how many
ATOMrecords the file contains:
grep -c '^ATOM' 5IKR.pdb
Alternative (counts matching lines):
grep '^ATOM' 5IKR.pdb | wc -l
- Extract FASTA headers (example). Suppose you have
sequences.fasta:
grep '^>' sequences.fasta
In Windows (PowerShell)¶
- Open PowerShell.
- Go to the directory where you want to create your workspace.
Option 1 (step-by-step):
mkdir GENE433
cd GENE433
mkdir labs
cd labs
mkdir pre-lab2
cd pre-lab2
Option 2 (one line):
mkdir GENE433\labs\pre-lab2
cd GENE433\labs\pre-lab2
- Download a structure file:
Invoke-WebRequest -Uri https://files.rcsb.org/download/5IKR.pdb -OutFile 5IKR.pdb
- Count
ATOMlines:
Select-String -Path 5IKR.pdb -Pattern '^ATOM' | Measure-Object | Select-Object -ExpandProperty Count
- Extract FASTA headers (example):
Select-String -Path sequences.fasta -Pattern '^>'
Transition to core laboratory modules¶
After completing this preliminary module, you should be prepared to:
- Navigate biological databases
- Interpret molecular structure files
- Use molecular visualization tools effectively
- Understand the computational basis of structure prediction and docking