Skip to content

Pre-Lab 2 - Bioinformatics Environment, Workflow, and Reproducibility

This pre-lab has two parts:

  1. Understanding the bioinformatics computing environment (why Linux/macOS are common, and what alternatives exist).
  2. Building good habits for workflows: command-line basics, directory organization, and reproducibility.

Part A - Bioinformatics environments

Why Linux/macOS are favored in bioinformatics

Most bioinformatics and structural biology tools are developed and deployed primarily on Linux (and often also work well on macOS). Common reasons include:

  • Stability and long-running job support (servers/HPC)
  • Strong package managers and dependency handling
  • Powerful shell scripting for automation
  • Easier remote work (SSH, clusters)

A few practical differences:

Feature Windows Linux/macOS
Case sensitivity No (by default) Yes
Shell scripting Limited Native
Package managers Weaker Strong

GUI vs CLI (Graphical vs Command Line)

Bioinformatics software is often CLI-first because:

  • CLI tools are easier to automate and combine into pipelines
  • Servers and clusters typically have no graphical desktop
  • CLI tools often expose more options and reproducible configuration

GUI vs CLI

A good mental model:

  • GUI is convenient for exploration.
  • CLI is better for automation, logging parameters, and reproducibility.

Software environment options

You may run bioinformatics tools using:

  • Native installation (Linux/macOS): best compatibility
  • WSL (Windows Subsystem for Linux): good balance for Windows users
  • Virtual machine (VM): full Linux environment, heavier
  • Remote servers (SSH): common for heavy computations

Key idea: tools have dependencies (libraries, runtimes, databases). Choosing an environment is mostly about reducing dependency friction and improving reproducibility.

Software environments


Part B - Workflow and reproducibility

Why workflows matter

A workflow is a repeatable sequence of steps (often across multiple tools) to go from input data to results.

A reproducible workflow usually depends on:

  • Consistent directory structure
  • Clear file naming
  • Logging parameters and software versions
  • Version control when possible (e.g., Git)
  • A short but informative README.md

Example directory structure

A recommended structure for your work in this course (especially the term project):

GENE433/
├── term-project/
│   ├── data/
│   ├── sequences/
│   ├── structures/
│   ├── scripts/
│   ├── results/
│   └── README.md
├── labs/
│   ├── lab1/
│   ├── lab2/
│   └── ...
├── lectures/
└── README.md

Basic shell commands (showcase)

These commands are not a formal prerequisite, but you are expected to be comfortable using them over time.

  • pwd: print current directory
  • ls: list files
  • cd: change directory
  • mkdir: create directory
  • cp: copy files
  • mv: move/rename files
  • rm: remove files
  • cat: print file contents
  • less: view file contents page-by-page
  • grep: search text patterns
  • wc: count lines/words/bytes

Note

rm is permanent. Always double-check what you are deleting.


Demonstration: building a small, reproducible workspace

Warning

This is just a demonstration. You are not responsible for learning anything given in this section.

In Linux/macOS (bash)

  1. Open a terminal.
  2. Navigate to the directory where you want to create your course workspace.

  3. Option A: use cd from the terminal.

  4. Option B: in the file manager, navigate to the destination folder, right-click, and choose “Open in Terminal” (if available).

  5. Create directories.

Option 1 (step-by-step):

mkdir GENE433
cd GENE433
mkdir labs
cd labs
mkdir pre-lab2
cd pre-lab2

Option 2 (one command):

mkdir -p GENE433/labs/pre-lab2
cd GENE433/labs/pre-lab2
  1. Download a structure file (example):
wget https://files.rcsb.org/download/5IKR.pdb
  1. Count how many ATOM records the file contains:
grep -c '^ATOM' 5IKR.pdb

Alternative (counts matching lines):

grep '^ATOM' 5IKR.pdb | wc -l
  1. Extract FASTA headers (example). Suppose you have sequences.fasta:
grep '^>' sequences.fasta

In Windows (PowerShell)

  1. Open PowerShell.
  2. Go to the directory where you want to create your workspace.

Option 1 (step-by-step):

mkdir GENE433
cd GENE433
mkdir labs
cd labs
mkdir pre-lab2
cd pre-lab2

Option 2 (one line):

mkdir GENE433\labs\pre-lab2
cd GENE433\labs\pre-lab2
  1. Download a structure file:
Invoke-WebRequest -Uri https://files.rcsb.org/download/5IKR.pdb -OutFile 5IKR.pdb
  1. Count ATOM lines:
Select-String -Path 5IKR.pdb -Pattern '^ATOM' | Measure-Object | Select-Object -ExpandProperty Count
  1. Extract FASTA headers (example):
Select-String -Path sequences.fasta -Pattern '^>'

Transition to core laboratory modules

After completing this preliminary module, you should be prepared to:

  • Navigate biological databases
  • Interpret molecular structure files
  • Use molecular visualization tools effectively
  • Understand the computational basis of structure prediction and docking