Skip to content

Pre-Lab 1 - Understanding Biological Data and Molecular Geometry

This pre-lab has two parts:

  1. Understanding the types of biological data you will encounter in bioinformatics and structural biology.
  2. Reviewing the basic geometry of molecules (coordinates and degrees of freedom) that underpins molecular visualization, docking, and structure refinement.

Part A - Understanding (Biological) Data

Why do we care about “data”?

To work effectively in computational biology, you must understand what “data” means in practice:

  • What the data represents (sequence, structure, annotations, measurements)
  • How it is stored (text vs binary)
  • How software tools interpret and transform it

In this course you will mostly work with text files and a smaller number of binary files.

  • Text files are human-readable and typically encoded in ASCII or UTF-8.
  • Binary files are not human-readable and are optimized for speed and storage. They still “interact” with you through software (for example via a graphical interface or a command-line tool).

Most common bioinformatics formats we will use (FASTA, PDB, CIF) are text-based.

What is biological data?

In bioinformatics, data may represent:

  • Nucleotide sequences
  • Protein sequences
  • Three-dimensional atomic coordinates
  • Functional annotations
  • Experimental measurements

Common biological data types


File Types and Extensions

A file type is defined by the structure of the content. A file extension is a naming convention that helps humans and software guess how to handle the file.

Common text-based formats you will use in this course:

  • FASTA: .fasta, .fa, .faa (plain text)
  • PDB: .pdb (plain text, fixed-column format)
  • mmCIF: .cif (plain text, structured key/value format)
  • MOL2: .mol2 (plain text)
  • CSV: .csv (plain text)

File extensions concept

Important: renaming a file does not change its contents. For example, renaming protein.pdb to protein.txt does not convert the file into a different format; it only changes the filename.


How software tools use these files

Most bioinformatics software does three things with your files:

  1. Parse the file according to a known format (e.g., “PDB lines are fixed columns”).
  2. Build an internal representation (atoms, residues, chains, coordinates, metadata).
  3. Apply transformations (delete a chain, mutate a residue, align structures) and optionally write output back to disk.

Hands-on A1 - CSV is just text

  1. Create a new file named toy_table.csv.
  2. Paste the following content and save:
name,score
Alice,90
Bob,75
Charlie,88
  1. Open it:

  2. In a spreadsheet program (LibreOffice / Excel)

  3. In a text editor

Observe that the file is still the same file; only the viewer changes.

Hands-on A2 - Changing the extension does not change the data

  1. Take any small text file (for example your toy_table.csv).
  2. Rename it to toy_table.txt.
  3. Open it again in a text editor.

You should see identical contents.

Optional: rename it back to .csv and open it in a spreadsheet program again.

Hands-on A3 - What happens when you “delete a chain” in a structure viewer?

  1. Download any PDB structure (.pdb) from the PDB.
  2. Open it in Chimera/ChimeraX.
  3. Delete one chain.
  4. Save the structure.

Then open the saved file in a text editor.

  • You should see that many lines were removed (corresponding to the atoms that belonged to the deleted chain).
  • The software did not “magically” change the nature of the data; it edited the text representation according to the file format rules.

Part B - Geometry of molecules

Cartesian coordinates in structural biology

Each atom in a protein structure is represented by three coordinates:

  • x
  • y
  • z

So a protein structure is a collection of points in 3D space (atoms), connected by chemical bonds.

Cartesian coordinates

Degrees of freedom (DoF)

Degrees of freedom describe how a system can move.

A daily-life analogy:

  • A door hinge mainly rotates about one axis.
  • Your finger has limited rotation.
  • Your shoulder has a wider range of motion.

In molecular systems:

  • Translation: movement along the x, y, z axes
  • Rotation: rotation about the x, y, z axes
  • Torsion: rotation around chemical bonds

Degrees of freedom

A useful way to think about degrees of freedom:

System Typical degrees of freedom
Rigid body 6 (3 translation + 3 rotation)
Protein backbone torsions: \(\phi\) and \(\psi\) angles
Side chains torsions: \(\chi\) angles

These concepts underpin molecular visualization, docking, and structure refinement.

Hands-on B1 - Observe translation and rotation in a viewer

Warning

For now, you are not expected to have any of the programs mentioned below installed on your computer. This is just to let you know what we will demonstrate in the lab.

  1. Open any structure in Chimera/ChimeraX/Jmol.
  2. Use the mouse controls to:

  3. Translate the model

  4. Rotate the model

  5. Reset the view.

Write down which actions correspond to translation vs rotation.

Hands-on B2 - Observe torsion angles (conceptual)

  1. In a molecular viewer, choose a residue with a visible side chain.
  2. Find the tool/menu that displays torsions/dihedrals (software-dependent).
  3. Identify:

  4. Backbone \(\phi\) and \(\psi\)

  5. One side-chain \(\chi\) angle

You do not need to calculate anything in this pre-lab; the goal is to recognize that proteins have internal rotational degrees of freedom.