Skip to content

Week 2 - Secondary Structure Prediction

Protein Secondary Structure Prediction Methods

1. Chou-Fasman Method

This method relies on the relative frequencies of amino acids in secondary structures: α-helices, β-sheets, and turns. Chou and Fasman used 3D structures solved with X-ray crystallography in the 1970s to determine relative amino acid frequencies in secondary structures of proteins.

This method calculates the probability of a given sequence of amino acids to fit in an α-helix, a β-sheet, or a turn by only taking account of the probability of presence of an individual amino acid in a secondary structure. It does not consider neighboring residues, which is why its accuracy is limited (about 50-60%). Another weakness is that the template structures are limited to those solved in the 1970s, so it cannot compete with newer methods.

Example 1: Predict secondary structure of “C5a anaphylatoxin chemotactic receptor” (Homo sapiens) using Chou-Fasman

  1. Go to http://fasta.bioch.virginia.edu
  2. Click on Protein-Protein FASTA
  3. Select Hydropathy/SecondaryStructure/seg

FASTA tool navigation

  1. In the new window:
  2. Choose the (A) Program as Chou-Fasman Secondary Structure Prediction
  3. Choose the format as FASTA
  4. Retrieve the primary sequence of C5a anaphylatoxin chemotactic receptor in FASTA format from either NCBI or UniProt (as done in Week 1)
  5. Copy and paste the FASTA sequence into the input box
  6. Submit the sequence

Choose Chou-Fasman program

  1. After getting the results, comment on the secondary structure content of this receptor.

2. GOR (Garnier, Osguthorpe, and Robson) method

GOR method is a developed version of Chou-Fasman method. It uses a Bayesian approach and takes the conditional probability of neighboring amino acids to fit in the same secondary structure into account. Its accuracy is about 65% (higher than Chou-Fasman). GOR is generally better at predicting α-helices but is poorer for β-sheets; it often mixes β-sheets with turns and random coils.

Example 2: Predict secondary structure of “C5a anaphylatoxin chemotactic receptor” (Homo sapiens) using GOR

  1. Go to https://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_gor.html
  2. Choose secondary structure prediction from the list
  3. Click on GOR – Garnier et al, 1996
  4. Type C5aR as the sequence name
  5. Copy and paste the primary structure sequence in FASTA format into the box provided
  6. Click SUBMIT and wait for the results
  7. Note the helix, extended strand, and random coil content of the protein

3. Neural Network Method

Neural network methods depend on training the prediction program with already solved structures. In this way, they can identify common sequence motifs associated with particular arrangements of secondary structures. Neural network methods are typically as accurate as ~70%. However, they are still inefficient in prediction of β-strands.

Example 3: Predict secondary structure of “C5a anaphylatoxin chemotactic receptor” (Homo sapiens) using neural network methods

  1. Go to https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_hnn.html
  2. Type C5aR as the sequence name
  3. Copy and paste the primary structure sequence in FASTA format into the box provided
  4. Click SUBMIT and wait for the results
  5. Note the helix, extended strand, and random coil content of the protein

Example 4: Compare accuracy using a known structure

Let’s check the accuracy of the 3 methods by applying them to an already known structure:

  • E. coli RNase II protein
  • PDB ID: 2ID0

Following Examples 1-3, find the secondary structure content of this protein and compare it with crystal structure data.

You can also do it for C5aR too (PDB ID: 8HK5). Be careful: this is a structure file that has more than one protein (C5aR and Gi and βγ proteins in a complex).


Protein Structure Databases

1. CATH

The CATH database is a hierarchical domain classification of protein structures in the Protein Data Bank (PDB). Only crystal structures solved to resolution better than 4.0 Å are considered, together with NMR structures. All non-proteins, models, and structures with greater than 30% “C-alpha only” are excluded.

CATH classifies protein structures into subgroups:

  • C → A → T → H

Levels:

  • Class (C-level)
  • Determined according to secondary structure composition and packing
  • Major classes: mainly-alpha, mainly-beta, alpha-beta
  • Architecture (A-level)
  • Describes overall shape as determined by orientations of secondary structures (ignores connectivity)
  • Topology / Fold family (T-level)
  • Grouped by whether they share the same topology/connectivity of secondary structures in the domain core
  • Homologous Superfamily (H-level)
  • Domains thought to share a common ancestor

CATH hierarchy

Example 5: Find CATH domains

Find the CATH domains in:

  • “C5a anaphylatoxin chemotactic receptor” (Homo sapiens)
  • “OmpF porin” (Escherichia coli APEC O1 strain)
  • “Glycogen phosphorylase isoform A” (Drosophila melanogaster)

Steps:

  1. Go to http://www.cathdb.info/
  2. Click on Search CATH by protein sequence
  3. Paste the FASTA format amino acid sequence into the search box
  4. Click on sequence search
  5. Note the CATH domains of the proteins

2. SCOP: Structural Classification of Proteins

SCOP is another database that includes protein folds, families and superfamilies. The SCOP database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known.

More information:

Example 6: Search SCOP database for “fumarase”

  1. Go to https://www.ebi.ac.uk/pdbe/scop/
  2. Under access methods, click on Keyword search of SCOP entries

SCOP keyword search

  1. Type fumarase into the search box
  2. Choose fumarase from Escherichia coli
  3. Make the same exercise for PPK (polyphosphate kinase) and analyze the results