Week 2 - Secondary Structure Prediction¶

Protein Secondary Structure Prediction Methods¶

1. Chou-Fasman Method¶

This method relies on the relative frequencies of amino acids in secondary structures: α-helices, β-sheets, and turns. Chou and Fasman used 3D structures solved with X-ray crystallography in the 1970s to determine relative amino acid frequencies in secondary structures of proteins.

This method calculates the probability of a given sequence of amino acids to fit in an α-helix, a β-sheet, or a turn by only taking account of the probability of presence of an individual amino acid in a secondary structure. It does not consider neighboring residues, which is why its accuracy is limited (about 50-60%). Another weakness is that the template structures are limited to those solved in the 1970s, so it cannot compete with newer methods.

Example 1: Predict secondary structure of “C5a anaphylatoxin chemotactic receptor” (Homo sapiens) using Chou-Fasman¶

Go to http://fasta.bioch.virginia.edu
Click on Protein-Protein FASTA
Select Hydropathy/SecondaryStructure/seg

In the new window:
Choose the (A) Program as Chou-Fasman Secondary Structure Prediction
Choose the format as FASTA
Retrieve the primary sequence of C5a anaphylatoxin chemotactic receptor in FASTA format from either NCBI or UniProt (as done in Week 1)
Copy and paste the FASTA sequence into the input box
Submit the sequence

Choose Chou-Fasman program

After getting the results, comment on the secondary structure content of this receptor.

2. GOR (Garnier, Osguthorpe, and Robson) method¶

GOR method is a developed version of Chou-Fasman method. It uses a Bayesian approach and takes the conditional probability of neighboring amino acids to fit in the same secondary structure into account. Its accuracy is about 65% (higher than Chou-Fasman). GOR is generally better at predicting α-helices but is poorer for β-sheets; it often mixes β-sheets with turns and random coils.

Example 2: Predict secondary structure of “C5a anaphylatoxin chemotactic receptor” (Homo sapiens) using GOR¶

Go to https://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_gor.html
Choose secondary structure prediction from the list
Click on GOR – Garnier et al, 1996
Type C5aR as the sequence name
Copy and paste the primary structure sequence in FASTA format into the box provided
Click SUBMIT and wait for the results
Note the helix, extended strand, and random coil content of the protein

3. Neural Network Method¶

Neural network methods depend on training the prediction program with already solved structures. In this way, they can identify common sequence motifs associated with particular arrangements of secondary structures. Neural network methods are typically as accurate as ~70%. However, they are still inefficient in prediction of β-strands.

Example 3: Predict secondary structure of “C5a anaphylatoxin chemotactic receptor” (Homo sapiens) using neural network methods¶

Go to https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_hnn.html
Type C5aR as the sequence name
Copy and paste the primary structure sequence in FASTA format into the box provided
Click SUBMIT and wait for the results
Note the helix, extended strand, and random coil content of the protein

Example 4: Compare accuracy using a known structure¶

Let’s check the accuracy of the 3 methods by applying them to an already known structure:

E. coli RNase II protein
PDB ID: 2ID0

Following Examples 1-3, find the secondary structure content of this protein and compare it with crystal structure data.

You can also do it for C5aR too (PDB ID: 8HK5). Be careful: this is a structure file that has more than one protein (C5aR and Gi and βγ proteins in a complex).

Protein Structure Databases¶

1. CATH¶

The CATH database is a hierarchical domain classification of protein structures in the Protein Data Bank (PDB). Only crystal structures solved to resolution better than 4.0 Å are considered, together with NMR structures. All non-proteins, models, and structures with greater than 30% “C-alpha only” are excluded.

CATH classifies protein structures into subgroups:

C → A → T → H

Levels:

Class (C-level)
Determined according to secondary structure composition and packing
Major classes: mainly-alpha, mainly-beta, alpha-beta
Architecture (A-level)
Describes overall shape as determined by orientations of secondary structures (ignores connectivity)
Topology / Fold family (T-level)
Grouped by whether they share the same topology/connectivity of secondary structures in the domain core
Homologous Superfamily (H-level)
Domains thought to share a common ancestor

CATH hierarchy

Example 5: Find CATH domains¶

Find the CATH domains in:

“C5a anaphylatoxin chemotactic receptor” (Homo sapiens)
“OmpF porin” (Escherichia coli APEC O1 strain)
“Glycogen phosphorylase isoform A” (Drosophila melanogaster)

Steps:

Go to http://www.cathdb.info/
Click on Search CATH by protein sequence
Paste the FASTA format amino acid sequence into the search box
Click on sequence search
Note the CATH domains of the proteins

2. SCOP: Structural Classification of Proteins¶

SCOP is another database that includes protein folds, families and superfamilies. The SCOP database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known.

More information:

https://www.rcsb.org/docs/search-and-browse/browse-options/scop2

Example 6: Search SCOP database for “fumarase”¶

Go to https://www.ebi.ac.uk/pdbe/scop/
Under access methods, click on Keyword search of SCOP entries

SCOP keyword search

Type fumarase into the search box
Choose fumarase from Escherichia coli
Make the same exercise for PPK (polyphosphate kinase) and analyze the results