Week 8 - Sequence Alignment (BLAST, Pairwise and Multiple Alignment)¶

Most proteins are somehow related to other proteins. They may share a similar structure and function. To understand how related or similar proteins are, the first and easiest way is to align the primary sequences of proteins.

Proteins that share a common ancestor are called homologous.

Orthologs: proteins in different organisms sharing a common ancestor (e.g., myoglobin in rat and human)
Paralogs: proteins formed via gene duplication within a species (e.g., α and β hemoglobin in human)

Identity is defined by exactly the same amino acid in two aligned sequences. Similarity describes structural/functional relatedness.

Pairwise sequence alignment is a good way to calculate percent identity and similarity of two sequences. BLAST (Basic Local Alignment Search Tool) is one of the best and fastest tools for sequence alignment and is provided by many websites and databases.

Database Search¶

Example 1: Use BLAST to analyze an unknown protein (GPR50 from Homo sapiens)¶

BLAST can be used to get an idea about an unknown protein. By BLASTing, you can find similar sequences that help you guess the function and possibly structure of the protein.

Let’s BLAST GPR50 protein from Homo sapiens to find related proteins.

Go to https://www.ncbi.nlm.nih.gov/
Search GPR50 by adjusting the search criterion to Protein.

Select GPR50 protein from Homo sapiens from the dropdown menu.
A new window displays information on GPR50. On the right side there is a link menu titled Analyze this sequence.
Click Run BLAST.
BLAST window appears and the accession number is filled automatically.
Click the BLAST button.

Alternative route:

Go to https://www.ncbi.nlm.nih.gov/
Find BLAST under Popular Resources.

Popular resources

Click BLAST and select protein BLAST.
In the query box you can paste:
FASTA sequence
GI number
Accession number
The results window displays a list of sequences producing significant alignments.
Results are ordered by descending E-values.
The smaller the E-value, the more significant/similar the match.

PSI-BLAST:

PSI-BLAST (position specific iterated BLAST) can be performed from the same window.
Select PSI-BLAST algorithm instead of the default BLASTP algorithm.

Program selection

Alignment of Two Sequences¶

Example 2: Align melatonin receptor type 1A from Homo sapiens and Mus musculus¶

Go to the protein BLAST window as in Example 1.
Check the box for Align two or more sequences.

Align two or more sequences

Enter FASTA sequence, GI number, or accession number of type 1A from Homo sapiens and from Mus musculus into the first and second box.
Click BLAST.
You will see the alignment.

Two sequences aligned

Interpretation:

Query represents the first sequence.
Subject is the second sequence.
There is a middle line between the two sequences showing the alignment.
Fully aligned parts are shown as exact amino acid matches.
+ signs show that the amino acid is not identical but changed to a similar amino acid (e.g., two nonpolar hydrophobic amino acids, I→L).
Gaps indicate regions that do not align well.
- signs represent that one sequence is shorter than the other in those regions.

Multiple Sequence Alignment¶

ClustalW¶

Example 3: Align dopamine D1 receptor sequences from different organisms¶

Go to https://www.ebi.ac.uk/jdispatcher/
Paste FASTA format sequences of the D1 receptors with the following accession numbers:
NP_001108459.1
NP_001011595.1
EFN72198.1
NP_776467.1
CAA41734.1
Do not forget to include the >gi... line at the beginning of each FASTA sequence.
Click Submit at the very end of the page.
Alignment will appear in a new window.

ClustalW query

Interpretation:

* indicates fully conserved residues.
: indicates conservation of strong groups.
. indicates conservation of weak groups.

Example:

conversion of tyrosine to phenylalanine (both aromatic amino acids)

Options:

Click Show Colors to color amino acids according to physicochemical characteristics.
Helps track changes to similar residues.
Click Guide Tree.

Guide tree

The Guide Tree displays a guide tree according to similarities of proteins, like a phylogenetic tree.

SDSC Biology Workbench¶

The Biology Workbench is a web-based tool for biologists. The Workbench allows biologists to search many popular protein and nucleic acid sequence databases. Database searching is integrated with access to a wide variety of analysis and modeling tools, all within a point and click interface that eliminates file format compatibility problems.

SDSC Biology Workbench was available at http://workbench.sdsc.edu/; however recently due to funding issues they have suspended their services.

Other Tools¶

PROSEARCH: searches PROSITE for patterns in the protein
GOR4 and CHOFAS: secondary structure prediction
PELE: predicts secondary structure with multiple different algorithms
TMHMM: predicts location of transmembrane helices and intervening loop regions
PSI-BLAST: position specific iterated BLAST