EMBOSS – an application suite for Bioinformatics Shahid Manzoor Adnan Niazi SLU Global...
-
Upload
sabastian-atlee -
Category
Documents
-
view
221 -
download
0
Transcript of EMBOSS – an application suite for Bioinformatics Shahid Manzoor Adnan Niazi SLU Global...
EMBOSS – an application suite for Bioinformatics
Shahid ManzoorShahid Manzoor
Adnan NiaziAdnan NiaziSLU Global Bioinformatics Centre
E – European
M – Molecular
B – Biology
O – Open
S – Software
S - SuiteSLU Global Bioinformatics Centre
SLU Global Bioinformatics Centre
All Information
EMBOSS info at http://emboss.sourceforge.net/.
wEMBOSS info at http://wemboss.sourceforge.net/.
E-mail [email protected] to get a username and password for
wEMBOSS at http://ebiokit.hgen.slu.se/.
SLU Global Bioinformatics Centre
Open Source molecular biology analysis package.
Handles a variety of common file formats.
Provides libraries for easy development
Software, licensed under GPL and LGPL
Developed by Martin Sarachu and Marc Colet
Available at http://emboss.sourceforge.net
What is EMBOSS
SLU Global Bioinformatics Centre
A comprehensive set of sequence analysis programs.
All sequence and many alignment and structural formats are Handled.
It runs on practically every UNIX you can think of (and likely some that you can't), plus Windows and OS X.
Each application has the same style of interface so master one and you've mastered them all.
Features of EMBOSS
SLU Global Bioinformatics Centre
Sequence alignment.
Protein motif identification (including domain analysis)
Nucleotide sequence pattern analysis (for example to
identify CpG islands or repeats).
Presentation tools for publications.
Uses for EMBOSS
SLU Global Bioinformatics Centre
Many small and large programs in package (>140).
All programs share a common look and feel.
Easy to run from command line.
Retrieval of sequence data from the web.
Programs in EMBOSS
SLU Global Bioinformatics Centre
The one Argument
help
the –help argument displays a short help for any EMBOSS program.
SLU Global Bioinformatics Centre
wossname
wossname searches the other programs short description for keywords.
The One Command
Large collection of gene and protein analysis tools
Sequence retrieval
Alignments
Primer design
Restriction Mapping
Protein domain searching
Translation
SLU Global Bioinformatics Centre
DNA
Sequence 1
DNA
Sequence 2
dotplot translation
protein local/global alignment
protein
Sequence 1
protein
Sequence 2
multiple sequence alignment
motif and domain
searching
physico-chemical
properties
SLU Global Bioinformatics Centre
AGTGGTCGTGAAGAGAATGCTCCTCCTTTGGAATCTTAA
>SEQ1.fasta
AGTGCTCCTCCCTTAGAATCTTAG
>SEQ2.fasta
Unix% dottup SEQ1.fasta SEQ2.fasta –window 10 &
Unix% dotmatcher SEQ1.fasta SEQ2.fasta –window 10 –threshold 17 &
For an exact match:
For a similarity match:
DotplotsDotplots
SLU Global Bioinformatics Centre
A T G C
A 5 -4 -4 -4
T -4 5 -4 -4
G –4 -4 5 -4
C -4 -4 -4 5
Identity Matrix
Dotplots …Dotplots …
SLU Global Bioinformatics Centre
Window Size is number of bases in a sliding window that is moved along each sequence and compared to generate a single data point on the plot. Window size must be an odd number.
Mismatch Limit determines how similar the two sequences in a window must be to "match". For example, if window size is 9 and mismatch limit is 2, then up to 2 mismatches in a 9 base window will still be classified as a match.
A T G C
A 5 -4 -4 -4
T -4 5 -4 -4
G –4 -4 5 -4
C -4 -4 -4 5
CCTCCTTTGG
CCTCCTTTGG
Score = 50555555555 5
CCTCCTTTGG
CCTCCCTTAG
55-455555 5-4 Score = 32
Pro Leu
Pro Leu
Dotplots …Dotplots …
SLU Global Bioinformatics Centre
DotplotsDotplots
SLU Global Bioinformatics Centre
A dot plot is a simple graphical representation of identical residues between two sequences.
The X axis represents the first sequence (PHO5),
The Y axis represents the second sequence (PHO3)
A dot is plotted for each match between two residues of the sequences.
Diagonal lines reveal regions of identity between the two sequences.
SLU Global Bioinformatics Centre
The dot plot can be adapted to display only word matches, which correspond to a
diagonal of dots in the letter-based dot plot.
Example: alignment of PHO5 and PHO3 coding sequences, with different word sizes.
Dotplots …Dotplots …
SLU Global Bioinformatics Centre
Detecting repeats with a dot plot
Sequence repeats are easily detected in a dot plot when a sequence is
compared to itself.
The main diagonal is completely marked
(by definition, since the sequence is identical do itself)
Repeats appear as segments of lines parallel to the diagonal.
ATGGGTCGTGAAGAGAATGCTCCTCCTTTGGAATCTTAA
>SEQ1.fasta
ATGGCTCCTCCCTTAGAATCTTAG
>SEQ2.fasta
Unix% plotorf SEQ1.fasta –stop TAA, TAG –out GA.plot &
Unix% getorf SEQ1.fasta –minsize 5 –table 0 –find 1 –out GA.getorf &
SLU Global Bioinformatics Centre
PlotorfPlotorf
ATGGGTCGTGAAGAGAATGCTCCTCCTTTGGAATCTTAA
TACCCAGCACTTCTCTTACGAGGAGGAAACCTTAGAATT
Frame -3Frame -2
Frame -1
Frame 1Frame 2
Frame 3
Start and stop codons are located according to the instructions to the program, and the area in between start and stop codons
SLU Global Bioinformatics Centre
Indication of full coding sequence?
Alternative splice form?
SLU Global Bioinformatics Centre
>_1 [17 - 37]
MLLLWNL
>_2 [1 - 36]
MGREENAPPLES*
Using getorf:
stop codon
start methionine
SLU Global Bioinformatics Centre
Unix% transeq SEQ1.fasta –frame 1 –table 0 –sbegin 4 –send 33 -out GA.fasta &
>GA.fastaGREENAPPLES
SLU Global Bioinformatics Centre
Unix% needle GA.fasta A.fasta –gapopen 10 –gapextend 0.5 –matrix EPAM250 &
Unix% water GA.fasta A.fasta –gapopen 10 –gapextend 0.5 –matrix EPAM250 &
>GA.fastaGREENAPPLES
>A.fastaAPPLES
For a global alignment:
For a local alignment:
AlignmentsAlignments
SLU Global Bioinformatics Centre
Alignments …Alignments …
To align two or more sequences in a biologically significant way.
GREENAPPLES
GREENAPPLES
APPLES
APPLES
APPLES
Local (water) Global (needle)
Gap penalty = 10; Extension penalty = 0.5
APPLES
SLU Global Bioinformatics Centre
GREENAPPLESAPPLES
looks like the “apples” motif may be part of a larger domain
APPLES
physicochemical properties
pattern searching
SLU Global Bioinformatics Centre
Physico-chemical propertiesPhysico-chemical properties
Unix% iep GA.fasta –plot -step 0.5 –out GA.IEP &
Unix% pepinfo GA.fasta –hwindow 8 –generalplot –hydropathyplot &
Isoelectric point
General properties
SLU Global Bioinformatics Centre
Physico-chemical propertiesPhysico-chemical properties
D
Y
F W
HK
R
EQ
N
M
AG
C S
P
I V
LT
Aliphatic
Aromatic
Hydrophobic
Tiny
Small
Charged
Positive
Polar
The pepinfo graph of properties is based on this diagram
SLU Global Bioinformatics Centre
Physico-Physico-chemical chemical propertiesproperties
non-polar region with small residues
polar region to one side of non-charged region
SLU Global Bioinformatics Centre
Pattern searchingPattern searching
GREENAPPL---ES
-RE-DAPPL---ES
GREEN---LEAVES
-RE-D---LEAVES
GREENAPPLES>GA.fasta
GREENLEAVES>GL.fasta
REDAPPLES>RA.fasta
REDLEAVES>RL.fasta
[G] (0,1)-R–[E] (1,2)–[ND]–X (3)–L–X (3) – E – S
SLU Global Bioinformatics Centre
Pattern searchingPattern searching
Unix% fuzzpro sptr:* pattern.fruit –mismatch 0 –out GA.fuzzpro &
Search a protein database:
[G] (0,1) - [R] – [E] (1,2) – [ND] –x (3) – [L] –x (3) – [E] – [S]
pattern.fruit
Nothing resembling this pattern is found in the database
- But we could try scanning PRINTS (pscan) and PROSTIE
(patmatmotifs) with one of our sequences.
SLU Global Bioinformatics Centre
SLU Global Bioinformatics Centre
Some Programs
SLU Global Bioinformatics Centre
Some Programs …
SLU Global Bioinformatics Centre
More Information