EMBOSS – an application suite for Bioinformatics Shahid Manzoor Adnan Niazi SLU Global...

33
EMBOSS – an application suite for Bioinformatics Shahid Manzoor Shahid Manzoor Adnan Niazi Adnan Niazi SLU Global Bioinformatics Centre

Transcript of EMBOSS – an application suite for Bioinformatics Shahid Manzoor Adnan Niazi SLU Global...

Page 1: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

EMBOSS – an application suite for Bioinformatics

Shahid ManzoorShahid Manzoor

Adnan NiaziAdnan NiaziSLU Global Bioinformatics Centre

Page 2: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

E – European

M – Molecular

B – Biology

O – Open

S – Software

S - SuiteSLU Global Bioinformatics Centre

Page 3: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

SLU Global Bioinformatics Centre

All Information

EMBOSS info at http://emboss.sourceforge.net/.

wEMBOSS info at http://wemboss.sourceforge.net/.

E-mail [email protected] to get a username and password for

wEMBOSS at http://ebiokit.hgen.slu.se/.

Page 4: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

SLU Global Bioinformatics Centre

Open Source molecular biology analysis package.

Handles a variety of common file formats.

Provides libraries for easy development

Software, licensed under GPL and LGPL

Developed by Martin Sarachu and Marc Colet

Available at http://emboss.sourceforge.net

What is EMBOSS

Page 5: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

SLU Global Bioinformatics Centre

A comprehensive set of sequence analysis programs.

All sequence and many alignment and structural formats are Handled.

It runs on practically every UNIX you can think of (and likely some that you can't), plus Windows and OS X.

Each application has the same style of interface so master one and you've mastered them all.

Features of EMBOSS

Page 6: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

SLU Global Bioinformatics Centre

Sequence alignment.

Protein motif identification (including domain analysis)

Nucleotide sequence pattern analysis (for example to

identify CpG islands or repeats).

Presentation tools for publications.

Uses for EMBOSS

Page 7: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

SLU Global Bioinformatics Centre

Many small and large programs in package (>140).

All programs share a common look and feel.

Easy to run from command line.

Retrieval of sequence data from the web.

Programs in EMBOSS

Page 8: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

SLU Global Bioinformatics Centre

The one Argument

help

the –help argument displays a short help for any EMBOSS program.

Page 9: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

SLU Global Bioinformatics Centre

wossname

wossname searches the other programs short description for keywords.

The One Command

Page 10: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

Large collection of gene and protein analysis tools

Sequence retrieval

Alignments

Primer design

Restriction Mapping

Protein domain searching

Translation

SLU Global Bioinformatics Centre

Page 11: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

DNA

Sequence 1

DNA

Sequence 2

dotplot translation

protein local/global alignment

protein

Sequence 1

protein

Sequence 2

multiple sequence alignment

motif and domain

searching

physico-chemical

properties

SLU Global Bioinformatics Centre

Page 12: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

AGTGGTCGTGAAGAGAATGCTCCTCCTTTGGAATCTTAA

>SEQ1.fasta

AGTGCTCCTCCCTTAGAATCTTAG

>SEQ2.fasta

Unix% dottup SEQ1.fasta SEQ2.fasta –window 10 &

Unix% dotmatcher SEQ1.fasta SEQ2.fasta –window 10 –threshold 17 &

For an exact match:

For a similarity match:

DotplotsDotplots

SLU Global Bioinformatics Centre

Page 13: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

A T G C

A 5 -4 -4 -4

T -4 5 -4 -4

G –4 -4 5 -4

C -4 -4 -4 5

Identity Matrix

Dotplots …Dotplots …

SLU Global Bioinformatics Centre

Window Size is number of bases in a sliding window that is moved along each sequence and compared to generate a single data point on the plot. Window size must be an odd number.

Mismatch Limit determines how similar the two sequences in a window must be to "match". For example, if window size is 9 and mismatch limit is 2, then up to 2 mismatches in a 9 base window will still be classified as a match.

Page 14: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

A T G C

A 5 -4 -4 -4

T -4 5 -4 -4

G –4 -4 5 -4

C -4 -4 -4 5

CCTCCTTTGG

CCTCCTTTGG

Score = 50555555555 5

CCTCCTTTGG

CCTCCCTTAG

55-455555 5-4 Score = 32

Pro Leu

Pro Leu

Dotplots …Dotplots …

SLU Global Bioinformatics Centre

Page 15: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

DotplotsDotplots

SLU Global Bioinformatics Centre

A dot plot is a simple graphical representation of identical residues between two sequences.

The X axis represents the first sequence (PHO5),

The Y axis represents the second sequence (PHO3)

A dot is plotted for each match between two residues of the sequences.

Diagonal lines reveal regions of identity between the two sequences.

Page 16: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

SLU Global Bioinformatics Centre

The dot plot can be adapted to display only word matches, which correspond to a

diagonal of dots in the letter-based dot plot.

Example: alignment of PHO5 and PHO3 coding sequences, with different word sizes.

Dotplots …Dotplots …

Page 17: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

SLU Global Bioinformatics Centre

Detecting repeats with a dot plot

Sequence repeats are easily detected in a dot plot when a sequence is

compared to itself.

The main diagonal is completely marked

(by definition, since the sequence is identical do itself)

Repeats appear as segments of lines parallel to the diagonal.

Page 18: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

ATGGGTCGTGAAGAGAATGCTCCTCCTTTGGAATCTTAA

>SEQ1.fasta

ATGGCTCCTCCCTTAGAATCTTAG

>SEQ2.fasta

Unix% plotorf SEQ1.fasta –stop TAA, TAG –out GA.plot &

Unix% getorf SEQ1.fasta –minsize 5 –table 0 –find 1 –out GA.getorf &

SLU Global Bioinformatics Centre

PlotorfPlotorf

Page 19: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

ATGGGTCGTGAAGAGAATGCTCCTCCTTTGGAATCTTAA

TACCCAGCACTTCTCTTACGAGGAGGAAACCTTAGAATT

Frame -3Frame -2

Frame -1

Frame 1Frame 2

Frame 3

Start and stop codons are located according to the instructions to the program, and the area in between start and stop codons

SLU Global Bioinformatics Centre

Page 20: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

Indication of full coding sequence?

Alternative splice form?

SLU Global Bioinformatics Centre

Page 21: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

>_1 [17 - 37]

MLLLWNL

>_2 [1 - 36]

MGREENAPPLES*

Using getorf:

stop codon

start methionine

SLU Global Bioinformatics Centre

Page 22: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

Unix% transeq SEQ1.fasta –frame 1 –table 0 –sbegin 4 –send 33 -out GA.fasta &

>GA.fastaGREENAPPLES

SLU Global Bioinformatics Centre

Page 23: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

Unix% needle GA.fasta A.fasta –gapopen 10 –gapextend 0.5 –matrix EPAM250 &

Unix% water GA.fasta A.fasta –gapopen 10 –gapextend 0.5 –matrix EPAM250 &

>GA.fastaGREENAPPLES

>A.fastaAPPLES

For a global alignment:

For a local alignment:

AlignmentsAlignments

SLU Global Bioinformatics Centre

Page 24: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

Alignments …Alignments …

To align two or more sequences in a biologically significant way.

GREENAPPLES

GREENAPPLES

APPLES

APPLES

APPLES

Local (water) Global (needle)

Gap penalty = 10; Extension penalty = 0.5

APPLES

SLU Global Bioinformatics Centre

Page 25: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

GREENAPPLESAPPLES

looks like the “apples” motif may be part of a larger domain

APPLES

physicochemical properties

pattern searching

SLU Global Bioinformatics Centre

Page 26: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

Physico-chemical propertiesPhysico-chemical properties

Unix% iep GA.fasta –plot -step 0.5 –out GA.IEP &

Unix% pepinfo GA.fasta –hwindow 8 –generalplot –hydropathyplot &

Isoelectric point

General properties

SLU Global Bioinformatics Centre

Page 27: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

Physico-chemical propertiesPhysico-chemical properties

D

Y

F W

HK

R

EQ

N

M

AG

C S

P

I V

LT

Aliphatic

Aromatic

Hydrophobic

Tiny

Small

Charged

Positive

Polar

The pepinfo graph of properties is based on this diagram

SLU Global Bioinformatics Centre

Page 28: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

Physico-Physico-chemical chemical propertiesproperties

non-polar region with small residues

polar region to one side of non-charged region

SLU Global Bioinformatics Centre

Page 29: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

Pattern searchingPattern searching

GREENAPPL---ES

-RE-DAPPL---ES

GREEN---LEAVES

-RE-D---LEAVES

GREENAPPLES>GA.fasta

GREENLEAVES>GL.fasta

REDAPPLES>RA.fasta

REDLEAVES>RL.fasta

[G] (0,1)-R–[E] (1,2)–[ND]–X (3)–L–X (3) – E – S

SLU Global Bioinformatics Centre

Page 30: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

Pattern searchingPattern searching

Unix% fuzzpro sptr:* pattern.fruit –mismatch 0 –out GA.fuzzpro &

Search a protein database:

[G] (0,1) - [R] – [E] (1,2) – [ND] –x (3) – [L] –x (3) – [E] – [S]

pattern.fruit

Nothing resembling this pattern is found in the database

- But we could try scanning PRINTS (pscan) and PROSTIE

(patmatmotifs) with one of our sequences.

SLU Global Bioinformatics Centre

Page 31: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

SLU Global Bioinformatics Centre

Some Programs

Page 32: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

SLU Global Bioinformatics Centre

Some Programs …

Page 33: EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.

SLU Global Bioinformatics Centre

More Information