Prosite UCSC Genome Browser MSAs and Phylogeny Exercise 2.

Post on 20-Dec-2015

221 views 1 download

Transcript of Prosite UCSC Genome Browser MSAs and Phylogeny Exercise 2.

Prosite Prosite UCSC Genome UCSC Genome

BrowserBrowserMSAsMSAsandand

Phylogeny Phylogeny

Exercise 2Exercise 2

Turning information into knowledgeTurning information into knowledge

The outcome of a sequencing project is The outcome of a sequencing project is masses of raw datamasses of raw data

The challenge is to turn this The challenge is to turn this raw data into raw data into biological knowledgebiological knowledge

A valuable tool for this challenge is an A valuable tool for this challenge is an automated diagnostic pipe through which automated diagnostic pipe through which newly determined sequences can be newly determined sequences can be streamlinedstreamlined

From sequence to functionFrom sequence to function

Nature tends to innovate rather than inventNature tends to innovate rather than invent Proteins are composed of functional Proteins are composed of functional

elements: domains and motifselements: domains and motifs DomainsDomains are structural units that carry out a are structural units that carry out a

certain functioncertain function The same domains are The same domains are

shared between different shared between different proteinsproteins

MotifsMotifs are shorter are shorter sequences with certainsequences with certainbiological activitybiological activity

http://www.ebi.ac.uk/http://www.ebi.ac.uk/interprointerpro//

InterProInterPro

An integrated documentation resource for An integrated documentation resource for protein families, domains and sitesprotein families, domains and sites

Groups signatures describing the same protein Groups signatures describing the same protein family or domainfamily or domain

Combines a number of databases that use Combines a number of databases that use different methodologies to derive protein different methodologies to derive protein signature:signature: UniProt: UniProtKB Swiss-Prot, TrEMBL, UniProt: UniProtKB Swiss-Prot, TrEMBL,

UniRef,UniParcUniRef,UniParc prosite: documented DB on domains, families and prosite: documented DB on domains, families and

functional sites.functional sites. Pfam: a DB of protein families represented by MSAsPfam: a DB of protein families represented by MSAs

InterPro searchInterPro search

http://www.expasy.ch/http://www.expasy.ch/prositeprosite//

prositeprosite

A method for determining the function of A method for determining the function of uncharacterized translated protein uncharacterized translated protein sequencessequences

Consists of a DB of annotated biologically Consists of a DB of annotated biologically important important sites/patterns/motifs/signature/fingerprintssites/patterns/motifs/signature/fingerprints

prositeprosite Entries are represented with Entries are represented with patternspatterns or or

profilesprofiles

pattern

1122334455

AA0.660.66110000..

TT00000011..

CC0.330.33000.660.6600..

GG00000.330.3300..

profile

[AC-]A-[GC]-T-[TC]-[GC]

Profiles are used in prosite when the motif is relatively Profiles are used in prosite when the motif is relatively divergent, and it is difficult to represent as a patterndivergent, and it is difficult to represent as a pattern

Scanning prositeScanning prosite

Query: sequence

Query: pattern

Result: all patterns found in sequence

Result: all sequences which adhere to this pattern

Patterns with a high probability of Patterns with a high probability of occurrenceoccurrence

Entries describing commonly found postEntries describing commonly found post--translational modifications or compositionally translational modifications or compositionally biased regions.biased regions.

Found in the majority of known protein Found in the majority of known protein sequences sequences

High probability of occurrenceHigh probability of occurrence

prosite sequence queryprosite sequence query

prosite pattern queryprosite pattern query

UCSC Genome BrowserUCSC Genome Browser

Reset all settings of

previous user

UCSC Genome Browser - GatewayUCSC Genome Browser - Gateway

UCSC Genome Browser - GatewayUCSC Genome Browser - Gateway

UCSC Genome Browser - GatewayUCSC Genome Browser - Gateway

UCSC Genome BrowserUCSC Genome Browserquery resultsquery results

UCSC Genome Browser UCSC Genome Browser Annotation tracksAnnotation tracks

Vertebrate conservation

mRNA (GenBank)

RefSeq

UCSC Genes

Base position

Single species compared

SNPs

Repeats

GeneDirection

Exon

Intron

UTR

USCS GeneUSCS Gene

UCSC Genome Browser - movementUCSC Genome Browser - movement

Zoom x3 + Center

UCSC Genome Browser – UCSC Genome Browser – Base viewBase view

Annotation track optionsAnnotation track options

dense

squish

full

pack

Annotation track optionsAnnotation track optionsAnother option totoggle between

‘pack’ and ‘dense’view is to click on

the track title

Sickle-cell anemia distr.

Malariadistr.

BLATBLAT

BLAT = BBLAT = Blast-last-LLike ike AAlignment lignment TTool ool BLAT is designed to find similarity of BLAT is designed to find similarity of >95% on >95% on

DNADNA, , >80% for protein>80% for protein Rapid search by indexing entire genome.Rapid search by indexing entire genome.

Good for:Good for:

1.1. Finding genomic coordinates of cDNAFinding genomic coordinates of cDNA

2.2. Determining exons/intronsDetermining exons/introns

3.3. Finding human (or chimp, dog, cow…) Finding human (or chimp, dog, cow…) homologs of another vertebrate sequencehomologs of another vertebrate sequence

BLAT on UCSC Genome BrowserBLAT on UCSC Genome Browser

BLAT on UCSC Genome BrowserBLAT on UCSC Genome Browser

BLAT ResultsBLAT Results

BLAT ResultsBLAT Results

Match

Non-Match(mismatch/indel)

Indel boundaries

BLAT ResultsBLAT Results

BLAT Results on the browserBLAT Results on the browser

Getting Getting DNADNA sequence of region sequence of region

Getting Getting DNADNA sequence of region sequence of region

Clustal X –Clustal X –

A Multiple A Multiple Alignment ToolAlignment Tool

Input: multiple sequence Fasta fileInput: multiple sequence Fasta file>gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein ]Homo sapiens[>gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein ]Homo sapiens[MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQMNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANSQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS

>gi|114051746|ref|NP_001040585.1| protease, serine, 2 ]Macaca mulatta[>gi|114051746|ref|NP_001040585.1| protease, serine, 2 ]Macaca mulatta[MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQMNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANSQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS

>gi|6755891|ref|NP_035775.1| mesotrypsin ]Mus musculus[>gi|6755891|ref|NP_035775.1| mesotrypsin ]Mus musculus[MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQMNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAANQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN

>gi|6981422|ref|NP_036861.1| protease, serine, 2 ]Rattus norvegicus[>gi|6981422|ref|NP_036861.1| protease, serine, 2 ]Rattus norvegicus[MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQMRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAANQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN

>gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen ]Bos taurus[>gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen ]Bos taurus[MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQMHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECLVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL. . .. . .

OneOne of the options to get multiple of the options to get multiple sequence Fasta filesequence Fasta file

OneOne of the options to get multiple of the options to get multiple sequence Fasta filesequence Fasta file

Input: multiple sequence Fasta fileInput: multiple sequence Fasta file>gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein ]Homo sapiens[>gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein ]Homo sapiens[MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQMNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANSQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS

>gi|114051746|ref|NP_001040585.1| protease, serine, 2 ]Macaca mulatta[>gi|114051746|ref|NP_001040585.1| protease, serine, 2 ]Macaca mulatta[MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQMNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANSQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS

>gi|6755891|ref|NP_035775.1| mesotrypsin ]Mus musculus[>gi|6755891|ref|NP_035775.1| mesotrypsin ]Mus musculus[MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQMNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAANQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN

>gi|6981422|ref|NP_036861.1| protease, serine, 2 ]Rattus norvegicus[>gi|6981422|ref|NP_036861.1| protease, serine, 2 ]Rattus norvegicus[MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQMRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAANQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN

>gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen ]Bos taurus[>gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen ]Bos taurus[MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQMHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECLVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL. . .. . .

Input: multiple sequence Fasta fileInput: multiple sequence Fasta file>>gi|21536452|ref|NP_002762.2|gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein ]Homo sapiens[mesotrypsin preproprotein ]Homo sapiens[

MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQMNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANSQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS

>>gi|114051746|ref|NP_001040585.1|gi|114051746|ref|NP_001040585.1| protease, serine, 2 ]Macaca mulatta[protease, serine, 2 ]Macaca mulatta[MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQMNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANSQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS

>>gi|6755891|ref|NP_035775.1|gi|6755891|ref|NP_035775.1| mesotrypsin ]Mus musculus[mesotrypsin ]Mus musculus[MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQMNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAANQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN

>>gi|6981422|ref|NP_036861.1|gi|6981422|ref|NP_036861.1| protease, serine, 2 ]Rattus norvegicus[protease, serine, 2 ]Rattus norvegicus[MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQMRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAANQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN

>>gi|27819626|ref|NP_777115.1|gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen ]Bos taurus[pancreatic anionic trypsinogen ]Bos taurus[MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQMHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECLVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL. . .. . .

Step1: Load the sequencesStep1: Load the sequences

Sequences and conservation viewSequences and conservation view

Step2: Perform AlignmentStep2: Perform Alignment

Sequences and conservation viewSequences and conservation view

Sequences and conservation viewSequences and conservation view

Step 3: Create treeStep 3: Create tree

Step 4: NJPlotStep 4: NJPlot

Step 4: NJPlotStep 4: NJPlot

The Newick tree format is used to represent trees as strings

CA D

In Newick format: ((A,C),(B,D));

B

Each pair of parenthesis () enclose a clade in the tree, and the comma separates the members of the corresponding clade.“;” – is always the last character

How How robustrobust is our tree is our tree??

We need some statistical way to estimate We need some statistical way to estimate the confidence in the tree topologythe confidence in the tree topology

But we don’t know anything about the tree But we don’t know anything about the tree topology distribution or parameterstopology distribution or parameters

The only data source we have is our data The only data source we have is our data (MSA)(MSA)

So, we must rely on our own resources: So, we must rely on our own resources: “pull up by your own bootstraps”“pull up by your own bootstraps”

How robust is our treeHow robust is our tree??

Bootstrap(and jackknife)

Jackknife1. We create n (typically 100-1000) new MSAs (pseudo-data sets) by randomly sampling half of the characters. (random samples without replacement)

We do not change the number of sequences, just the number of positions!

POS: 523161 : TATTT2 : CATTT3 : CACTTN : AACTT

POS: 187451 : TTTAT2 : TAACC3 : TAACCN : TGGGA

POS: 183941 : TTGTA2 : TAGAC3 : TAAACN : TGAGG

Jackknife2. We reconstruct a tree from each data set, using the same method used for reconstructing the original tree

POS: 523161 : TATTT2 : CATTT3 : CACTTN : AACTT

POS: 187451 : TTTAT2 : TAACC3 : TAACCN : TGGGA

POS: 183941 : TTGTA2 : TAGAC3 : TAAACN : TGAGG

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

3. For each node in our original tree, we count the number of times it appeared in the Jackknife analysis

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Back to Jackknife

Sp1Sp2

Sp3

Sp4

67%100%

In 67% of the data sets, the node SP1+SP2 was found

Bootstrap

The same as jackknife, but instead of sampling K/2 positions, we sample K positions with replacement

Bootstrap

1. Resample K positions n times

12345 K1 : ATCTG…A 2 : ATCTG…C3 : ACTTA…C N : ACCTA…T

11244 K1 : AATTT…T2 : AATTT…G3 : AACTT…TN : AACTT…T

47789…K1 : TTTAT…T2 : TAACC…G3 : TAACC…TN : TGGGA…T

15578… K1 : AGGTA…T2 : AGGAC…G3 : AAAAC…AN : AAAGG…C

Bootstrap2. Reconstruct a tree from each data set using the same method used for reconstructing the original tree

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

11244 K1 : AATTT…T2 : AATTT…G3 : AACTT…TN : AACTT…T

47789…K1 : TTTAT…T2 : TAACC…G3 : TAACC…TN : TGGGA…T

15578… K1 : AGGTA…T2 : AGGAC…G3 : AAAAC…AN : AAAGG…C

Bootstrap3. For each node in our original tree, we count the number of times it appeared in the bootstrap analysis

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3

Sp4

67%100%

Step 3.5 - BootstrapStep 3.5 - Bootstrap

Bootstrap values on NJPlotBootstrap values on NJPlot

Note:ClustalX saves trees as .ph filetrees with bootstrap are saved as .phb

You might have to reopen the tree…