Practical Bioinformatics Practical Bioinformatics ToolsToolsForFor
Understanding EvolutionUnderstanding Evolution
Robert Latek, PhDBioinformatics and Research Computing
Whitehead Institute for Biomedical Research
WIBR Bioinformatics, © Whitehead Institute 2004 2
AimsAims
• Examine Techniques For Describing Examine Techniques For Describing Evolutionary RelationshipsEvolutionary Relationships
• Learn To Apply Bioinformatics Tools Learn To Apply Bioinformatics Tools To Study EvolutionTo Study Evolution
• Question, Interrupt, Discuss, SuggestQuestion, Interrupt, Discuss, Suggest
WIBR Bioinformatics, © Whitehead Institute 2004 3
Bioinformatics ?Bioinformatics ?• DefinitionDefinition
– Integration of computational and biological methods to promote biological discovery
– Combination of Biology, Statistics, CS, Clinical Research
• PurposePurpose– Predict, Decipher, Visualize
• MethodologyMethodology– Data Mining and Comparisons
Data Visualization
MSRKGPRAEVCADCSAPDPGWASISRGVLVCDECCSVHRLGRHISIVKHLRHSAWPPTLLQMVHTLASNGANSIWEHSLLDPAQVQSGRRKAN
G. Bell
WIBR Bioinformatics, © Whitehead Institute 2004 4
Bioinformatics :-)Bioinformatics :-)• Biological Comparisons (Evolutionary Analysis)Biological Comparisons (Evolutionary Analysis)
– How closely/distantly related are two populations?
• Gene Function PredictionGene Function Prediction– How and why does Gene X function/malfunction?
• Pharmaceutical Design & In Silico TestingPharmaceutical Design & In Silico Testing
WIBR Bioinformatics, © Whitehead Institute 2004 5
Bioinformatics@WIBioinformatics@WI• Bioinformatics and Research ComputingBioinformatics and Research Computing
• Collaboration, Consultation, Education Collaboration, Consultation, Education in Bioinformatics and Graphicsin Bioinformatics and Graphics
• Provide hardware, commercial/custom Provide hardware, commercial/custom software tools, training, and software tools, training, and bioinformatics expertisebioinformatics expertise
Decipher Predict
WIBR Bioinformatics, © Whitehead Institute 2004 6
Discussion MapDiscussion Map
• Relationships Among Groups Of GenesRelationships Among Groups Of Genes– Comparing Sequences– Building Sequence Families
• Sequence Conservation During EvolutionSequence Conservation During Evolution– Aligning Multiple Sequences
• Evolutionary DiagramsEvolutionary Diagrams– Tracing The Descent From Common Ancestors– Growing Phylogenetic Trees
WIBR Bioinformatics, © Whitehead Institute 2004 7
Evolutionary AnalysisEvolutionary Analysis
• DefinitionDefinition– The use of phylogeny to reveal relationships among
sets of genes
• PurposePurpose– To utilize information about common ancestors to
predict gene function and regulation
• MethodologyMethodology– Compare properties between genes/organisms and
identify commonalities and differences– Organization of genes into a evolutionary diagrams– Sequence by sequence comparisons
WIBR Bioinformatics, © Whitehead Institute 2004 8
Sequence-Based Sequence-Based ComparisonsComparisons
• Identify sequences within an organism that are related Identify sequences within an organism that are related to each other and/or across different speciesto each other and/or across different species– Within: Fetal and adult hemoglobin– Across : Human and chimpanzee hemoglobin
• Generate an evolutionary history of related genesGenerate an evolutionary history of related genes• Locate insertions, deletions, and substitutions that Locate insertions, deletions, and substitutions that
have occurred during evolutionhave occurred during evolution
CREATE CREASE -RELAPSE
GREASER
(C) Cysteine(R) Arginine(E) Glutamate(A) Alanine(T) Threonine(S) Serine(L) Leucine(P) Proline(G) Glycine
[Ancestor] [Progenitors]
WIBR Bioinformatics, © Whitehead Institute 2004 9
Homology & SimilarityHomology & Similarity• HomologyHomology
– Conserved sequences arising from a common ancestor– Orthologs: homologous genes that share a common
ancestor in the absence of any gene duplication (Mouse and Human Hemoglobin)
– Paralogs: genes related through gene duplication (one gene is a copy of another - Fetal and Adult Hemoglobin)
• SimilaritySimilarity– Genes that share common sequences but are not
necessarily related
WIBR Bioinformatics, © Whitehead Institute 2004 10
Sequences As ModulesSequences As Modules
• Proteins are derived from a limited Proteins are derived from a limited number of basic building blocks number of basic building blocks (Modules)(Modules)
• Evolution has shuffled these modules Evolution has shuffled these modules giving rise to a diverse repertoire of giving rise to a diverse repertoire of protein sequencesprotein sequences
• As a result, proteins can share a global As a result, proteins can share a global relationships or local relationship relationships or local relationship specific to a particular DOMAINspecific to a particular DOMAIN
Global
Local
WIBR Bioinformatics, © Whitehead Institute 2004 11
Sequence DomainsSequence DomainsModules Define Functional/Structural DomainsModules Define Functional/Structural Domains
WIBR Bioinformatics, © Whitehead Institute 2004 12
Sequence FamiliesSequence Families
• Definition Definition – Group of sequences that share a common function
and/or structure, that are potentially derived from a common ancestor (set of homologous sequences)
• Building A Family Building A Family – Domains are used to group different sequences into
common families
WIBR Bioinformatics, © Whitehead Institute 2004 13
Defining A Sequence Defining A Sequence FamilyFamily
Family A
Family B
Family D Family E
Family C
WIBR Bioinformatics, © Whitehead Institute 2004 14
Sequence Family Sequence Family ResourcesResources
• Search and Browse Family DatabasesSearch and Browse Family Databases
• PFAMPFAM– http://pfam.wustl.edu/
>src
MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL
WIBR Bioinformatics, © Whitehead Institute 2004 15
Sequence Family Sequence Family ResourcesResources
• NCBI Family Database ResourcesNCBI Family Database Resources• Conserved Domain DatabaseConserved Domain Database
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd
• Conserved Domain Architecture Retrieval ToolConserved Domain Architecture Retrieval Tool– http://www.ncbi.nlm.nih.gov/BLAST/
WIBR Bioinformatics, © Whitehead Institute 2004 16
Discussion MapDiscussion Map
• Relationships Among Groups Of GenesRelationships Among Groups Of Genes– Comparing Sequences– Building Sequence Families
• Sequence Conservation During EvolutionSequence Conservation During Evolution– Aligning Multiple Sequences
• Evolutionary DiagramsEvolutionary Diagrams– Tracing The Descent From Common Ancestors– Growing Phylogenetic Trees
WIBR Bioinformatics, © Whitehead Institute 2004 17
Multiple Sequence Multiple Sequence Alignments Alignments
• Place residues in columns that Place residues in columns that are derived from a common are derived from a common ancestral residueancestral residue
• Identify Identify MatchesMatches, , MismatchesMismatches, , and and GapsGaps
• MSA can reveal sequence MSA can reveal sequence patternspatterns– Demonstration of homology
between >2 sequences– Identification of functionally
important sites– Protein function prediction– Structure prediction
CRE-A-TE-
CRE-A-SE-
-RELAPSE-
GRE-A-SER
CREATE
CREASE
GREASER
RELAPSE
123456789
SeqA
SeqB
SeqC
SeqD
WIBR Bioinformatics, © Whitehead Institute 2004 18
Global vs. Local Global vs. Local AlignmentsAlignments
• GlobalGlobal– Search for alignments, matching over
entire sequences
• LocalLocal– Examine regions of sequence for
conserved segments
• Both Consider: Matches, Mismatches, Both Consider: Matches, Mismatches, GapsGaps
WIBR Bioinformatics, © Whitehead Institute 2004 19
Global Sequence AlignmentsGlobal Sequence Alignments
Yeast Prion-Like Proteins
WIBR Bioinformatics, © Whitehead Institute 2004 20
How To Make A Global How To Make A Global MSAMSA
• On The WebOn The Web– http://pir.georgetown.edu/pirwww/search/multaln.html
• On Your ComputerOn Your Computer– ClustalX: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/
WIBR Bioinformatics, © Whitehead Institute 2004 21
MSA Example SequencesMSA Example Sequences
>KSYK_HUMANFFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVAHGRKAHHYTIERELNGTYAIAGGRTHASPADLCHYH
>ZA70_HUMANWYHSSLTREEAERKLYSGAQTDGKFLLRPRKEQGTYALSLIYGKTVYHYLISQDKAGKYCIPEGTKFDTLWQLVEYL
>KSYK_PIGWFHGKISRDESEQIVLIGSKTNGKFLIRARDNGSYALGLLHEGKVLHYRIDKDKTGKLSIPGGKNFDTLWQLVEHY
>MATK_HUMANWFHGKISGQEAVQQLQPPEDGLFLVRESARHPGDYVLCVSFGRDVIHYRVLHRDGHLTIDEAVFFCNLMDMVEHY
>CSK_CHICKWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCEGKVEHYRIIYSSSKLSIDEEVYFENLMQLVEHY
>CRKL_HUMANWYMGPVSRQEAQTRLQGQRHGMFLVRDSSTCPGDYVLSVSENSRVSHYIINSLPNRRFKIGDQEFDHLPALLEFY
>YES_XIPHEWYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKLDNGGYYITTRTQFMSLQMLVKHY
>FGR_HUMANWYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKLDMGGYYITTRVQFNSVQELVQHY
>SRC_RSVPWYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKLYSGGFYITSRTQFGSLQQLVAYY
Standard FASTA Sequence Format
WIBR Bioinformatics, © Whitehead Institute 2004 22
MSA Example ResultMSA Example Result
YES_XIPHE WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKLFGR_HUMAN WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKLSRC_RSVP WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKLMATK_HUMAN WFHGKISGQEAVQQLQPPED--GLFLVRESARHPGDYVLCVS-----FGRDVIHYRVLHRCSK_CHICK WFHGKITREQAERLLYPPET--GLFLVRESTNYPGDYTLCVS-----CEGKVEHYRIIYSCRKL_HUMAN WYMGPVSRQEAQTRLQGQRH--GMFLVRDSSTCPGDYVLSVS-----ENSRVSHYIINSLZA70_HUMAN WYHSSLTREEAERKLYSGAQTDGKFLLRPRK-EQGTYALSLI-----YGKTVYHYLISQDKSYK_PIG WFHGKISRDESEQIVLIGSKTNGKFLIRAR--DNGSYALGLL-----HEGKVLHYRIDKDKSYK_HUMAN FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVA-----HGRKAHHYTIERE :: . : :: : * :*:* * : * : ** :
YES_XIPHE DNGGYYITTRTQFMSLQMLVKHYFGR_HUMAN DMGGYYITTRVQFNSVQELVQHYSRC_RSVP YSGGFYITSRTQFGSLQQLVAYYMATK_HUMAN -DGHLTIDEAVFFCNLMDMVEHYCSK_CHICK -SSKLSIDEEVYFENLMQLVEHYCRKL_HUMAN PNRRFKIGDQE-FDHLPALLEFYZA70_HUMAN KAGKYCIPEGTKFDTLWQLVEYLKSYK_PIG KTGKLSIPGGKNFDTLWQLVEHYKSYK_HUMAN LNGTYAIAGGRTHASPADLCHYH * . : .
WIBR Bioinformatics, © Whitehead Institute 2004 23
Discussion MapDiscussion Map
• Relationships Among Groups Of GenesRelationships Among Groups Of Genes– Comparing Sequences– Building Sequence Families
• Sequence Conservation During EvolutionSequence Conservation During Evolution– Aligning Multiple Sequences
• Evolutionary DiagramsEvolutionary Diagrams– Tracing The Descent From Common Ancestors– Growing Phylogenetic Trees
WIBR Bioinformatics, © Whitehead Institute 2004 24
Phylogenetic TreesPhylogenetic Trees
• A Graph Representing The A Graph Representing The Evolutionary History Of SequencesEvolutionary History Of Sequences– Relationship of sequences to one
another (How everything is connected)– Dissect the order of appearance of
insertions, deletions, and mutations
• Identify Related Sequences, Predict Identify Related Sequences, Predict Function, Observe Epidemiology Function, Observe Epidemiology (Analyze changes in viral strains)(Analyze changes in viral strains)
A
B
C
D
SimpleTree
WIBR Bioinformatics, © Whitehead Institute 2004 25
Tree ShapesTree Shapes
Rooted Un-rooted
Branches intersect at NodesLeaves are the topmost branches
A
B
C
D
A
B
C
D
A
B
C
D
WIBR Bioinformatics, © Whitehead Institute 2004 26
Tree CharacteristicsTree Characteristics
• Tree PropertiesTree Properties– Clade: all the descendants of a common
ancestor represented by a node
– Distance: number of changes that have taken place along a branch
• Tree TypesTree Types
– Cladogram: shows the branching order of nodes
– Phylogram: shows branching order and distances
A
B
C
D
.035
.009
.057
.044
.012
.016
Phylogram
WIBR Bioinformatics, © Whitehead Institute 2004 27
Tree Building MethodsTree Building Methods
• Group Most Common SequencesGroup Most Common Sequences– Find the tree that changes one sequence into all of the
others by the least number of steps – Sequences with the smallest number of differences have
the shortest distance between them and are called:
“related taxa”
WIBR Bioinformatics, © Whitehead Institute 2004 28
Tree Building MethodsTree Building Methods
A
B
A
B
C
DF
A
B
D C E
F2 11
A
B
F E
C
D
2 1
3
E
ABCDEF
CDEF
A
B
C
D
EF
F
A
B
C
D
E
A
B
C
D
E
F
1 2 3 4 5
WIBR Bioinformatics, © Whitehead Institute 2004 29
Example Evolutionary Example Evolutionary TreesTrees
• Tree Of LifeTree Of Life– http://tolweb.org/tree/phylogeny.html
• Theory Of Human Evolution At The SITheory Of Human Evolution At The SI– http://www.mnh.si.edu/anthro/humanorigins/ha/a_tree.html
Anthropological and Archeological
WIBR Bioinformatics, © Whitehead Institute 2004 30
How To Build A TreeHow To Build A Tree
• Create AlignmentCreate Alignment– http://pir.georgetown.edu/pirwww/search/multaln.html
• Create TreeCreate Tree– http://www.genebee.msu.su/services/phtree_reduced.html
• Draw TreeDraw Tree– http://iubio.bio.indiana.edu/treeapp/treeprint-form.html
Sequence Based
WIBR Bioinformatics, © Whitehead Institute 2004 31
MSA and Tree RelationshipMSA and Tree Relationship
• ““The optimal alignment of several sequences can The optimal alignment of several sequences can be thought of as minimizing the number of be thought of as minimizing the number of mutational steps in an evolutionary tree for which mutational steps in an evolutionary tree for which the sequences are the leaves” (Mount, 2001)the sequences are the leaves” (Mount, 2001)
+R
CREATE
CREASE
CREATE
CRE-A-TE-
CRE-A-SE-
-RELAPSE-
GRE-A-SER
SeqA
SeqB
SeqC
SeqD
T to S
C to G GREASE
CREASE
CREATE
+L +P
-G
WIBR Bioinformatics, © Whitehead Institute 2004 32
Summary ReviewSummary Review
• Relationships Among Groups Of GenesRelationships Among Groups Of Genes– Comparing Sequences– Building Sequence Families
• Sequence Conservation During EvolutionSequence Conservation During Evolution– Aligning Multiple Sequences
• Evolutionary DiagramsEvolutionary Diagrams– Tracing The Descent From Common Ancestors– Growing Phylogenetic Trees
WIBR Bioinformatics, © Whitehead Institute 2004 33
ReferencesReferences
• Bioinformatics: Sequence and genome Bioinformatics: Sequence and genome Analysis. David W. Mount. CSHL Press, 2001.Analysis. David W. Mount. CSHL Press, 2001.
• Bioinformatics: A Practical Guide to the Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. Andreas D. Analysis of Genes and Proteins. Andreas D. Baxevanis and B.F. Francis Ouellete. Wiley Baxevanis and B.F. Francis Ouellete. Wiley Interscience, 2001.Interscience, 2001.
• Bioinformatics: Sequence, structure, and Bioinformatics: Sequence, structure, and databanks. Des Higgins and Willie Taylor. databanks. Des Higgins and Willie Taylor. Oxford University Press, 2000.Oxford University Press, 2000.
Top Related