Introduction To Bioinformatics - Purdue...

72
1 Biol47800/59500 Bioinformatics Introduction To Bioinformatics Biol 47800 & 59500-012

Transcript of Introduction To Bioinformatics - Purdue...

Page 1: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

1 Biol47800/59500 Bioinformatics

Introduction To BioinformaticsBiol 47800 & 59500-012

Page 2: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

2 Biol47800/59500 Bioinformatics

Biol47800 – Introduction to Bioinformatics

• For Undergrads: BIOL 47800 (3 credits)

• For Grads: BIOL 59500-003 (4 credits)

• Instructor: Michael Gribskov

[email protected]

Hockmeyer 331, 494-6933

Office hours by appointment, see my calendar

http://www.google.com/[email protected]

Page 3: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

3 Biol47800/59500 Bioinformatics

Course work

• Lectures◦ The syllabus gives a list of readings in the text that should be completed

BEFORE the lecture. Note that I do not discuss all the material in the text in class. Nevertheless, any of the assigned material is likely to appear in exams.

• Homework◦ Homework assignments will be available to self-test your understanding and

to get additional points.

• Quizzes◦ There will be at least two quizzes. Each quiz will last 25 minutes. The

quizzes are closed book, closed notes and no calculators or computers.

• Exams◦ Two midterms are tentatively scheduled for 9 October and 14 November. The

exams are closed book, closed notes and no calculators or computers. The material to be covered will be described in detail in class. The two midterms will each emphasize material covered during the relevant portion of the class, but the second will include some material from the first exam. The final exam will cover the entire course, but will emphasize the material covered since the second midterm.

Page 4: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

4 Biol47800/59500 Bioinformatics

Assessment and GradingActivity Points Each Points Overall

Final 300 300

Midterms 100 200

Homework varies varies

Quizzes 30 60

Total 560+

Anticipated Grade Ranges

• A 540

• B 480

• C 420

• D 360

• F less than 360

Ranges may be moved downward

but will not be moved upward

Page 5: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

5 Biol47800/59500 Bioinformatics

Policies - Academic behavior

• Academic dishonesty of any kind (cheating, plagiarism,

fabrication of data, improper collaboration, etc.) is not tolerated

and is grounds for failing the course (grade F) and notification

of University administration for further disciplinary action.

• All assignments will be explicitly labeled for individual versus

group effort; groups will be instructed as to the rules for

collaboration.

• All questions about course policy and administration should be

directed to the instructor.

Page 6: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

6 Biol47800/59500 Bioinformatics

Textbook

• Primary Text:

◦ Understanding BioinformaticsZvelebil and BaumGarland ScienceISBN: 978-0-8153-4024-9

◦ Also available as ebook– http://store.vitalsource.com/show/978-1-1369-7696-4

– 180 day e-rental: $72.00

– 1 year e-rental: $86.40

– E-book purchase: $144.00

◦ Amazon

– new: $81.00

– used: from $72.92

– rent: $36.00

– etextbook: $29.18

Page 7: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

7 Biol47800/59500 Bioinformatics

• Bioinformatics/Computational Biology

◦ Bioinformatics – originally the application of databases in biology – it has

come to mean any kind of computational analysis and is synonymous

with computational biology

• Main Topics

◦ Genomics (DNA and protein sequence analysis)

◦ Evolution and Phylogenetics

◦ Systems Biology

◦ Protein structure

Page 8: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

8 Biol47800/59500 Bioinformatics

Schedule Week 1 25 Aug – 29 Aug

• Introduction & motivation

• Tree thinking & Sequences and Evolution

◦ Tree thinking handout – Reading/Tree thinking; Baum, 2005

• Intro to sequence comparison

Page 9: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

9 Biol47800/59500 Bioinformatics

Genomics

Genomics – Study of the whole nucleotide sequence

of an organism

Page 10: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

10 Biol47800/59500 Bioinformatics

Genomics

Page 11: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

11 Biol47800/59500 Bioinformatics

Genomics

What is genomics good for?

• Forensics - Genotyping

• Medical diagnostics

◦ Genetic diseases

◦ Drug response/sensitivity

• Identifying new diseases

◦ Contagion (2011)

◦ Outbreak (1995)

◦ Andromeda strain (1971)

• Identifying genes and what they do

• Understanding the whole cell

Page 12: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

12 Biol47800/59500 Bioinformatics

Genomics

Key technologies

• Methods to isolate very small amounts of DNA from

environment or tissue (PCR – polymerase chain reaction)

• Automated machines to rapidly determine DNA base sequence

(DNA sequencer)

• Reasoning on trees

• Computational approaches

Page 13: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

13 Biol47800/59500 Bioinformatics

Genomics

The Genographic Project

https://www3.nationalgeographic.com/genographic/

• 525,000 mitochondrial genotypes

Page 14: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

14 Biol47800/59500 Bioinformatics

Genomics

Haplotype H

• L0/L1/L2 – subsaharan african 150 – 80,000 BC

• L3 northern africa – 80,000 BC

• N/R middle east – 60,000 BC

• Pre-HV/HV central asia – 60 - 30,000 BC

• H europe – 30,000 BC

Page 15: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

15 Biol47800/59500 Bioinformatics

Genomics

Mitochondrial inheritance

• Mitochondria have their own DNA and replicate separately from

nuclear DNA

• Mitochondria are inherited from the mother only

• Mitochondria have very limited recombination or transposable

elements

• Good for making trees in the 10 Kyr to 1 Myr range

Page 16: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

16 Biol47800/59500 Bioinformatics

Genomics

Gain and loss of mitochondrial lineages

• Due purely to chance, some lineages die out

• no children

• no female children

• Random mutations

gradually change

sequence

• Since changes are

small, you can tell

the relationships

between new and

old forms

Page 17: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

17 Biol47800/59500 Bioinformatics

Genomics

1918 "Spanish" flu pandemic

• Appeared suddenly

• So different from other flu strains that there was no resistance

~ 2% mortality

• 25-30% of world population infected

• Killed 40-100 million worldwide, 675,000 US (in 4 months)

◦ Today a similar pandemic would kill in the 100

millions

◦ Similar pandemic today would cause 2 million

US deaths

• Using modern techniques, sequences have been

obtained from pathology samples, frozen

tissues, etc

Page 18: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

18 Biol47800/59500 Bioinformatics

Genomics

Page 19: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

19 Biol47800/59500 Bioinformatics

Genomics

HIV infection

Page 20: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

20 Biol47800/59500 Bioinformatics

Genomics

Identification of AIDS/HIV

• 1978 - Gay men in the US and Sweden -- and heterosexuals in Tanzania and Haiti -- begin showing signs of what will later be called AIDS.

• 1980 - 31 known deaths identified in US

• 1981 - CDC reports 5 young homosexual men treated for Pneumocystis carinii at 3 different LA hospitals have with multiple infections including cytomegalovirus, 2 die

• 1981 - 26 cases of Karposi Sarcoma identified in the last 30 months among gay males, 8 died within 2 years

• 1987 - first drugs

• 1981 - deaths US 234

• 1982 - deaths US 853 AIDS is defined, linked to blood transfusion

• 1983 - deaths US 2,304 retrovirus isolated (HTLV-III, LAV)

• 1984 - deaths US 4,300 HIV sequenced (HTLV-III)

• 1985 - deaths US 2960 cumulative 16,301 first blood test

• 1986 - deaths US

• 1987 - deaths US 4,100 first drugs

• 1988 - deaths US 4,900

• 1989 - deaths US 14,500

• 1990 - deaths US 18,500

• 1991 - deaths US 20,500

• 1992 - deaths US 23,400

• 1993 - deaths US 41,900

• 1994 - deaths US 32,300

• 1995 - deaths US 48,400

• 1996 - deaths US 35,000

• 1997 - death count US 21,400, worldwide 6.4 million, 22 million infected

• 2004 - death count US 17,557, cumulative 524,000 (0.2 % of US population)

• 2007 - approximately 30-36 million infected worldwide (0.8% of population)

• since the beginning of the AIDS epidemic, 617,025 people have died of AIDS in the US

Page 21: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

21 Biol47800/59500 Bioinformatics

Genomics

AIDS

• Many AIDS drugs are targeted at the HIV protease

◦ HIV protein is translated as a single large protein (polyprotein) and must

be cut up into individual pieces by the HIV encoded protease

◦ How was the protease identified as a target?

– sequence comparison showed homology to aspartic proteases

– aspartic proteases were already being investigated as drug targets

– inferences were made based on homology modeling

Page 22: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

22 Biol47800/59500 Bioinformatics

Genomics

Understanding a genome

• The human genome has been completely sequenced.

• How do we find the important genes

◦ genes related to human genetic diseases: sickle cell anemia, Parkinson's

disease, Huntingtons's disease, cystic fibrosis

◦ multi-factor diseases hypertension, obesity, schizophrenia, diabetes,

cancer

• How do we figure out how the genetic differences lead to

disease?

• How do we generate hypotheses about possible functions?

◦ DNA binding

◦ Enzyme activity

◦ Signal Transduction

◦ etc.

Page 23: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

23 Biol47800/59500 Bioinformatics

• Bioinformatics/Computational Biology

◦ Bioinformatics – originally the application of databases in biology – it has

come to mean any kind of computational analysis and is synonymous

with computational biology

• Main Topics

◦ Genomics (DNA and protein sequence analysis)

◦ Evolution and Phylogenetics

◦ Systems Biology

◦ Protein structure

Page 24: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

24 Biol47800/59500 Bioinformatics

Lecture 2 – 27 August

Page 25: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

25 Biol47800/59500 Bioinformatics

Schedule Week 1

25 Aug – 29 Aug

• Introduction & motivation

• Tree thinking & motivation

◦ Tree thinking handout – Reading/Tree thinking; Baum, 2005

• Intro to sequence comparison

• Reading for next week

◦ Ch 4.1-4.4

Ch 5.2

Page 26: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

26 Biol47800/59500 Bioinformatics

Tree thinking

• Trees represent biological relationships

◦ species

◦ genes

◦ sequences

◦ data

• Adjacent branches are the most closely

related

• Biology uses trees because history is a

fundamental explanation for why the

world is how it is

Page 27: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

27 Biol47800/59500 Bioinformatics

Genomics

Four Corners virus

• May 1993 – a young physically fit Navaho man suffering from

shortness of breath is admitted to the hospital in new Mexico

and dies rapidly. His fiancée had died a few days earlier with

similar symptoms.

• Within a few hours, five other deaths of young healthy people

from acute respiratory failure were identified.

• Mortality rate >80% in initial patients/victims

• Many causes were investigated and rejected:

◦ Bubonic plague

◦ Bacterial sepsis

◦ Herbicide exposure

◦ Influenza

Page 28: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

28 Biol47800/59500 Bioinformatics

Genomics

Four Corners virus

• Symptoms suggested a virus

• Eventually, an previously unknown Hantavirus was identified

from tissue samples

◦ Low cross reactivity of patient antibodies to known Hantaviruses

◦ PCR amplification and sequencing

• How does this help?

Page 29: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

29 Biol47800/59500 Bioinformatics

Genomics

Four Corners virus

• All known Hantaviruses are known to be transmitted by rodents

• 1700 rodents were trapped from June to August

1993 near the homes of victims. Trapped

rodents were dissected and analyzed.

• About 30% of deer mice (Peromyscus maniculatus)

were found to carry the unknown strain of Hantavirus

• In November 1993 the specific virus was isolated (now called Sin

Nombre virus)

Page 30: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

30 Biol47800/59500 Bioinformatics

Genomics

Four Corners virus

• Steps

◦ Identification of sequence

◦ Identification of known viruses with similar sequence

◦ Inference of virus characteristics based on known viruses

Page 31: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

31 Biol47800/59500 Bioinformatics

Tree thinking

Page 32: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

32 Biol47800/59500 Bioinformatics

Tree Thinking

• When a characteristic, for instance

black or blond hair, is distributed along

a tree in a way that implies inheritance,

we can

◦ infer the ancestral characteristics

◦ we can infer the characteristics of current

members of the tree

• When characteristics are distributed

without regard to inheritance they are

independent (or the tree is wrong)

?

a b

?

Page 33: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

33 Biol47800/59500 Bioinformatics

Genomics

Tree thinking

• Phylogenetic trees maximize the similarity in characters

between related species

• Which tree is more correct?

Left to right position means nothing in a tree!

Page 34: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

34 Biol47800/59500 Bioinformatics

Genomics

Tree thinking

• Branches with a more recent common ancestor are more

closely related

• Common features are likely to come from ancestors, and to be

shared in sibling lineages

Page 35: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

35 Biol47800/59500 Bioinformatics

Genomics

Spread of HIV

Page 36: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

36 Biol47800/59500 Bioinformatics

Genomics

Looking at similar molecules can tell us

• where they came from (history)

• how they work (mapping knowledge)

• The key starting point is the knowledge that you are looking at

molecules that are ancestrally related. This is called homology

Page 37: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

37 Biol47800/59500 Bioinformatics

Genomics

What is Homology?

• Seen in the light of evolution, biology is, perhaps, intellectually the most satisfying and inspiring science. Without that light it becomes a pile of sundry facts some of them interesting or curious but making no meaningful picture as a whole.Nothing in biology makes sense except in the light of evolution, Theodosius Dobzhansky (1900-1975)

• homology - the presence of a similar feature because of descent from a common ancestor ◦ Homology cannot be observed. We can’t actually see the ancestral

organisms/molecules and trace descent.

◦ Homology is an inference, a conclusion we draw based on observed similarity.

◦ Homology is an all-or-none relationship – no partial homology

• homoplasy - the presence of a similar feature because of convergence. ◦ Text pg 74 implies this might be common, but it is not.

Page 38: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

38 Biol47800/59500 Bioinformatics

Genomics

Why is homology Important?

• Homology strongly suggests that the molecules have similar structure and function

• Some time in the past, the molecules had identical structure and function

• Biology is conservative - small accumulations of mutations lead to small changes in function, but not to radical changes

• If you can prove homology, you have a strong basis for predicting similar structure and function

• Known information about related molecules can be "mapped" onto unknown molecules

Highly

Similar

Sequence

Homology

(Common

Ancestry)

Similar

Structure &

Function

Page 39: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

39 Biol47800/59500 Bioinformatics

Genomics

Convergence is Unlikely

• There are (very) many ways to fold a polypeptide to place

specific chemical groups at specific locations. There is no

reason, a priori, why proteins with a specific function should

have similar 3-D structures.

• Therefore, there is no reason, a priori, why unrelated

sequences should have any detectable similarity in sequence.

Significantly similar molecular sequences are very unlikely to

arise by chance - i.e., homoplasy on the molecular level is very

unlikely.

• When we see significant similarity, we infer that the

sequences/structures are homologous, i.e., at some point in the

past they shared an identical sequence and structure.

Page 40: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

40 Biol47800/59500 Bioinformatics

Genomics

Constraints

• The only thing that keeps sequences tied to each other is the

commonality of structure and function arising from homology,

and ongoing constraints on the function of the molecule.

• Mutations are free to happen in portions of the molecule with

no function

◦ Non-coding region

◦ Disorder region of protein

◦ Third position of codon

• If a molecule is essential or useful, mutations that disrupt the

regulation, structure, or function are deleterious and selected

against.

• The function therefore constrains the evolution of the molecule

– mutations accumulate more slowly in regions that have

important functions.

Page 41: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

41 Biol47800/59500 Bioinformatics

Genomics

How important is homology?

• Many developmental genes in mammals are homologous to

genes in drosophila

• Fundamental processes of replication, transcription, and

translation are homologous in all living things

• How much of what we know about molecular function comes

from inferred homology?

Over 95% of genes with "known" functions have their functions

"determined" by sequence matching, i.e., by homology

Page 42: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

42 Biol47800/59500 Bioinformatics

Genomics

Inferring homology

• We can only make inferences about function when we know we

are comparing the "same" things.

◦ homologous genes

◦ homologous proteins

• Typical argument: This gene in mouse is the same as a gene in

humans, therefore it does about the same thing

• How do you…

◦ … find the same gene in two genomes?

◦ … find the same protein in two proteomes?

◦ … guess the function of a new gene?

Page 43: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

43 Biol47800/59500 Bioinformatics

Genomics

How is homology determined?

• Because molecular homoplasy is unlikely, significant sequence

similarity strongly indicates homology

Similarity ≈ homology

• Similarity is determined by sequence matching or comparison,

more commonly called sequence alignment

• Approaches

◦ Dotplots

◦ Alignments

◦ Database searches

Page 44: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

44 Biol47800/59500 Bioinformatics

Genomics

Homology

• Sequences alignments and database searches let us

◦ Find homologous sequences (genes/proteins)

◦ Map information from known systems to new ones

– Gene identification

– Gene function

– Metabolic and regulatory systems

• Two common classes of homologs

◦ Orthologs – genes separated by a speciation event, i.e., the same gene

in two species

◦ Paralogs – genes separated by a duplication events, originally the same

but now diverged with possibly different functions

Page 45: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

45 Biol47800/59500 Bioinformatics

Genomics

Dotplots – a simple way to compare sequences

1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.... 46

.||.::. | ..||||:|. .:.|.|.| |:| : |.| . |..|

1 GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK 50

. . . . .

47 ..DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVD 94

| .:|.::| || .| .||.. : . :. ...:.:|.: || | ::.

51 SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIP 100

. . . .

95 PVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR 141

. :.::|.|:: .|... |::|.:..::.::| |. ... :.|.|:

101 VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYK 147

Page 46: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

46 Biol47800/59500 Bioinformatics

Genomics

Change over Time

• How do genes (proteins, regulatory networks) change over

time?

• Is the evolutionary process random?

• Why do the changes we see in proteins have such a non-

random distribution

Mutations change the sequence of the DNA, causing changes in the

properties of the encoded proteins

While different mutagens have different preferences, mutations are

essentially random with respect to the position of genes

The changes we see are the result of two processes: Mutation

(random) and selection (non-random). If you assume that genes

have useful functions, different mutations affect the function to

different extents. The structure and function of the gene and the

encoded protein constrain which mutations are compatible with the

function

Page 47: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

47 Biol47800/59500 Bioinformatics

Genomics

Dotplots

• Simplest method - put a dot wherever sequences are identical

• A little better - use a scoring table, put a dot wherever the

residues have better than a certain score

• Or, put a dot wherever you get at least n matches in a row

(identity matching, compare/word)

• Even better - filter the plot

Page 48: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

48 Biol47800/59500 Bioinformatics

Dotplots

M Y S E Q U E N C E

H

I

S

S

E

Q

E

N

C

E

H I S S E Q E N C E

M Y S E Q U E N C E

H I S S E Q E N C E

M Y S E Q U E N C E

Genomics

Page 49: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

49 Biol47800/59500 Bioinformatics

Genomics

Dotplots

Page 50: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

50 Biol47800/59500 Bioinformatics

A C C T T G T C C T C T T T A C C T G C C G A A

A C G T T G A C C T G T A A C C T G C C G A T T

Window Length = Segment = Span = 6

Genomics

Dotplots

• Windowed scores

◦ Calculate a score within a window

◦ Move the window over one

A C C T T G T C C T C T T T A C C T G C C G A A

A C G T T G A C C T G T A A C C T G C C G A T T

Page 51: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

51 Biol47800/59500 Bioinformatics

Genomics

RecA DNA sequence from Helicobacter pylori and Streptococcus

mutans, window/ match shown below figure

2/2 4/4

Page 52: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

52 Biol47800/59500 Bioinformatics

Genomics

RecA DNA sequence from Helicobacter pylori and Streptococcus

mutans, window/ match =9/6

Page 53: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

53 Biol47800/59500 Bioinformatics

Genomics

RecA DNA sequence from Helicobacter pylori and Streptococcus

mutans, window/ match = 12/8

Page 54: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

54 Biol47800/59500 Bioinformatics

Genomics

Dot Matrix Plots

• What can you see in dotplots?

◦ Similar regions

◦ Repeated sequences

◦ Rearrangements

◦ RNA structures

Page 55: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

55 Biol47800/59500 Bioinformatics

Dotplots/Repeats

Genomics

Repeat type 1 Repeat type 2

Length of repeat

# of repeats = # parallel lines

Page 56: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

56 Biol47800/59500 Bioinformatics

Genomics

Drosophila Notch protein

EGF repeats

Lin12 repeats

“low entropy” sequences

Page 57: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

57 Biol47800/59500 Bioinformatics

Genomics

Repeated sequence in E. coli ribosomal protein S1

1 2 3 4 5 6• Found only in gram-negative

bacteria

• Ancient duplication of IF-1

like gene (6-fold)

• Common domain in other

protein such as polynucleo-

tide phosphorylase

• Typically binds as trimer

• Repeats specialized after

duplication

Page 58: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

58 Biol47800/59500 Bioinformatics

1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.... 46

.||.::. | ..||||:|. .:.|.|.| |:| : |.| . |..|

1 GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK 50

. . . . .

47 ..DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVD 94

| .:|.::| || .| .||.. : . :. ...:.:|.: || | ::.

51 SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIP 100

. . . .

95 PVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR 141

. :.::|.|:: .|... |::|.:..::.::| |. ... :.|.|:

101 VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYK 147

Genomics

Homology

• Homologous sequences

show up as diagonal lines

in dotplots

• Basis for methods to find

homologous sequences

Page 59: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

59 Biol47800/59500 Bioinformatics

Lecture 3 – 29 August

Page 60: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

60 Biol47800/59500 Bioinformatics

• Week 1

◦ Introduction & motivation

◦ Tree thinking & motivation

– Tree thinking handout – Reading/Tree thinking; Baum, 2005

◦ Intro to sequence comparison

• Week 2 - 27 Aug – 31 Aug

◦ Monday – Alignments/Dynamic Programming

◦ Wednesday - Alignments /Scoring Systems

◦ Friday – Alignments /Scoring Systems

Page 61: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

61 Biol47800/59500 Bioinformatics

Genomics

Repeated sequence in E. coli ribosomal protein S1

• Found only in gram-negative

bacteria

• What does this pattern

mean?

1 2 3 4 5 6

• Ancient duplication of IF-1

like gene (6-fold)

• Common domain in other

proteins such as polynucleo-

tide phosphorylase

• Typically binds as trimer

• Repeats specialized after

duplication

• Last repeat can be deleted

Page 62: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

62 Biol47800/59500 Bioinformatics

Sequence Comparison

Dotplots of Sequence Rearrangements

A B C A BC

A B C

AB

C

Page 63: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

63 Biol47800/59500 Bioinformatics

Sequence Comparison

Dotplots of Sequence Inversions

CTATTGGAGG AGAAGGCCGA GAGGAGCAGG ACGGCGGGAA GAGGAGTGCG GAACCCGCGG

GATAACCTCC TCTTCCGGCT CTCCTCGTCC TGCCGCCCTT CTCCTCACGC CTTGGGCGCC

CTATTGGAGG AGAAGGCCGA TTCCCGCCGT CCTGCTCCTC GAGGAGTGCG GAACCCGCGG

GATAACCTCC TCTTCCGGCT AAGGGCGGCA GGACGAGGAG CTCCTCACGC CTTGGGCGCC

Page 64: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

64 Biol47800/59500 Bioinformatics

Sequence Comparison

Dotplots of Sequence Inversions

GATAACCTCCTCTTCCGGCTAAGGGCGGCAGGACGAGGAGCTCCTCACGCCTTGGGCGCC

CTATTGGAGGAGAAGGCCGAGAGGAGCAGGACGGCGGGAAGAGGAGTGCGGAACCCGCGG

CTATTGGAGGAGAAGGCCGATTCCCGCCGTCCTGCTCCTCGAGGAGTGCGGAACCCGCGG

CTATTGGAGGAGAAGGCCGAGAGGAGCAGGACGGCGGGAAGAGGAGTGCGGAACCCGCGG

Original sequence vs inverted sequenceOriginal sequence vs

inverted sequence reverse complement

Page 65: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

65 Biol47800/59500 Bioinformatics

Sequence Comparison

Dotplot of base-paired RNA

A T

C

C

G

G

G

G

C

C

C

A

T T

GACCGCTTACGGTC

G A C C G T A A G C G G T C

G A C C

G C T T

A C G G

T C

Red dots = base paired region

Dotplot of sequences vs

reverse-complement (other

strand of same sequence)

Page 66: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

66 Biol47800/59500 Bioinformatics

1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.... 46

.||.::. | ..||||:|. .:.|.|.| |:| : |.| . |..|

1 GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK 50

. . . . .

47 ..DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVD 94

| .:|.::| || .| .||.. : . :. ...:.:|.: || | ::.

51 SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIP 100

. . . .

95 PVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR 141

. :.::|.|:: .|... |::|.:..::.::| |. ... :.|.|:

101 VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYK 147

Sequence Comparison

Homology

• Homologous sequences

show up as diagonal lines

in dotplots

• Basis for methods to find

homologous sequences

Page 67: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

67 Biol47800/59500 Bioinformatics

Sequence Comparison

Dotplots and Alignments

1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.... 46

.||.::. | ..||||:|. .:.|.|.| |:| : |.| . |..|

1 GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK 50

. . . . .

47 ..DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVD 94

| .:|.::| || .| .||.. : . :. ...:.:|.: || | ::.

51 SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIP 100

. . . .

95 PVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR 141

. :.::|.|:: .|... |::|.:..::.::| |. ... :.|.|:

101 VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYK 147

An alignment is a one-to-one matching of two sequences, with the addition

of gaps (spaces) to improve the matching

Page 68: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

68 Biol47800/59500 Bioinformatics

Sequence Comparison

Measuring the difference between sequences

ACGGTTAGCAAA

||||||| ||||

ACGGTTACCAAA

ACGGTTAGCAAA

|||| || ||||

ACGGATACCAAA

ACGGTTAGCAAA

|| ||| ||||

ACCCTTACCAAA

ACGGTTAGCAAA

| ||

CCTTACCAAAAC

ACGGTTAGCAAA

| | |||

CCTTACCAAAAC

ACGGTTAGCAAA

||| ||||

CCTTACCAAAAC

ACGGTTAGCAAA

| |||

CCTTACCAAAAC

1 2 3 Distance

11 10 9 Similarity

Sequences may have to be offset to optimize score

Match Score

Page 69: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

69 Biol47800/59500 Bioinformatics

Sequence Comparison

Measuring the difference between sequences

• What if sequences need spaces, what should the score be?

AGTTACGGCAAA

||||| |

AGTTAGCAAA

AGTTACGGCAAA

|||||

AGTTAGCAAACC

AGTTACGGCAAA

||||| |||||

AGTTA..GCAAACC

ATCTAGCAG.T.C.A

|| | || | | |

CT.G.AGCTCCCA

• Allowing gaps makes it easier to get a high score.

• Intuitively, there should be some negative score for gaps.

• Otherwise any pair of random sequences can get a high

score.

Page 70: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

70 Biol47800/59500 Bioinformatics

Sequence Comparison

Measuring the difference between sequences

• Sequences alignments use a scoring function based on the

number of matches and mismatches, and a function based on

the number of gaps

Match = Nmatch – Nmismatch – f(gap)

• The score than unrelated sequences might get (on average)

also matters

Page 71: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

71 Biol47800/59500 Bioinformatics

Sequence Comparison

Finding the best alignment

• Without gaps – just slide the two sequences past each other

and choose the offset with the highest score

◦ Requires time proportional to the square of the length of the sequences

( O(L2) )

• With gaps

◦ For each offset,

– for each possible gap position

– For each possible gap length

– For each possible number of gaps

– Calculate score ( O(LL) )

Page 72: Introduction To Bioinformatics - Purdue Universityrna.genomics.purdue.edu/@api/deki/files/2202/=14biol47800_1.pdf · Biol47800 –Introduction to Bioinformatics ... 22 Biol47800/59500

72 Biol47800/59500 Bioinformatics

Sequence Comparison

Dynamic Programming Alignment

• Dynamic programming allows an optimal (highest scoring)

alignment that considers all possible numbers and lengths of

gaps to be found in O(L2) time

• Dynamic programming uses a recursive definition of an optimal

alignment

• Alignment is guaranteed to be "optimal"

◦ Given: the scoring systems used and gap penalties

• Don't confuse optimal with correct - Even unrelated sequences

can be optimally aligned!