1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.
-
date post
21-Dec-2015 -
Category
Documents
-
view
224 -
download
3
Transcript of 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.
![Page 1: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/1.jpg)
1
CAP5510 – BioinformaticsFall 2009
Tamer Kahveci
CISE Department
University of Florida
![Page 2: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/2.jpg)
2
Vital Information
• Instructor: Tamer Kahveci
• Office: E436
• Time: Mon/Wed/Thu 3:00 - 3:50 PM
• Office hours: Mon/Wed 2:00-2:50 PM
• TA: TBA
• Course page: – http://www.cise.ufl.edu/~tamer/teaching/fall2009
![Page 3: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/3.jpg)
3
Goals
• Understand the major components of bioinformatics data and how computer technology is used to understand this data better.
• Learn main potential research problems in bioinformatics and gain background information.
![Page 4: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/4.jpg)
4
This Course will
• Give you a feeling for main issues in molecular biological computing: sequence, structure and function.
• Give you exposure to classic biological problems, as represented computationally.
• Encourage you to explore research problems and make contribution.
![Page 5: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/5.jpg)
5
This Course will not
• Teach you biology.
• Teach you programming
• Teach you how to be an expert user of off-the-shelf molecular biology computer packages.
• Force you to make a novel contribution to bioinformatics.
![Page 6: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/6.jpg)
6
Course Outline
• Introduction to terminology• Biological sequences • Sequence comparison
– Lossless alignment (DP)– Lossy alignments (BLAST, etc)
• Substitution matrices, statistics • Multiple alignment • Phylogeny • Protein structures and function (primary, secondary, etc.) • Structure alignment • Structure prediction ?• Pathways
![Page 7: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/7.jpg)
7
Grading
• Homeworks (35 %) • Project (50 %)
– Contribution (2.5 % bonus)
• Survey (15 %)
How can I get an A ?
Bioinformatics DailyFirst homework is posted
First homework is posted
![Page 8: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/8.jpg)
8
Expectations
• Require– Data structures and algorithms.– Coding (C, Java)
• Encourage – actively participate in discussions in the classroom– read bioinformatics literature in general– attend colloquiums on campus
• Academic honesty
![Page 9: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/9.jpg)
9
Text Book
• Not required, but recommended.• Class notes + papers.
![Page 10: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/10.jpg)
10
Where to Look ?
• Journals– Bioinformatics– Genome Research– Nucleic Acid Research– Journal of Computational Biology– Protein Science
• Conferences– RECOMB– ISMB– PSB– CSB– VLDB, ICDE, SIGMOD
![Page 11: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/11.jpg)
11
What is Bioinformatics?• Bioinformatics is the field of science in which biology, computer
science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. There are three important sub-disciplines within bioinformatics:– the development of new algorithms and statistics with which to assess
relationships among members of large data sets – the analysis and interpretation of various types of data including
nucleotide and amino acid sequences, protein domains, and protein structures
– the development and implementation of tools that enable efficient access and management of different types of information.
From NCBI (National Center for Biotechnology Information)http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/milestones.html
![Page 12: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/12.jpg)
12
Does biology have anything to do with computer science?
![Page 13: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/13.jpg)
13
Challenges 1/6
• Data diversity– DNA
(ATCCAGAGCAG)– Protein sequences
(MHPKVDALLSR)– Protein structures– Microarrays– Pathways– Bio-images– Time series
![Page 14: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/14.jpg)
14
Challenges 2/6
• Database diversity– GenBank, SwissProt, …– PDB, Prosite, …– KEGG, EcoCyc, MetaCyc, …
![Page 15: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/15.jpg)
15
Challenges 3/6• Database size
– GeneBank : As of August 2009, there are over 85,759,586,764 bases.
– 400 K protein sequences, each about 300 long
– 50K protein structures in PDB. 400K in Modbase.
Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than
Shakespeare managed in a lifetime, although the latter make better reading.
-- G A Pekso, Nature 401: 115-116 (1999)
![Page 16: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/16.jpg)
16
• Moore’s Law Matched by Growth of Data• CPU vs Disk
– As important as the increase in computer speed has been, the ability to store large amounts of information on computers is even more crucial
Str
uct
ure
s in
PD
B
0500
10001500200025003000350040004500
1980 1985 1990 19950
20
40
60
80
100
120
1401979 1981 1983 1985 1987 1989 1991 1993 1995
CP
U In
stru
ctio
nT
ime
(ns)Num.
Protein DomainStructures
Challenges 4/6
![Page 17: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/17.jpg)
17
Challenges 5/6
• Deciphering the code– Within same data type: hard– Across data types: harder
caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg
atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact
![Page 18: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/18.jpg)
18
Challenges 6/6
• Inaccuracy
• Redundancy
![Page 19: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/19.jpg)
19
What is the Real Solution?
We need better computational methods
•Compact summarization•Fast and accurate analysis of data•Efficient indexing
![Page 20: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/20.jpg)
20
A Gentle Introduction to Molecular Biology
![Page 21: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/21.jpg)
21
Goals
• Understand major components of biological data– DNA, protein sequences, expression arrays,
protein structures
• Get familiar to basic terminology
• Learn commonly used data formats
![Page 22: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/22.jpg)
22
Genetic Material: DNA
• Deoxyribonucleic Acid, 1950s– Basis of inheritance– Eye color, hair color,
…
• 4 nucleotides – A, C, G, T
![Page 23: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/23.jpg)
23
Chemical Structure of Nucleotides
Purines
Pyrmidines
![Page 24: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/24.jpg)
24
Making of Long Chains
5’ -> 3’
![Page 25: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/25.jpg)
25
DNA structure
• Double stranded, helix (Watson & Crick)
• Complementary– A-T– G-C
• Antiparallel– 3’ -> 5’ (downstream)– 5’ -> 3’ (upstream)
• Animation (ch3.1)
![Page 26: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/26.jpg)
26
Base Pairs
![Page 27: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/27.jpg)
27
Question
• 5’ - GTTACA – 3’
• 5’ – XXXXXX – 3’ ?
• 5’ – TGTAAC – 3’
• Reverse complements.
![Page 28: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/28.jpg)
28
Repetitive DNA
• Tandem repeats: highly repetitive – Satellites (100 k – 1 Gbp) / (a few hundred bp)– Mini satellites (1 k – 20 kbp) / (9 – 80 bp)– Micro satellites (< 150 bp) / (1 – 6 bp)– DNA fingerprinting
• Interspersed repeats: moderately repetitive– LINE– SINE
• Proteins contain repetitive patterns too
![Page 29: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/29.jpg)
29
Genetic Material: an Analogy
• Nucleotide => letter• Gene => sentence• Contig => chapter• Chromosome => book
– Gender, hair/eye color, …– Disorders: down syndrome, turner syndrome, …
• http://gslc.genetics.utah.edu/units/disorders/karyotype/– Chromosome number varies for species
• http://www.web-books.com/MoBio/Free/Ch1C2.htm– We have 46 (23 + 23) chromosomes
• http://www.web-books.com/MoBio/Free/Ch1C5.htm
• Complete genome => volumes of encyclopedia• Hershey & Chase experiment show that DNA is the
genetic material. (ch14)
![Page 30: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/30.jpg)
30
Functions of Genes 1/2
• Signal transduction: sensing a physical signal and turning into a chemical signal
• Structural support: creating the shape and pliability of a cell or set of cells
• Enzymatic catalysis: accelerating chemical transformations otherwise too slow.
• Transport: getting things into and out of separated compartments– Animation (ch 5.2)
![Page 31: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/31.jpg)
31
Functions of Genes 2/2
• Movement: contracting in order to pull things together or push things apart.
• Transcription control: deciding when other genes should be turned ON/OFF– Animation (ch7)
• Trafficking: affecting where different elements end up inside the cell
![Page 32: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/32.jpg)
32
Central Dogma
![Page 33: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/33.jpg)
33
Introns and Exons 1/2
![Page 34: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/34.jpg)
34
Introns and Exons 2/2
• Humans have about 35,000 genes = 40,000,000 DNA bases = 3% of total DNA in genome.
• Remaining 2,960,000,000 bases for control information. (e.g. when, where, how long, etc...)
![Page 35: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/35.jpg)
35
Central dogma
ProteinPhenotype
DNA(Genotype)
Gene expression
![Page 36: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/36.jpg)
36
Gene Expression
• Building proteins from DNA– Promoter sequence: start of a gene 13 nucleotides.
• Positive regulation: proteins that bind to DNA near promoter sequences increases transcription.
• Negative regulation
![Page 37: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/37.jpg)
37
Microarray
Animation on creating microarrays
![Page 38: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/38.jpg)
38
Amino Acids
• 20 different amino acids– ACDEFGHIKLMNPQRSTVWY but not BJOUXZ
• ~300 amino acids in an average protein, ~400 K known protein sequences
• How many nucleotides can encode one amino acid ?– 42 < 20 < 43
– E.g., Q (glutamine) = CAG– degeneracy– Triplet code (codon)
![Page 39: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/39.jpg)
39
Triplet Code
![Page 40: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/40.jpg)
40
Molecular Structure of Amino Acid
Side Chain
•Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P)•Polar, Hydrophilic (S, T, C, Y, N, Q)•Electrically charged (D, E, K, R, H)
C
![Page 41: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/41.jpg)
41
Peptide Bonds
![Page 42: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/42.jpg)
42
Direction of Protein Sequence
Animation on protein synthesis (ch15)
![Page 43: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/43.jpg)
43
Data Format
• GenBank
• EMBL (European Mol. Biol. Lab.)
• SwissProt
• FASTA
• NBRF (Nat. Biomedical Res. Foundation)
• Others– IG, GCG, Codata, ASN, GDE, Plain ASCII
![Page 44: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/44.jpg)
44
Primary Structure of Proteins
phi1
psi1
phi2
2N angles
![Page 45: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/45.jpg)
45
Secondary Structure: Alpha Helix
• 1.5 A translation• 100 degree rotation• Phi = -60• Psi = -60
![Page 46: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/46.jpg)
46
anti-parallel parallel
Secondary Structure: Beta sheet
Phi = -135Psi = 135
![Page 48: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/48.jpg)
48
• 3-d structure of a polypeptide sequence– interactions between non-local atoms
tertiary structure ofmyoglobin
Tertiary Structure
![Page 49: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/49.jpg)
49
• Arrangement of protein subunits
quaternary structure of Cro
human hemoglobin tetramer
Quaternary Structure
![Page 50: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/50.jpg)
50
• 3-d structure determined by protein sequence
• Prediction remains a challenge
• Diseases caused by misfolded proteins– Mad cow disease
• Classification of protein structure
Structure Summary
![Page 51: 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d585503460f94a38432/html5/thumbnails/51.jpg)
51
STOP
Next Week:•Basic sequence comparison•Dynamic programming methods
–Global/local alignment–Gaps