1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.
-
date post
21-Dec-2015 -
Category
Documents
-
view
224 -
download
3
Transcript of 1 CAP5510 – Bioinformatics Fall 2009 Tamer Kahveci CISE Department University of Florida.
1
CAP5510 – BioinformaticsFall 2009
Tamer Kahveci
CISE Department
University of Florida
2
Vital Information
• Instructor: Tamer Kahveci
• Office: E436
• Time: Mon/Wed/Thu 3:00 - 3:50 PM
• Office hours: Mon/Wed 2:00-2:50 PM
• TA: TBA
• Course page: – http://www.cise.ufl.edu/~tamer/teaching/fall2009
3
Goals
• Understand the major components of bioinformatics data and how computer technology is used to understand this data better.
• Learn main potential research problems in bioinformatics and gain background information.
4
This Course will
• Give you a feeling for main issues in molecular biological computing: sequence, structure and function.
• Give you exposure to classic biological problems, as represented computationally.
• Encourage you to explore research problems and make contribution.
5
This Course will not
• Teach you biology.
• Teach you programming
• Teach you how to be an expert user of off-the-shelf molecular biology computer packages.
• Force you to make a novel contribution to bioinformatics.
6
Course Outline
• Introduction to terminology• Biological sequences • Sequence comparison
– Lossless alignment (DP)– Lossy alignments (BLAST, etc)
• Substitution matrices, statistics • Multiple alignment • Phylogeny • Protein structures and function (primary, secondary, etc.) • Structure alignment • Structure prediction ?• Pathways
7
Grading
• Homeworks (35 %) • Project (50 %)
– Contribution (2.5 % bonus)
• Survey (15 %)
How can I get an A ?
Bioinformatics DailyFirst homework is posted
First homework is posted
8
Expectations
• Require– Data structures and algorithms.– Coding (C, Java)
• Encourage – actively participate in discussions in the classroom– read bioinformatics literature in general– attend colloquiums on campus
• Academic honesty
9
Text Book
• Not required, but recommended.• Class notes + papers.
10
Where to Look ?
• Journals– Bioinformatics– Genome Research– Nucleic Acid Research– Journal of Computational Biology– Protein Science
• Conferences– RECOMB– ISMB– PSB– CSB– VLDB, ICDE, SIGMOD
11
What is Bioinformatics?• Bioinformatics is the field of science in which biology, computer
science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. There are three important sub-disciplines within bioinformatics:– the development of new algorithms and statistics with which to assess
relationships among members of large data sets – the analysis and interpretation of various types of data including
nucleotide and amino acid sequences, protein domains, and protein structures
– the development and implementation of tools that enable efficient access and management of different types of information.
From NCBI (National Center for Biotechnology Information)http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/milestones.html
12
Does biology have anything to do with computer science?
13
Challenges 1/6
• Data diversity– DNA
(ATCCAGAGCAG)– Protein sequences
(MHPKVDALLSR)– Protein structures– Microarrays– Pathways– Bio-images– Time series
14
Challenges 2/6
• Database diversity– GenBank, SwissProt, …– PDB, Prosite, …– KEGG, EcoCyc, MetaCyc, …
15
Challenges 3/6• Database size
– GeneBank : As of August 2009, there are over 85,759,586,764 bases.
– 400 K protein sequences, each about 300 long
– 50K protein structures in PDB. 400K in Modbase.
Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than
Shakespeare managed in a lifetime, although the latter make better reading.
-- G A Pekso, Nature 401: 115-116 (1999)
16
• Moore’s Law Matched by Growth of Data• CPU vs Disk
– As important as the increase in computer speed has been, the ability to store large amounts of information on computers is even more crucial
Str
uct
ure
s in
PD
B
0500
10001500200025003000350040004500
1980 1985 1990 19950
20
40
60
80
100
120
1401979 1981 1983 1985 1987 1989 1991 1993 1995
CP
U In
stru
ctio
nT
ime
(ns)Num.
Protein DomainStructures
Challenges 4/6
17
Challenges 5/6
• Deciphering the code– Within same data type: hard– Across data types: harder
caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg
atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact
18
Challenges 6/6
• Inaccuracy
• Redundancy
19
What is the Real Solution?
We need better computational methods
•Compact summarization•Fast and accurate analysis of data•Efficient indexing
20
A Gentle Introduction to Molecular Biology
21
Goals
• Understand major components of biological data– DNA, protein sequences, expression arrays,
protein structures
• Get familiar to basic terminology
• Learn commonly used data formats
22
Genetic Material: DNA
• Deoxyribonucleic Acid, 1950s– Basis of inheritance– Eye color, hair color,
…
• 4 nucleotides – A, C, G, T
23
Chemical Structure of Nucleotides
Purines
Pyrmidines
24
Making of Long Chains
5’ -> 3’
25
DNA structure
• Double stranded, helix (Watson & Crick)
• Complementary– A-T– G-C
• Antiparallel– 3’ -> 5’ (downstream)– 5’ -> 3’ (upstream)
• Animation (ch3.1)
26
Base Pairs
27
Question
• 5’ - GTTACA – 3’
• 5’ – XXXXXX – 3’ ?
• 5’ – TGTAAC – 3’
• Reverse complements.
28
Repetitive DNA
• Tandem repeats: highly repetitive – Satellites (100 k – 1 Gbp) / (a few hundred bp)– Mini satellites (1 k – 20 kbp) / (9 – 80 bp)– Micro satellites (< 150 bp) / (1 – 6 bp)– DNA fingerprinting
• Interspersed repeats: moderately repetitive– LINE– SINE
• Proteins contain repetitive patterns too
29
Genetic Material: an Analogy
• Nucleotide => letter• Gene => sentence• Contig => chapter• Chromosome => book
– Gender, hair/eye color, …– Disorders: down syndrome, turner syndrome, …
• http://gslc.genetics.utah.edu/units/disorders/karyotype/– Chromosome number varies for species
• http://www.web-books.com/MoBio/Free/Ch1C2.htm– We have 46 (23 + 23) chromosomes
• http://www.web-books.com/MoBio/Free/Ch1C5.htm
• Complete genome => volumes of encyclopedia• Hershey & Chase experiment show that DNA is the
genetic material. (ch14)
30
Functions of Genes 1/2
• Signal transduction: sensing a physical signal and turning into a chemical signal
• Structural support: creating the shape and pliability of a cell or set of cells
• Enzymatic catalysis: accelerating chemical transformations otherwise too slow.
• Transport: getting things into and out of separated compartments– Animation (ch 5.2)
31
Functions of Genes 2/2
• Movement: contracting in order to pull things together or push things apart.
• Transcription control: deciding when other genes should be turned ON/OFF– Animation (ch7)
• Trafficking: affecting where different elements end up inside the cell
32
Central Dogma
33
Introns and Exons 1/2
34
Introns and Exons 2/2
• Humans have about 35,000 genes = 40,000,000 DNA bases = 3% of total DNA in genome.
• Remaining 2,960,000,000 bases for control information. (e.g. when, where, how long, etc...)
35
Central dogma
ProteinPhenotype
DNA(Genotype)
Gene expression
36
Gene Expression
• Building proteins from DNA– Promoter sequence: start of a gene 13 nucleotides.
• Positive regulation: proteins that bind to DNA near promoter sequences increases transcription.
• Negative regulation
37
Microarray
Animation on creating microarrays
38
Amino Acids
• 20 different amino acids– ACDEFGHIKLMNPQRSTVWY but not BJOUXZ
• ~300 amino acids in an average protein, ~400 K known protein sequences
• How many nucleotides can encode one amino acid ?– 42 < 20 < 43
– E.g., Q (glutamine) = CAG– degeneracy– Triplet code (codon)
39
Triplet Code
40
Molecular Structure of Amino Acid
Side Chain
•Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P)•Polar, Hydrophilic (S, T, C, Y, N, Q)•Electrically charged (D, E, K, R, H)
C
41
Peptide Bonds
42
Direction of Protein Sequence
Animation on protein synthesis (ch15)
43
Data Format
• GenBank
• EMBL (European Mol. Biol. Lab.)
• SwissProt
• FASTA
• NBRF (Nat. Biomedical Res. Foundation)
• Others– IG, GCG, Codata, ASN, GDE, Plain ASCII
44
Primary Structure of Proteins
phi1
psi1
phi2
2N angles
45
Secondary Structure: Alpha Helix
• 1.5 A translation• 100 degree rotation• Phi = -60• Psi = -60
46
anti-parallel parallel
Secondary Structure: Beta sheet
Phi = -135Psi = 135
48
• 3-d structure of a polypeptide sequence– interactions between non-local atoms
tertiary structure ofmyoglobin
Tertiary Structure
49
• Arrangement of protein subunits
quaternary structure of Cro
human hemoglobin tetramer
Quaternary Structure
50
• 3-d structure determined by protein sequence
• Prediction remains a challenge
• Diseases caused by misfolded proteins– Mad cow disease
• Classification of protein structure
Structure Summary
51
STOP
Next Week:•Basic sequence comparison•Dynamic programming methods
–Global/local alignment–Gaps