EECS 730 Introduction to Bioinformatics

EECS 730Introduction to Bioinformatics

Luke HuanElectrical Engineering and Computer Science

http://people.eecs.ku.edu/~jhuan/

2012/8/21 EECS 730 2

About EECS 730

EECS 730: Introduction to Bioinformatics Meeting time: M/W/F 11:00 -11:50 Room: 3153 Learned Hall Course home page:

http://people.eecs.ku.edu/~jhuan/EECS730_F12*make sure you check the course website regularly.

2012/8/21 EECS 730 3

About EECS730 Instructor: Prof. Luke Huan

Email: [email protected]: Room 2034, Eaton Hall

Office hour: M/W 10:00 – 11:00 or by appointment

2012/8/21 EECS 730 4

Introduce Yourself Your name Your major Your background Your research interests Why you study bioinformatics Your expectations from this course Other

2012/8/21 EECS 730 5

Expected Background

Algorithm, Data Structures, Programming (EECS 560 )

Statistics: good if you’ve had at least one course, but not required We will cover the necessary stat. background

Molecular biology (BIOL 150 ): no knowledge assumed, but an interest in learning some basic molecular biology is mandatory

2012/8/21 EECS 730 6

Course Objective Learn algorithms and databases in bioinformatics Gain knowledge and hands-on experience of

bioinformatics tools Understand the interaction between computer science

and modern biology within the context of data-driven knowledge discovery Understand the important computational problems in biology. Combine theory and algorithms to help you solve research

problems Learn the art of how to turn bytes, bits, and flops into scientific

knowledge (in the biological domain)

2012/8/21 EECS 730 7

Textbook No required textbook: Bioinformatics and Functional Genomics, by Jonathan

Pevsner (Wiley, 2003). The textbook website is: http://www.bioinfbook.org This has 1000 URLs, organized by chapter

Some reading assignments may be in the form of papers.

2012/8/21 EECS 730 8

Some Good Reference Book (not a comprehensive list) Supplementary recommended reading: Biological Sequence Analysis by R. Durbin, S. Eddy, A.

Krogh, G. Mitchison, Cambridge, 1st edition, 1999, ISBN-10: 0521629713

Bioinformatics, Sequence and Genome Analysis, by David Mount, Cold Spring Harbor Laboratory Press, 1st edition, 2001, ISBN-10: 0879696087

All of Statistics: A Concise Course in Statistical Inference, by Larry Wasserman, Springer, 2004, ISBN-10: 0387402721, ISBN-13: 978-0387402727

An Introduction to Bioinformatics Algorithms, by Neil C. Jones and Pavel A. Pevzner, MIT Press, 2004.

Molecular Biology of the Cell. B. Alberts et al. 4th Ed. 2002.

2012/8/21 EECS 730 9

Course Requirement

Background survey: 1% Homeworks: 20% Midterm Exams (2): 40% Projects: 19% Paper Presentation 10% Class participation 10% Total: 100pts

2012/8/21 EECS 730 10

Grading Policy

Cutoffs for grades (roughly)A: 90 – 100 B: 80 – 90C: 70 – 80D: 60 – 70 F: 0 – 60

2012/8/21 EECS 730 11

Classroom Attendance I expect you to come to lectures on a regular basis. While you are in classroom, please show courtesy to

your classmate. If you need to leave early, consider to sit close to the door No cell phone talking during classroom

You are responsible for all announcements made in class.

Class participation is strongly encouraged.

2012/8/21 EECS 730 12

Academic Integrity Policy The work you turned in is your own! If you get help from others, you need to

acknowledge the help on the work you hand in. Always cite the references you use. Consequence of cheating First time: a loss on one letter grade in the course and

referral to the department chairman and the dean of engineering.

Second time: a dismissal hearing may be initiated by the dean of engineering.

2012/8/21 EECS 730 13

What is Bioinformatics

Interface of biology and computers Analysis of proteins, genes and genomes using

computer algorithms and computer databases Research, development, or application of

computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data---NIH

2012/8/21 EECS 730 14

The need for bioinformatics.The number of entries in biological databases is increasing exponentially. Bioinformatics is needed to understand and use this information.

0.E+00

5.E+08

1.E+09

2.E+09

2.E+09

3.E+09

0.E+00

5.E+05

1.E+06

2.E+06

2.E+06

3.E+06

3.E+06

4.E+06

Residues Records

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04

GenBank growth

2012/8/21 EECS 730 15

What is Bioinformatics Representation/storage/retrieval/analysis of

biological data Concerning Sequences Structures Functions

Sometimes used synonymously with computational biology or computational molecular biology

Highly interdisciplinary nature Biology, mathematics, statistics, computer science,

biochemistry, physics, chemistry, medicine, …

2012/8/21 EECS 730 16

Medicine Knowledge of protein structure facilitates drug design Understanding of genomic variation allows the tailoring

of medical treatment to the individual’s genetic make-up Genome analysis allows the targeting of genetic

diseases The effect of a disease or of a therapeutic on RNA and

protein levels can be elucidated The same techniques can be applied to biotechnology,

crop and livestock improvement, etc...

Promises of Bioinformatics

2012/8/21 EECS 730 17

Challenges in bioinformaticsChallenges in bioinformatics

Explosion of information Need for faster, automated analysis to process large amounts of

data Need for integration between different types of information

(sequences, literature, annotations, protein levels, RNA levels etc…)

Need for “smarter” software to identify interesting relationships in very large data sets

Lack of “bioinformaticians” Software needs to be easier to access, use and understand Biologists need to learn about the software, its limitations, and

how to interpret its results

2012/8/21 EECS 730 18

The First Bioinformatician?Mendelian Genetics Mendel started genetics research before we know

chromosome and gene Phenotype-- observable difference among members

in a population For example: hair color, eye color, blood type

What controls a phenotype? This is the question that Mendel tried to answer Is still the central question of modern genetics

He used pea, a simple organism, and quantitative method to study phenotypes. We call a quantitative study of biology computational

biology now.

2012/8/21 EECS 730 19

Mendel’s Peas

2012/8/21 EECS 730 20

Mendel’s Experiments He bred green peas with yellow peas In genetics, we call this practice cross (or mating)

-- sexual reproduction between 2 organisms Parental strains (denoted by P0 or F0)-- originally

crossed organisms

X

F0 F0

2012/8/21 EECS 730 21

Mendel’s Results Mendel collected results for F1 and F2

generations F1 generation-- offspring of the F0 generation

(parents)

F1 generation 227 0green yellow ratio

2012/8/21 EECS 730 22

Mendel’s Explanation: Gene Model Model postulated that there are something called

“genes” that controls the phenotype For the time being, let’s assumes that each organism

always have two copies of the same gene. One from “father” and the other from “mother”.

Some genes are dominant: the associated phenotype is visible in the F1 generation, e.g. green seed color

Some genes are recessive: the associated phenotype is invisible in the F1 generation, e.g. yellow seed color

How could we tell whether the gene model is correct or not?

2012/8/21 EECS 730 23

Mendel’s New Experiments F2 generation-- offspring of F1 generation

crossed to itself What should we expect to see in F2? Green seed: ¾ Yellow seed: ¼

His experimental results:

F1 generation 227 0F2 generation 593 193 3.07

green yellow ratio

2012/8/21 EECS 730 24

Topics Covered (Samples) Introduction to Bioinformatics & Molecular Biology Molecular biology databases Sequence Alignment Multiple sequence alignment Protein structure analysis Protein structure prediction Gene expression & data analysis Proteomics Emerging topics in Bioinformatics

2012/8/21 EECS 730 25

Molecular Biology

We will present a very brief introduction to molecular biology.

Selected topics: DNA RNA Proteins Gene expression: from DNA to protein Central dogma of molecular biology &

bioinformatics

2012/8/21 EECS 730 26

Molecular biology databases

Genomic sequence database Gene expression database Protein sequence database Protein structure database Protein family database

2012/8/21 EECS 730 27

Sequence Alignment Pairwise sequence alignment is the most fundamental

operation of bioinformatics Compare two (pairwise) or more (multiple) sequences

DNA – 4 letters; Protein – 20 letters

Useful for discovering functional, structural, and evolutionary information in biological sequences

Assumptions: similar sequences may have the same function; or two similar sequences from different organisms may have a common ancestor sequence (homologous).

2012/8/21 EECS 730 28

Sequence alignment: DNA sequences can be aligned to see similarities between gene from different sources

768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813|| || || | | ||| | |||| ||||| ||| |||

87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135. . . . .

814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863| | | | |||||| | |||| | || | |

136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG 172. . . . .

864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913||| | ||| || || ||| | ||||||||| || |||||| |

173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216

2012/8/21 EECS 730 29

Database similarity searching: The BLAST program has been written to allow rapid comparison of a new gene sequence with the 100s of 1000s of gene sequences in data bases

Sequences producing significant alignments: (bits) Value

gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae] 112 7e-26gi|603258 (U18795) Prb1p: vacuolar protease B [Saccharomyces ce... 106 5e-24gnl|PID|e264388 (X59720) YCR045c, len:491 [Saccharomyces cerevi... 69 7e-13gnl|PID|e239708 (Z71514) ORF YNL238w [Saccharomyces cerevisiae] 30 0.66gnl|PID|e239572 (Z71603) ORF YNL327w [Saccharomyces cerevisiae] 29 1.1gnl|PID|e239737 (Z71554) ORF YNL278w [Saccharomyces cerevisiae] 29 1.5

gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae]Length = 478

Score = 112 bits (278), Expect = 7e-26Identities = 85/259 (32%), Positives = 117/259 (44%), Gaps = 32/259 (12%)

Query: 2 QSVPWGISRVQAPAAHNRG---------LTGSGVKVAVLDTGIST-HPDLNIRGG-ASFV 50+ PWG+ RV G G GV VLDTGI T H D R + +

Sbjct: 174 EEAPWGLHRVSHREKPKYGQDLEYLYEDAAGKGVTSYVLDTGIDTEHEDFEGRAEWGAVI 233

Query: 51 PGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYXXXXXXXXXXXXXXXXXQGLE 110P D NGHGTH AG I + + GVA + ++ +G+E

Sbjct: 234 PANDEASDLNGHGTHCAGIIGSKH-----FGVAKNTKIVAVKVLRSNGEGTVSDVIKGIE 288

EECS 730

Multiple sequence alignment: Sequences of proteins from different organisms can be aligned to see similarities and differences

2012/8/21 EECS 730 31

Protein structure Proteins perform various functions in cells. The 3-D structure of a protein determines its function. One of the major goals of bioinformatics is to

understand the relationship between amino acid sequence and 3-D structure in proteins.

In theory, the structure of a protein could be reliably predicted from the amino acid sequence.

2012/8/21 EECS 730 32

Protein Structure/Function

Computational Challenges: Determine structure from sequence Determine function from sequence/3D structure

Amino Acid Sequence

3-D Structure

Protein Function

> 1NLG:_ NADP-LINKED GLYCERALDEHYDE-3-PHOSPHATE EKKIRVAINGFGRIGRNFLRCWHGRQNTLLDVVAINDSGGVKQASHLLKYDSTLGTFAAD VKIVDDSHISVDGKQIKIVSSRDPLQLPWKEMNIDLVIEGTGVFIDKVGAGKHIQAGASK VLITAPAKDKDIPTFVVGVNEGDYKHEYPIISNASCTTNCLAPFVKVLEQKFGIVKGTMT TTHSYTGDQRLLDASHRDLRRARAAALNIVPTTTGAAKAVSLVLPSLKGKLNGIALRVPT PTVSVVDLVVQVEKKTFAEEVNAAFREAANGPMKGVLHVEDAPLVSIDFKCTDQSTSIDA SLTMVMGDDMVKVVAWYDNEWGYSQRVVDLAEVTAKKWVA

Classification: Gene TransferEC Number: 1.2.1.13

2012/8/21 EECS 730 33

Protein analysis & proteomics

Four perspectives of proteins Protein families (domains & motifs) Physical properties of proteins Protein localization Protein function Gene ontology

High-throughput protein analysis Protein interactions

2012/8/21 EECS 730 34

Gene expression and data analysis

Microarray High-throughput approaches based on hybridization

principle, developed recently. Generate terabytes of information that are overwhelming

conventional methods of biological analysis; different from sequence analysis.

Microarray technology allows biologists to study genome-wide patterns of gene expression in any given cell type, at any given time, and under any given set of conditions, e.g., cancer classification.

Various algorithms for microarray data analysis will be discussed

2012/8/21 EECS 730 35

Gene expression and data analysis

•Microarray analysis•Clustering•Classification

2012/8/21 EECS 730 36

Course’s Main PointLearn to do:Define the problem → Find computational

solutionThree major Aspects:Biological

What is the task?Algorithmic

How to perform the task efficiently and effectively?Statistical

How to differentiate true phenomena from artifacts

2012/8/21 EECS 730 37

Reading assignment

L. Hunter, Molecular Biology for Computer Scientists, Artificial Intelligence for Molecular Biology, L. Hunter Ed., pp. 1-46, AAAI Press, 1993. (online download: http://www.aaai.org/Library/Books/Hunter/01-Hunter.pdf)

EECS 730 Introduction to Bioinformatics

Documents

Transcript of EECS 730 Introduction to Bioinformatics