A proposed undergraduate bioinformatics curriculum for

31
A proposed undergraduate bioinformatics curriculum for computer scientists Crossing the Interdisciplinary Boundaries Drs. Travis Doom (CS), Michael Raymer (CS), Dan Krane (Bio), and Oscar Garcia (CS) This work supported by NSF grant #EIA-01

description

 

Transcript of A proposed undergraduate bioinformatics curriculum for

Page 1: A proposed undergraduate bioinformatics curriculum for

A proposed undergraduatebioinformatics curriculum for

computer scientists

A proposed undergraduatebioinformatics curriculum for

computer scientistsCrossing the Interdisciplinary

Boundaries

Drs. Travis Doom (CS), Michael Raymer (CS),

Dan Krane (Bio), and Oscar Garcia (CS)

Crossing the Interdisciplinary Boundaries

Drs. Travis Doom (CS), Michael Raymer (CS),

Dan Krane (Bio), and Oscar Garcia (CS)

                           

This work supported by NSF grant #EIA-0122582

Page 2: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 2

OverviewOverview• What is bioinformatics?

– The genome as an information source– Bioinformatics problems

• How do people learn bioinformatics?

• A bioinformatics curriculum

• What is bioinformatics?– The genome as an information source– Bioinformatics problems

• How do people learn bioinformatics?

• A bioinformatics curriculum

Page 3: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 3

Genomic information: from genes to proteinsGenomic information: from genes to proteins

• TATAAGCTGACTGTCACTGA• TATAAGCTGACTGTCACTGA

one codon

3apr.pdb

4 Bases:A,G,C,T

20 Amino Acids

Protein:Structural orEnzyme

Page 4: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 4

Bioinformatics ProblemsBioinformatics Problems• Sequence alignment

– Given a gene, search a database for similar genes

• Protein folding

• Sequence alignment– Given a gene, search a database for similar genes

• Protein folding

GCTATAATGCGTGT*CCA*CGCAGC*A*AATGC*TGTACCATCGCA

Page 5: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 5

Bioinformatics ProblemsBioinformatics Problems

• Complementarity– Shape– Chemical– Electrostatic

• Complementarity– Shape– Chemical– Electrostatic

??Drug Lead Screening/Docking

Page 6: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 6

The Role of ComputationThe Role of Computation• Target Identification: Pattern Recognition, Data

Mining, Dynamic Programming– Finding proteins that are likely to be related to disease &

determining their active sites– Finding genes that code for these proteins

• Finding drug leads: Databases, Parallel Systems, Graph Theory, etc.– Database screening– Docking

• Refining leads: Knowledge-Based & Expert Systems, AI, Pattern Recognition, Graph Theory– Toxicology & delivery

• Target Identification: Pattern Recognition, Data Mining, Dynamic Programming– Finding proteins that are likely to be related to disease &

determining their active sites– Finding genes that code for these proteins

• Finding drug leads: Databases, Parallel Systems, Graph Theory, etc.– Database screening– Docking

• Refining leads: Knowledge-Based & Expert Systems, AI, Pattern Recognition, Graph Theory– Toxicology & delivery

Page 7: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 7

Growth of biological databasesGrowth of biological databases

1 2 3 5 10 16 24 35 49 72 101 157217

385652

1,160

2,009

3,841

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

Millions

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

Source: GenBank

3D StructuresGrowth:3D StructuresGrowth:

Source: http://www.rcsb.org/pdb/holdings.html

GenBank BASEPAIR GROWTH:GenBank BASEPAIR GROWTH:

Page 8: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 8

The role of computationThe role of computationACGTCCGGCCTTATACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAG…

Please find me the genes on this chromosome associated withtype II diabetes…

Please find me the genes on this chromosome associated withtype II diabetes…

Page 9: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 9

Bioinformatics ProblemsBioinformatics Problems• Site

Recognition– Active site– Other binding

sites

• Data Integration– Indexing,

retrieval– Formatting

• Site Recognition– Active site– Other binding

sites

• Data Integration– Indexing,

retrieval– Formatting

Page 10: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 10

A growing research industryA growing research industry

Source: Ernst & Young 13th & 14th Annual Reports, Biospace.

3,500

2,007

1,222 1,354

10,896

0

2,000

4,000

6,000

8,000

10,000

12,000

96 97 98 99 Qt1, 00

Cas

h I

nfl

ow

($M

)

3x Bioinformatics

Related (B)

Other Biotech (O)

37%

63%

AVERAGE QUARTERLY FINANCING

Year

>95% O

<5% B

Page 11: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 11

OverviewOverview• What is bioinformatics?

• How do people learn bioinformatics?– Learning a bilingual discipline– Why undergraduates?

• A bioinformatics curriculum

• What is bioinformatics?

• How do people learn bioinformatics?– Learning a bilingual discipline– Why undergraduates?

• A bioinformatics curriculum

Page 12: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 12

Bioinformatics in the USBioinformatics in the US• The demand is growing

– The National Institute for General Medical Sciences (NIGMS) has issued a report that shows there is a critical need for researchers for other disciplines that can perform the kind of modeling and data analysis that biological researchers require.

• Graduate programs are flourishing– Approximately 20 US universities started graduate

programs in Bioinformatics last year.– New graduate programs are being proposed at many

universities across the nation.

• The demand is growing– The National Institute for General Medical Sciences

(NIGMS) has issued a report that shows there is a critical need for researchers for other disciplines that can perform the kind of modeling and data analysis that biological researchers require.

• Graduate programs are flourishing– Approximately 20 US universities started graduate

programs in Bioinformatics last year.– New graduate programs are being proposed at many

universities across the nation.

Page 13: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 13

The ProblemThe Problem• Bioinformatics is interdisciplinary

– Students must posses a strong grasp of computer science fundamentals

– Students must posses a strong grasp of biochemistry to recognize and appreciate the results

– Learning to speak the languages of both fields is essential

– Learning to “think” as a bioinformatician requires training in both the scientific method and solid engineering design methodology

• We believe this can (and must) be done at the undergraduate level

• Bioinformatics is interdisciplinary– Students must posses a strong grasp of computer science

fundamentals

– Students must posses a strong grasp of biochemistry to recognize and appreciate the results

– Learning to speak the languages of both fields is essential

– Learning to “think” as a bioinformatician requires training in both the scientific method and solid engineering design methodology

• We believe this can (and must) be done at the undergraduate level

Page 14: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 14

The ProblemThe Problem• To pursue a career or graduate study in bioinformatics,

a CS student must be familiar with:– “Classical” CS: introductory programming, data structures,,

formal and comparative languages (complexity and optimizaiton algorithms), probability and statistics

– “Contemporary” CS: AI algorithms (search, optimization, list processing, pattern recognition, etc.), databases (storage, transmission, and processing of large data sets), modeling and simulation

– Biology: genetics, molecular bio, cellular bio, gene expression, replication, recombination, repair, and the experimental tools of molecular biology (~2.5 years)

– Chemistry: inorganic and organic chemistry (~2 years)

• To pursue a career or graduate study in bioinformatics, a CS student must be familiar with:– “Classical” CS: introductory programming, data structures,,

formal and comparative languages (complexity and optimizaiton algorithms), probability and statistics

– “Contemporary” CS: AI algorithms (search, optimization, list processing, pattern recognition, etc.), databases (storage, transmission, and processing of large data sets), modeling and simulation

– Biology: genetics, molecular bio, cellular bio, gene expression, replication, recombination, repair, and the experimental tools of molecular biology (~2.5 years)

– Chemistry: inorganic and organic chemistry (~2 years)

Page 15: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 15

OverviewOverview• What is Bioinformatics?

• How do people learn bioinformatics?

• How are we facilitating learning in bioinformatics at Wright State University?– NSF CISE Educational Innovation Award– Towards an accredited undergraduate program in

bioinformatics

• What is Bioinformatics?

• How do people learn bioinformatics?

• How are we facilitating learning in bioinformatics at Wright State University?– NSF CISE Educational Innovation Award– Towards an accredited undergraduate program in

bioinformatics

Page 16: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 16

NSF Educational InnovationNSF Educational Innovation• The NSF’s directorate for Computer and

Information Sciences and Engineering has awarded WSU an Educational Innovation grant.– Crossing the interdisciplinary barrier: An integrated

undergraduate program in bioinformatics– Three year plan – Fall 2001 to Summer 2004.– Goal: An interdisciplinary baccalaureate

bioinformatics program in Computer Science at WSU to serve as a national model of excellence

• The NSF’s directorate for Computer and Information Sciences and Engineering has awarded WSU an Educational Innovation grant.– Crossing the interdisciplinary barrier: An integrated

undergraduate program in bioinformatics– Three year plan – Fall 2001 to Summer 2004.– Goal: An interdisciplinary baccalaureate

bioinformatics program in Computer Science at WSU to serve as a national model of excellence

Page 17: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 17

The Big PictureThe Big Picture• Graduate programs accept students with either

bachelor’s degrees in CS or Biology– The majority of the first year of graduate study is

generally consumed with remedial coursework in the other discipline

• Undergraduate programs must incorporate:– More specific (and shorter) biology and chemistry

sequences– More focused computer science foundation– Redesignate traditional “core” CS with

contemporary areas of IT knowledge

• Graduate programs accept students with either bachelor’s degrees in CS or Biology– The majority of the first year of graduate study is

generally consumed with remedial coursework in the other discipline

• Undergraduate programs must incorporate:– More specific (and shorter) biology and chemistry

sequences– More focused computer science foundation– Redesignate traditional “core” CS with

contemporary areas of IT knowledge

Page 18: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 18

Goal: Integrating researchGoal: Integrating research• Integrating research into the undergraduate

curriculum– Academic collaborations– Industry collaborations for research and internship

• Why is bioinformatics a rich field for integration?– Apply the tools to new data– Develop new tools

• Integrating research into the undergraduate curriculum– Academic collaborations– Industry collaborations for research and internship

• Why is bioinformatics a rich field for integration?– Apply the tools to new data– Develop new tools

Page 19: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 19

Goal: Minimal New ResourcesGoal: Minimal New Resources• Bio/CS 2xx – Introduction to Bioinformatics

– Tools-oriented approach to bioinformatics emphasizing data structure in DNA, string representation in PERL, data searches, pairwise alignment, substitution patterns, protein structure prediction and modeling, proteomics, and the use of web-based bioinformatic tools

• Bio/CS 4xx – Algorithms for Bioinformatics– Theory-oriented approach to the application of contemporary

algorithms to bioinformatics. Graph theory, complexity theory, dynamic programming and optimization techniques are introduced in the context of application toward solving specific computational problems in molecular genetics

• Bio/CS 2xx – Introduction to Bioinformatics– Tools-oriented approach to bioinformatics emphasizing data

structure in DNA, string representation in PERL, data searches, pairwise alignment, substitution patterns, protein structure prediction and modeling, proteomics, and the use of web-based bioinformatic tools

• Bio/CS 4xx – Algorithms for Bioinformatics– Theory-oriented approach to the application of contemporary

algorithms to bioinformatics. Graph theory, complexity theory, dynamic programming and optimization techniques are introduced in the context of application toward solving specific computational problems in molecular genetics

Page 20: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 20

Goal: Strong CS BS programGoal: Strong CS BS program• This degree program should a different but

strong CS BS student:– We use the CAC guidelines as a rule for “core” CS– Other components developed in close collaboration

with Biology and an industry panel

• CAC guidelines include:– Algorithms, data structures, software design,

programming languages (variety), computer org. & arch., discrete math, calculus, statistics, lab science, and development of oral, written, and social/ethical skills

• This degree program should a different but strong CS BS student:– We use the CAC guidelines as a rule for “core” CS– Other components developed in close collaboration

with Biology and an industry panel

• CAC guidelines include:– Algorithms, data structures, software design,

programming languages (variety), computer org. & arch., discrete math, calculus, statistics, lab science, and development of oral, written, and social/ethical skills

Page 21: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 21

Towards a CAC accredited programTowards a CAC accredited programCourses Removed3xx-04 Digital Sys. Design 4xx-04 Concurrent Software4xx-04 Formal Languages4xx-04 Software Engineering

xxx-20 CS Electives package

1xx-16 Physics sequencexxx-04 Science electivexxx-24 Concentration reqs.

(MTH/SCI/ENG)80 QH removed

Courses Added2xx-04 Intro. Bioinformatics4xx-04 Artificial Intelligence4xx-04 Algorithms for Bioinf.4xx-04 Databases

xxx-08 Focused CS electives

1xx-15 Inorganic Chemistry2xx-18 Organic Chemistryxxx-29 Biology sequence

82 QH added

Page 22: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 22

Towards a CAC accredited programTowards a CAC accredited program

• 195 Total Quarter Credit Hours– 42 General Education (as per CS)

– 66 Computer Science / Engineering (Vs. 82)• Includes AI, Databases, two new bioinformatics courses; excludes

Digital System Design, Formal Languages, Software Eng., Concurrent Software

– 29 Biology (~two year sequence) (Vs. 24 Concentration)

– 33 Chemistry (two year sequence) (Vs. 19 MTH/Sci)

– 25 Mathematics (as per CS)

• Approved Winter 2002

• 195 Total Quarter Credit Hours– 42 General Education (as per CS)

– 66 Computer Science / Engineering (Vs. 82)• Includes AI, Databases, two new bioinformatics courses; excludes

Digital System Design, Formal Languages, Software Eng., Concurrent Software

– 29 Biology (~two year sequence) (Vs. 24 Concentration)

– 33 Chemistry (two year sequence) (Vs. 19 MTH/Sci)

– 25 Mathematics (as per CS)

• Approved Winter 2002

Page 23: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 23

Un undergraduate textbookUn undergraduate textbookFundamental Concepts in Bioinformatics

I. Molecular Biology and Biological ChemistryII. Data searches and pairwise alignmentsIII. Substitution patternsIV. Distance-based methods of phylogeneticsV. Character-Based approaches to phylogeneticsVI. Gene recognition: Prokaryotic GenomesVII. Gene Recognition: Eukaryotic GenomesVII. Protein foldingVIII. ProteomicsAppendix 1: A gentle introduction to programming & data structuresAppendix 2: Enzyme kineticsAppendix 3: Sample programs in Perl and worksets

Page 24: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 24

Questions?Questions?

[email protected]

http://birg.cs.wright.edu

Page 25: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 25

Simplified Diagram of Modern IT & CSSimplified Diagram of Modern IT & CS

Classical View

Modern IT View

Logic DatabasesLogic Databases

Machine ReasoningMachine Reasoning DataWarehousingDataWarehousing

Web ProgrammingWeb Programming

WWWWWW

DataminingDatamining

Video on DemandVideo on DemandParallelismParallelism

Human-Computer InteractionHuman-Computer Interaction

SearchingSearching

Page 26: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 26

Three Possible Views of BioinformaticsThree Possible Views of Bioinformatics

ComputerComputerScienceScience BiologyBiology

Is it Genomics in CS?

ComputerComputerScienceScience BiologyBiology

Or is it CS in Biology?

Or is it an independent discipline? This argues for theformation of interdisciplinary centers broader than either

the bio or the informatics disciplines.

ACT

G

See: “Impact of EmergingTechnologies on the Bio-logical Sciences” athttp://www.nsf.gov/bio/pubs/stctech/stcmain.html

Page 27: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 27

Sister program in BiologySister program in Biology• 200 credit hour program in Biological Sciences

– 42 General Education– 63 Biology (~four year sequence)

• Includes two new bioinformatics courses

– 28 Computer Science (~three year sequence)– 33 Chemistry (two year sequence)– 34 Mathematics and Physics

• Close collaboration with the department of computer science and an industrial panel

• 200 credit hour program in Biological Sciences– 42 General Education– 63 Biology (~four year sequence)

• Includes two new bioinformatics courses

– 28 Computer Science (~three year sequence)– 33 Chemistry (two year sequence)– 34 Mathematics and Physics

• Close collaboration with the department of computer science and an industrial panel

Page 28: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 28

Bioinformatics OverviewBioinformatics Overview• Genomics

– emphasis on genetics, chemical and physical aspects of flow of genetic information from DNA to proteins, gene expression, replication, recombination, and repair

– Databases, Data Mining, Neural Networks, Pattern Recognition, etc.

• Proteomics– Study of how genes make proteins. Emphasis on the structure and

properties of proteins and ligands

– Molecular modeling, Pattern Recognition, Data Mining, etc.

• Genomics– emphasis on genetics, chemical and physical aspects of flow of genetic

information from DNA to proteins, gene expression, replication, recombination, and repair

– Databases, Data Mining, Neural Networks, Pattern Recognition, etc.

• Proteomics– Study of how genes make proteins. Emphasis on the structure and

properties of proteins and ligands

– Molecular modeling, Pattern Recognition, Data Mining, etc.

Page 29: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 29

Molecular EvolutionMolecular EvolutionXLRHODOP 1 ggtagaacagcttcagttgggatcacaggcttcta 35 ||||||||||||||||||||||||||||||||||XL23808 1171 tgggtcatactgtagaacagcttcagttgggatcacaggcttcta 1215XLRHODOP 36 gggatcctttgggcaaaaaagaaacacagaaggcattctttctat 80 |||||||||||||||||||||||||||||||||||||||||||||XL23808 1216 gggatcctttgggcaaaaaagaaacacagaaggcattctttctat 1260 XLRHODOP 81 acaagaaaggactttatagagctgctaccatgaacggaacagaag 125 |||||||||||||||||||||||||||||||||||||||||||||XL23808 1261 acaagaaaggactttatagagctgctaccatgaacggaacagaag 1305XLRHODOP 126 gtccaaatttttatgtccccatgtccaacaaaactggggtggtac 170 |||||||||||||||||||||||||||||||||||||||||||||XL23808 1306 gtccaaatttttatgtccccatgtccaacaaaactggggtggtac 1350

Page 30: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 30

Drug discovery life cycleDrug discovery life cycle

Years

0 2 4 6 8 10 12 14 16

Discovery (2 to 10 Years)

Preclinical Testing(Lab and Animal Testing)

Phase I(20-30 Healthy Volunteers used to check for safety and dosage)

Phase II(100-300 Patient Volunteers used to check for efficacy and side effects)

Phase III(1000-5000 Patient Volunteers used to monitor reactions to long-term drug use)

FDA Review & Approval

Post-Marketing Testing

$600-700 Million,$600-700 Million,

7 – 15 Years7 – 15 Years

Page 31: A proposed undergraduate bioinformatics curriculum for

T. Doom, M. Raymer, D. Krane, O. Garcia 31

Benefits of bioinformaticsBenefits of bioinformatics• Every major pharmaceutical company now

employs bioinformatics techniques to improve drug design (among other business aspects)

• Increased understanding of evolution at the genetic/molecular level (phylogenetics)

• Our best glimpse yet at the molecular mechanisms that regulate life at a cellular level and possibilities for simulating some aspects with a computer (basic science)

• Every major pharmaceutical company now employs bioinformatics techniques to improve drug design (among other business aspects)

• Increased understanding of evolution at the genetic/molecular level (phylogenetics)

• Our best glimpse yet at the molecular mechanisms that regulate life at a cellular level and possibilities for simulating some aspects with a computer (basic science)