NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004...

33
NCB I Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 [email protected] .gov

Transcript of NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004...

Page 1: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Quick Overview of Bioinformatics

Chuong HuynhNIH/NLM/NCBI

New Delhi, IndiaSeptember 28, [email protected].

gov

Page 2: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

What is bioinformatics? - Definition

• My definition – bringing biological themes to computers

• Peter Elkin: Primer on Medical Genomics: Part V: Bioinformatics– “Bioinformatics is the discipline that develops and applies

informatics to the field of molecular biology.”• BISTIC Bioinformatics Definition

– “Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data”

• BISTIC Computational Biology Definition– “Computational Biology: the development and application

of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.”

• http://www.bisti.nih.gov/

Page 3: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Useful/Necessary Bioinformatics Skills

• Strong background in some aspect of molecular biology!!! • Ability to communicate biological questions

comprehensibly to computer scientists• Thorough comprehension of the problem in the

bioinformatics field• Statistics (association studies, clustering, sampling)• Ability to filter, parse, and munge data and

determine the relationships between the data sets• Mathematics (e.g. algorithm development)• Engineering (e.g. robotics)• Good knowledge of a few molecular biology software

packages (molecular modeling / sequence analysis)• Command line computing environment (Linux/Unix

knowledge)• Data administration (esp. relational database concept) and

Computer Programming Skills/Experience (C/C++, Sybase, Java, Oracle) and Scripting Language Knowledge (Perl and perhaps Phython)

Page 4: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Bioinformatics Flow Chart (0)

6. Gene & Protein expression data

7. Drug screening

Ab initio drug design ORDrug compound screening in database of molecules

8. Genetic variability

1a. Sequencing

1b. Analysis of nucleic acid seq.

2. Analysis of protein seq.

3. Molecular structure prediction

4. molecular interaction

5. Metabolic and regulatory networks

Page 5: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Bioinformatics Flow Chart (1)

1a. Sequencing

1b. Analysis of nucleic acid seq.

-Base calling-Physical mapping-Fragment assembly

-gene finding-Multiple seq alignment evolutionary tree

Stretch of DNA coding for protein;Analysis of noncoding region of genome

2. Analysis of protein seq.

3. Molecular structure prediction 3D modeling;DNA, RNA, protein, lipid/carbohydrate

Sequence relationship

4. molecular interactionProtein-protein interactionProtein-ligand interaction

5. Metabolic and regulatory networks

Page 6: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Bioinformatics Flow Chart (2)

6. Gene & Protein expression data

7. Drug screening

-EST-DNA chip/microarray

a) Lead compound binds tightly to binding site of target proteinb) Lead optimization – lead compound modified to be nontoxic, few side effects, target deliverable

Ab initio drug design ORDrug compound screening in database of molecules

8. Genetic variability

Drug molecules designed to be complementary to bindingSites with physiochemical and steric restrictions.

-Now investigated at the genome scale

-SNP, SAGE

Page 7: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Genome Sequencing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

•Most genome will be sequenced and can be sequenced;

few problem are unsolvable.

Clone by clone vs whole genome shotgun

•Problem lies in understanding what you have:

•Gene prediction/gene finding

•Annotation

Subcloning; generate small insert libraries

Assembly: Process of taking raw single-pass reads into contiguous consensus sequence (Phred/Phrap)

Assembly

Libraries

Strategy

Sequencing

Closure: Process of ordering and merging consensus sequences into a single contiguous sequence

Closure

Annotation -DNA features (repeats/similarities)-Gene finding-Peptide features-Initial role assignment-Others- regulatory regions

Release Release data to the public e.g. EMBL or GenBank

Page 8: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Complete sequence

Shotgun reads

Contigs

Genomic DNA

Shearing/Sonication

Subclone and Sequence

Assembly

Finishing

Finishing read

Sequencing

Small DNA fragments1.0-2.0kb

Clone LibrarypUC18

DNA sequencingRandom clones

Both strands coverage;Gap filled

Page 9: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Annotation of eukaryotic genomes

transcription

RNA processing

translation

AAAAAAA

Genomic DNA

Unprocessed RNA

Mature mRNA

Nascent polypeptide

folding

Reactant A Product BFunction

Active enzyme

ab initio gene prediction

Comparative gene prediction

Functional identification

Gm3

Page 10: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Annotation

• Predict protein• Extract ORFs• Remove errors• Compare with database of ‘known

function proteins’• Provide transitive annotations

Page 11: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Positional Cloning

Page 12: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Positional Candidate Cloning

Page 13: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

The new information is always partial

• Complete Eukaryotic Genomes

• Ongoing Eukaryotic• Prokaryotic Ongoing• Published• Even a complete genome is only

partially understood

Page 14: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Why not use the genome sequence once its ‘ready’?

• Finding exons– 30% overprediction– 20% not found at all– Comparison systems rely on EST sequences

which themselves contain large error rates– Others are looking through partial data– Once the genome is done …when?

• Expressed sequences are there in part and represent a very very powerful key.

Page 15: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Interpreting data from many sources

Page 16: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Genomics and Tropical Diseases

How Can Genomics Contribute to

the Control of Tropical Diseases?

Challenges and Opportunities

The Role of BioinformaticsStrategic emphases for research http://www.who.int/tdr/grants/strategic-emphases/default.htmWHO/TDR Genomics and World Health Report 2002

Page 17: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Why Pathogen Genomics?

“The power and cost-effectiveness of modern genome sequencing technology mean that complete genome sequences of 25 of the major bacterial and parasitic pathogens could be available within five years. For about 100 million dollars (…), we could buy the sequence of every virulence determinant, every protein antigen and every drug target.”

B. Bloom (1995) A microbial minimalist. Nature 378:236

Page 18: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Genomics and Drug Development for Tropical Diseases: Challenges

• Knowledge limitations– A large proportion of pathogen genes have unknown

function– Heavy investment in genomics is done by the commercial

sector and therefore not widely available

• Emphasis and priorities– Genomes of non-pathogenic model organisms (S.

cerevisiae, D. melanogaster, C. elegans, A. thaliana)– Genomes of pathogens that affect individuals in

developed countries– Neglected diseases neglected pathogens

Page 19: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Doing Successful Science in the new millennium

• Huge increase in available biological information• Classic paradigm of ‘molecular biology’ now is

altering rapidly to genomics• Understanding of the new paradigms concerns more

than ‘just bench biology’• Discovery requires large scale systems and broad

collaborations, Global problems• Funding comes in large amounts at group level, no

longer a single laboratory or institution effort.• Accountable output

Page 20: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

The Bigger Picture (Malaria)

Page 21: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Genomics Approach to Drug Development: Opportunities

• Classical laboratory assays aim at targets in which mutation is lethal to the pathogen– Valuable targets can be missed

• Sulphonamides: Inhibition of the p-aminobenzoic acid pathway not lethal for growth in laboratory but severely attenuate the capacity to cause disease

Page 22: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Genomics Approach to Drug Development: Opportunities

• New approaches for the identification of gene products specifically involved in the disease process may uncover further drug targets– Signature tagged mutagenesis (STM)– Transposon site hybridization (TraSH)

• Pathogen genomics and data mining for the discovery of new drug targets

Page 23: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Fosmidomycin • September

1999: a basic science breakthrough (data mining through bioinformatics identify new targets for chemotherapy of malaria)

• 1st semester 2001: Results of Phase I clinical trials

Page 24: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Fosmidomycin example - lesson

• A lesson to take home: 1½ years from data mining and laboratory research to phase II, proof-of-principle clinical trials

Page 25: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Bioinformatics: Opportunities in Health Research and Development

• New drug research and development– Identification of novel drug/vaccine targets– Structural predictions– Tapping into biodiversity– Reconstruction of metabolic pathways– Systems biology

• Identification of vaccine candidates through analysis of surface antigens and epitopes

Page 26: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

A Window of Opportunity for Disease Endemic Countries

• Bioinformatics is an extremely important tool, with relevance to studying pathogenic organisms– Pathogens of interest to DECs already being

sequenced (e.g. P. falciparum, T. cruzi, T. brucei, Leishmania sp.)

• Computational biology is ‘people-intensive’, less affected by infrastructure, economics, etc than other areas of biological research

• ‘Critical mass’ issues less critical – a world-wide community is within reach

Page 27: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

• Linux operating system permits use of the personal computer as a powerful workstation– Vast repository of public domain software for

computational biology

• Individual accounts for remote access and data processing can be open at high-performance computer facilities and regional centers– EMB network nodes, FIOCRUZ (Brazil), SANBI

(South Africa), CECALCULA (Venezuela), ICGEB (Trieste and New Delhi)

Relatively Modest Hardware Needs

and Technical Support

Page 28: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

• Powerful searches using public websites– NCBI, EMB nodes, Sanger Center,

Expasy/SwissProt, KEGG database

• High-speed internet access is becoming more and more available in disease endemic countries through regional and international support, e.g.:– Asia-Pacific Advanced Network Consortium

(APAN) http://www.th.apan.net/– MIMCom Malaria Research Resources

http://www.nlm.nih.gov/mimcom/about.html

Relatively Modest Hardware

Needs and Technical Support

Page 29: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

TDR Regional Training Centers & Regional Training Courses on Bioinformatics Applied to Tropical Diseases

• Africa– SANBI, Cape Town, South Africa

• Course: Jan 20-Feb 02, 2002; Mar 19-Apr 4, 2003; Feb 2-15, 2004 (with NBN series)

– Univ of Ibadan, Ibadan, Nigeria• Course: May 26-Jun 07, 2003

• South America– USP, São Paulo, Brazil

• Course: Feb 18-March 02, 2002; July 17-19, 2003; July 5-16, 2004;

• Southeast Asia– ICGEB, New Delhi, India

• Course: Apr 26-May 09, 2002; Sep 22-Oct 06, 2003; Sept 28-Oct 11, 2004

– Mahidol University, Bangkok, Thailand• Course: Jul 09-23, 2002; Sep 29-Oct 10, 2003; July 26-

Aug6, 2004

International Training Course on Bioinformatics and Computational Biology Applied to Genome Studies (Train-the-

trainers Workshop)May 21-June 15, 2001 FIOCRUZ, Brazil

Page 30: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Training Course on Bioinformatics and Functional Genomics Applied to Insect Vectors

of Human DiseasesAt the

Center for Bioinformatics and Applied Genomics (CBAG) and Center for Vector and Vector-Borne

Diseases (CVVD), Faculty of Science, Mahidol University,

Bangkok, ThailandJanuary 17-28, 2005

Training Course on Functional Genomics of Insect Vectors of Human Diseases

African Center for Training in Functional Genomics of Insect Vectors of Human Diseases

(AFRO VECTGEN) At the Malaria Research and Training Center (MRTC),

Bamako, MaliDec 1-16, 2004

Page 31: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Beginning Bioinformatics Books

• Baxevanis & Ouellette 2001. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins 2nd Edition. John Wiley Publishing.

• Gibas & Jambeck 2001. Developing Bioinformatics Computer Skills. O’Reilly.

• Bioinformatics: Genome Sequence Analysis Mount 2001

• Bioinformatics For Dummies – Claverie & Notredame 2003

• Bioinformatics and Functional Genomics Pesvner 2003

• Introduction to Bioinformatics – Lesk 2002• Fundamental Concepts of Bioinformatics Krane &

Raymer 2003• Beginning Perl for Bioinformatics – Tisdall 2002• Primer of Genome Science – Gibson & Muse 2002

Page 32: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

Course Schedule

Comments and Suggestions

Take out your course schedule.

Page 33: NCBI Quick Overview of Bioinformatics Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004 huynh@ncbi.nlm.nih.gov.

NC

BI

The Challenge

What is expected of you?