CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified...

51
CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (http://www.smi.stanford.edu/projects/helix/bmi214/) Patrik Medstrand (www.cmb.lu.se/devbiol/bioinfo/ old/download/intro2003/databases_handouts.pdf) Mark Gerstein (http://bioinfo.mbb.yale.edu/mbb452a/2002/sequences2002.pdf)

Transcript of CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified...

Page 1: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

CIS 595 Bioinformatics

Lecture 2

Introduction to Bioinformatics

A number of slides taken/modified from:

Russ B. Altman (http://www.smi.stanford.edu/projects/helix/bmi214/)

Patrik Medstrand (www.cmb.lu.se/devbiol/bioinfo/ old/download/intro2003/databases_handouts.pdf)

Mark Gerstein (http://bioinfo.mbb.yale.edu/mbb452a/2002/sequences2002.pdf)

Page 2: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

What is Bioinformatics?

• Every application of computer science to biology– Sequence analysis, images analysis, sample

management, population modeling, …

• Analysis of data coming from large-scale biological projects– Genomes, transcriptomes, proteomes,

metabolomes, etc…

Page 3: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

The New Biology

• Traditional biology– Small team working on a specialized topic– Well defined experiment to answer precise

questions

• New “high-throughput” biology– Large international teams using cutting edge

technology defining the project– Results are given raw to the scientific

community without any underlying hypothesis

Page 4: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Examples of “High-Throughput”

• Complete genome sequencing• Simultaneous expression analysis of thousands of

genes (DNA microarrays, SAGE)• Large-scale sampling of the proteome• Protein-protein analysis large-scale 2-hybrid

(yeast, worm)• Large-scale 3D structure production (yeast)• Metabolism modeling• Biodiversity

Page 5: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Role of Bioinformatics

• Control and management of the data • Sequence, Structure and Function analysis• Analysis of primary data e.g.

– Mass spectra analysis

– DNA microarrays image analysis

• Statistics• Database storage and access• Interpreting results in a biological context

Page 6: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Sequence, Structure and Function Analysis

In order to gather insight into the ways in which genes and gene products (proteins) function perform:

• SEQUENCE ANALYSIS: Analyze DNA and protein sequences, searching for clues about structure, function, and control.

• STRUCTURE ANALYSIS: Analyze biological structures, searching for clues about sequence, function and control.

• FUNCTION ANALYSIS: Understand how the sequences and structures leads to the functions.

Page 7: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Evolution and Bioinformatics

1. Common descent of organisms implies that they will share many “basic technologies.”

2. Development of new phenotypes in response to environmental pressure can lead to “specialized technologies.”

3. More recent divergence implies more shared technologies between species.

4. All of biology is about two things: understanding shared or unshared features.

Page 8: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Biology is Fundamentally Information Science

Where is information:• DNA Sequences

– GENBANK release 128 (2/02) contains 17,089,143,893 bases in 1,546,532 sequences

• Protein Sequences– PIR or Swiss-prot (as of 3/02); 106,736 sequences,

39,242,287 total amino acids

• Protein 3D Structures– Protein Data Bank (PDB), as of March 2002: 17,679

Coordinate Entries; 15,855 proteins, 1060 nucleic acids, 746 protein/nucleic acid complex 18 carbohydrates

Page 9: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Biology is Fundamentally Information Science

Where is information: • Online access to DNA microarray data

– http://smd.stanford.edu/; 10,000 to 40,000 genes per chip; Each set of experiments involves 3 to 100 “conditions”

• Medical Literature on line.– Online database of published literature since 1966 =

Medline = PubMED resource 4,600 journals 11,000,000+ articles (most with abstracts)

• ETC…

Page 10: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Topics

• Sequence Alignment; Sequence Motifs; Gene Finding

• Computing with Biological Structures

• Phylogenetic Algorithms

• Microarray Data Analysis

• Genetic Networks

• Comparative Genomics

• Proteomics

• Biological Ontologies; Biological Text Mining

Page 11: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Sequence Alignment

• What is sequence alignment?– Given two sequences and a scoring scheme find the

optimal pairing of letters.RKVA--GMAKPNMRKIAVAAASKPAV

• Why align sequences?– A few sequences with known structure and function;

much more with unknown properties.– If one of them has known structure/function, then

alignment to the other yields insight about another– Similarity may be used as evidence of homology, but

does not necessarily imply homology

Page 12: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Sequence Alignment

Types of alignment: – Local vs. global;

– Pairwise vs. multipled1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF

Page 13: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Sequence Alignment

How to measure the alignment quality?– Define scoring matrix (PAM250)

Page 14: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Sequence Alignment

Alignment algorithms:• dot matrix• dynamic programming

– Fasta, – Blast, – Psi-Blast; – Clustal

Similarity strength:• Percent identity• E-value (statistical measure)

Page 15: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Sequence Alignment

Page 16: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Sequence Motifs

• A subsequence that occurs in multiple sequences with a biological importance.– Protein motifs often result from

structural features

– DNA sequences that provide signals for protein binding or nucleic acid folding

Page 17: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Sequence Motifs

• PROSITE Database a collection of motifs (1135 different motifs):– A manually created collection of regular

expressions associated with different protein families/functions.

– Globin sequence signature (PDOC00933):F-[LF]-x(5)-G-[PA]-x(4)-G-[KRA]-x-[LIVM]-x(3)-H

Page 18: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Gene Finding• Problem : Identify the genes within raw genomic DNA

sequence

• Input: Raw DNA sequence

• Output: Location of gene elements in the raw sequence (including exons, introns, other sequence annotations)

Page 19: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Topics

• Sequence Alignment; Sequence Motifs; Gene Finding

• Computing with Biological Structures

• Phylogenetic Algorithms

• Microarray Data Analysis

• Genetic Networks

• Comparative Genomics

• Proteomics

• Biological Ontologies; Biological Text Mining

Page 20: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Computing with Biological Structures

• General Issues– How do we represent structure for computation?

– How do we compare structures?

– How can we summarize structural families?

Page 21: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Computing with Biological Structures

Applications:• Structure alignment • Build fold library

Hb

Mb

Alignment of Individual Structures

Fusing into a Single Fold “Template”

Page 22: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Computing with Biological Structures

Why align structures:– Provides the “gold standard” for

sequence alignment

– For nonhomologous proteins, identify common substructures of interest

– Classify proteins into clusters, based on structural similarity (SCOP)

Page 23: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Computing with Biological Structures

Applications:• Predicting RNA

Secondary Structure (the MFOLD Program http://www.bioinfo.rpi.edu/applications/mfold/old/rna/)

Page 24: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Computing with Biological Structures

Protein secondary structure prediction

Sequence RPDFCLEPPYTGPCKARIIRYFYNAKAGLVQTFVYGGCRAKRNNFKSAEDAMRTCGGAStructure CCGGGGCCCCCCCCCCCEEEEEEETTTTEEEEEEECCCCCTTTTBTTHHHHHHHHHCC

Page 25: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Topics

• Sequence Alignment; Sequence Motifs; Gene Finding

• Computing with Biological Structures

• Phylogenetic Algorithms

• Microarray Data Analysis

• Genetic Networks

• Comparative Genomics

• Proteomics

• Biological Ontologies; Biological Text Mining

Page 26: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Phylogenetic Algorithms

Why build evolutionary tree?

• Understand the lineage of different species.

• Have an organizing principle for sorting species into a taxonomy

• Understand how various functions evolved.

• Understand forces and constraints on evolution.

• To do multiple alignment.

Page 27: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Phylogenetic Algorithms

Multiple Alignment and Trees • Progressive alignment methods do multiple

alignment and evolutionary tree construction at the same time.

• Sequence alignment provides scores which can be interpreted as inversely related to distances in evolution.

• Distances can be used to build trees.• Trees can be used to give multiple alignments via

common parents.

Page 28: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Topics

• Sequence Alignment; Sequence Motifs; Gene Finding

• Computing with Biological Structures

• Phylogenetic Algorithms

• Microarray Data Analysis

• Genetic Networks

• Comparative Genomics

• Proteomics

• Biological Ontologies; Biological Text Mining

Page 29: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Microarray Data AnalysisExperimental Protocol

Page 30: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Microarray Data Analysis

Page 31: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Microarray Data Analysis

What are expression arrays good for? – Follow population of (synchronized) cells over time, to

see how expression changes (vs. baseline).

– Expose cells to different external stimuli and measure their response (vs. baseline).

– Take cancer cells (or other pathology) and compare to normal cells.

– (Also some non-expression uses, such as assessing presence/absence of sequences in the genome)

Page 32: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Microarray Data Analysis

Preprocessing

Data input

Background

correction

Cy5/Cy3

normalization

Merging

replicate

experiments

Score

differential

hybridization

Spot

quality

Artifactual regions

Duplicate spot

variability

Replicate experiment variability

Page 33: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Microarray Data Analysis

Convert microarray images to data

Page 34: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Microarray Data AnalysisClustering:

– If two genes are expressed in the same way, they may be functionally related.

– If a gene has unknown function, but clusters with genes of known function, this is a way to assign its general function.

– We may be able to look at high resolution measurements of expression and figure out which genes control which other genes.

– E.g. peak in cluster 1 always precedes peak in cluster 2 => cluster 1 turns cluster 2 on?

Page 35: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Microarray Data Analysis

Classification:• Uses known groups of interest

(from other sources) to – learn the features associated with

these groups in the primary data,

– create rules for associating the data with the groups of interest.

• Often called “supervised machine learning.”

Page 36: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Topics

• Sequence Alignment; Sequence Motifs; Gene Finding

• Computing with Biological Structures

• Phylogenetic Algorithms

• Microarray Data Analysis

• Genetic Networks

• Comparative Genomics

• Proteomics

• Biological Ontologies; Biological Text Mining

Page 37: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Genetic Networks

What is a genetic network? – Individual genes have a function (e.g. transforming a

substance or binding to a substance)

– Sets of functions when sequenced can produce pathways (e.g. output of one transformation is the input to another)

– Sets of pathways, as they interact with other pathways, create a genetic network of interactions.

Page 38: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Genetic Networks

Reconstructing Genetic Regulatory Networks:– Hard problem.

– Given N genes, there are an exponential number of connections between the genes.

– Relationships are not generally +/- but are but are continuous valued.

– Must use knowledge about expected function and membership in pathways to prune the list of possible network interactions.

Page 39: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Topics

• Sequence Alignment; Sequence Motifs; Gene Finding

• Computing with Biological Structures

• Phylogenetic Algorithms

• Microarray Data Analysis

• Genetic Networks

• Comparative Genomics

• Proteomics

• Biological Ontologies; Biological Text Mining

Page 40: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Comparative Genomics

• Large scale comparison of genomes to – understand the biology of individual genomes – extract general principles applying to groups of

genomes.

• Assumption:– many biological sequences, structures, and

functions are shared across organisms, – the signal from these organisms can be

increased by combining them in analyses.

Page 41: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Comparative Genomics

Important issues for Comparative Genomics – Aligning very large sequences – Comparative approaches to gene finding – Comparative approaches to assigning function– Comparative approaches to identifying key

regulatory regions

Page 42: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Comparative GenomicsExample: Assigning protein functions

Page 43: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Topics

• Sequence Alignment; Sequence Motifs; Gene Finding

• Computing with Biological Structures

• Phylogenetic Algorithms

• Microarray Data Analysis

• Genetic Networks

• Comparative Genomics

• Proteomics

• Biological Ontologies; Biological Text Mining

Page 44: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Proteomics

• What is PROTEOMICS? – -OMICS has become the suffix to denote the

study of the entire set of something – Genomics: study of all genes– Proteomics: study of all proteins – Transcriptomics: study of all mRNA transcripts– Metabolomics: study of metabolites in cell

Page 45: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Proteomics

Proteomics questions– Which proteins are made from the genome? – What is their 3D structure? – Where they are? – What they do?– Which other proteins they interact with? – Are they modified in the cell post-

translationally?

Page 46: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Proteomics

Key proteomic technologies – 3D structure determination (X-ray/NMR)

– 2D Gels to assess all the proteins in a cell.

– Mass spectrometry to identify proteins, protein modifications.

– Yeast-Two-Hybrid systems to assess protein-protein interactions

– Protein Arrays to assess all proteins in a cell using antibodies or other recognition technology.

Page 47: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Topics

• Sequence Alignment; Sequence Motifs; Gene Finding

• Computing with Biological Structures

• Phylogenetic Algorithms

• Microarray Data Analysis

• Genetic Networks

• Comparative Genomics

• Proteomics

• Biological Ontologies; Biological Text Mining

Page 48: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Biomedical Ontologies

• In order to communicate effectively we need:– common language

– basic knowledge

• Example:– Metabolic Pathways:

• language: names of products, enzymes, substrates and pathways

• knowledge: what is a reaction, how do enzymes and substrates participate, what are the legal components of a pathway

Page 49: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Biomedical Ontologies

Gene Ontology (http://www.geneontology.org/)

• Used to classify gene function.

• A controlled listing of three types of function:– Molecular Function

– Biological Process

– Cellular Component

Page 50: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Biological Text Mining

• Literature in Biomedicine • Much literature generated quickly.

– 11 million citations in MEDLINE.– 400,000 added yearly.

• Need methods to deal with data.– Query– Summarize– Organize– Understand

Page 51: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (

Long term challenges

• Computational model of physiology. – Can we give a medication to a computer before we give it to a

human?

• Design of new compounds for medical and industrial use. – Can we design a protein or nucleic acid to have a specified

function?

• Engineering new biological pathways.– Can we devise methods for designing and implementing new

metabolic capabilities for treating disease?

• Data mining for new knowledge. – Can we ask computer programs to examine data (in the context of

our models) and create new knowledge?