CMSC702: Computational systems biology and functional...

44
CMSC702: Computational systems biology and functional genomics Héctor Corrada Bravo Dept. of Computer Science Center for Bioinformatics and Computational Biology University of Maryland University of Maryland, Spring 2014

Transcript of CMSC702: Computational systems biology and functional...

Page 1: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

CMSC702: Computational systems biology

and functional genomics

Héctor Corrada Bravo Dept. of Computer Science

Center for Bioinformatics and Computational Biology University of Maryland

University of Maryland, Spring 2014

Page 2: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Advances in Biology and Medicine needed, need, and will continue to need computational

and statistical thinking (and their tools)

Héctor Corrada Bravo Dept. of Computer Science

Center for Bioinformatics and Computational Biology University of Maryland

Page 3: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Course Introduction

Page 4: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Why  are  my  children        such  pigs?

Page 5: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

What  is  Genomics?• Each  cell  contains  a  complete  copy  of  an  organism’s  genome,  or  blueprint  for  all  cellular  structures  and  ac;vi;es.  !

• The  genome  is  distributed  along  chromosomes,  which  are  made  of  compressed  and  entwined  DNA.  !

• Cells  are  of  many  different  types  (e.g.  blood,  skin,  nerve  cells),  but  all  can  be  traced  back  to  a  single  cell,  the  fer;lized  egg.

Page 6: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Chromosomes

These  are  actually  human.  And  for  a  down  syndrome  pa;ent

Page 7: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

DNA

Watson  and  Crick  1953

DNAs  (Deoxyribonucleic  acids)  are  molecules  to  store  gene;c  informa;on  of  a  living  organism.  !!DNA  consists  of  two  polymers  made  from  four  types  of  nucleo;des:  adenine  (A)  guanine  (G),  cytosine  (C)  and  thymine  (T).  !!Purines:  A,  G;  Pyrimidines:  C,  T  !!Two  polymers  are  complementary  to  each  other  and  from  a  double-­‐helix  structure  ! 5’-ACCGTTCGACGGTAA-3’   |||||||||||||||   3’-TGGCAAGCTGCCATT-5’

Page 8: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Hector Corrada Bravo

What is Genomics?

• Study the molecular basis of variation in development and disease

• Using high-throughput experimental methods

• algorithms

• ML

• data management

• modeling

���8

cancer

healthy

Page 9: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Measurement

• For  a  small  enough  piece,  we  can  measure  the  sequence  of  bases,  referred  to  as  sequencing  

• Human  Genome  Project

Page 10: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

GenomeTCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTCTCACACCTGACATGAAAAGGCACATGAGGATCCTCAAATACCCCGTGATCAGTCTCAGGGTAGCTCTCATAGCCTGGACAGGGCCCCCCTCGGGGGTTGCGCCCAGGTCCAGGCGGGGGATGCACAGCAACAGTCACCGAAGCAGAAGCCGTCACAGTGGTGATGGGCTGGCAGTAGCTGGGCACAGAGCTGCCCATGGCGGTGGACGTTGGGTTCCGAGGGTTGTGAGAACGGGCCCCACGGGGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGGGATCCCTTCGAGGCTGCTGGCCTTGCCGTCCAGGGGACAAGGAGCCAGAGTCCAGGTGGGGCTGTTGCCGAGGGGTCAAGGGAGGCTGATGTCTGGAGTCCGGATGGACCACCTGCAGAGGAGAGACATAGGTCAACACAGGGAGGTAGGATGGTGGTGATGTTCCACCCACAAAAGAAAACCTATTCCTTTAGAAACCTCCAGGATGTGAATCCTGCCTGCACCTGCACAGCTGGCTGGAGGCATATAGCCACTGCCCATAGATCTCAACTTACCCTCACAACCAACTGCCCCCAGGCCTAAGTTCTCTGCCTCAAAACTGCCAAGGCCTGGATAGCCAAGAGCCTGGGTGTCTTGGAAATATGCAACCATAAATAGTAGCTTTTAGAAGTATAAGGCTCCTGTTTCTGGGTCATATTAGTGTTGTTTTCACCTGTCCCCAGCCCTAAGCCAGGTGTGGCCAGAAGCAAATGTACTGTAAGAGCAGAGCAAAAACTTCCACACAGATAGTTCTGTTAGGCAATACATCTCTGCCTGACTATTAGGAATCTGGTTTCTGGGTCCTCTGTACAAAGCTCGGAGCAACACAGTGGCCACATCAATCAAAAGGACCGTGACCAACTTCAAAGTCGGTGAGCTTGTACCTATTTTTAGGCTCCTGCTGAACAGAACCAGATTCACACTACAGCTCAGCAGGGCATCGTCACGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGGGGGGGGGGGTGGACAGAGGACGGGGACACAATTCACTGGCCAGCCCTTCTCTCCTTCAAGGAAGGCTGCTCTAGCCTGGGACTGGAATACACATTTCCTGTAAACATGGTGGGGGCCTCAGGCAAGCCAGAGTTTTGGAGCCTTCCTTAACTCTTCAAGGTGAGCATCTTGACTTGGAGGGTGGGGGTGCGGGTAAGGAAGGAACCTGTGGACTCCTCCCTACAAGACAGAAAAGGAATAAGCCACGAAGACAATAACGATTTTTGTATCAAGCGTCCTCTCCCATTTCAGCTTACCTGACAATGAAATCAAATTCGGACCCTGCAAGCATCAGTACACCCAGCAGAGTGGACACAGCACCGTCCAGAACGGGAGCAAACATGTGCTCCAGAGCGAGCATAGCCCTGTGGTTCTTGTCCCCAATGGCTGTCAGAAAGGCCTGAACAAAGGAGAAAATTGACACGGTCACATTCTGGGTGTGGTAAAGTGCTCAGCTGTGTCTATACTTGGGTTTTGTAT…

Total  amount  of  DNA  in  human  genome:  3  *  109  base  pairs  (bp)

Page 11: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Why  are  these  two  different?

Differences  explained  by  1-­‐10%  difference  in  genome

Similari;es  explained  by  similar  genes

Page 12: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Genes

Gene Gene Gene Gene Gene

Page 13: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Computa;onal  Biology

genomics transcriptomics proteomics

Genes  encode  proteins  which  are  transcribed  into  mRNA  and  translated  into  proteins.

Major  technological  advances  allow  unprecedented  data  acquisi;on

Page 14: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

�14

build a whole human genome sequencing device and use it to sequence 100 human genomes within 30 days or less, with an accuracy of no more than one error in every 1,000,000 bases sequenced, with an accuracy rate of at least 98% of the genome, and at a recurring cost of no more than $1,000 (US) per genome.

Page 15: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

�15

“genome sequencing technology is plummeting in cost and increasing in speed independent of our competition”

“companies can do this for less than $5,000 per genome, in a few days or less — and are moving quickly towards the goals we set for the prize.”

Page 16: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

What  makes  them  different?

Much  human  varia;on  is  due  to  difference  in  ~  6  million  base  pairs  (0.1  %  of  genome)  referred  to  as  SNPs

Page 17: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

TACATAGCCATCGGTANGTACTCAATGATGATAGenomic  DNA: A SNP

G

Single  Nucleo;de  Polymorphism  (SNP)  

Page 18: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

From reads to evidence

Page 19: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

From reads to evidence1. Comparative

Sequence-wise, individuals of a species are nearly identical

Well curated, annotated “reference” genomes exist

D. melanogaster, Science, 2000 H. sapiens, Nature, 2000 M. musculus, Nature, 2002and Science, 2000

Idea: “Map” reads to their point of origin with respect to a reference, then study differences

Page 20: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

From reads to evidence2. de novo

Assume nothing! - let reads tell us everything

Reads with overlapping sequence probably originate from overlapping portions of the subject genome

Encode overlap relationships as a graph

The full genome sequence is a “tour” of the graph

Source: De Novo Assembly Using Illumina Reads. Illumina. 2010

Source: De Novo Assembly Using Illumina Reads. Illumina. 2010 http://www.illumina.com/Documents/products/technotes/technote_denovo_assembly.pdf

Page 21: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

How  many  basepair  differences?

Page 22: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

T  T  C  G  A  T  T  A  C  G  A

A  A  G  C  T  A  A  T  G  C  T

T  T  C  G  A  T  T  A  C  G  A

A  A  G  C  T  A  A  T  G  C  T

Liver

Brain

Page 23: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Epigene1cs

h5p://nihroadmap.nih.gov/EPIGENOMICS/images/epigene1cmechanisms.jpg

Page 24: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

What  we  will  cover

1.Transcrip1on:  Measuring  gene  “ac1vity”  2.Regula1on:  Analyzing  transcrip1on  regula1on  3.Epigene1cs:  Measuring  epigene1c  profiles    4.Gene1cs:  Analyzing  genotypes  and  their  associa1on  to  different  phenotypes  

5.Integra1on:  How  to  put  these  data  together  and  understand  biological  systems.  Individualized  medicine

�24

Page 25: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Population Genomics

Page 26: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Computa1onal  Biology

genomics transcriptomics proteomics

Genes  encode  proteins  which  are  transcribed  into  mRNA  and  translated  into  proteins.

Major  technological  advances  allow  unprecedented  data  acquisi1on

Page 27: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Measurements

1!2!.!.!.!.!.!.!.!.!G

1 2 ……….N

DATA MATRIX

Samples (individuals)

Gen

es (

prob

es)

Page 28: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

article

nature genetics • volume 30 • january 2002 41

MLL translocations specify a distinct geneexpression profile that distinguishes aunique leukemiaScott A. Armstrong1–4, Jane E. Staunton5, Lewis B. Silverman1,3,4, Rob Pieters6, Monique L. den Boer6, MarkD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5, Todd R. Golub1,3,4,5* & Stanley J. Korsmeyer2,4,8**These authors contributed equally to this work.

Published online: 3 December 2001, DOI: 10.1038/ng765

Acute lymphoblastic leukemias carrying a chromosomal translocation involving the mixed-lineage leukemia gene(MLL, ALL1, HRX) have a particularly poor prognosis. Here we show that they have a characteristic, highly distinctgene expression profile that is consistent with an early hematopoietic progenitor expressing select multilineagemarkers and individual HOX genes. Clustering algorithms reveal that lymphoblastic leukemias with MLL transloca-tions can clearly be separated from conventional acute lymphoblastic and acute myelogenous leukemias. We propose that they constitute a distinct disease, denoted here as MLL, and show that the differences in geneexpression are robust enough to classify leukemias correctly as MLL, acute lymphoblastic leukemia or acute myelogenous leukemia. Establishing that MLL is a unique entity is critical, as it mandates the examination ofselectively expressed genes for urgently needed molecular targets.

1Departments of Pediatric Oncology, 2Cancer Immunology and AIDS and 8Howard Hughes Medical Institute, Dana-Farber Cancer Institute, Boston,Massachusetts, USA. 3Division of Pediatric Hematology/Oncology, Children’s Hospital, Boston, Massachusetts, USA. 4Harvard Medical School, BostonMassachusetts, USA. 5Whitehead Institute/Massachusetts Institute of Technology Center for Genome Research, Cambridge Massachusetts, USA. 6Division of Pediatric Hematology/Oncology, Sophia Children’s Hospital, University of Rotterdam, The Netherlands. 7Princess Margaret Hospital, The University ofToronto, Ontario, Canada. Correspondence and requests for materials should be addressed to S.K. (e-mail: [email protected]) or T.G.(e-mail: [email protected]).

A subset of human acute leukemias with a decidedly unfavorableprognosis possess a chromosomal translocation involving themixed-lineage leukemia gene (MLL, HRX, ALL1) on chromo-some segment 11q23 (refs 1–4). The leukemic cells, which typi-cally have a lymphoblastic morphology, have been classified asacute lymphoblastic leukemia (ALL). Unlike other types of child-hood ALL, however, the presence of the MLL translocation inALL often results in an early relapse after chemotherapy. As MLLtranslocations are typically found in infant leukemias and inchemotherapy-induced leukemia, it has remained uncertainwhether host-related factors or tumor-intrinsic biological differ-ences are responsible for poor survival.

Lymphoblastic leukemias with a rearranged MLL or germlineMLL are similar in most morphological and histochemical char-acteristics. Immunophenotypic differences associated with lym-phoblasts bearing an MLL translocation include a lack of theearly lymphocyte antigen CD10 (ref. 5), expression of the pro-teoglycan NG2 (ref. 6) and a propensity to co-express themyeloid antigens CD15 and CD65 (ref. 5). This prompted thecorresponding gene to be called mixed-lineage leukemia1 andgave rise to models that remain largely unresolved, in which theleukemia reflects disordered cell-fate decisions or the transfor-mation of a more multipotent progenitor.

Translocations in MLL result in the production of a chimericprotein in which the amino–terminal portion of MLL is fused to

the carboxy–terminal portion of 1 of more than 20 fusion part-ners7. This has led to models of leukemogenesis in which theMLL fusion protein either may confer gain of function or neo-morphic properties or may interfere with normal MLL function(with the MLL translocation representing a dominant-negativegene). Moreover, mice heterozygous for Mll (Mll+/–) show devel-opmental aberrations8,9, suggesting that the disruption of oneallele by chromosomal translocation may also manifest itself ashaplo-insufficiency in leukemic cells.

The MLL protein is a homeotic regulator that shares homologywith Drosophila trithorax (trx) and positively regulates the main-tenance of homeotic (Hox) gene expression during develop-ment8. Studies of Mll-deficient mice indicate that Mll is requiredfor proper segment identity in the axioskeletal system and alsoregulates hematopoiesis9. As Mll normally regulates the expres-sion of Hox genes, its role in leukemogenesis may include alteredpatterns of HOX gene expression. Much evidence shows thatHOX genes are important for appropriate hematopoietic devel-opment10. In addition, the t(7;11) (p15;p15) found in humanacute myelogenous leukemia (AML) results in a fusion ofHOXA9 to the nucleoporin NUP98 (refs 11,12). Thus, HOXgenes represent one set of transcriptional targets that warrantassessment in leukemias with MLL translocation.

We considered that MLL translocations might maintain a geneexpression program that results in a distinct form of leukemia

©20

02 N

atur

e Pu

blis

hing

Gro

up h

ttp://

gene

tics.

natu

re.c

om

and reasoned that RNA profiles might resolve whether leukemiasbearing an MLL translocation represent a truly biphenotypicleukemia of mixed identity, a conventional B-cell precursor ALLwith expression of limited myeloid genes, or a less committedhematopoietic progenitor cell. In addition, comparing geneexpression profiles of lymphoblastic leukemias with and withoutrearranged MLL is important because of their markedly differentresponse to standard ALL therapy and because such analysis mayidentify molecular targets for therapeutic approaches. Theexpression profiles reported here show that ALLs possessing arearranged MLL have a highly uniform and distinct pattern thatclearly distinguishes them from conventional ALL or AML andwarrants designation as the distinct leukemia MLL.

ResultsMLL is distinct from conventional ALLTo further define the biological characteristics specified by MLLtranslocations, we compared the gene expression profiles ofleukemic cells from individuals diagnosed with B-precursor ALL bearing an MLL translocationagainst those from individuals diagnosed withconventional B-precursor ALL that lack thistranslocation. Initially, we collected samples from20 individuals with conventional childhood ALL(denoted ALL), 10 of which had a TEL/AML1translocation. In addition, we collected samplesfrom 17 individuals affected with the MLLtranslocation (denoted MLL). Details of theaffected individuals and expression data are avail-able online (Methods).

First, we determined whether there were genesamong the 12,600 tested whose expression patterncorrelated with the presence of an MLL transloca-tion. We sorted the genes by their degree of correla-tion with the MLL/ALL distinction (Fig. 1) andused permutation testing to assess the statistical sig-nificance of the observed differences in gene expres-sion13. For the 37 samples tested, roughly 1,000genes are underexpressed in MLL as compared withconventional ALL, and about 200 genes are rela-tively highly expressed (data not shown). Thus,MLL shows a gene expression profile markedly dif-ferent from that of conventional ALL.

MLL shows multilineage gene expressionInspection of the genes differentially expressedbetween MLL and ALL is instructive (Fig. 1). Manygenes underexpressed in MLL have a function inearly B-cell development. These include genesexpressed in early B cells14,15, MME, CD24, CD22

and DNTT (mouse TdT); genes required for appropriate B-celldevelopment16–19, TCF3, TCF4, POU2AF1 and LIG4; andSMARCA4 (mouse Snf2b), which is correlated with B-precursorALL in an AML/ALL comparison13 (Fig. 1 and Web Note A).Genes encoding certain adhesion molecules are relatively over-expressed in MLL, including LGALS1, ANXA1, ANXA2, CD44and SPN.

Several genes that are expressed in hematopoietic lineagesother than lymphocytes are also highly expressed in MLL. Theseinclude genes that are expressed in progenitors20–22, PROML1,FLT3 and LMO2; myeloid-specific genes23–25, CCNA1, SER-PINB1, CAPG and RNASE3; and at least one natural killercell–associated gene26, the gene encoding NKG2D (Fig. 1 andWeb Note A). Overexpression of HOXA9 and PRG1 in MLL is ofparticular interest, as these genes have been reported to be highlyexpressed in AML13 and overexpression of HOXA9 has beenassociated with a poor prognosis13.

Fig. 1 Genes that distinguish ALL from MLL. The 100 genesthat are most highly correlated with the class distinction areshown. Each column represents a leukemia sample, and eachrow represents an individual gene. Expression levels are nor-malized for each gene, where the mean is 0, expression levelsgreater than the mean are shown in red and levels less thanthe mean are in blue. Increasing distance from the mean isrepresented by increasing color intensity. The top 50 genesare relatively underexpressed and the bottom 50 genes rela-tively overexpressed in MLL. Gene accession numbers and thegene symbols or DNA sequence names are labeled on theright. Individual samples are arranged such that column 1 cor-responds to ALL patient 1, column 2 corresponds to ALLpatient 2, and so on. Information about the samples alongwith the top 200 genes that make the ALL/MLL distinctionand their accession numbers can be found on our web site(http://research.dfci.harvard.edu/korsmeyer/MLL.htm).

article

42 nature genetics • volume 30 • january 2002

©20

02 N

atur

e Pu

blis

hing

Gro

up h

ttp://

gene

tics.

natu

re.c

om

Page 29: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Population Genomics

• Clustering: Group samples (individuals) that show similar gene expression profiles

• Classification: Discover gene expression profiles that distinguish two populations: e.g., cancer patients vs. healthy people

• Networks: Discover groups of genes whose expression behaves differently in two populations

Page 30: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Why stats

• If we want to infer things about gene expression in populations, we need to do some statistics

• we want to see if some particular differences we see are due to chance

• we want to make sure an experiment is setup so differences we see are those we care about

• we want to have a sense of how general are inferences are (overfitting)

Page 31: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Sta1s1cal  and  ML  methods

1.Sta1s1cal  es1ma1on  and  inference  2.Model  building:  supervised  learning,  semi-­‐supervised  learning  

3.Clustering  analysis  (unsupervised  learning)  4.Predic1on  and  classifica1on  

1.Sparse  methods  to  deal  with  high-­‐dimensionality  

5.Graphical  models  6.We  will  also  discuss  visualiza1on

�31

Page 32: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,
Page 33: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

!33

Personal Genomics

Page 34: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Sequence Once Read Often

Read what? - genome - variants - methylation - expression - other genome features - medical literature - risk models - population information - ...

Page 35: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Personal Genomics

• We need to produce reliable genome measurements, but on much bigger scale (Algorithmics, Systems)"

• Multiple genome features, decide which are relevant and significant (Information Retrieval, Data Management)"

• Population-based science, interpreted individually (Machine Learning/Statistics, Privacy)

!35

Page 36: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

NHGRI  strategic  plan

• What  does  the  NIH  think  genomics  should  be  for  the  next  10  years?

[Nature,  Feb.  2011]

Page 37: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Where  do  we  fit  in?• The  major  bo5leneck  in  genome  sequencing  is  no  longer  data  genera1on—the  computa1onal  challenges  around  data  

analysis,  display  and  integra1on  are  now  rate  limi1ng.  New  approaches  and  methods  are  required  to  meet  these  challenges.  

• Data  analysis    – Computa1onal  tools  are  quickly  becoming  inadequate  for  analysing  the  amount  of  genomic  data  that  can  now  be  generated,  and  this  

mismatch  will  worsen.  Innova1ve  approaches  to  analysis,  involving  close  coupling  with  data  produc1on,  are  essen1al.  • Data  integraIon  

– Genomics  projects  increasingly  produce  disparate  data  types  (for  example,  molecular,  phenotypic,  environmental  and  clinical),  so  computa1onal  approaches  must  not  only  keep  pace  with  the  volume  of  genomic  data,  but  also  their  complexity.  New  integra1ve  methods  for  analysis  and  for  building  predic1ve  models  are  needed.  

• VisualizaIon  – In  the  past,  visualizing  genomic  data  involved  indexing  to  the  one-­‐dimensional  representa1on  of  a  genome.  New  visualiza1on  tools  will  

need  to  accommodate  the  mul1dimensional  data  from  studies  of  molecular  phenotypes  in  different  cells  and  1ssues,  physiological  states  and  developmental  1me.  Such  tools  must  also  incorporate  non-­‐molecular  data,  such  as  phenotypes  and  environmental  exposures.  The  new  tools  will  need  to  accommodate  the  scale  of  the  data  to  deliver  informa1on  rapidly  and  efficiently.  

• ComputaIonal  tools  and  infrastructure  – Generally  applicable  tools  are  needed  in  the  form  of  robust,  well-­‐engineered  sogware  that  meets  the  dis1nct  needs  of  genomic  and  

non-­‐genomic  scien1sts.  Adequate  computa1onal  infrastructure  is  also  needed,  including  sufficient  storage  and  processing  capacity  to  accommodate  and  analyse  large,  complex  data  sets  (including  metadata)  deposited  in  stable  and  accessible  repositories,  and  to  provide  consolidated  views  of  many  data  types,  all  within  a  framework  that  addresses  privacy  concerns.  Ideally,  mul1ple  solu1ons  should  be  developed105.  

Page 38: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Where  do  we  fit  in?

• Mee1ng  the  computa1onal  challenges  for  genomics  requires  scien1sts  with  exper1se  in  biology  as  well  as  in  informa1cs,  computer  science,  mathema1cs,  sta1s1cs  and/or  engineering.    !

• A  new  genera)on  of  inves)gators  who  are  proficient  in  two  or  more  of  these  fields  must  be  trained  and  supported.

Page 39: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

It’s genomic data science!

�39

Obtain'

Communicate'

Visualize)

Transform)

Model)

integration (contextualization)

[H. Wickham]

Page 40: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

Administrative Details

Page 41: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

• Class webpage:

•http://www.cbcb.umd.edu/~hcorrada/CFG

• Everything you want to know is there.

Page 42: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

1. Name 2. email (@umd.edu) 3. Department and degree 4. Are you registered?(Y/N) 5. Relevant background

1. CS: e.g., string algorithms, graph algorithms, ML, … 2. stats: e.g., linear regression 3. biology: e.g., last class was in high school

6. What do you hope to get out of this class? 7. (a) Favorite, and (b) least favorite CS/stats term/name/

word/phrase. Why? 8. (a) Favorite, and (b) least favorite biology term/name/word/

phrase. Why?

Page 43: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

1. Name: Héctor Corrada Bravo 2. email (@umd.edu): [email protected] 3. Department and degree: CS, PhD 4. Are you registered?(Y/N): No 5. Relevant background:

1. CS: Machine Learning, Intro to Bioinformatics, Advanced Bioinformatics

2. stats: PhD-level math stats sequence in stats dept. 3. biology: undergrad bio coursework

6. What do you hope to get out of this class? a neat project with publishable results

7. (a) Favorite, and (b) least favorite CS/stats term/name/word/phrase. (a) pigeon-hole principle, (b) machine learning

8. (a) Favorite, and (b) least favorite biology term/name/word/phrase. (a) oligonucleotide, (b) mammalome

Page 44: CMSC702: Computational systems biology and functional …users.umiacs.umd.edu/~hcorrada/CFG/lectures/lect1_Intro/CMSC702_Lecture1.pdfD. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5,

The plan

• First series of lectures will get us going: Introduction to the R data analysis environment, biology background and Bioconductor

• Required:

• read John Cooks intro to R (class page)

• setup R (make sure it’s latest release 3.0.2), I recommend using Rstudio

• start reading for Tuesday’s lecture (2/4):

• Larry Hunter, “Molecular Biology for Computer Scientists” (class page)

• Sign up on piazza class page, start building teams (2-4)