Array Informatics Mark Gerstein

35
Do not reproduce without permission 1 Lectures.GersteinLab.org (c) 2007 Array Informatics Mark Gerstein

description

Array Informatics Mark Gerstein. CEGS Informatics Developing Tools and Technical Analyses Related to Genome Technologies. Main Genome Technologies Tiling Arrays Next Generation Sequencing Main Applications Transcript mapping Protein-DNA Binding CGH Transitioning to Seq. - PowerPoint PPT Presentation

Transcript of Array Informatics Mark Gerstein

Page 1: Array Informatics Mark Gerstein

Do not reproduce without permission 1 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Array Informatics

Mark Gerstein

Page 2: Array Informatics Mark Gerstein

Do not reproduce without permission 2 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

CEGS Informatics Developing Tools and Technical Analyses Related to

Genome Technologies

• Main Genome Technologies Tiling Arrays Next Generation Sequencing

• Main Applications Transcript mapping Protein-DNA Binding CGH

• Transitioning to Seq....

Page 3: Array Informatics Mark Gerstein

Do not reproduce without permission 3 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Tools & Tech. Analyses for Processing of Genome Technology Data

• Normalizing Arrays and Measuring & Correcting Artifacts COP - Correcting positional artifacts [Yu et al. NAR '07] Efficient Pseudomedian Calculation - for Tiling Array Scoring

[Royce et al., BMC Bioinfo. '07] Measuring Mismatch Effects

[Seringhaus et al., BMC Genomics (submitted)] Removing Seq. Effects [Royce et al., Bioinfo. '07] NN Prediction of Probe Intensity - measuring & exploiting specific

cross-hyb [Royce et al. NAR '07]

• Simulating NextGen Sequencing ChipSeqSim - simulating ChIP Seq [Zhang et al., PLoS CB '08]

Page 4: Array Informatics Mark Gerstein

Do not reproduce without permission 4 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Tools & Tech. Analyses for Genome Structural Variation

Breakptr - HMM-based Array Segmentation for CNV detection [Korbel et al., PNAS '07]

MSB - Mean-shift-based Array Segmentation for CNV detection with extension to sequencing [Wang et al. Gen. Res. (submitted)]

PEMer - Paired-end Mapping for SV Dectection with simulation calibration and breakpoint DB [Korbel et al., GenomeBiol. (submitted)]

Long-SV-Assembly Simulations [Du et al., Nat. Meth. (submitted)] SD-CNV-CORR - Approach for correlating the occurrence of CNVs

and SDs with genomic features (particularly repeats) [Kim et al., Genome Res. (submitted)]

Page 5: Array Informatics Mark Gerstein

Do not reproduce without permission 5 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

5

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

5

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

A Starting Point: Noisy Raw Signal from Tiling Arrays (Transcription)

Joh

nso

n e

t a

l. (2

00

5)

TIG

, 2

1,

93

-10

2.

Li et al., PLOS one (2007)

Page 6: Array Informatics Mark Gerstein

Do not reproduce without permission 6 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

6

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Perfect Match Specific Cross-hyb. Non-specific Cross-hyb.

Specific & Non-specific Cross-Hyb.• Perfect match (PM):

probe binding intended target• Specific cross-hyb.: probes binding non-PM

targets with a small number of mismatches• Non-specific cross-hyb.: probes binding

targets with many mismatches, due to general stickiness of oligos

Page 7: Array Informatics Mark Gerstein

Do not reproduce without permission 7 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Non-Specific Cross Hyb. (Sequence Effects)

Page 8: Array Informatics Mark Gerstein

Do not reproduce without permission 8 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

8

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Creation of Standardized Datasets for

Quantifying Effect of Mismatches

[Seringhaus et al., BMC Genomics (in press)]

Types of Mispairs (probe on array is first)

No

rmal

ized

In

ten

sity

MM

M

M v

s. P

M

Yeast Human

GA

v A

G

CA

v A

C

GT

v T

G

CT

v T

C

Page 9: Array Informatics Mark Gerstein

Do not reproduce without permission 9 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Sourc

e:

Royce

, T.E

., e

t al (2

00

7),

Bio

info

rmati

cs, 23

, 9

88

-97

Avg. intensity of all background probes with a C at position 4

Avg. intensity of all background probes with a T at position 33

Observing Non-specific Cross-hyb. (Probe sequence effects)

Page 10: Array Informatics Mark Gerstein

Do not reproduce without permission 10

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00710

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Iterated Quantile Normalization to

Correct for Non-specific Cross-hyb.

• Adapt Bolstad et al (2003) approach to tiling arrays

• Force distributions with a given nt at each position to be same

• Distributions at other positions now different so iterate

• Also, robust adaptation of Naef & Magnasco (2003)

T Royce et al (2007), Bioinformatics, 23, 988-97

Page 11: Array Informatics Mark Gerstein

Do not reproduce without permission 11

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Measuring Specific Cross-Hyb

Source: Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97

Page 12: Array Informatics Mark Gerstein

Do not reproduce without permission 12

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00712

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Proof of principle test to “exploit” this

Figure from http://www.members.cox.net/amgough/Fanconi-genetics-genetics-primer.htm

• Using Cheng et al. (2005), predict gene expression levels (and profiles across tissues) for genes on part of chr. #6

• ...Based on closest cross-hyb tiles on part of chr. #7

• Then compare to measured levels and profile on #6

Source: Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97

Page 13: Array Informatics Mark Gerstein

Do not reproduce without permission 13

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Nearest Nbr Search on Virtual Tiling

Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97

Page 14: Array Informatics Mark Gerstein

Do not reproduce without permission 14

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Agreement between predicted tile expression profile and actual one

Source: Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97

• Correlated predicted profiles with the actual profiles of gene expression across cell lines

• Much more correlation than expected by chance (dist. centered on 0)

Page 15: Array Informatics Mark Gerstein

Do not reproduce without permission 15

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00715

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Royce, T. E. et al. Nucl. Acids Res. 2007 35:e99

Very Strong ROC Curve: Most genes are accurately detected using

nearest-neighbor features' signals

• Illustrates great magnitude of cross-hyb. on hi-density arrays

• High feature density arrays inadvertently resurrecting generic n-mer concept (van Dam & Quake, 2003)

• Suggests that tiling arrays could be exploited to create universal arrays

• Gold std. set of known expressed genes. How well do we find. • A set of known positives was defined as the Refseq genes with at

least 75% transfrag coverage. A set of known negatives was constructed by permuting the sequences in the set of known positives. For various thresholds, sensitivity and specificity were computed and then plotted.

Page 16: Array Informatics Mark Gerstein

Do not reproduce without permission 16

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

CEGS Informatics Credits

• Array Corrections J Rozowsky T Royce M Seringhaus

• PEMer, SD-CNV, BreakPtr P Kim J Korbel J Du X Mu A Abyzov N Carriero

• Experimental M Snyder S Weissman A Urban

Page 17: Array Informatics Mark Gerstein

Do not reproduce without permission 17

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Computational Methods for SV Characterization

Mark Gerstein

Page 18: Array Informatics Mark Gerstein

Do not reproduce without permission 18

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Computational Methods for SV Characterization

Segmenting Array CGH data

Building a PEM pipeline

Correlating SVs and SDs with Repeats

Page 19: Array Informatics Mark Gerstein

Do not reproduce without permission 19

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00719

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

http://breakptr.gersteinlab.orgKorbel*, Urban* et al., PNAS (2007)

-0.5

0

0.5

Flu

ores

cenc

e

log2

rat

io

NO T TO SCALEACGTGACAC ATAAGCACACCA ATTGCTTGAGGGACCT TAGGCACAGT TAAC ATGATAAGCACACCA ATTGCTTGAGGTGACDNA

sequence

BreakPtr HMM

• To get highest resolution on breakpoints need to smooth & segment the signal

• BreakPtr: prediction of breakpoints, dosage and cross-hybridization using a system based on Hidden Markov Models

Page 20: Array Informatics Mark Gerstein

Do not reproduce without permission 20

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00720

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

High resolution of tiling arrays allows statistical integration of nucleotide sequence patterns

>4-fold enrichment of the breakpoints of copy number variants near segmental duplications (SDs)[e.g. Sharp et al., Am. J. Hum. Genet. 2005; 77:78-88].

Page 21: Array Informatics Mark Gerstein

Do not reproduce without permission 21

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00721

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

BreakPtr statistically integrates array signal and DNA sequence signatures (using a discrete-valued bivariate HMM)

Korbel*, Urban* et al., PNAS (2007)

Transition A

Transition A’

Transition B

Transition B’

Duplication DeletionNormal

Sequence

Arr

ayva

lues

Sequence

Arra

yva

lues

Page 22: Array Informatics Mark Gerstein

Do not reproduce without permission 22

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00722

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Gold standards

Training data

Parameter optimization

Model parameter estimation

Breakpointvalidation

[sequencing]

log (number of CNVs available for parameter estimation)10

10[core]

304[intermediate A]

1003[intermediate B]

2503[full]

SDs

norm

aliz

ed f

luor

esce

nt

inte

nsity

log

-ra

tios

2

Max

imum

num

ber

of p

aram

eter

s pe

r tr

ansi

tion

stat

e

‘Active’ approach for breakpoint identification: initial scoring with preliminary model, targeted validation (with sequencing),

retraining, and rescoring

HMM optimized iteratively (using Expectation Maximization, EM) Korbel*, Urban* et al., PNAS (2007)

CNV breakpoints sequenced in ~10 cases following BreakPtr analysis;

Median resolution <300 bp

No improvement in accuracy with higher resolution (9nt tiling)

Page 23: Array Informatics Mark Gerstein

Do not reproduce without permission 23

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00723

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Moving Beyond Arrays, Computational Methods for Next-

Generation Sequencing:Paired End Mapping to Find SVs

Page 24: Array Informatics Mark Gerstein

Do not reproduce without permission 24

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00724

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Overall Strategy for Analysis of

NextGen Seq. Data to Detect Structural Variants

[Korbel et al., Science ('07); Korbel et al., GenomeBiol. (submitted)]

Page 25: Array Informatics Mark Gerstein

Do not reproduce without permission 25

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00725

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Simulation strategy

Simulation

Experiment454 sequencing [Korbel et al., GenomeBiol. (submitted)]

Page 26: Array Informatics Mark Gerstein

Do not reproduce without permission 26

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00726

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Reconstruction efficiency at different coverage

[Korbel et al., GenomeBiol. (submitted)]

Page 27: Array Informatics Mark Gerstein

Do not reproduce without permission 27

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00727

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Building a Database of Variants: Complexities

[Korbel et al., GenomeBiol. (submitted)]

Page 28: Array Informatics Mark Gerstein

Do not reproduce without permission 28

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00728

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Analyzing Duplications in the Genome (SDs & CNVs)

pers. photo, see streams.gerstein.info

Page 29: Array Informatics Mark Gerstein

080907_SD_CNV_Slides_MBG_CEGS_PMK

29

SEGMENTAL DUPLCATIONS AND COPY NUMBER VARIANTS ARE RELATED PHENOMENA AND HAVE BEEN CREATED BY SEVERAL DIFFERENT MECHANISMS

Copy Number Variants (CNV) Segmental Duplications (SD)

Fixation

Intra-species variation Fixed mutations(differences to other species)

NAHR (Non-allelic homologous recombination)

Flanking repeat(e.g. Alu, LINE…)

NHEJ (Non-homologous-end-joining)

No (flanking) repeats. In some cases <4bp microhomologies

Page 30: Array Informatics Mark Gerstein

080907_SD_CNV_Slides_MBG_CEGS_PMK

30

PERFORM LARGE SCALE CORRELATION ANALYSIS TO DETECT REPEAT SIGNATURES OF SDs AND CNVs

Survey a range of genomic features

Count the number of features in each genomic bin (100kb)

Calculate correlations / enrichments using robust stats

1

2

3

…ATCAAGG CCGGAA…

Exact match

Local environment

If exact CNV breakpoints are known, we can calculate the enrichment of repeat elements relative to the genome or relative to the local environment

[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]

Page 31: Array Informatics Mark Gerstein

080907_SD_CNV_Slides_MBG_CEGS_PMK

31

SDs ARE CORRELATED WITH ALUS AND OTHER SDs

0.080.09

0.120.14 0.14 0.13

Alu association with SDs by age

90-92% 92-94% 94-96% 96-98% 98-99% >99%

• The co-localization of Alu elements with SDs is highly significant.

• Older SDs have a much higher association with Alus than younger SDs.

• SDs can mediate NAHR and lead to the formation of CNVs

• Such mechanisms (“preferential attachment”) are well studied in physics and should leads a very skewed (“power-law”) distribution of SDs.

•Hotspots[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]

f

Number of SDs in Genomic Bin

Oc

cu

rre

nc

e

Page 32: Array Informatics Mark Gerstein

080907_SD_CNV_Slides_MBG_CEGS_PMK

32

0.0480.0006

0.07390.0466

ASSOCIATIONS ARE DIFFERENT FOR SDs AND CNVs

Alu

0.92

CNV association with repeats

Microsatellite

<0.001

Pseudogenes

0.046

LINE

0.001

0.07

0.27

0.0940.21

Alu

<0.001

SD association with repeats

Microsatellite

<0.001

Pseudogenes

0.046

LINE

0.001

[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]

CNVAssociationwith SDs

>99% SDs* CNVs

0.31

0.11

CNVs ARE LESS ASSOCIATED WITH SDs THAN THE GENERAL SD TREND

Page 33: Array Informatics Mark Gerstein

Do not reproduce without permission 33

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00733

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

uOldLow seq-ID (%)

YoungHigh seq-ID (%)

CNVs

Fixation Aging (~40Mya)

NAHR

NHEJ

SDs

AluSD

LINEMicrosatellite

SubtelomeresFragile sites

Alu Burst (40 MYA)

AFTER THE ALU BURST, THE IMPORTANCE OF ALU ELEMENTS FOR GENOME REARRANGEMENT DECLINED RAPIDLY

• About 40 million years ago there was a burst in retrotransposon activity

• The majority of Alu elements stem from that time

• This, in turn, led to rapid genome rearrangement via NAHR

• The resulting SDs, could create more SDs, but with Alu activity decaying, their creation slowed

[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]

Page 34: Array Informatics Mark Gerstein

Do not reproduce without permission 34

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Future Directions

• Simulations of SV Assembly• Analysis of Split Reads• Detailed Analysis of SV and CNVs

with Genomic Features

Page 35: Array Informatics Mark Gerstein

Do not reproduce without permission 35

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

CEGS Informatics Credits

• Array Corrections J Rozowsky T Royce M Seringhaus

• PEMer, SD-CNV, BreakPtr P Kim J Korbel J Du X Mu A Abyzov N Carriero

• Experimental M Snyder S Weissman A Urban