Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST...

Starting Monday

M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in lab we will predict orthologs using reciprocal genome-scale BLAST searches

W Oct 31 – Phylogenetic Profiles ( an example of unsupervised machine learning) and supervised machine learning approaches and applications

M Nov 5 - Phylogeny (Phylogeny Lab)

W Nov 7 – Metabolic reconstruction and modeling

***2-3 pg paper on preliminary results due***

Today: Chip-chip and Chip-seq analysis

Chromatin immunoprecipitation (ChIP)

1. Chemical or light-basedcrosslinking added toliving cells

2. Shear DNA by sonication ordigestion

3. IP by specific Ab orAb against protein tag

2

ChIP on ChIP (tiled genomic microarrays)

Sign

al I

nten

sity

Array Probes

Peak resolution a function of:- shearing size- probe resolution- ChIP enrichment

3

ChIP - Seq

Rea

d C

ount

s

4

1. Map reads to the reference genome

2. Convert to ‘tag’ counts: sequence coverage at each base pair in the genome

3. Find peaks of high tag count (using a fixed/sliding window with count threshold)or based on bimodal peak distribution

4. Convert bimodal peaks into summits (by shifting 3’ tag positions OR byextending the tag signal to estimated size of fragments)

5. Identify summits that represent fragment enrichment relative to control

6. Assign a confidence score (p-value, enrichment score, and/or FDR)

Types of ‘control’ data for ChIP experiments

1. ‘Input’ DNA = sheared but no IP

2. No-antibody mock IP

3. Untagged strain

Almost always somebackground in mock-IP

… hope is to haveenrichment of IP material

over background.

* Certain artifacts can givethe appearance of real peaks in

control experiments.

Pepke et al. 2009

Read counts/ tag profile is generallysmoothed before peak calling(e.g. running average) and then the‘summit’ is inferred by the dual read peaks

* using a method that incorporatesmeasured background model is probably very important

10

3 Types of peaks1. Sharp & narrow (100s bp)

(eg. site-specific TF)

2. Broader but defined (kb)(eg. RNA Polymerase)

3. Very broad (regional, 1000s kb)(eg. heterochromatin histone marks)

• methods that identify bimodal peak profiles to identify summits work less well forbiologically wider peaks/loci

Hidden Markov Models for Identifying Bound Fragments

HMM’s are trained on known data to recognize different states (eg. bound vs. unbound fragments) and the probability of moving between those states

Example: ChIP-chip data from a tiling microarray identifying regions bound toa transcription complex with a known 50bp binding sequence.

You expect that a bound fragment will have high signal on the array and that the bound fragment will be 2-3 probes long.

Once trained, an HMM can be used to identify the ‘hidden’ states in an unknown dataset, based on the known characteristics of each state (‘emission probabilities ’) and

the probability of moving between states (‘transition probabilities’)

Example: “A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences” 2005. Li, Meyer, Liu



P( I ) = 0.2P( i ) = 0.8

P( I ) = 0.8P( i ) = 0.2

P( I ) = 0.8P( i ) = 0.2

P( I ) = 0.8P( i ) = 0.2

I = Intensity units > 10,000 i = Intensity units < 10,000

P= 0.5

P= 0.5

P= 1.0

P= 0

P= 0.7

P= 0.3

P= 1.0

Unbound 25mer Bound 25mer Bound 25mer Bound 25mer



P= 0.5

P= 0.5

P= 1.0

P= 0

P= 0.7

P= 0.3

P= 1.0

Unbound 25mer Bound 25mer Bound 25mer Bound 25mer

Emission Probabilities

Transition Probabilities

Given the data, an HMM will consider many different models and give back the optimal model

P( I ) = 0.2P( i ) = 0.8

P( I ) = 0.8P( i ) = 0.2

P( I ) = 0.8P( i ) = 0.2

P( I ) = 0.8P( i ) = 0.2

14

Evaluated 11 different peak-calling algorithms using 3 real datasets * & defaultparameters (mimicking “non-expert users”)

- methods with smaller peak lists often return peaks identified by other methods(more stringent)

“many programs call similar peaks, though default parameters are tuned to different levels of stringency”

Output: list of peak locations (start & stop) and p-values

Challenge is peaks do not show precisely where protein binds.

Different programs vary in the width of the identified peaks

Can apply the same type of motif finding to a set of IP’d regionsto identify motifs shared by regions.

Other approaches

ChIP-exoDNaseI hypersensitive sites

Micrococcal nuclease sensitive sites(nucleosome mapping)

What can you do with the data?

1.Motif finding: look for motif shared in bound regions (e.g. XX)

2.Association bound loci with neighboring genes, elements- functional enrichment of neighboring genes- other non-random association among neighboring genes,

e.g. shared expression profiles, expression dependency on factor in question

3.Locus distribution across the genome

Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST...

Documents

Transcript of Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST...