ChIP-seq Methods & Analysis

ChIP-seq Methods & AnalysisChIP-seq Methods & Analysis

Gavin SchnitzlerAsst. Prof. Medicine TUSM, Investigator at MCRI, TMC

[email protected]

mailto:[email protected]

• Day 1: ChIP techniques, library production, USCS browser tracks

• Day 2: QC on reads, Mapping binding site peaks, examining read density maps.

• Day 3: Analyzing peaks in relation to genomic feature, etc.

• Day 4: Analyzing peaks for transcription factor binding site consensus sequences.

• Day 5: Variants & advanced approaches.

ChIP-seq COURSE OUTLINE

• Position weight matrices to find transcription factor binding sites (TFBSes)

• TFBS enrichment in peaks using CentDist

• TFBS enrichment using Storm in UNIX

• Mining Storm results

• Disambiguating similar matrices w/ STAMP

DAY 4 OUTLINE

• Analyzing overlaps between peak & regulated genes in UNIX

DAY 3 REMNANT

How can we test the significance of binding site association w/ regulated genes?

If you haven’t already, go to the cluster & move bed and txt files to your /cluster/shared/userID/chip folder (mkdir chip & cd chip if you don’t have this folder yet):

cp /cluster/tufts/cbi*/Ch*/Sam*/ER*beds/*.* .

The .txt files list the transcription start sites (TSSes) of genes that were up- or down-regulated by estrogen in aorta or liver (by RNA-seq analysis).

Overlaps between peaks & genes Take a look at one of them using head [name].txtchr6 73171625 - Dnahc6chr2 25356026 - C8gchr6 65540391 + Tnip3…

The file format is (tab-delimited) chromosome, TSS, transcription direction (+=sense) & geneID.

You can get all this info easily from the UCSC browser, for individual genes (by hand)…… or you can get this information for all genes & extract what you want for your gene set of interest.. Check out the RNA-seq module for info on making & handling .gtf files.

Overlaps between peaks & genes 2 The overlap program can recognize this type of file & will test for overlaps between ChIP-seq peaks and regions around the listed TSSes (default +/-1000 bp).You can also change this range by specifying a –range variable.Find the overlaps between 10-kb regions around TSSes of genes up- or downregulated in each tissue & the corresponding ER binding site data using variations on:bsub perl /cluster/home/g/s/gschni01/perl*/overlap_1.3.pl Ao_up_TSS.txt AoE_all.bed –outfile Ao_up_v_AoE.overlap

(these commands are in /cluster/tufts/cbi*/Ch*/Sam*/Fin*/workflow2.txt)

Note the number of overlaps (hits), number of genes (tests) and the number of overlaps expected by chance divided by the number of genes (background frequency) provides all the information you need for binomial tests. Note these numbers down for each comparison.

Accessing the R statistical language

On the PCs in this room:Start->programs->R

To get R for your PC (free): http://cran.r-project.org/

To get RStudio (allows for easier management of R projects): http://www.rstudio.com/

On the cluster type: module load RThen: bsub -Ip -q int_public6 RTo exit use the R command q()

For more info on using R & Unix see:http://sites.tufts.edu/cbi/resources/rna-seq-course/UNIX resources & R resources

http://sites.tufts.edu/cbi/resources/rna-seq-course/

Binomial tests in R

Use the R command: binom.test(hits, tests, bkg_freq) to address the significance of overlaps you see

For Ao_down_TSS.txt vs. AoE.bed: binom.test(118,2, 1.03/118)

Which comparisons show significant enrichment. Do any show anti-enrichment?






DAY 4 OUTLINE

What is PWM? Transcription factor binding sites (TFBSs) are

usually slightly variable in their sequences.

A positional frequency matrix (PFM) specifies the probability that you will see a given base at each index position of the motif.

This is built from sequences known to bind the TF (e.g. 46 sequences for the PFM below).

NCCAGTNNNACTGGNCon165231426973424447T61034441915111089343113G1839431001415214339338C611391077729145818A151413121110987654321Pos

5’ 3’

Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University

PFM->normalized PFM->PWM

1. acggcagggTGACCc

2. aGGGCAtcgTGACCc

3. cGGTCGccaGGACCt

4. tGGTCAggcTGGTCt

5. aGGTGGcccTGACCc

6. cTGTCCctcTGACCc

7. aGGCTAcgaTGACGt ...

41. cagggagtgTGACCc

42. gagcatgggTGACCa

43. aGGTCAtaacgattt44. gGAACAgttTGACC

c45. cGGTGAcctTGAC

Cc46. gGGGCAaagTGAC

Tg

1. acggcagggTGACCc

2. aGGGCAtcgTGACCc

3. cGGTCGccaGGACCt

4. tGGTCAggcTGGTCt

5. aGGTGGcccTGACCc

6. cTGTCCctcTGACCc

7. aGGCTAcgaTGACGt ...

41. cagggagtgTGACCc

42. gagcatgggTGACCa

43. aGGTCAtaacgattt44. gGAACAgttTGACC

c45. cGGTGAcctTGAC

Cc46. gGGGCAaagTGAC

Tg

Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number of times a particular nucleotide appears at a given position).

A normalized PFM, in which each column adds up to a total of one, is a matrix of probabilities for observing each nucleotide at each position (e.g. divide by 46).

Position frequency matrix (PFM)

(also known as raw count matrix)

The normalized PFM is converted to log-scale for efficient computational analysis. To eliminate null values before log-conversion, and to correct for small samples of binding sites, a sampling correction, known as pseudocounts, is added to each cell of the PFM.

Position weight matrix (PWM)(also known as position-specific scoring matrix)

Binding site data


Position Weight Matrix for ERE

Converting a PFM into a PWM

)(

4 log

,log),(

,

22 bpNN

Nf

bp

ibpibw

ib

– raw count (PFM matrix element) of nucleotide b in column i

N – number of sequences used to create PFM (= column sum)

- pseudocounts (correction for small sample size)

p(b) - background frequency of nucleotide b

NN

4

and

For each matrix element do:

A 18 8 5 4 1 29 7 7 7 0 1 39 1 1 6C 8 3 3 9 33 4 21 15 14 0 0 1 43 39 18G 13 31 34 9 8 10 11 15 19 4 44 3 0 1 6T 7 4 4 24 4 3 7 9 6 42 1 3 2 5 16

ibf ,

A 0.58-

0.44-

0.98-

1.21-

2.29 1.22-

0.60-

0.60-

0.60-

2.96-

2.29 1.62-

2.29-

2.29 -0.72

C-

0.44-

1.49-

1.49-

0.30 1.39-

1.21 0.78 0.34 0.25-

2.96-

2.96-

2.29 1.76 1.62 0.46

G 0.16 1.31 1.44-

0.30-

0.44-

0.17-

0.06 0.34 0.65-

1.21 1.79-

1.49-

2.96-

2.29 -0.64

T-

0.60-

1.21-

1.21 0.96-

1.21-

1.49-

0.60-

0.30-

0.78 1.73-

2.29-

1.49-

1.84-

0.98 0.23


G G G T C A G C A T G G C C A

Absolute score of the site

Max 0.58 1.31 1.44 0.96 1.39 1.22 0.78 0.34 0.65 1.73 1.79 1.62 1.76 1.62 17.20Min -0.60 -1.49 -1.49 -1.21 -2.29 -1.49 -0.60 -0.60 -0.78 -2.96 -2.96 -2.29 -2.96 -2.29 -24.02

scoreMinimumscoreMaximum

scoreMinimumscoreAbsolutescorerelative

__

___

86.0

02.2420.17

02.2457.11

m

i

ibwS1

),( =11.57

Scoring putative EREs by scanning the promoter w/ PWM

Row Sum

A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -0.60 -0.60 -2.96 -2.29 1.62 -2.29 -2.29 -0.72

C -0.44 -1.49 -1.49 -0.30 1.39 -1.21 0.78 0.34 0.25 -2.96 -2.96 -2.29 1.76 1.62 0.46

G 0.16 1.31 1.44 -0.30 -0.44 -0.17 -0.06 0.34 0.65 -1.21 1.79 -1.49 -2.96 -2.29 -0.64

T -0.60 -1.21 -1.21 0.96 -1.21 -1.49 -0.60 -0.30 -0.78 1.73 -2.29 -1.49 -1.84 -0.98 0.23

This is also called “functional depth”


G G G T C A G C A T G G C C A

Estimating p. values for a match to the matrix

A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -0.60 -0.60 -2.96 -2.29 1.62 -2.29 -2.29 -0.72

C -0.44 -1.49 -1.49 -0.30 1.39 -1.21 0.78 0.34 0.25 -2.96 -2.96 -2.29 1.76 1.62 0.46

G 0.16 1.31 1.44 -0.30 -0.44 -0.17 -0.06 0.34 0.65 -1.21 1.79 -1.49 -2.96 -2.29 -0.64

T -0.60 -1.21 -1.21 0.96 -1.21 -1.49 -0.60 -0.30 -0.78 1.73 -2.29 -1.49 -1.84 -0.98 0.23

This sequence had a functional depth (f) of 0.86

The summed probabilities of all sequences with f >=.86 gives the p.value for this sequence = chance that f>=.86 would be achieved by a randomized DNA sequence.

Short matrices can reach f > .9 but still have high p. values – thus f is the best measure for short seqs.

Long matrices can have very low p. values but still have f< .9 – thus p.value is the best measure for long seqs.






DAY 4 OUTLINE

Preparing for PWM search

Lauch WinSCP (Start->programs->WinSCP)

Navigate to: /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/Final_output_files

Pull over the “ipvinput19_peaks.xls” file to the PC.(this is the MACS output file that we generated yesterday)

Open it into Excel

Making .bed file w/ +/-200 bp around peak summit (where we expect TFBS

enrichment will be greatest)chr start end length summit tags -10*LOG10(pvalue)fold_enrichmentFDR(%)chr7 74606586 74607824 1239 571 181 3132.99 34.87 0chr10 94601968 94603119 1152 541 174 3135.11 34.76 0chr2 1.67E+08 1.67E+08 809 377 18 100.44 4.7 0.06chr12 34760179 34761206 1028 496 22 101.03 4.17 0.06chrX 48371756 48372420 665 437 18 100.29 4.12 0.06

=same row, chr column

=start col+summit-200=start col+summit+200

•Copy these 3 columns (without any header row).•In WinSCP click on any file on the PC, then on files->new->file & provide a name (“LiE_chr19_400bp.bed”) to edit a new simple text file.•Paste, save & close.

Making a file of control .bed regionschr start end length summit tags FDR(%)chr7 74606586 74607824 1239 571 181 3132.99 34.87 0chr10 94601968 94603119 1152 541 174 3135.11 34.76 0chr2 1.67E+08 1.67E+08 809 377 18 100.44 4.7 0.06chr12 34760179 34761206 1028 496 22 101.03 4.17 0.06chrX 48371756 48372420 665 437 18 100.29 4.12 0.06

=peaks:chr

=peaks:start-5000=peaks:end-5000

•5000 bp is far enough away to not be part of an enhancer composed of the ER binding site... but is close enough to likely be in the same general chromatin territory (e.g. accessible euchromatin vs. inaccessible heterochromatin)•Copy these columns & make a “CTRL_chr19_400bp.bed” file with WinSCP

…

peak ctrs. chr start end chr start end

control regions

CentDistA TFBS enrichment program designed for ChIP-seq data

Go to: http://biogpu.ddns.comp.nus.edu.sg/~chipseq/webseqtools2/TASKS/Motif_Enrichment/submit.php?email=guest…or (more simply) just google centdist and click on the first link (should end in /centdist/)

Assumes that TFBS-matrix hits will be most highly enriched at the centers of ChIP-seq peaks.

Adds PWM score to “peakiness” score (e.g. how much more enriched the TF site is in the center of the peak) final p. val.

Good enrichment good shape(best p.)

Good enrichment OK shape

Good enrichment poor shape (higher p.val.)

http://biogpu.ddns.comp.nus.edu.sg/~chipseq/webseqtools2/TASKS/Motif_Enrichment/submit.php?email=guest

http://biogpu.ddns.comp.nus.edu.sg/~chipseq/webseqtools2/TASKS/Motif_Enrichment/submit.php?email=guest

Run CentDistGive centdist a name for your run

Upload your +/-200 bp .bed file (CentDist does not need a separate background file, instead using flanking sequences at a case-specific optimized distance as background)

Check “Jaspar”, “vertebrate” & set max-co-motif distance to 3000

Then click Submit Job

On the new window that opens click “turn on autorefresh” so you will be notified when the job ends

Jaspar vs. TransfacJaspar is a freely-available set of TFBS matrices that can be downloaded from jaspar.genereg.net

Transfac is a commercial product that you need to pay for the latest release of. A version of Transfac (from ~2006) is available on the cluster (e.g. /cluster/home/g/s/gschni01/vertebrates.mat)

Which to use? Both, ideally, but beware that programs like CentDist will give you results from Transfac matrices – and you won’t be able to find out details of what they are.

CentDist ResultsView by factors, put in max number & hit go.

•P. Values (based on Score compose of Z0 (enrichment) & Z1 (peakiness)•Distribution graph•Weblogo representation of Jaspar matrix

Shows information content at each position. A,G,C&T 25% each-> 0 bits, only 1 base 100%->2 bits. Bases most highly over-represented relative to chance are largest.

How many enriched TF sites are there really?

Matrix hit enrichment for many factors, are all of them real?

Maybe not, notice how similar top sites are to each other and to estrogen response elements (EREs) such as V$jaspar_ESR1

V$jaspar_HNF4A

V$jaspar_NR2F1

V$jaspar_ESR1

Downloading CentDist ResultsClick on download as text & save the file somewhere you remember.

Open it into excel. Basic summary statistics & a few other things.

Many questions unanswered: -What is the fold enrichment over background?-What are the peaks with the greatest enrichment for

each factor?-What are the TFBS hit locations in each peak?-Which are the real enriched TFBSes & which are just

showing up by homology?-Do certain factors group together into the same same

peaks?






DAY 4 OUTLINE

Storm

Storm is a straightforward PWM scanning program that runs in UNIX.

Its greatest advantage is that it gives you all of the unprocessed output data, which allows you to do much more powerful analyses!

It also allows us to specify thresholds for matches to the matrix – allowing us to use functional depth as well as p. value

Getting DNA for StormTo run storm, we first need to get the actual DNA sequence for centers of our peaks (where we expect the greatest enrichment for TFBSes to be).Go to the UCSC genome browser at: genome.ucsc.edu

Under genome choose mouse mm9

Then choose add custom track & upload your +/-200 bp .bed file.

Click on Tools->Table BrowserSelect your new trackSelect output format “sequence”Provide a file name “LiE_chr19_400bp.fa” & hit “get output”Hit ‘get output’ again on the next page

Now do the same for your “CTRL_chr19_400bp.bed” file.

.fa denotes a simple ‘fasta’ format sequence file.

Cleaning up our .fa filesUse WinSCP to move these .fa files and their corresponding .bed files to your …/chip directory.

Each entry in the .fa file has a header with special characters in it that confuse storm. All of the commands below are in the file /cluster/tufts/cbi*/Ch*/Sam*/Final*/workflow2.txt… cat this to your screen, to copy & paste commands.

To fix this, go to your …/chip directory in Putty & do:perl /cluster/home/g/s/gschni01/perl*/Lax_convert.pl LiE_chr19_400bp.fa > LiE_chr19_400bp_converted.fa

To see what has changed use:head *.fa

Do the same for your “CTRL_chr19_400bp.fa” file.

Running stormFirst set some path variables:export CREAD=/cluster/home/g/s/gschni01/cread-0.84export PATH=$PATH:$CREAD/bin

Then run storm for your IP .fa file:bsub -oo LiE_chr19_400bp_p.storminfo storm -p -t 0.0005 -s LiE_chr19_400bp_converted.fa -o LiE_chr19_400bp_p.storm /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat

And for your control .fa file:bsub -oo CTRL_chr19_400bp_p.storminfo storm -p -t 0.0005 -s CTRL_chr19_400bp_converted.fa -o CTRL_chr19_400bp_p.storm /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat

Use more to look at one of your .storm output files (space for next page ctrl c to exit)






DAY 4 OUTLINE

Interpreting Storm dataRun the dme_parse perl program to gather and tabulate your storm data:

bsub -oo LiE_chr19_400bp_p.dmeparseinfo perl /cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl LiE_chr19_400bp_p.storm LiE_chr19_400bp.bed peaks

bsub -oo CTRL_chr19_400bp_p.dmeparseinfo perl /cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl CTRL_chr19_400bp_p.storm CTRL_chr19_400bp.bed peaks

dme_parse outputs…storm.bed file:Has USCS browser tracks for each TFBS matrix with locations of all hits in bed format.

…storm.map file:Lists all input matrices followed by the PFM derived from all of the hits to this matrix from our data.

…storm.info file:Summarizes a lot of information about matrix hits

Move the .info files to your PC with WinSCP & open them into Excel. File provides summary statistics for # of peaks with 0,1,2,etc. hits, total hits, and normalized hits per 50 bp vs distance from peak center.

dme_parse outputsUsing the .info file to plot relative density of TFBS hits in aorta IP, liver IP & offset controls:

1.5

1.7

1.9

2.1

2.3

2.5

2.7

-225 -175 -125 -75 -25 25 75 125 175 225

BP from peak apex

av

era

ge

ma

tch

es

/kb

AoE_EBF_avgBKG4_AoE_EBF_avgLiE_EBF_avgBKG4_LiE_EBF_avg

00.5

11.5

22.5

33.5

4

-225 -175 -125 -75 -25 25 75 125 175 225

BP from peak apex

av

era

ge

ma

tch

es

/kb

AoE_ER_avg BKG4_AoE_ER_avgLiE_ER_avg BKG4_LiE_ER_avg

0.8

1

1.2

1.4

1.6

1.8

2

-225 -175 -125 -75 -25 25 75 125 175 225

BP from peak apex

av

era

ge

ma

tch

es

/kb

AoE_MYC_avg BKG4_AoE_MYC_avgLiE_MYC_avg BKG4_LiE_MYC_avg

dme_parse outputsUsing the .info files to structure binomial tests

Hits= # of matches to each matrix in IP dataTests=# of times storm tested for a match

=(# of peaks) * (400 bp length of peaks - matrix length)Background freq= matches to offset conrol peak data/# tests (same as for IP)

Using the .info files to determine fractional enrichment

Hit frequency in IP data/Hit frequency in offset control

dme_parse outputs.freqs file: Number of hits to each matrix for each peak

Distribution of hits per peak in offset background establishes # of hits to be p.<=.05 enriched over backgound

• Can also look for significant overlaps between the peaks with enrichment for 2 different factors - to identify cooperative versus antagonistic interactions.

• Allows identification of sites at which a given TFBS may be functionally targeted (candidates for further testing)

Details on how to do these analyses are in ChIPseq_analysis_methods_2013_02_11 on the cbi website.






DAY 4 OUTLINE

Go to www.benoslab.pitt.edu/stamp/index.php

STAMP lets you compare matrices for evolutionary similarities to each other.

Go to your CentDist output.

Create a new column in which you change the names of the factors to fit with the names in the Jaspar_non_redundant_vertebrate.mat file you used for Storm.

=substitute(b2,“V$jaspar_”,”Jaspar$”), & propogate down

Select all matrix names w/ p.<.05 & paste them into a new file called “select_mats.txt” in your /chip folder on the cluster using WinSCP.

STAMP

http://www.benoslab.pitt.edu/stamp/index.php

Getting STAMP to help classify our CentDist top hits

perl /cluster/home/g/s/gschni01/perl*/MatrixSelect.pl /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat select_mats.txt select_mats.mat

Now, open the select_mats.mat file with WinSCP, copy everything & paste it into STAMP.

Keep all the STAMP defaults & hit submit.

STAMP TreeThis indicates that enrichment of PPARG, RORA, NR4A2 could be just because of their similarity to EREs.Other enriche sites, such as SP1, FoxA2 & Myf fall in separate homology classes.

To further distinguish which one is real, you can use the enrichment ratios & p. values (the “real” TFBS should be best in both of these.

ChIP-seq Methods & Analysis

Documents

Transcript of ChIP-seq Methods & Analysis