ChIP-seq Methods & Analysis
description
Transcript of ChIP-seq Methods & Analysis
ChIP-seq Methods & AnalysisChIP-seq Methods & Analysis
Gavin SchnitzlerAsst. Prof. Medicine TUSM, Investigator at MCRI, TMC
• Day 1: ChIP techniques, library production, USCS browser tracks
• Day 2: QC on reads, Mapping binding site peaks, examining read density maps.
• Day 3: Analyzing peaks in relation to genomic feature, etc.
• Day 4: Analyzing peaks for transcription factor binding site consensus sequences.
• Day 5: Variants & advanced approaches.
ChIP-seq COURSE OUTLINE
• Position weight matrices to find transcription factor binding sites (TFBSes)
• TFBS enrichment in peaks using CentDist
• TFBS enrichment using Storm in UNIX
• Mining Storm results
• Disambiguating similar matrices w/ STAMP
DAY 4 OUTLINE
• Analyzing overlaps between peak & regulated genes in UNIX
DAY 3 REMNANT
How can we test the significance of binding site association w/ regulated genes?
If you haven’t already, go to the cluster & move bed and txt files to your /cluster/shared/userID/chip folder (mkdir chip & cd chip if you don’t have this folder yet):
cp /cluster/tufts/cbi*/Ch*/Sam*/ER*beds/*.* .
The .txt files list the transcription start sites (TSSes) of genes that were up- or down-regulated by estrogen in aorta or liver (by RNA-seq analysis).
Overlaps between peaks & genes Take a look at one of them using head [name].txtchr6 73171625 - Dnahc6chr2 25356026 - C8gchr6 65540391 + Tnip3…
The file format is (tab-delimited) chromosome, TSS, transcription direction (+=sense) & geneID.
You can get all this info easily from the UCSC browser, for individual genes (by hand)…… or you can get this information for all genes & extract what you want for your gene set of interest.. Check out the RNA-seq module for info on making & handling .gtf files.
Overlaps between peaks & genes 2 The overlap program can recognize this type of file & will test for overlaps between ChIP-seq peaks and regions around the listed TSSes (default +/-1000 bp).You can also change this range by specifying a –range variable.Find the overlaps between 10-kb regions around TSSes of genes up- or downregulated in each tissue & the corresponding ER binding site data using variations on:bsub perl /cluster/home/g/s/gschni01/perl*/overlap_1.3.pl Ao_up_TSS.txt AoE_all.bed –outfile Ao_up_v_AoE.overlap
(these commands are in /cluster/tufts/cbi*/Ch*/Sam*/Fin*/workflow2.txt)
Note the number of overlaps (hits), number of genes (tests) and the number of overlaps expected by chance divided by the number of genes (background frequency) provides all the information you need for binomial tests. Note these numbers down for each comparison.
Accessing the R statistical language
On the PCs in this room:Start->programs->R
To get R for your PC (free): http://cran.r-project.org/
To get RStudio (allows for easier management of R projects): http://www.rstudio.com/
On the cluster type: module load RThen: bsub -Ip -q int_public6 RTo exit use the R command q()
For more info on using R & Unix see:http://sites.tufts.edu/cbi/resources/rna-seq-course/UNIX resources & R resources
Binomial tests in R
Use the R command: binom.test(hits, tests, bkg_freq) to address the significance of overlaps you see
For Ao_down_TSS.txt vs. AoE.bed: binom.test(118,2, 1.03/118)
Which comparisons show significant enrichment. Do any show anti-enrichment?
• Position weight matrices to find transcription factor binding sites (TFBSes)
• TFBS enrichment in peaks using CentDist
• TFBS enrichment using Storm in UNIX
• Mining Storm results
• Disambiguating similar matrices w/ STAMP
DAY 4 OUTLINE
What is PWM? Transcription factor binding sites (TFBSs) are
usually slightly variable in their sequences.
A positional frequency matrix (PFM) specifies the probability that you will see a given base at each index position of the motif.
This is built from sequences known to bind the TF (e.g. 46 sequences for the PFM below).
NCCAGTNNNACTGGNCon165231426973424447T61034441915111089343113G1839431001415214339338C611391077729145818A151413121110987654321Pos
5’ 3’
Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University
PFM->normalized PFM->PWM
1. acggcagggTGACCc
2. aGGGCAtcgTGACCc
3. cGGTCGccaGGACCt
4. tGGTCAggcTGGTCt
5. aGGTGGcccTGACCc
6. cTGTCCctcTGACCc
7. aGGCTAcgaTGACGt ...
41. cagggagtgTGACCc
42. gagcatgggTGACCa
43. aGGTCAtaacgattt44. gGAACAgttTGACC
c45. cGGTGAcctTGAC
Cc46. gGGGCAaagTGAC
Tg
1. acggcagggTGACCc
2. aGGGCAtcgTGACCc
3. cGGTCGccaGGACCt
4. tGGTCAggcTGGTCt
5. aGGTGGcccTGACCc
6. cTGTCCctcTGACCc
7. aGGCTAcgaTGACGt ...
41. cagggagtgTGACCc
42. gagcatgggTGACCa
43. aGGTCAtaacgattt44. gGAACAgttTGACC
c45. cGGTGAcctTGAC
Cc46. gGGGCAaagTGAC
Tg
Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number of times a particular nucleotide appears at a given position).
A normalized PFM, in which each column adds up to a total of one, is a matrix of probabilities for observing each nucleotide at each position (e.g. divide by 46).
Position frequency matrix (PFM)
(also known as raw count matrix)
The normalized PFM is converted to log-scale for efficient computational analysis. To eliminate null values before log-conversion, and to correct for small samples of binding sites, a sampling correction, known as pseudocounts, is added to each cell of the PFM.
Position weight matrix (PWM)(also known as position-specific scoring matrix)
Binding site data
Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University
Position Weight Matrix for ERE
Converting a PFM into a PWM
)(
4 log
,log),(
,
22 bpNN
Nf
bp
ibpibw
ib
– raw count (PFM matrix element) of nucleotide b in column i
N – number of sequences used to create PFM (= column sum)
- pseudocounts (correction for small sample size)
p(b) - background frequency of nucleotide b
NN
4
and
For each matrix element do:
A 18 8 5 4 1 29 7 7 7 0 1 39 1 1 6C 8 3 3 9 33 4 21 15 14 0 0 1 43 39 18G 13 31 34 9 8 10 11 15 19 4 44 3 0 1 6T 7 4 4 24 4 3 7 9 6 42 1 3 2 5 16
ibf ,
A 0.58-
0.44-
0.98-
1.21-
2.29 1.22-
0.60-
0.60-
0.60-
2.96-
2.29 1.62-
2.29-
2.29 -0.72
C-
0.44-
1.49-
1.49-
0.30 1.39-
1.21 0.78 0.34 0.25-
2.96-
2.96-
2.29 1.76 1.62 0.46
G 0.16 1.31 1.44-
0.30-
0.44-
0.17-
0.06 0.34 0.65-
1.21 1.79-
1.49-
2.96-
2.29 -0.64
T-
0.60-
1.21-
1.21 0.96-
1.21-
1.49-
0.60-
0.30-
0.78 1.73-
2.29-
1.49-
1.84-
0.98 0.23
Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University
G G G T C A G C A T G G C C A
Absolute score of the site
Max 0.58 1.31 1.44 0.96 1.39 1.22 0.78 0.34 0.65 1.73 1.79 1.62 1.76 1.62 17.20Min -0.60 -1.49 -1.49 -1.21 -2.29 -1.49 -0.60 -0.60 -0.78 -2.96 -2.96 -2.29 -2.96 -2.29 -24.02
scoreMinimumscoreMaximum
scoreMinimumscoreAbsolutescorerelative
__
___
86.0
02.2420.17
02.2457.11
m
i
ibwS1
),( =11.57
Scoring putative EREs by scanning the promoter w/ PWM
Row Sum
A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -0.60 -0.60 -2.96 -2.29 1.62 -2.29 -2.29 -0.72
C -0.44 -1.49 -1.49 -0.30 1.39 -1.21 0.78 0.34 0.25 -2.96 -2.96 -2.29 1.76 1.62 0.46
G 0.16 1.31 1.44 -0.30 -0.44 -0.17 -0.06 0.34 0.65 -1.21 1.79 -1.49 -2.96 -2.29 -0.64
T -0.60 -1.21 -1.21 0.96 -1.21 -1.49 -0.60 -0.30 -0.78 1.73 -2.29 -1.49 -1.84 -0.98 0.23
This is also called “functional depth”
Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University
G G G T C A G C A T G G C C A
Estimating p. values for a match to the matrix
A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -0.60 -0.60 -2.96 -2.29 1.62 -2.29 -2.29 -0.72
C -0.44 -1.49 -1.49 -0.30 1.39 -1.21 0.78 0.34 0.25 -2.96 -2.96 -2.29 1.76 1.62 0.46
G 0.16 1.31 1.44 -0.30 -0.44 -0.17 -0.06 0.34 0.65 -1.21 1.79 -1.49 -2.96 -2.29 -0.64
T -0.60 -1.21 -1.21 0.96 -1.21 -1.49 -0.60 -0.30 -0.78 1.73 -2.29 -1.49 -1.84 -0.98 0.23
This sequence had a functional depth (f) of 0.86
The summed probabilities of all sequences with f >=.86 gives the p.value for this sequence = chance that f>=.86 would be achieved by a randomized DNA sequence.
Short matrices can reach f > .9 but still have high p. values – thus f is the best measure for short seqs.
Long matrices can have very low p. values but still have f< .9 – thus p.value is the best measure for long seqs.
• Position weight matrices to find transcription factor binding sites (TFBSes)
• TFBS enrichment in peaks using CentDist
• TFBS enrichment using Storm in UNIX
• Mining Storm results
• Disambiguating similar matrices w/ STAMP
DAY 4 OUTLINE
Preparing for PWM search
Lauch WinSCP (Start->programs->WinSCP)
Navigate to: /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/Final_output_files
Pull over the “ipvinput19_peaks.xls” file to the PC.(this is the MACS output file that we generated yesterday)
Open it into Excel
Making .bed file w/ +/-200 bp around peak summit (where we expect TFBS
enrichment will be greatest)chr start end length summit tags -10*LOG10(pvalue)fold_enrichmentFDR(%)chr7 74606586 74607824 1239 571 181 3132.99 34.87 0chr10 94601968 94603119 1152 541 174 3135.11 34.76 0chr2 1.67E+08 1.67E+08 809 377 18 100.44 4.7 0.06chr12 34760179 34761206 1028 496 22 101.03 4.17 0.06chrX 48371756 48372420 665 437 18 100.29 4.12 0.06
=same row, chr column
=start col+summit-200=start col+summit+200
•Copy these 3 columns (without any header row).•In WinSCP click on any file on the PC, then on files->new->file & provide a name (“LiE_chr19_400bp.bed”) to edit a new simple text file.•Paste, save & close.
Making a file of control .bed regionschr start end length summit tags FDR(%)chr7 74606586 74607824 1239 571 181 3132.99 34.87 0chr10 94601968 94603119 1152 541 174 3135.11 34.76 0chr2 1.67E+08 1.67E+08 809 377 18 100.44 4.7 0.06chr12 34760179 34761206 1028 496 22 101.03 4.17 0.06chrX 48371756 48372420 665 437 18 100.29 4.12 0.06
=peaks:chr
=peaks:start-5000=peaks:end-5000
•5000 bp is far enough away to not be part of an enhancer composed of the ER binding site... but is close enough to likely be in the same general chromatin territory (e.g. accessible euchromatin vs. inaccessible heterochromatin)•Copy these columns & make a “CTRL_chr19_400bp.bed” file with WinSCP
…
peak ctrs. chr start end chr start end
control regions
CentDistA TFBS enrichment program designed for ChIP-seq data
Go to: http://biogpu.ddns.comp.nus.edu.sg/~chipseq/webseqtools2/TASKS/Motif_Enrichment/submit.php?email=guest…or (more simply) just google centdist and click on the first link (should end in /centdist/)
Assumes that TFBS-matrix hits will be most highly enriched at the centers of ChIP-seq peaks.
Adds PWM score to “peakiness” score (e.g. how much more enriched the TF site is in the center of the peak) final p. val.
Good enrichment good shape(best p.)
Good enrichment OK shape
Good enrichment poor shape (higher p.val.)
Run CentDistGive centdist a name for your run
Upload your +/-200 bp .bed file (CentDist does not need a separate background file, instead using flanking sequences at a case-specific optimized distance as background)
Check “Jaspar”, “vertebrate” & set max-co-motif distance to 3000
Then click Submit Job
On the new window that opens click “turn on autorefresh” so you will be notified when the job ends
Jaspar vs. TransfacJaspar is a freely-available set of TFBS matrices that can be downloaded from jaspar.genereg.net
Transfac is a commercial product that you need to pay for the latest release of. A version of Transfac (from ~2006) is available on the cluster (e.g. /cluster/home/g/s/gschni01/vertebrates.mat)
Which to use? Both, ideally, but beware that programs like CentDist will give you results from Transfac matrices – and you won’t be able to find out details of what they are.
CentDist ResultsView by factors, put in max number & hit go.
•P. Values (based on Score compose of Z0 (enrichment) & Z1 (peakiness)•Distribution graph•Weblogo representation of Jaspar matrix
Shows information content at each position. A,G,C&T 25% each-> 0 bits, only 1 base 100%->2 bits. Bases most highly over-represented relative to chance are largest.
How many enriched TF sites are there really?
Matrix hit enrichment for many factors, are all of them real?
Maybe not, notice how similar top sites are to each other and to estrogen response elements (EREs) such as V$jaspar_ESR1
V$jaspar_HNF4A
V$jaspar_NR2F1
V$jaspar_ESR1
Downloading CentDist ResultsClick on download as text & save the file somewhere you remember.
Open it into excel. Basic summary statistics & a few other things.
Many questions unanswered: -What is the fold enrichment over background?-What are the peaks with the greatest enrichment for
each factor?-What are the TFBS hit locations in each peak?-Which are the real enriched TFBSes & which are just
showing up by homology?-Do certain factors group together into the same same
peaks?
• Position weight matrices to find transcription factor binding sites (TFBSes)
• TFBS enrichment in peaks using CentDist
• TFBS enrichment using Storm in UNIX
• Mining Storm results
• Disambiguating similar matrices w/ STAMP
DAY 4 OUTLINE
Storm
Storm is a straightforward PWM scanning program that runs in UNIX.
Its greatest advantage is that it gives you all of the unprocessed output data, which allows you to do much more powerful analyses!
It also allows us to specify thresholds for matches to the matrix – allowing us to use functional depth as well as p. value
Getting DNA for StormTo run storm, we first need to get the actual DNA sequence for centers of our peaks (where we expect the greatest enrichment for TFBSes to be).Go to the UCSC genome browser at: genome.ucsc.edu
Under genome choose mouse mm9
Then choose add custom track & upload your +/-200 bp .bed file.
Click on Tools->Table BrowserSelect your new trackSelect output format “sequence”Provide a file name “LiE_chr19_400bp.fa” & hit “get output”Hit ‘get output’ again on the next page
Now do the same for your “CTRL_chr19_400bp.bed” file.
.fa denotes a simple ‘fasta’ format sequence file.
Cleaning up our .fa filesUse WinSCP to move these .fa files and their corresponding .bed files to your …/chip directory.
Each entry in the .fa file has a header with special characters in it that confuse storm. All of the commands below are in the file /cluster/tufts/cbi*/Ch*/Sam*/Final*/workflow2.txt… cat this to your screen, to copy & paste commands.
To fix this, go to your …/chip directory in Putty & do:perl /cluster/home/g/s/gschni01/perl*/Lax_convert.pl LiE_chr19_400bp.fa > LiE_chr19_400bp_converted.fa
To see what has changed use:head *.fa
Do the same for your “CTRL_chr19_400bp.fa” file.
Running stormFirst set some path variables:export CREAD=/cluster/home/g/s/gschni01/cread-0.84export PATH=$PATH:$CREAD/bin
Then run storm for your IP .fa file:bsub -oo LiE_chr19_400bp_p.storminfo storm -p -t 0.0005 -s LiE_chr19_400bp_converted.fa -o LiE_chr19_400bp_p.storm /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat
And for your control .fa file:bsub -oo CTRL_chr19_400bp_p.storminfo storm -p -t 0.0005 -s CTRL_chr19_400bp_converted.fa -o CTRL_chr19_400bp_p.storm /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat
Use more to look at one of your .storm output files (space for next page ctrl c to exit)
• Position weight matrices to find transcription factor binding sites (TFBSes)
• TFBS enrichment in peaks using CentDist
• TFBS enrichment using Storm in UNIX
• Mining Storm results
• Disambiguating similar matrices w/ STAMP
DAY 4 OUTLINE
Interpreting Storm dataRun the dme_parse perl program to gather and tabulate your storm data:
bsub -oo LiE_chr19_400bp_p.dmeparseinfo perl /cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl LiE_chr19_400bp_p.storm LiE_chr19_400bp.bed peaks
bsub -oo CTRL_chr19_400bp_p.dmeparseinfo perl /cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl CTRL_chr19_400bp_p.storm CTRL_chr19_400bp.bed peaks
dme_parse outputs…storm.bed file:Has USCS browser tracks for each TFBS matrix with locations of all hits in bed format.
…storm.map file:Lists all input matrices followed by the PFM derived from all of the hits to this matrix from our data.
…storm.info file:Summarizes a lot of information about matrix hits
Move the .info files to your PC with WinSCP & open them into Excel. File provides summary statistics for # of peaks with 0,1,2,etc. hits, total hits, and normalized hits per 50 bp vs distance from peak center.
dme_parse outputsUsing the .info file to plot relative density of TFBS hits in aorta IP, liver IP & offset controls:
1.5
1.7
1.9
2.1
2.3
2.5
2.7
-225 -175 -125 -75 -25 25 75 125 175 225
BP from peak apex
av
era
ge
ma
tch
es
/kb
AoE_EBF_avgBKG4_AoE_EBF_avgLiE_EBF_avgBKG4_LiE_EBF_avg
00.5
11.5
22.5
33.5
4
-225 -175 -125 -75 -25 25 75 125 175 225
BP from peak apex
av
era
ge
ma
tch
es
/kb
AoE_ER_avg BKG4_AoE_ER_avgLiE_ER_avg BKG4_LiE_ER_avg
0.8
1
1.2
1.4
1.6
1.8
2
-225 -175 -125 -75 -25 25 75 125 175 225
BP from peak apex
av
era
ge
ma
tch
es
/kb
AoE_MYC_avg BKG4_AoE_MYC_avgLiE_MYC_avg BKG4_LiE_MYC_avg
dme_parse outputsUsing the .info files to structure binomial tests
Hits= # of matches to each matrix in IP dataTests=# of times storm tested for a match
=(# of peaks) * (400 bp length of peaks - matrix length)Background freq= matches to offset conrol peak data/# tests (same as for IP)
Using the .info files to determine fractional enrichment
Hit frequency in IP data/Hit frequency in offset control
dme_parse outputs.freqs file: Number of hits to each matrix for each peak
Distribution of hits per peak in offset background establishes # of hits to be p.<=.05 enriched over backgound
• Can also look for significant overlaps between the peaks with enrichment for 2 different factors - to identify cooperative versus antagonistic interactions.
• Allows identification of sites at which a given TFBS may be functionally targeted (candidates for further testing)
Details on how to do these analyses are in ChIPseq_analysis_methods_2013_02_11 on the cbi website.
• Position weight matrices to find transcription factor binding sites (TFBSes)
• TFBS enrichment in peaks using CentDist
• TFBS enrichment using Storm in UNIX
• Mining Storm results
• Disambiguating similar matrices w/ STAMP
DAY 4 OUTLINE
Go to www.benoslab.pitt.edu/stamp/index.php
STAMP lets you compare matrices for evolutionary similarities to each other.
Go to your CentDist output.
Create a new column in which you change the names of the factors to fit with the names in the Jaspar_non_redundant_vertebrate.mat file you used for Storm.
=substitute(b2,“V$jaspar_”,”Jaspar$”), & propogate down
Select all matrix names w/ p.<.05 & paste them into a new file called “select_mats.txt” in your /chip folder on the cluster using WinSCP.
STAMP
Getting STAMP to help classify our CentDist top hits
perl /cluster/home/g/s/gschni01/perl*/MatrixSelect.pl /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat select_mats.txt select_mats.mat
Now, open the select_mats.mat file with WinSCP, copy everything & paste it into STAMP.
Keep all the STAMP defaults & hit submit.
STAMP TreeThis indicates that enrichment of PPARG, RORA, NR4A2 could be just because of their similarity to EREs.Other enriche sites, such as SP1, FoxA2 & Myf fall in separate homology classes.
To further distinguish which one is real, you can use the enrichment ratios & p. values (the “real” TFBS should be best in both of these.