Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

32
Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor

Transcript of Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Page 1: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Advanced ChIPseq

Identification of consensus binding sites for the LEAFY transcription factor

Page 2: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Scientific Objective

The LEAFY transcription factor has been shown (Moyroud et al. 2011) to bind a dimer of the motif CCANTG[G/T]

We will use data from a chromatin immunoprecipitation assay on the LEAFY protein to: •Identify LEAFY binding targets •Attempt confirmation of the binding site

Page 3: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

A Few Known LEAFY targets

Gene Name Locus

APETALA (AP1) AT1G69120.1

AGAMOUS (AG) AT4G18960.1

LMI2 AT3G61250.1

LMI3 AT5G49770.1

LMI4 AT5G60630.1

LMI5 AT1G16070.1

Look for LEAFY enrichment at these loci in IGV 2.0

Page 4: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

AP1 (APETALA) Mutant

Why do we even care about LEAFY? Well, it activates AP1. If API is not active, Arabidopsis can’t make flowers and instead makes cauliflowers!

Wild-type ap1

Page 5: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

AP1 (APETALA) Mutant

Why do we even care about LEAFY? Well, it activates AP1. If API is not active, Arabidopsis can’t make flowers and instead makes cauliflowers!

Wild-type ap1

Page 6: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

ChIPseq Conceptual Overview

Page 7: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

The NCBI SRA

• NCBI SRA is a repository for NGS sequence reads

• Data is stored in association with basic metadata explaining experimental technique and inter-sample relationships

• Data format is NCBI-specific SRA and SRA-lite format. “Universal” lossless format.

• Upload and download is offered via FTP and HTTP but also via Aspera ASCP– Fast, parallel protocol similar in performance to iRODS

iput/iget commands used in iPlant Data Store

• One can use NCBI SRA Import to rapidly copy SRA accession SRP003928 over ASCP into the iPlant Data Store.

Page 8: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.
Page 9: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Import SRA data from NCBI SRA

Extract FASTQ files from the

downloaded SRA archives

Page 10: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

NCBI SRA Toolkit

• SRA data format is a universal format, but no downstream apps can accept it natively.

• Need to export SRA to FASTQ, SFF, etc.

• These are the standard file formats for representing sequence.

• Use the NCBI SRA Toolkit fastq-dump to export FASTQ sequence files from SRA files so we can process them

Page 11: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Import SRA data from NCBI SRA

Extract FASTQ files from the

downloaded SRA archives

Page 12: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

BWA

• BWA is one of many applications whose objective is to efficiently align short sequence reads to a reference genome sequence

• Other alternatives are BOWTIE, MAQ, TopHat, Stampy, Novoalign, etc.

• BWA was developed and used by the Human 1000 genomes project due to its speed and accuracy.

• BWA mem 0.7.4 is a fast variant of BWA able to use long reads. It is newly available in the iPlant DE

Page 13: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Outputs from BWA

• BWA emits alignments in the SAM format

• SAM is a universal system for describing next-gen sequences and their corresponding genome alignments

• SAMTools is a suite of applications for manipulating SAM files– Sort, Merge, Index, and more– Emit as binary BAM file

• All SAMTools functions are in the DE

Page 14: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Align FASTQ files to Arabidopsis genome using BWA

Merge and index BAM files using SAMtools apps

Page 15: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

PeakRanger

• PeakRanger is a fast, optimized algorithm for detecting enrichment peaks in ChIPseq data sets

• PeakRanger was developed at OICR in partnership between modENCODE and iPlant and is now maintained at UTSW

• It’s not the only option for peak finding:– MACS– ChIPseq Peak Finder– CisGenome– FindPeaks

http://ranger.sourceforge.net/

Page 16: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Use PeakRanger with the BAM files from the Control and Sample assays to find

LEAFY enrichment

NOTE: Many parameters to tweak. You are recommended to read the PeakRanger paper.

NOTE: Many parameters to tweak. You are recommended to read the PeakRanger paper.

Page 17: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

• Wiggle (.wig) files: Density map of sequence reads across the reference genome for control and sample BAM alignments

• Region (.bed) file: Feature file containing the significantly enriched domains in the genome

• Summit (.bed) file: Feature file containing the single base maximum of each peak

Outputs from PeakRanger

Page 18: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Wiggle file

BED file

Page 19: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Integrative Genomics Viewer

The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.

Use IGV to inspect outputs from PeakRanger

http://www.broadinstitute.org/igv/

Page 20: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Using IGV in Atmosphere

1. Launch an instance of RNA-Seq Visualization (or any image that has IGV) from the Atmosphere App list

2. Use VNClient to connect to your remote desktop

Page 21: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Using IGV in Atmosphere

1. Configure iDrop

2. Copy .wig and .bed files from the PeakRanger output to your Atmosphere instance desktop

Page 22: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Using IGV in Atmosphere

1. Launch IGV (Integrative Genomics Viewer)

2. Change the current genome to A. thaliana (TAIR10)

Page 23: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Using IGV in Atmosphere

1. Open igvtools and convert .wig file to .tdf

2. Load the .tdf and .bed files into the IGV window

3. Inspect loci by entering their name into search box

Page 24: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Using IGV in Atmosphere

Enrichment region and alignment

peak at promoter region of

APETALA (AP1)

Enrichment region and alignment

peak at promoter region of

APETALA (AP1)

Page 25: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Filtering the PeakRanger summits file

The statiscally best summits from PeakRanger have P-values of Zero. If you look at the summits.bed file you can see this is embedded in the name of the features. So, if we filter the summits.bed for only lines matching pval_0, we will generate a BED file containing summits most likely to be near true LEAFY binding sites.

This identical to running

egrep “pval_0” peakranger_summit.bed > peakranger_summit_best.bed

on a command line

This identical to running

egrep “pval_0” peakranger_summit.bed > peakranger_summit_best.bed

on a command line

Find Lines Matching a Regular Expression

Page 26: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

BEDTools for Interval Operations

The BEDTools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage. The utilities are largely based on four widely-used file formats: BED, GFF/GTF, VCF, and SAM/BAM. Using BEDTools, one can develop sophisticated pipelines that answer complicated research questions by "streaming" several BEDTools together.

The BEDTools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage. The utilities are largely based on four widely-used file formats: BED, GFF/GTF, VCF, and SAM/BAM. Using BEDTools, one can develop sophisticated pipelines that answer complicated research questions by "streaming" several BEDTools together.

* The entire BEDtools suite is now integrated into the iPlant DE. Follow us on Twitter @iPlantCollab to learn when new tools become available.

slopBed – Expand the coordinates of features in a BED file by a a defined number of bases

fastaFromBed – Extract a multiFASTA file from a reference sequence using a BED file of features

Page 27: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Best Summits BED File(single base pair features)Best Summits BED File

(single base pair features)

100 bp Region BED File(100 bp centered on peak centers)

100 bp Region BED File(100 bp centered on peak centers)

FASTA file of 100 bp regions(likely to contain consensus motifs)

FASTA file of 100 bp regions(likely to contain consensus motifs)

BEDTools slopBed, 50bp equidistantBEDTools slopBed, 50bp equidistant

BEDTools fastaFromBed, Arabidopsis genomeBEDTools fastaFromBed, Arabidopsis genome

DREMEDREME

Filter summits.bed on pval_0 Filter summits.bed on pval_0

Objective

Go from BED file of single-base peak summits to a FASTA file containing the 100 bp surrounding those summits that can be used for motif hunting

Peak regions from PeakRanger and/or MACSPeak regions from PeakRanger and/or MACS

IntersectBed peak regionsIntersectBed peak regions

Peaks found by both codesPeaks found by both codes

Page 28: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

DREME

• Run DREME on 100bp windows surrounding LEAFY peaks

• Download results

Page 29: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

DREME results

CCANTG(G/T)!Success!

Page 30: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Potential Next Steps

• Identify all consensus LEAFY sites in the genome that fall in promoters

• Extract all the promoters where LEAFY has significant binding and associate them with genes.

• Generate a simple gene list and run Ontology Term enrichment analysis to find classes of genes influenced by LEAFY

Page 31: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Cyberinfrastructure OverviewComponent What we did Why we used it

iPlant Data Store Imported data from SRA. Stored results of analyses. Downloaded

results.

Fast, flexible storage for large bioinformatics

data.

Discovery Environment Data import. NGS Alignment. Peak

Finding. Data organization.

One interface. Multiple bioinformatics

applications. Easy to manage work products.

Atmosphere Loaded results into desktop client application.

Avoid downloading large files to personal

computer. Easy access to powerful desktop

environment.

Page 32: Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

On to the Exercise