Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical...

Post on 12-Jan-2016

221 views 0 download

Tags:

Transcript of Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical...

Transcription factor binding sites and Transcription factor binding sites and gene regulatory networkgene regulatory network

Victor JinVictor JinDepartment of Biomedical InformaticsDepartment of Biomedical Informatics

The Ohio State UniversityThe Ohio State University

Transcription in higher eukaryotesTranscription in higher eukaryotes

Gene Expression

1. Chromatin structure

2. Initiation of transcription

3. Processing of the transcript

4. Transport to the cytoplasm

5. mRNA translation

6. mRNA stability

7. Protein activity stability

Transcriptional Regulation

Nuclear membrane

Transcriptional Regulation

Nuclear membrane

Binding site/motifCCG__CCG Genome-wide mRNA

transcript data (e.g. microarrays)

Transcriptional Regulation

Nuclear membrane

Binding site/motifCCG__CCG

• Understand which regulators control which target genes

• Discover motifs representing regulatory elements

Learning problems:

Some common approaches

• Cluster-first motif discovery – Cluster genes by expression profile, annotation, …

to find potentially coregulated genes– Find overrepresented motifs in promoter

sequences of similar genes (algorithms: MEME, Consensus, Gibbs sampler, AlignACE, …)

(Spellman et al. 1998)

Training data – Features

label

promoter sequence

regulator expression

feature vector

What is PWM?

Transcription factor binding sites (TFBSs) are usually slightly variable in their sequences.

A positional weight matrix (PWM) specifies the probability that you will see a given base at each index position of the motif.

NCCAGTNNNACTGGNCon165231426973424447T61034441915111089343113G1839431001415214339338C611391077729145818A151413121110987654321Pos

PWM for ERE

1. acggcagggTGACCc

2. aGGGCAtcgTGACCc

3. cGGTCGccaGGACCt

4. tGGTCAggcTGGTCt

5. aGGTGGcccTGACCc

6. cTGTCCctcTGACCc

7. aGGCTAcgaTGACGt ...

41. cagggagtgTGACCc

42. gagcatgggTGACCa

43. aGGTCAtaacgattt44. gGAACAgttTGACC

c45. cGGTGAcctTGAC

Cc46. gGGGCAaagTGAC

Tg

1. acggcagggTGACCc

2. aGGGCAtcgTGACCc

3. cGGTCGccaGGACCt

4. tGGTCAggcTGGTCt

5. aGGTGGcccTGACCc

6. cTGTCCctcTGACCc

7. aGGCTAcgaTGACGt ...

41. cagggagtgTGACCc

42. gagcatgggTGACCa

43. aGGTCAtaacgattt44. gGAACAgttTGACC

c45. cGGTGAcctTGAC

Cc46. gGGGCAaagTGAC

Tg

Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number of times a particular nucleotide appears at a given position). A normalized PFM, in which each column adds up to a total of one, is a matrix of probabilities for observing each nucleotide at each position.

Position frequency matrix (PFM)

(also known as raw count matrix)

PFM should be converted to log-scale for efficient computational analysis. To eliminate null values before log-conversion, and to correct for small samples of binding sites, a sampling correction, known as pseudocounts, is added to each cell of the PFM.

Position weight matrix (PWM)(also known as position-specific scoring matrix)

Position Weight Matrix for ERE

Converting a PFM into a PWM

)(

4 log

,log),(

,

22 bpNN

Nf

bp

ibpibw

ib

– raw count (PFM matrix element) of nucleotide b in column i

N – number of sequences used to create PFM (= column sum)

- pseudocounts (correction for small sample size)

p(b) - background frequency of nucleotide b

NN

4

and

For each matrix element do:

A 18 8 5 4 1 29 7 7 7 0 1 39 1 1 6C 8 3 3 9 33 4 21 15 14 0 0 1 43 39 18G 13 31 34 9 8 10 11 15 19 4 44 3 0 1 6T 7 4 4 24 4 3 7 9 6 42 1 3 2 5 16

ibf ,

A 0.58-

0.44-

0.98-

1.21-

2.29 1.22-

0.60-

0.60-

0.60-

2.96-

2.29 1.62-

2.29-

2.29 -0.72

C-

0.44-

1.49-

1.49-

0.30 1.39-

1.21 0.78 0.34 0.25-

2.96-

2.96-

2.29 1.76 1.62 0.46

G 0.16 1.31 1.44-

0.30-

0.44-

0.17-

0.06 0.34 0.65-

1.21 1.79-

1.49-

2.96-

2.29 -0.64

T-

0.60-

1.21-

1.21 0.96-

1.21-

1.49-

0.60-

0.30-

0.78 1.73-

2.29-

1.49-

1.84-

0.98 0.23

G G G T C A G C A T G G C C A

Absolute score of the site

Max 0.58 1.31 1.44 0.96 1.39 1.22 0.78 0.34 0.65 1.73 1.79 1.62 1.76 1.62 17.20Min -0.60 -1.49 -1.49 -1.21 -2.29 -1.49 -0.60 -0.60 -0.78 -2.96 -2.96 -2.29 -2.96 -2.29 -24.02

scoreMinimumscoreMaximum

scoreMinimumscoreAbsolutescorerelative

__

___

86.0

02.2420.17

02.2457.11

m

i

ibwS1

),( =11.57

Scoring putative EREs by scanning the promoter with PWM

Row Sum

A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -0.60 -0.60 -2.96 -2.29 1.62 -2.29 -2.29 -0.72

C -0.44 -1.49 -1.49 -0.30 1.39 -1.21 0.78 0.34 0.25 -2.96 -2.96 -2.29 1.76 1.62 0.46

G 0.16 1.31 1.44 -0.30 -0.44 -0.17 -0.06 0.34 0.65 -1.21 1.79 -1.49 -2.96 -2.29 -0.64

T -0.60 -1.21 -1.21 0.96 -1.21 -1.49 -0.60 -0.30 -0.78 1.73 -2.29 -1.49 -1.84 -0.98 0.23

Yeast ESR: Biological Validation

STRE element

Universal stress repressor motif

Previous work: “Structure learning”

• Graphical models (and other methods)– Learn structure of “regulatory network”, “regulatory

modules”, etc. – Fit interpretable model to training data– Model small number of genes or clusters of genes– Many computational and statistical challenges; often

used for qualitative hypotheses rather than prediction

(Segal et al, 2003, 2004)

(Pe’er et al. 2001)

Signaling networks in a cell

• Regulator-motif associations in nodes can have different meanings:

• Need other data to confirm binding relationship between regulator and target (e.g. ChIP-chip)

• Still, can determine statistically significant regulator-target relationships from regulation program

TFMTF

PPMp

PMMp

Direct binding Indirect effect Co-occurrence

Network inference

Example: oxygen sensing and regulatory network

• ChIP-chip: genome-wide protein-DNA binding data, i.e. what promoters are bound by TF?

• Investigate regulatory network model: use ChIP-chip data in place of motifs (no motif discovery)– Features: (regulator, TF-

occupancy) pairs

TFP2P1

Binding data for regulatory networks

Inferring regulatory networks from the combination of expression data and binding data

CCNL1

BRF1

ER

FOSMYC

CEBPXBP1

RXRA

HSF2

PNN

NRIP1

TXNDC

IVNS1ABP

BATF

HES1

CHAF1B

CSDE1

CUTL1

PURB

ADAR

C140RF43

SP3

DDX20

ELF3

TXNIPPAWR

BRIP1

FOXP4

ZNF394

BAZ1B

STRAP

ASCC3

MKL2

GTF2I

RUVBL1

RFC1ZNF500 TTF2

RAB18 ZKSCAN1

MSX2

LASS2

HDAC1ZBTB41

TBX2

THRAP1

VPS72TLE3

BHLHB2

ZNF38

ZNF239

DNMT1

HIF1AHEY2

An extended ER regulatory network in MCF7 cells

Signaling molecules -- Networks

• Find all SMs that associate as regulators with a particular TF’s ChIP occupancy in ADT features

• e.g.

• Hypothesis: Glc7 phosphatase complex interacts with Hsf1 in regulation of Hsf1 targets (Interaction supported in literature)

Hsf1Gac1Gip1Sds22

Glc7 phosphatase

complex

TFSM mRNA

Input Data

Ab initio Motif Discovery Programs

Statistical Methods

STAMP Matching

Results

•SeqLog

•PWM

•P-value

•Known or novel motifs

•Bootstrap re-sampling

•Fisher test

•Weeder

•MaMf

•MEME

•FASTA file

•Contact Info

•Control data (optional)

http://motif.bmi.ohio-state.edu/ChIPMotifs/

http://motif.bmi-ohio-state.edu/HRTBLDb

Software Demo

• W-ChIPMotifs• HRTargetDB