Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical...

23
Transcription factor binding Transcription factor binding sites and gene regulatory sites and gene regulatory network network Victor Jin Victor Jin Department of Biomedical Informatics Department of Biomedical Informatics The Ohio State University The Ohio State University

Transcript of Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical...

Page 1: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Transcription factor binding sites and Transcription factor binding sites and gene regulatory networkgene regulatory network

Victor JinVictor JinDepartment of Biomedical InformaticsDepartment of Biomedical Informatics

The Ohio State UniversityThe Ohio State University

Page 2: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Transcription in higher eukaryotesTranscription in higher eukaryotes

Gene Expression

1. Chromatin structure

2. Initiation of transcription

3. Processing of the transcript

4. Transport to the cytoplasm

5. mRNA translation

6. mRNA stability

7. Protein activity stability

Page 3: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Transcriptional Regulation

Nuclear membrane

Page 4: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Transcriptional Regulation

Nuclear membrane

Binding site/motifCCG__CCG Genome-wide mRNA

transcript data (e.g. microarrays)

Page 5: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Transcriptional Regulation

Nuclear membrane

Binding site/motifCCG__CCG

• Understand which regulators control which target genes

• Discover motifs representing regulatory elements

Learning problems:

Page 6: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Some common approaches

• Cluster-first motif discovery – Cluster genes by expression profile, annotation, …

to find potentially coregulated genes– Find overrepresented motifs in promoter

sequences of similar genes (algorithms: MEME, Consensus, Gibbs sampler, AlignACE, …)

(Spellman et al. 1998)

Page 7: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Training data – Features

label

promoter sequence

regulator expression

feature vector

Page 8: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

What is PWM?

Transcription factor binding sites (TFBSs) are usually slightly variable in their sequences.

A positional weight matrix (PWM) specifies the probability that you will see a given base at each index position of the motif.

NCCAGTNNNACTGGNCon165231426973424447T61034441915111089343113G1839431001415214339338C611391077729145818A151413121110987654321Pos

Page 9: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

PWM for ERE

1. acggcagggTGACCc

2. aGGGCAtcgTGACCc

3. cGGTCGccaGGACCt

4. tGGTCAggcTGGTCt

5. aGGTGGcccTGACCc

6. cTGTCCctcTGACCc

7. aGGCTAcgaTGACGt ...

41. cagggagtgTGACCc

42. gagcatgggTGACCa

43. aGGTCAtaacgattt44. gGAACAgttTGACC

c45. cGGTGAcctTGAC

Cc46. gGGGCAaagTGAC

Tg

1. acggcagggTGACCc

2. aGGGCAtcgTGACCc

3. cGGTCGccaGGACCt

4. tGGTCAggcTGGTCt

5. aGGTGGcccTGACCc

6. cTGTCCctcTGACCc

7. aGGCTAcgaTGACGt ...

41. cagggagtgTGACCc

42. gagcatgggTGACCa

43. aGGTCAtaacgattt44. gGAACAgttTGACC

c45. cGGTGAcctTGAC

Cc46. gGGGCAaagTGAC

Tg

Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number of times a particular nucleotide appears at a given position). A normalized PFM, in which each column adds up to a total of one, is a matrix of probabilities for observing each nucleotide at each position.

Position frequency matrix (PFM)

(also known as raw count matrix)

PFM should be converted to log-scale for efficient computational analysis. To eliminate null values before log-conversion, and to correct for small samples of binding sites, a sampling correction, known as pseudocounts, is added to each cell of the PFM.

Position weight matrix (PWM)(also known as position-specific scoring matrix)

Page 10: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Position Weight Matrix for ERE

Converting a PFM into a PWM

)(

4 log

,log),(

,

22 bpNN

Nf

bp

ibpibw

ib

– raw count (PFM matrix element) of nucleotide b in column i

N – number of sequences used to create PFM (= column sum)

- pseudocounts (correction for small sample size)

p(b) - background frequency of nucleotide b

NN

4

and

For each matrix element do:

A 18 8 5 4 1 29 7 7 7 0 1 39 1 1 6C 8 3 3 9 33 4 21 15 14 0 0 1 43 39 18G 13 31 34 9 8 10 11 15 19 4 44 3 0 1 6T 7 4 4 24 4 3 7 9 6 42 1 3 2 5 16

ibf ,

A 0.58-

0.44-

0.98-

1.21-

2.29 1.22-

0.60-

0.60-

0.60-

2.96-

2.29 1.62-

2.29-

2.29 -0.72

C-

0.44-

1.49-

1.49-

0.30 1.39-

1.21 0.78 0.34 0.25-

2.96-

2.96-

2.29 1.76 1.62 0.46

G 0.16 1.31 1.44-

0.30-

0.44-

0.17-

0.06 0.34 0.65-

1.21 1.79-

1.49-

2.96-

2.29 -0.64

T-

0.60-

1.21-

1.21 0.96-

1.21-

1.49-

0.60-

0.30-

0.78 1.73-

2.29-

1.49-

1.84-

0.98 0.23

Page 11: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

G G G T C A G C A T G G C C A

Absolute score of the site

Max 0.58 1.31 1.44 0.96 1.39 1.22 0.78 0.34 0.65 1.73 1.79 1.62 1.76 1.62 17.20Min -0.60 -1.49 -1.49 -1.21 -2.29 -1.49 -0.60 -0.60 -0.78 -2.96 -2.96 -2.29 -2.96 -2.29 -24.02

scoreMinimumscoreMaximum

scoreMinimumscoreAbsolutescorerelative

__

___

86.0

02.2420.17

02.2457.11

m

i

ibwS1

),( =11.57

Scoring putative EREs by scanning the promoter with PWM

Row Sum

A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -0.60 -0.60 -2.96 -2.29 1.62 -2.29 -2.29 -0.72

C -0.44 -1.49 -1.49 -0.30 1.39 -1.21 0.78 0.34 0.25 -2.96 -2.96 -2.29 1.76 1.62 0.46

G 0.16 1.31 1.44 -0.30 -0.44 -0.17 -0.06 0.34 0.65 -1.21 1.79 -1.49 -2.96 -2.29 -0.64

T -0.60 -1.21 -1.21 0.96 -1.21 -1.49 -0.60 -0.30 -0.78 1.73 -2.29 -1.49 -1.84 -0.98 0.23

Page 12: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Yeast ESR: Biological Validation

STRE element

Universal stress repressor motif

Page 13: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Previous work: “Structure learning”

• Graphical models (and other methods)– Learn structure of “regulatory network”, “regulatory

modules”, etc. – Fit interpretable model to training data– Model small number of genes or clusters of genes– Many computational and statistical challenges; often

used for qualitative hypotheses rather than prediction

(Segal et al, 2003, 2004)

(Pe’er et al. 2001)

Page 14: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Signaling networks in a cell

Page 15: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

• Regulator-motif associations in nodes can have different meanings:

• Need other data to confirm binding relationship between regulator and target (e.g. ChIP-chip)

• Still, can determine statistically significant regulator-target relationships from regulation program

TFMTF

PPMp

PMMp

Direct binding Indirect effect Co-occurrence

Network inference

Page 16: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Example: oxygen sensing and regulatory network

Page 17: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

• ChIP-chip: genome-wide protein-DNA binding data, i.e. what promoters are bound by TF?

• Investigate regulatory network model: use ChIP-chip data in place of motifs (no motif discovery)– Features: (regulator, TF-

occupancy) pairs

TFP2P1

Binding data for regulatory networks

Page 18: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Inferring regulatory networks from the combination of expression data and binding data

Page 19: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

CCNL1

BRF1

ER

FOSMYC

CEBPXBP1

RXRA

HSF2

PNN

NRIP1

TXNDC

IVNS1ABP

BATF

HES1

CHAF1B

CSDE1

CUTL1

PURB

ADAR

C140RF43

SP3

DDX20

ELF3

TXNIPPAWR

BRIP1

FOXP4

ZNF394

BAZ1B

STRAP

ASCC3

MKL2

GTF2I

RUVBL1

RFC1ZNF500 TTF2

RAB18 ZKSCAN1

MSX2

LASS2

HDAC1ZBTB41

TBX2

THRAP1

VPS72TLE3

BHLHB2

ZNF38

ZNF239

DNMT1

HIF1AHEY2

An extended ER regulatory network in MCF7 cells

Page 20: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Signaling molecules -- Networks

• Find all SMs that associate as regulators with a particular TF’s ChIP occupancy in ADT features

• e.g.

• Hypothesis: Glc7 phosphatase complex interacts with Hsf1 in regulation of Hsf1 targets (Interaction supported in literature)

Hsf1Gac1Gip1Sds22

Glc7 phosphatase

complex

TFSM mRNA

Page 21: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Input Data

Ab initio Motif Discovery Programs

Statistical Methods

STAMP Matching

Results

•SeqLog

•PWM

•P-value

•Known or novel motifs

•Bootstrap re-sampling

•Fisher test

•Weeder

•MaMf

•MEME

•FASTA file

•Contact Info

•Control data (optional)

http://motif.bmi.ohio-state.edu/ChIPMotifs/

Page 22: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

http://motif.bmi-ohio-state.edu/HRTBLDb

Page 23: Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University.

Software Demo

• W-ChIPMotifs• HRTargetDB