Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan...

14
Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University of Connecticut

Transcript of Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan...

Page 1: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools

Nathan SnyderCarnegie Mellon University

BioGrid REU 2009University of Connecticut

Page 2: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

Transcription Factors

Transcription factors regulate DNA transcription

Page 3: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

Transcription Factor Binding Site Detection Algorithms

Training sequencesAGATCGTTACATGATTTGATGGAT

Genetic region to searchATCGTCGATGCTGAGATGTCTATCGTAGCTAGTC

Highest scoring sequence in that regionAGATGTCT

Page 4: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

Assessment by Osada et al.

Compared various transcription factor binding site detection algorithms

Consensus: builds a consensus sequence based on the training data

PSSM: makes a scoring matrix based on the logs of nucleotide frequencies.

Berg and von Hippel: like PSSM, but with nucleotide counts instead of freqs.

Centroid: sum of position specific frequencies

Page 5: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

The Same Length Training Sequence Assumption

Example set of known binding sites from TRANSFAC:

ACATTTAACTGGTTAATTGAATAACCCAATTTAATCCGTTACCGGGTTGCTCGAAGGGATTAGACTGGGTTATTTAACCCGTTTTTAGCGGCATAAAAGGGTTAAACAGGAATGCGCGCCCATAAAAGGGTTAAG

Page 6: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

Project Goal

Modify the tools evaluated by Osada et al. to handle training sets with varying sequence length and still produce decent performance

Page 7: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

Overall Strategy

Step 1: AlignmentAGCTTTCAACCTTTGGACGTAACTTTCA

AGCTTTCA ACCTTTGGACGTAACTTTCA

Step 2: Scoring

ACTGAGTCGATAATTTTGAACTG

AATTTTGA

Page 8: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

MLCentroid

Applies this strategy to the Centroid algorithm

Centroid was chosen for its strong performance, more efficient execution, and ease of implementation

The same techniques could be readily applied to any of the other algorithms

Page 9: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

Running Time Issues

First version: O(c * L^numseqs)

Second version: O(c * L * numseqs^2)

Quadratic is MUCH better than exponential!

Page 10: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

Method of Testing

Leave one out testing similar to that used by Osada

Counts the number of sequence which score higher than the desired one

The data sets for Drosophila Melanogaster from Tompa's paper were used

Page 11: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

Experimental Results

Sequence left out

Training set 1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 0 0 1 0 0 0 1

2 12 9 0 15 10

3 8 0 0 7 24 4 1 34 0

4 20 4 5 1 3 0 32 9 32

5 36 33 2 2 2 2 1 24 0 2 1 2 3 2

6 0 0 0 37 0 0 0

Page 12: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

Future Work

Better alignment scoring schemes

Modify and test PSSM, Berg and von Hippel, and Consensus

Incorporate these techniques into de novo motif discovery algorithms

Trying to incorporate sequence structure into alignment.

Page 13: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

References

Timothy L. Bailey, Nadya Williams1, Chris Misleh1 and Wilfred W. Li: MEME: discovering and analyzing DNA and protein sequence motifs, Nucleic Acids Research, 2006, Vol. 34, Web Server issue W369–W373

Berg, O. and von Hippel, P. : Selection of DNA binding sites by Regulatory Proteins. Statistical-Mechanical Theory and Application to Operators and Promoters, Journal of Molecular Biology, 1987, 193, pages 723-750

Day,W.H. and McMorris,F. : Critical comparison of consensus methods for molecular sequences. Nucleic Acids Res., 20, 1992, pages 1093–1099

Charles E. Lawrence and Andrew A. Reilly, An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences, PROTEINS: Structure, Function, and Genetics, 1990 7:41-51

Robert Osada, Elena Zaslavsky, Mona Singh: Comparative analysis of methods for representing and searching for transcription factor binding sites, Bioinformatics, Vol. 20 no. 18 2004, pages 3516–3525

Giulio Pavesi, Giancarlo Mauri, and Graziano Pesole: An algorithm for finding signals of unknown length in DNA sequences, Bioinformatics, Vol. 17 Suppl. 1 2001 pages S207–S214

Giulio Pavesi, Paolo Mereghetti, Giancarlo Mauri and Graziano Pesole, Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes, Nucleic Acids Research, vol. 32, Web server issue, 2004

Tompa et al.: Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotechnology, vol. 23, no. 1, January 2005, pages 137-144

http://www.embl-grenoble.fr/groups/dna/t.gif

blogs.venturacountystar.com

www.clas.ufl.edu

Page 14: Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

Questions?