Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

62
ADVANCED COMPUTATIONAL BIOLOGY PROJECT PRESENTATION Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Transcript of Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Page 1: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

ADVANCED COMPUTATIONAL BIOLOGY

PROJECT PRESENTATION

Team Members:

Joshua Wu 11174269

Shuyu (Christine) Xu 11161640

Page 2: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 3: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Project Description

Explicit Suffix Trees Suppose that we want to store explicitly

all strings that are edge labels of a suffix tree.

The main question of this project is how much space explicit suffix trees require comparing to implicit suffix trees.

Implement suffix tree algorithm and run it on substrings of real data.

Page 4: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 5: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Introduction

Any string of length m can be degenerated into m suffixes, and these suffixes can be stored in a suffix tree.

Setup time O(m) (m is length of string)

searching time O(n) (n is length of pattern)

Page 6: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 7: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Motivation

"Suffix trees are widely used in the computer field... Recent improvements in the method have cut the memory requirement to 17 bytes per letter, which brings the method to the verge of practicality [for bioinformatics applications]" -- Nat Goodman (Genome Technology).

Page 8: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 9: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Bioinformatics Application

1. multiple genome alignment (Michael Hohl et al., 2002)

2. selection of signature oligonucleotides for DNA arrays (Kaderali and Schliep, 2002)

3. identification of sequence repeats (Kurtz and Schleiermacher, 1999)

Page 10: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 11: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Explicit vs Implicit ABC $ Explicit 1 2 3 4 ABC$ $

BC$ C$

Implicit

1,4 4,4

2,4 3,4

Page 12: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 13: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Problem Analysis

Best Case for explicit and implicit suffix trees: All different characters

Best case not likely with DNA inputs: total of 4 characters

Worst case: same characters throughout

Page 14: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Assumptions

In implicit trees, each number will only take up one bit. (the number 10 takes up 1 bit)

Only alphabets will be in the sequence

Page 15: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Example: all different char ABCD $ 1,5 5,5 1 2 3 4 5 2,5 3,5 4,5

N: string length N = 5 Memory = 10 best case

Page 16: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Example

ABCABC $ 7,7 1 2 3 4 5 6 7 1,3 2,3 6,6 N: string length N = 7 4,7 7,7 7,7 7,7 Memory = 20 4,7 4,7

Page 17: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Example: all same character AAAA $ 1 2 3 4 5 1,1 5,5 N=string length N = 5, 6, 7 2,2 5,5 Memory = 16, 20, 24 Memory = 4n-4 3,3 5,5

Worse case

4,5 5,5

Page 18: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Program Input Data

DNA for all kinds of creatures:

Homo Sapiens, Monkeys, Chickens, …

Page 19: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 20: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Sample input: Homo Sapien

cagctcctgagactgctggcatgaaggggagccgtgccctcctgctggtggccctcaccctgttctgcatctgccggatggccacaggggaggacaacgatgagtttttcatggacttcctgcaaacactactggtggggaccccagaggagctctatgaggggaccttgggcaagtacaatgtcaacgaagatgccaaggcagcaatgactgaactcaagtcctgcagagatggcctgcagccaatgcacaaggcggagctggtcaagctgctggtgcaagtgctgggcagtcaggacggtgcctaagtggacctcagacatggctcagccataggacctgccacacaagcagccgtggacacaacgcccactaccacctcccacatggaaatgtatcctcaaaccgtttaatcaataa

Page 21: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Sample result

Page 22: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Sample input 2: plants

EARPIVVGPPPPLSGGLPGTENSDQARDGTLPYTKDRFYLQPLPPTEAAQRAKVSASEILNVKQFIDRKAWPSLQNDLRLRASYLRYDLKTVISAKPKDEKKSLQELTSKLFSSIDNLDHAAKIKSPTEAEKYYGQTVSNINEVLAKLG

Page 23: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Sample output:

Page 24: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 25: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Homo Sapien

Page 26: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Sample Input: Homo Sapiens

atgaaggggagccgtgccctcctgctggtggccctcaccctgttctgcatctgccggatggccacaggggaggacaacgatgagtttttcatggacttcctgcaaacactactggtggggaccccagaggagctctatgaggggaccttgggcaagtacaatgtcaacgaagatgccaaggcagcaatgactgaactcaagtcctgcagagatggcctgcagccaatgcacaaggcggagctggtcaagctgctggtgcaagtgctgggcagtcaggacggtgcctaa

Page 27: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Comparisons: Homo Sapiens

Page 28: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Comparisons: Homo Sapiens

Page 29: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Monkey Virus

Page 30: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Sample Input: Monkey Virus

GGSCFKCGKKGHFAKNCHEHAHNNAEPKVPGLCPRCKRGKHWANECKSKTDNQGNPIPPH

Page 31: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Monkey Virus

Page 32: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Plants

Page 33: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Sample Input: Plants

EARPIVVGPPPPLSGGLPGTENSDQARDGTLPYTKDRFYLQPLPPTEAAQRAKVSASEILNVKQFIDRKAWPSLQNDLRLRASYLRYDLKTVISAKPKDEKKSLQELTSKLFSSIDNLDHAAKIKSPTEAEKYYGQTVSNINEVLAKLG

Page 34: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Plants

Page 35: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Tobacco

Page 36: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Sample input: tobacco

SYSITTPSQFVFLSSAWADPIELINLCTNALGNQFQTQQARTVVQRQFSEVWKPSPQVTVRFPDSDFKVYRYNAVLDPLVTALLGAFDTRNRIIEVENQANPTTAETLDATRRVDDATVAIRSAINNLIVELIRGTGSYNRSSFESSSGLVWTSGPAT

Page 37: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Tobacco

Page 38: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Insects

Page 39: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Sample Input: Insects

DCLSGRYKGPCAVWDNETCRRVCKEEGRSSGHCSPSLKCWCEGC

Page 40: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Insects

Page 41: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Birds

Page 42: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Sample Input: Birds

IDTCRLPSDRGRCKASFERWYFNGRTCAKFIYGGCGGNGNKFPTQEACMKRCAKA

Page 43: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Birds

Page 44: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

SARS

Page 45: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Sample Input: SARS

ALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEV

Page 46: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

SARS

Page 47: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Fish

Page 48: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Sample Input: Fish

GHHHHHHLEDPSGGTPYIGSKISLISKAEIRYEGILYTIDTENSTVALAKVRSFGTEDRPTDRPIAPRDETFEYIIFRGSDIKDLTVCEPPKPIM

Page 49: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Fish

Page 50: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Chicken

Page 51: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Sample Input: Chicken

RVKRVWPLVIRTVIAGYNLYRAIKKK

Page 52: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Chicken

Page 53: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

files

Code

Results

Page 54: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 55: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Conclusion

Explicit suffix trees require more space than implicit suffix trees in real datas.

Data comparison: worst case is DNA input (least variety of characters)

results Implicit trees should be used for smaller

use of storage

Page 56: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

1 3 5 7 9 11 13 15 17 19 21 23 250

500

1000

1500

2000

2500

3000

variety of string vs tree size

variety of string vs tree size

# of alphabets

Page 57: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Conclusion

Application:it is easier to compare structures for implicit

than explicit suffix trees (number comparisons)

Save spaceEasy to implement

Further improvement?

Page 58: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 59: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

Possible Future Work

Program speed is too slow

The interface of our program should be improved. (Matlab)

More variety of input

Page 61: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

References

Online info http://en.wikipedia.org/wiki/Suffix_tree http://marknelson.us/1996/08/01/suffix-tr

ees/ http://homepage.usask.ca/~ctl271/857/s

uffix_tree.shtml http://www.cs.uku.fi/~kilpelai/BSA05/lect

ures/print07.pdf

Page 62: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640.

THANK YOU!