Contents
-
Upload
clinton-sullivan -
Category
Documents
-
view
36 -
download
2
description
Transcript of Contents
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly
Contents
Team Bioinformatics
Genome Sequencing Research Problem Fuzzy Logic Ongoing Work Future Work
Team
Advisors: PI: Dr. Gregory Vert (Dept. of Computer Science, University of
Nevada Reno)
Co-PI: Dr. Alison Murray (Desert Research Institute, Reno)
Co-PI: Dr. Monica Nicolescu (Dept. of Computer Science, University of Nevada Reno)
Student : Sara Nasser (Dept. of Computer Science, University of Nevada
Reno)
Bioinformatics- Genome Sequencing Genome sequencing is figuring out the order of DNA
nucleotides, or bases, in a genome—the order of As, Cs, Gs, and Ts that make up an organism's DNA.
Sequencing the genome is an important step towards understanding it.
The whole genome can't be sequenced all at once because available methods of DNA sequencing can only handle short stretches of DNA at a time.
Genome Sequencing
Much of the work involved in sequencing lies in putting together this giant biological jigsaw puzzle.
Various problems occur such as: Errors in reading Flips
Shot-Gun Sequencing
The "whole-genome shotgun" method, involves breaking the genome up into small pieces, sequencing the pieces, and reassembling the pieces into the full genome sequence.
Environmental Genomics
Multiple sequence alignment is an important first step in many bioinformatics applications such as structure prediction, phylogenetic analysis and detection of key functional residues.
The accuracy of these methods relies heavily on the quality of the underlying alignment.
[1]
Multiple Sequence Alignment
The traditional multiple sequence alignment problem is NP-hard, which means that it is impossible to solve for more than a few sequences [1].
In order to align a large number of sequences, many different approaches have been developed.
Tools and Techniques
MUMMER Phrap, Phred, Consed TIGR
The Smith-Waterman Algorithm Tree-Based Algorithms
Meta-genomics
Meta-genomics is the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species.
[2]
Meta-genomics
Bacteria can often have minor variations in their DNA that can result in different metabolic characteristics.
The differences can make it difficult to classify bacteria taxonomically.
What has been needed is a method of creating a characteristic representation (characteristic genome) from the sub sequences of DNA found in several sub variant of a bacteria of the same species.
Such genome could be used for more efficient classification at a molecular level through the process of controlled generalization.
Research Goals
Given a collection of nucleotide sequences from multiple organisms, develop techniques based on fuzzy set theory and other methods for assembly of the sequences into the original full genome for each organism.
Using the above techniques to develop a generalized approach for creating a characteristic genome that represents a generalization of the original organisms that donated sequence data.
The Data
SYM (Original Raw Data): Contains 302K Sequences Average length of 450 base pairs (bp)
It was obtained from a community of bacteria
There is an estimated of 100 organisms
Lets say, for example 75% of data is repeated, we still need to reassemble a sequence of ~ 33 Million bp
Motivation
Current tools could not solve the problem: Complexity of the dataset, since they are from same
species. Sequencing environmental genomes, not a single
organism. Limited tools that sequence environmental genomes.
Algorithm: Underlying algorithm determines the accuracy of match. Performance can be highly improved. Interfaces could be improved.
Problem
Genome assembly is a O(2k) problem.
Using Dynamic Programming it can be reduced.
Example in seconds: Assembly that takes around 1125899906842624
seconds to can be reduced to 2500+ seconds!
A Start
We divide the problem in two steps Acquiring subsets such that each subset
represents an organism Assembling this into a characteristic genome
sequence The above two steps to be obtained by
Clustering Assembly
Steps
Clustering AssemblyRawData
AssembledSequences
Characteristic Genome
The Data
CAVEEG (Cleaned Dataset): Contains 128K Sequences Assembled + Singletons Length ranges from 200bp-1000bp
D2 Cluster
It is a software for clustering genome sequences
The technique is based on distance.
Clustering with D2 Cluster
Clustering was performed on 128K CAVEEG Dataset One dataset with 100K Sequences was obtained Majority of the data falls into one cluster This makes the process of separating organisms
hard The clustering/assembly failed to assemble the
sequences (the number of organisms were estimated manually and compared)
Problem with D2 clustering
Does not look for contigs Ex: A same cluster may have:
AATGCGTATTCGATGCGC
CATACTTAGTCGATC – AG
When we assemble we desire:
AATGCGTATTCGATGCGC
TGCGCATCGTATCG
Problems
Since data is closely related the clustering technique assigns them to same cluster.
Existing tools are unable to assemble the data correctly.
The clustering software can only perform one round of clustering.
Ongoing Work
Genome assembly using dynamic programming
Uses Longest Common Sub-Sequence LCS is commonly used (ex: Mummer) We added restrictions Enforce strict matches
Encoding of data
Results
Performance for a sequence of original length 1000 wrt the number of subsequences
0500
1000150020002500300035004000
200 400 600 800 1000 1200
number of Random subsequences used
Tim
e in
sec
on
ds
average time
median time
Perfect match
worst times
Time(in seconds) chart for sequence length of 500
0200400600800
100012001400160018002000
100 300 500 700 900 1500
Sequence lenght 500
Then…
We added clustering. Instead of comparing each sequence with
each other we can compare them with a group.
Faster, less number of comparisons.
Clustering
[3]
Comparison of Clustering
Performance Comparisons of Sequence assembly time for a sequence of 1000bp
0100020003000400050006000700080009000
300 500 700 900 1100 1300 1500 1700 1900
No of Subsequences
Tim
e LCS
Clustering
Performance Comparison
Best Alignment for a Sequence with 1000 bp
960965
970975
980985
990995
1000
300 500 700 900 1100 1300 1500 1700
No of Subsequences
Lo
ng
est
Ali
gn
ed S
equ
ence
LCS
Clustering
How much does it matter?
Obtaining an exact full length sequence it not essential
A sequence that is very close to the original is desired
Snapshot 1
Snapshot 2
Fuzzy Logic
Fuzzy Logic has been used extensively in approximate string matching using distance measures, etc.
However, very little work has been done in application of building genomes from subsequences of nucleotides.
The concept of similarity and application of fuzzy logic will be defined which is a relatively new area in nucleotide sequencing.
In Future..
Compare technique with Phrap (alignment software, Mummer)
Improve clustering Define Similarity using Fuzzy Logic Define Dissimilarity
Parallelize the process
References
[1] http://bioinformatics.oxfordjournals.org/cgi/content/full/21/8/1408#FIG1
accessed May, 2006.
[2] DeLong EF (2002) Microbial population genomics and ecology. Curr Opin Microbiol 5: 520–524.
[3] http://www.togaware.com/datamining/survivor/kmeans04.png
Questions