Team

Advisors: PI: Dr. Gregory Vert (Dept. of Computer Science, University of

Nevada Reno)

Co-PI: Dr. Alison Murray (Desert Research Institute, Reno)

Co-PI: Dr. Monica Nicolescu (Dept. of Computer Science, University of Nevada Reno)

Student : Sara Nasser (Dept. of Computer Science, University of Nevada

Reno)

Bioinformatics- Genome Sequencing Genome sequencing is figuring out the order of DNA

nucleotides, or bases, in a genome—the order of As, Cs, Gs, and Ts that make up an organism's DNA.

Sequencing the genome is an important step towards understanding it.

The whole genome can't be sequenced all at once because available methods of DNA sequencing can only handle short stretches of DNA at a time.

Genome Sequencing

Much of the work involved in sequencing lies in putting together this giant biological jigsaw puzzle.

Various problems occur such as: Errors in reading Flips

Shot-Gun Sequencing

The "whole-genome shotgun" method, involves breaking the genome up into small pieces, sequencing the pieces, and reassembling the pieces into the full genome sequence.

Environmental Genomics

Multiple sequence alignment is an important first step in many bioinformatics applications such as structure prediction, phylogenetic analysis and detection of key functional residues.

The accuracy of these methods relies heavily on the quality of the underlying alignment.

[1]

Multiple Sequence Alignment

The traditional multiple sequence alignment problem is NP-hard, which means that it is impossible to solve for more than a few sequences [1].

In order to align a large number of sequences, many different approaches have been developed.

Tools and Techniques

MUMMER Phrap, Phred, Consed TIGR

The Smith-Waterman Algorithm Tree-Based Algorithms

Meta-genomics

Meta-genomics is the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species.

[2]

Meta-genomics

Bacteria can often have minor variations in their DNA that can result in different metabolic characteristics.

The differences can make it difficult to classify bacteria taxonomically.

What has been needed is a method of creating a characteristic representation (characteristic genome) from the sub sequences of DNA found in several sub variant of a bacteria of the same species.

Such genome could be used for more efficient classification at a molecular level through the process of controlled generalization.

Research Goals

Given a collection of nucleotide sequences from multiple organisms, develop techniques based on fuzzy set theory and other methods for assembly of the sequences into the original full genome for each organism.

Using the above techniques to develop a generalized approach for creating a characteristic genome that represents a generalization of the original organisms that donated sequence data.

The Data

SYM (Original Raw Data): Contains 302K Sequences Average length of 450 base pairs (bp)

It was obtained from a community of bacteria

There is an estimated of 100 organisms

Lets say, for example 75% of data is repeated, we still need to reassemble a sequence of ~ 33 Million bp

Motivation

Current tools could not solve the problem: Complexity of the dataset, since they are from same

species. Sequencing environmental genomes, not a single

organism. Limited tools that sequence environmental genomes.

Algorithm: Underlying algorithm determines the accuracy of match. Performance can be highly improved. Interfaces could be improved.

Problem

Genome assembly is a O(2k) problem.

Using Dynamic Programming it can be reduced.

Example in seconds: Assembly that takes around 1125899906842624

seconds to can be reduced to 2500+ seconds!

A Start

We divide the problem in two steps Acquiring subsets such that each subset

represents an organism Assembling this into a characteristic genome

sequence The above two steps to be obtained by

Clustering Assembly

Steps

Clustering AssemblyRawData

AssembledSequences

Characteristic Genome

The Data

CAVEEG (Cleaned Dataset): Contains 128K Sequences Assembled + Singletons Length ranges from 200bp-1000bp

D2 Cluster

It is a software for clustering genome sequences

The technique is based on distance.

Clustering with D2 Cluster

Clustering was performed on 128K CAVEEG Dataset One dataset with 100K Sequences was obtained Majority of the data falls into one cluster This makes the process of separating organisms

hard The clustering/assembly failed to assemble the

sequences (the number of organisms were estimated manually and compared)

Problem with D2 clustering

Does not look for contigs Ex: A same cluster may have:

AATGCGTATTCGATGCGC

CATACTTAGTCGATC – AG

When we assemble we desire:

AATGCGTATTCGATGCGC

TGCGCATCGTATCG

Problems

Since data is closely related the clustering technique assigns them to same cluster.

Existing tools are unable to assemble the data correctly.

The clustering software can only perform one round of clustering.

Ongoing Work

Genome assembly using dynamic programming

Uses Longest Common Sub-Sequence LCS is commonly used (ex: Mummer) We added restrictions Enforce strict matches

Encoding of data

Results

Performance for a sequence of original length 1000 wrt the number of subsequences

0500

1000150020002500300035004000

200 400 600 800 1000 1200

number of Random subsequences used

Tim

e in

sec

on

ds

average time

median time

Perfect match

worst times

Time(in seconds) chart for sequence length of 500

0200400600800

100012001400160018002000

100 300 500 700 900 1500

Sequence lenght 500

Then…

We added clustering. Instead of comparing each sequence with

each other we can compare them with a group.

Faster, less number of comparisons.

Clustering

[3]

Comparison of Clustering

Performance Comparisons of Sequence assembly time for a sequence of 1000bp

0100020003000400050006000700080009000

300 500 700 900 1100 1300 1500 1700 1900

No of Subsequences

Tim

e LCS

Clustering

Performance Comparison

Best Alignment for a Sequence with 1000 bp

960965

970975

980985

990995

1000

300 500 700 900 1100 1300 1500 1700

No of Subsequences

Lo

ng

est

Ali

gn

ed S

equ

ence

LCS

Clustering

How much does it matter?

Obtaining an exact full length sequence it not essential

A sequence that is very close to the original is desired

Snapshot 1

Snapshot 2

Fuzzy Logic

Fuzzy Logic has been used extensively in approximate string matching using distance measures, etc.

However, very little work has been done in application of building genomes from subsequences of nucleotides.

The concept of similarity and application of fuzzy logic will be defined which is a relatively new area in nucleotide sequencing.

In Future..

Compare technique with Phrap (alignment software, Mummer)

Improve clustering Define Similarity using Fuzzy Logic Define Dissimilarity

Parallelize the process

References

[1] http://bioinformatics.oxfordjournals.org/cgi/content/full/21/8/1408#FIG1

accessed May, 2006.

[2] DeLong EF (2002) Microbial population genomics and ecology. Curr Opin Microbiol 5: 520–524.

[3] http://www.togaware.com/datamining/survivor/kmeans04.png

Questions

Contents

Documents

Transcript of Contents