Presentation 4

Slide 1

Department of Information Science and EngineeringM S Ramaiah Institute of TechnologyBangalore- 54OPTIMIZING PHYLOGENETIC ANALYSIS USING MAP REDUCE PROGRAMMING MODELAbhinav Anurag Darshil Shah Eklavya Uppal Ishank Mishra Under the guidance of Mr. Siddesh G M, Assistant Professor, Department of Information Science and Engineering, M S Ramaiah Institute of Technology IntroductionWhat Bioinformatics is

research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, analyze, or visualize such data.

By developing techniques for analyzing sequence data and related structures, we can attempt to understand molecular basis of life.

Phylogeny Phylogeny is the study of evolutionary relationships among groups of organisms (e.g. species, populations), which are discovered through molecular sequencing data and morphological data matrices.

Computational Phylogenetics is the application of computational algorithms, methods and programs to phylogenetic analyses. The goal is to assemble a phylogenetic tree representing a hypothesis about the evolutionary ancestry of a set of genes, species, or other taxa.

Phylogentic TreeA phylogenetic tree is a statement about the evolutionary relationship between a set of homologous characters of one or several organisms.

Homology is the relationship of two characters that have descended, usually with divergence, from a common ancestral character. The characters can be any genic (gene sequence, protein sequence), structural (i.e.morphological) or behavioural feature of an organism.

Scientists build phylogenetic trees in an attempt to understand evolutionary relationships.

An evolutionary tree showing the divergence of raccoons and bears. Despite their difference in size and shape, these two families are closely related.Building Phylogenetic TreeDistance Matrix Method

Distance data is a matrix in which a measure of the evolutionary distance between each pair of sequences in the multiple alignment has been calculated.

Seq1Seq2Seq3Seq4Seq110.80.40.6Seq20.810.50.6Seq30.20.510.9Seq40.60.60.91Needleman-Wunsch AlgorithmThe NWA (Needleman-Wunsch algorithm) was proposed by Saul Needleman and Christian Wunsch (1970).

The algorithm performs a Global Alignment on two sequences and is commonly used in bioinformatics to align protein or nucleotide sequences.

To find the alignment with the highest score, a two-dimensional array (or matrix) is allocated. This matrix is often called the F matrix, and its (i,j)th entry is often denoted by Fij. There is one column for each character in sequence A, and one row for each character in sequence B. Thus, if we are aligning sequences of sizes n and m, the amount of memory used by the algorithm is in O(nm).

Calculating the F Matrix

AlgorithmThe F matrix would look like this

Backtracing

This would produce an alignment like thisG-ATTACA GCA-TGCUCalculation of ScoreFor example, if the similarity matrix was

then the alignment:

with a gap penalty of -5, would have the following score

MapReduceMapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a Map() procedure that performs filtering and sorting and a Reduce() procedure that performs a summary operation.

"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node."Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve.MapReduce examplemapper (filename, file-contents):// filename : document name // file-contents : document contentsfor each word in file-contents:emit (word, 1)

reducer (word, values):// values : a list of aggregated partial countssum = 0for each value in values:Sum = sum + valueemit (word, sum)

For example, if we had the files:

foo.txt: Sweet, this is the foo filebar.txt: This is the bar file

We would expect the output to be:

sweet 1this 2is 2the 2foo 1bar 1file 2Apache HadoopApache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.

MapReduce is the heart of Hadoop.

While it can be used on a single machine, its true power lies in its ability to scale to multiple computers, each with several processor cores.

Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.

Hadoop Distributed File SystemIn a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. The Hadoop Distributed File System (HDFS) will split large data files into chunks which are managed by different nodes in the cluster. In addition to this each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable.

Problem StatementDistance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore they require multiple sequence alignments as an input.

Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. Dynamic programming algorithms like Needleman-Wunsch algorithm (NWA) and Smith-Waterman algorithm (SWA) produce accurate alignments.

But these algorithms are computation intensive and are hence limited to a small number of short sequences.Goal of the ProjectThe goal of this project is Design and Implementation of parallel approach to Phylogenetic analyses using Hadoop Data Clusters.

The proposed methodology should be able to give a performance enhancement when the sequence alignments are done in parallel using the Hadoop framework.

Proposed MethodologyInput format

This project uses the input data in FASTA format. FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes.

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column.

Sequence data set in FASTA format

Proposed Algorithm

Hierarchical clustering using UPGMAAlgorithmUPGMA (Unweighted Pair Group Method with Arithmetic Mean) is a simple agglomerative (bottom-up) hierarchical clustering method.The UPGMA algorithm constructs a rooted tree (dendrogram) that reflects the structure present in a pairwise similarity matrix.

Various stages of MapReduce in the proposed systemUser InterfaceThe external interface for this project was created using D3.js (D3 for Data-Driven Documents). D3.js is a JavaScript library that uses digital data to drive the creation and control of dynamic and interactive graphical forms which run in web browsers.

Performance AnalysisRunning TimeRunning Time or Number of sequences versus time taken for alignmentsTime for the three map reduce stagesComparison of three stages of MapReduce for different Sequence setsThroughput or number of sequences aligned per secondThroughput or number of sequences aligned per secondConclusionThe project work proposed a time efficient approach to Phylogenetic analyses that produces a phylogram (phylogenetic tree or evolutionary tree).

The proposed method of making Phylogenetic analyses has found improvements on the computation time and also maintains the accuracy.

The dynamic nature of the algorithm NWA coupled with data and computational parallelism of Hadoop data grids has found to improve the accuracy and speed of sequence alignment.

Presentation 4

Documents

Transcript of Presentation 4