Augmented Reality for Human-Swarm Interaction in a Swarm ...
Using SWARM service to run a Grid based EST Sequence Assembly
description
Transcript of Using SWARM service to run a Grid based EST Sequence Assembly
![Page 1: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/1.jpg)
1
Using SWARM service to run a Grid based EST Sequence Assembly
Karthik NarayanPrimary Advisor : Dr. Geoffrey Fox
![Page 2: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/2.jpg)
2
Outline
Objective EST Sequence Assembly The Problem SWARM Tools Results Future Work
![Page 3: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/3.jpg)
3
Objective
Use the SWARM service and leverage the High Performance clusters for EST Sequence Assembly.
![Page 4: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/4.jpg)
4
EST Sequence Assembly ESTs are a collection of random cDNA
sequences, sequenced from a cDNA library.
The ESTs are clustered and assembled to form contigs.
The contigs are then used to identify potential unknown genes, by Blasting against a known protein database.
![Page 5: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/5.jpg)
5
The Problem The input is typically large, of the order
of 1 million sequences. Memory intensive Time consuming Involves multiple programs
![Page 6: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/6.jpg)
6
SWARM A high-level job scheduling Web service
framework, developed by the Pervasive Technology Institute – Indiana University.
Can submit millions of jobs to several high performance clusters and monitor their status.
extensible, lightweight, and easily installable on a desktop or small server.
![Page 7: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/7.jpg)
7
ToolsTask Tools
Cleaning sequence reads Repeat Masker
Clustering sequence reads PaCE
Assemble reads Cap3
Similarity search Blast
![Page 8: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/8.jpg)
8
Repeat Masker Developed by Institute of Systems
Biology Screens sequences for interspersed
repeats and low complexity regions. Sequence comparisons done by
cross_match Splitting of input to buckets Post processing step
![Page 9: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/9.jpg)
9
CAP3 Developed by Department of Computer
Science, Michigan Technological University.
CAP3 is very memory intensive and cannot be run on small servers.
![Page 10: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/10.jpg)
10
PaCE Developed by Department of Computer
Science, Iowa State University. Clusters ESTs on parallel computers Post-Processing step
![Page 11: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/11.jpg)
11
CAP3 Since the clustering step is done, the
load for CAP3 is considerably less, but not trivial.
No. of Sequences No. of Clusters by PaCE
10000 97420000 2412150000 12544
![Page 12: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/12.jpg)
12
PaCE Clusters
1 10 100 1000 100000
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
PaCE Clusters for 150K ESTs
Series1
No. Of Clusters
No.
Of
Seq
uenc
es
![Page 13: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/13.jpg)
13
CAP3 Sort the input files,
and submit the Cap3 jobs both ways.
![Page 14: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/14.jpg)
14
CAP3 Set a threshold, and
submit the files with number of sequences less than the threshold to the local machine and the others to GRID.
20000 1500000
2000
4000
6000
8000
10000
12000
Grid JobsLocal Jobs
![Page 15: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/15.jpg)
15
CAP3 CAP3 Job Distribution after clustering of
clusters for 2 million sequences
20000000
100002000030000400005000060000700008000090000
100000
Before ClusteringAfter Clustering
![Page 16: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/16.jpg)
16
BLAST NCBI BLAST for homology search Splitting of input to buckets If Complete, update the status for the
pipeline in the database, zip the output files and email to the User.
![Page 17: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/17.jpg)
17
Workflow Login and select
the programs one wants to run from the list of available programs.
![Page 18: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/18.jpg)
18
Workflow Enter the parameters for the selected
programs.
![Page 19: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/19.jpg)
19
Workflow Upload the required files, if any. The job is then submitted to the Swarm
service and a status message is displayed.
An email is sent to the user, once the job is completed.
![Page 20: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/20.jpg)
20
Results Assembly results for 2million sequences
No. of Sequences
Runtime for PaCE
No. of Clusters by PaCE
No. of jobs for CAP3
Runtime for CAP3
Total Runtime
2000000
01:22 hours
75460 4073 25:44 hours 27:06 hours
![Page 21: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/21.jpg)
21
Results Runtime for the entire pipeline for 2 million
sequences
Program No. Of Jobs Run timeRepeat Masker 1000 11:56PaCE 1 01:22CAP3 4073 25:44BLAST 893 49:00
![Page 22: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/22.jpg)
22
Validation The Assembly results for Daphnia pulex,
assembled using Swarm was compared to the assembly results of EST Piper.
Comparison of Blast results with hits greater than e value of 2 are as follows :
No. Name EST Piper
Swarm
1 Number Of Contigs 17465 208032 Number of hits 13216 157473 No. of unique top hit
genes9221 10329
![Page 23: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/23.jpg)
23
Validation Number of genes commonly identified were
7045. That is, Swarm predicted 76.4% of the genes predicted by assembly using EST Piper.
There were 3284 genes identified by Swarm but not EST Piper.
![Page 24: Using SWARM service to run a Grid based EST Sequence Assembly](https://reader035.fdocuments.us/reader035/viewer/2022070504/5681664d550346895dd9cc5f/html5/thumbnails/24.jpg)
24
Future Work Implement assembly programs like
MIRA for next-gen sequences. Try different job scheduling strategies. Use cloud computing resources.