Post on 09-Jun-2020
4/6/07 Ph.D defense 1
Service-oriented architecture forintegration of bioinformatic dataand applications
Xiaorong XiangDepartment of Computer Science and EngineeringUniversity of Notre Dame
4/6/07 Ph.D defense 2
Contributions
Survey of research issues and challenges inservice-oriented computing (Chapter 2)
Built a SOA based system for supportingbioinformatics research (Chapter 3)
Explored the deep phylogeny of the plastidwith the system (Chapter 4)
Enhanced the system with semantic webtechnology and a novel approach of reuseworkflows (Chapters 5 & 6)
4/6/07 Ph.D defense 3
Outline
Introduction to SOA MoG project and MoGServ Ontological data and service representation
model Knowledge and workflow reuse
4/6/07 Ph.D defense 4
ServiceRequester
ServiceBroker
ServiceProvider
2 3 54
1
Discovery Invoke
Publish
interface
SOA – an architectural style ofdistributed computing
Why SOA Reusability Interoperability Security Maintenance Save cost when
integrating applications Adoption of SOA
e-Business e-Science e-Government
4/6/07 Ph.D defense 5
Web services – one realization ofSOA
Network Transport ProtocolsTCP/IP, HTTP, SMTP, FTP, etc
Meta LanguageXML
Services CommunicationSOAP
Service Publishing & DiscoveryUDDI
Services DescriptionWSDL
Business Process ExecutionBPEL4WS, WFML, WSFL,
BizTalk, …
Additional WS* Standards … Transactions
Management
Security
Web Service Description Language
Simple Object Access Protocol
Universal Description, Discovery andIntegration
4/6/07 Ph.D defense 6
Semantic Web
Grid ComputingService-orientedArchitecture(Web Service)
Semantic WebService
Semantic Grid
Open Grid Service Architecture (OGSA)
SemanticGridService
The P2P technology plays an important role of increasing the scalability and reliability in Service discovery and workflow execution process
1
2
3
SOA research orientations
4/6/07 Ph.D defense 7
From the article “Genome Sequencingvs. Moore’s Law: Cyber Challenges forthe Next Decade” by Folker Meyer injournal CTWatch Quarterly August,2006 volume 2 number 3
Bioinformatics today
• Rapidly accumulating data: DNA sequences, contigs, expression data,ontologies, annotations, etc.
• Non-standard independently developed heterogeneous data sources• Data sharing, data integration, and security
4/6/07 Ph.D defense 8
SOA in Bioinformatics
MORE Community efforts needed to provide
more shared and reliable services More demonstration projects needed =>
best practices, measured utility, feedbackto middleware projects, etc.
Recent exposure of data & analysis tools as services
Large public databaseMiddleware projects
Provide infrastructureTo compose,manage,Execute, connect theDistributed services
4/6/07 Ph.D defense 9
Outline
Introduction to SOA MoG project and MoGServ Ontological data and service representation
model Knowledge and workflow reuse
4/6/07 Ph.D defense 10
Mother of Green (MoG) project
Biological science In collaboration with Prof. Jeanne Romero-Severson,
Biological Sciences, University of Notre Dame. Study the deep phylogeny of plastid
Computer science Provide an environment to support scientists’ investigations A case study of using SOA for data and application
integration A prototype for future research in service-oriented
architecture domain
4/6/07 Ph.D defense 11
MoG project – one motivation Malaria causes 1.5 - 2.7 million deaths every year 3,000 children under age five die of malaria every day Plasmodium falciparum (P. falciparum) causes human malaria Targeted drug design through phylogenomics P. falciparum has three genomes: nuclear, mitochondrial, plastid
(apicoplast) Find the ancestors of the apicoplast, better understanding of the
evolution of plastid Identify genes in the ancestors Determine gene function
P. falciparumP. falciparum
Apicoplast in P. falciparum
4/6/07 Ph.D defense 12
A typical in-silico investigationData driven research workflow
A: Query complete genome sequences
given a taxon
B: Query protein coding genes
for each genome sequence
C: Eliminate vectorsequences
D: Sequences alignment
E: Phylogenetic analysis
4/6/07 Ph.D defense 13
Challenges (Time consuming manual web-based operations) Data collection and information gathering
Rapid accumulation of raw sequence information Rate of accumulation is increasing Information accumulates faster than analyses
finish Information in forms not readily accessible
Analysis tool usage Experimental data recording Repetitive experiments for scientific
discovery
Web Interface Applications
Application Server
Data AccessServices
Data AnalysisServices
Job Manager
Job Launcher
Service/WorkflowRegistry
MetadataSearch
Local DataStorage
Workflow/SoapEngines
Services
NCBI DDBJ EMBL
Data/Services Providers
MoGServMiddleLayer
ServicesAccessClient
Others
MoG
Ser
v S
yste
m A
rchi
tect
ure
4/6/07 Ph.D defense 15
Data storage and access services
Local database Integrating data from multiple data sources with
scientists interests Supporting repetitive investigations against
several subsets of sequences Avoiding network traffic and service failure when
retrieving data on-the-fly from public data sources Accessing the data in the local database by
services
4/6/07 Ph.D defense 16
Service and workflow registry
A table-based description with necessary properties Text description Service location Input/output Provider Version Algorithm Invocation method
Not intended for supporting service discovery orcomposition at current stage
A repository of service and workflow used for localapplication developers
4/6/07 Ph.D defense 17
Indexing and querying metadata
Metadata Service and workflow description Description of sequence data in order to track the
origination of data Experimental data output, input, and intermediate
data Indexing and querying with keyword
Lucene Implemented as services
4/6/07 Ph.D defense 18
Service and workflow enactment
INPUT
Parameters
Task Name
Timer
Service/WorkflowRegistry
Job Manager
Find the service/workflowdefinition using the task name
Form a JobDescription
Output
Job ID
Job Launcher
Instances of Workflow/Service Engines
Job Information
4/6/07 Ph.D defense 19
Implementation Development and deployment
J2EE, JSP, XSLT Tomcat 5.0.18 / Axis 1_2RC2
Database PostgresSQL 8.1
Index and search of metadata Apache Lucene library
Service implementation Java2WSDL Wrap command line applications with JLaunch library
Workflow Taverna workbench, part of myGrid project Freefluo workflow engine
4/6/07 Ph.D defense 20
4/6/07 Ph.D defense 21
Taverna workbench
4/6/07 Ph.D defense 22
A more complex workflow
4/6/07 Ph.D defense 23
Issues with the first prototype Meta data description
Solution Index-based (keyword syntactic search) Capture most properties to support the end-users requirement Support data provenance
Limitation Similar to most services in the bioinformatics community Lack of semantic description (goal => semantic search)
Failure tolerance and recovery Solution
Statically encode alternative services in the workflow to prevent service failure Record status of the service and workflow execution into the database for possible
recovery strategy Multiple workflow engines deployment to prevent the hardware or network failure
Limitation No dynamic service selection (more semantic description support) during execution
time No fine grained resource management and monitoring
Security
4/6/07 Ph.D defense 24
Outline
Introduction to SOA MoG project and MoGServ Ontological data and service representation
model Knowledge and workflow reuse
4/6/07 Ph.D defense 25
Semantic web
Semantic web vision Giving meaning (semantics) to web-based information Machine-understandable such that software agents can
autonomously process them Two standards: OWL & RDF
The Web Ontology Language (OWL) Defines common vocabularies for specifying the concepts and
relationship among concepts Resource Description Framework (RDF)
Formal format for encoding web content using defined vocabularies Semantic web for Bioinformatics
UniProt RDF project Semantic web for SOA
Automated service discovery, composition
4/6/07 Ph.D defense 26
http://www.nd.edu/~mog
#hasCreator
#gmadey
#hasFullName
Gregory Madey
#hasTitle
#professor http://www.nd.edu/~gmadey
#hasPersonalSite
MoG is a … project
#hasTextDescription #hasResearchTopic
#bioinformatics
Literal Resource # URI provided the definition of these vocabularies
#hasFundedBy
#foundation
Resource Description Framework (RDF)
A graph model ofstatements, a set oftriples: Predicate (Subject,
Object)
Representations: RDF/XML N-triples Turtle
A standard format toconnect webinformation
4/6/07 Ph.D defense 27
Generic Service Description Ontology(myGrid/Feta model)
Data ServicesWorkflows
Service Domain Ontology(myGrid)
MoGServ applicationDomain Ontology
(MoGServ)
Software components for annotation RDFStore
Ontological modules used for semanticdescription of data, services & workflows
4/6/07 Ph.D defense 28
MoGServ Application Domain Ontology
To better track the dataorigination
To support the automationof workflow creation
To better share the data onthe web in the future XML:StringSethasSetName
ServiceJobisInstanceOf
SetSetisParentOf
UserJobinvokedby
rangedomainproperties
171126myGrid/Feta model
myGrid
MoGServ
Ontologicalmodules
8419
7912
Number of propertiesObject Datatype
Number of Concepts
Example concepts and properties defined in MoGServ
4/6/07 Ph.D defense 29
Sample data annotation – metadata from MoG local database
Displayed byRdf-Gravity
4/6/07 Ph.D defense 30
Sample service/workflow annotation
Question:Which service has anoperation that acceptsnucleotide_sequence as a parameter
Answer:Uri:http://www.ebi.ac.uk …/alignment:blastn_ncbiOperationName: Run
Displayed byRdf-Gravity
4/6/07 Ph.D defense 31
Implementation of annotation and query components fordata, services & workflows
Sesame 1.2.6 library Supports files, RDBMS, SeRQL
Sesame RDF store
AnnotationTemplates
(Data)
AnnotationTemplates(Service)
Querytemplates
Select Y, W, X from {Y} mg:hasOperation{W} mg:inputParameter {X} rdf:type {mog:set}using namespace rdf = <http://www.w3.org/1999/02/22-rdf-syntax-ns#>, mg = <http://www.mygrid.org.uk/ontology#>, mog = <http://almond.cse.nd.edu:10000/mog#>
QueryComponents
Annotationcomponents
resultService: http://almond.cse.nd.edu:10000/axis/services/ClustalW?wsdlOperation: runClustalWdfinputParameter: setidSeRQL
4/6/07 Ph.D defense 32
Limitations
The MoGServ ontology is not complete Contains a small portion of necessary concepts
used for tracking the data provenance Service domain ontology is not complete
Needs more concepts as more services arepublished
Challenges of using semantic web in general Ontology creation, never complete Data and service annotation accuracy, efficiency Ontology integration
4/6/07 Ph.D defense 33
Outline
Introduction to SOA MoG project and MoGServ Ontological data and service representation
model Knowledge and workflow reuse
4/6/07 Ph.D defense 34
Aligning
Retrieving
Workflow A defined by a lessexperienced user using thefunctional definition of services
queryGene
clustalW
Workflow B defined by anintermediate user with executableservices
queryGene
clustalW
queryGene queryGene
setIds
setFilter
clustalW clustalW
Workflow C defined by an expert user with two extra executable services to ensure the accurate output of
the biological process
Three user-defined workflows from different viewsQuestion: “are gene genealogies for ATP subunitαβ γ different?”
4/6/07 Ph.D defense 35
Limitations of current workflowmanagement systems Existing workflow management system and
bioinformatics middleware Taverna, Kepler, Triana, Pegasus Design, execute, monitor, re-run
Support ad-hoc, semi-automated and automatedservice discovery and composition from scratch
Our approach: reuse the verified knowledge andworkflow Increase the correctness over time Provide more accurate guidelines
4/6/07 Ph.D defense 36
User ServiceAnnotator
Abstractworkflow
DL reasoner
Ontology
Create abstract workflow using ontology
Annotate services using ontology
Semantics enabled service registry
Semantics enabled service discovery
Service matchmaking
Workflow composer (software agent/experienced users)
Find appropriate service
Workflow execution
engine
concreteworkflow
Data provenancemanagement
Collect and manageinformation aboutdata origination
Knowledgebase
managementKnowledgediscovery
Enhanced workflow system
4/6/07 Ph.D defense 37
Encode, convert theHigh level definition To low-level executable
Invoke a workflow withSpecific input data andRecord the data Provenance and Performance of services,workflows.
Abstract workflow
Concrete workflow
Optimal workflow
Workflow instance
Replace individual Services with theiroptimal alternatives
Task A Task B
Service B
Service A
Service DService C
Service BService A
Service DService C’
input
outputService B
Service A
Service DService C’
Our hierarchical workflow structure
F F T f i l e a
/usr/local/bin/fft /home/file1
M o v e f i l e a f r o m h o s t 1 : / /
h o m e / f i l e a
t o h o s t 2 : / /h o m e / f i l e 1
Abstract
Workflow
Concrete
Workflow
DataTransfer
Data Registration
Pegasus workflow structure
4/6/07 Ph.D defense 38
Reusable knowledge Connectivity
Helps to convert from abstract workflow toconcrete workflow
Alternatives and quality-of-service profiles Helps to convert from concrete workflow to
optimal workflow Mapping of abstract workflow and concrete
workflow Helps to choose reusable workflows
4/6/07 Ph.D defense 39
Connectivity identification(Match detection)
Service: QueryLocal Operation: createSet
performTask: mygrid:retrieving
inputPara: Settype(String, mog:gene)
Queryterm(String, null) outputPara:
Setid(string, mog:geneset)useResource: MoG
Service: ClustalW Operation: runClustalWdf
performTask: mygrid:aligning
inputPara: Setid(String, mog:set )Sequencetype(String,
mog:sequence) outputPara: filen(string, mygrid:sequence
_alignment_report) useResource: EBI
Service: FormatConversion
Operation: convert performtask:
mygrid: translatinginputPara: filen(String, mygrid:sequence
_alignment_report )outputPara:
Out(String, mygrid:nexus_paup_format)
useResource: MoG
Parameter (data type, semantic type)Matching rule:opertation ij → operation mnif exist parameterk is output parameter of operationij and exist parametero is input parameter of operationmn and data type (parametero) = data type (parameterk) and semantic type (parametero) = semantic type(parameterk)
4/6/07 Ph.D defense 40
Need for verified service connectivityThe mismatching problem
TNFN
FPTPMatch detection
output
Accurate annotationInaccurate annotationLack semantic annotationInaccurate reasoning
Inaccurate annotationLack of semantic annotationInaccurate reasoning
Accurate annotation
GenBankServiceOut:GenBank record
BlastpIn: protein sequenceX
Mediator, adaptor,shim
DDBJ-XMLOut: sequence
data record
NCBI blastIn: sequence data
record
fasta formatSelf-defined format
May be detectedby expertise at design time or afterrun
Can be detectedautomatically
X
Yes No
Yes
No
FPTN
Real match
4/6/07 Ph.D defense 41
Connectivity Graph Implementation
Registrationprocess
registry
Automatically Identify the connectivity
Knowledge base
Store the connectivity
Workflow Translation /
Service compositionprocess
Refine, update, decompose the workflow
connect (servicea, operationai, parameterc, serviceb, operationbi, parameterd)identifyConnect (Single service, rdf repository)Search at syntactic level: search path between two nodes search next available service
automatic composition base on input, outputImplementation: shortest path algorithm Dijkstra
4/6/07 Ph.D defense 42
Experiment Used 418 concepts from domain
ontology for semantic type, defined 10concepts for data type.
Randomly generate serviceannotation. 1 input, 1 output
1000 services connectivity graph (rightside)
Intel Pentium mobile 1.5GZ
12.51
12.35
12.31
13.01
12.02
Average time ofmatch detectionper singleservice(milliseconds)
3325
3015
2600
2346
1547
Load RDFrepository(milliseconds)
10200
34400
84600
138800
2251000
Number ofMatched pair
Number of services
587Number of arcs
Less than 1Average path searchtime (milliseconds)
220Connectivity graph loadtime (milliseconds)
724Number of nodes
Length 0 = 724, length 1= 587,length 2=448, length 3= 281,Length 4=114, length 5=71Length 6 =28, length 7=16Length 8 = 4, length 9 = 2
Conclusion:Feasible solution.
4/6/07 Ph.D defense 43
Reuse of workflows
Reuse of abstractworkflows
Reuse ofconcreteworkflows
Comparestructuralsimilarity of twoworkflows
Implementation:SUBDUEalgorithm
input
output
query_term
hasParameter
task
hasInput
task
hasNext
retrieving
aligning
multiple_alignment_report
performTask
hasOutput performTask
hasParameter
v 1 inputv 2 outputv 3 taskv 4 taskv 5 query_termv 6 retrievingv 7 aligningv 8 multiple_aligning_report
e 3 4 hasNexte 3 1 hasInpute 4 2 hasOutpute 3 6 performTaske 4 7 performTaske 1 5 hasParametere 2 8 hasParameter
SUBDUE input formatGraph view
4/6/07 Ph.D defense 44
Pro and Con Pro
Increase the correctness of the formed workflow over time Avoid the incorrect, inaccurate semantic annotations Take advantage of verified knowledge Avoid the ontological reasoning process
Better support for semi-automated and automated service compositionover time Provide more accurate guideline to users over time
Con The connectivity graph can be big
Number of parameters Number of services
Search the connectivity of a service when a service is registered in thesystem may take relative long time More complex matching rule Number of parameters
May not have high accuracy at the beginning
4/6/07 Ph.D defense 45
Summary
Described the design and implementation ofMoGServ
Explored the ontological representation ofdata and services
Described new approach for reuse ofworkflows and connectivity of services
4/6/07 Ph.D defense 46
Future work
Integrate the GridSam into the MoGServ forexecution, monitoring
Integrate the Grid computing technology forresource allocation
Refine the MoGServ application domain ontology Create interface for end-user workflow creation Create interface for individual workspace Evaluate the scalability, accuracy of connectivity
graph approach and the graph matching approachwith large number real workflows and services
4/6/07 Ph.D defense 47
Acknowledgements
Dr. Madey Dr. Romero-Severson Dr. Flynn Dr. Striegel Dr. Chaudhary Dr. Collins Mr. Eric Morgan Dr. Jean-Christophe Ducom
Partially supported by the Indiana Center for Insect Genomics(ICIG) with funding from the Indiana 21st Century fund
4/6/07 Ph.D defense 48
Publications
X. Xiang, G. Madey and J. Romero-Severson, “A Service-orientedData Integration and Analysis Environment for In-Silico Experimentsand Bioinformatics Research”, Proceedings of the 40th Annual HawaiiInternational Conference on System Sciences (CD-ROM), January 3-62007, Computer Society Press.
Xiaorong Xiang and Greg Madey, "A Semantic Web ServicesEnabled Web Portal Architecture", IEEE International Conference onWeb Services (ICWS 2004), San Diego, July 2004
Xiaorong Xiang and Greg Madey, “Improving the reuse of scientificworkflows and their by-products. In International Conference on WebServices (ICWS2007). Under review.
Xiaorong Xiang and Eric Lease Morgan, Exploiting "Light-weight"Protocols and Open Source Tools to Implement Digital LibraryCollections and Services. D-Lib Magazine, October 2005, Volume 11Number 10
4/6/07 Ph.D defense 49
Publications planned
One journal paper for BMC Bioinformatics Chapter 3, chapter 4, chapter 5
Future IEEE ICWS proceedings Chapter 6
Biology journal – TBD Results from using MoGServ
4/6/07 Ph.D defense 50
Thank you