PhD Thesis Proposal - Service Composition in Biomedical Applications

55
Service Composition in Biomedical Applications PhD Thesis proposal Pedro Lopes Research Supervisor: José Luís Oliveira

description

PhD Thesis Proposal

Transcript of PhD Thesis Proposal - Service Composition in Biomedical Applications

 

 

   

Service  Composition  in  Biomedical  Applications  

PhD  Thesis  proposal    

     

     

     

     

   

Pedro  Lopes  Research  Supervisor:  José  Luís  Oliveira  

2  

 

3  

Abstract  

Information  technologies  evolution  has  raised  many  challenges  throughout  the  years.  Supported  by   the   exponential   Internet   evolution,   current   critical   issues   are   related   to   the   tremendous  

amount   of   information   available   online.   This   immense   information   quantity   leverages   a  heterogeneity   increase   and   results   in   an   overwhelming   difficulty   in   finding   information   with  certified   quality.   The   major   strategic   approach   to   this   problem   is   to   develop   integration  

applications   that   can   offer   access   to   a   multitude   of   distributed   online   resources   in   a   single  workspace.  Integration   tasks   are   often   quite   complex   and   require   the  manual   implementation   of   software  

tools   that   can   connect   distinct   applications.   Hence,   it   is   necessary   to   strive   efforts   in   the  development   of   protocols   and   technologies   that   enable   resource   description   and   promote   the  design   of   interoperable   software.   These   advances   are   materialized   in   the   phenomenon   of   the  

semantic   web.   This   higher   level   of   intelligence   in   web   applications   can   only   be   reached   if  developers  adopt  standard  ontologies  and  describe  their  resources  correctly.  The   research   that   will   be   conducted   during   this   doctorate   aims   to   research   new   software  

integration   frameworks   and   novel   implementation   strategies   that   enhance   the   development   of  next-­‐generation  web  applications  in  the  life  sciences  context.  

4  

5  

Table  of  Contents  

 

Abstract...............................................................................................................................3  

Table  of  Contents ................................................................................................................5  Acronym  List........................................................................................................................7  1   Introduction...............................................................................................................9  

1.1   Objectives ............................................................................................................10  1.2   Structure..............................................................................................................10  

2   Background..............................................................................................................11  

2.1   Problems  and  Requirements ................................................................................14  2.1.1   Heterogeneity....................................................................................................14  2.1.2   Integration .........................................................................................................16  

2.1.3   Interoperability..................................................................................................19  2.1.4   Description ........................................................................................................20  

2.2   Technologies........................................................................................................22  

2.2.1   Online  resource  access ......................................................................................23  2.2.2   Web  Services .....................................................................................................24  2.2.3   GRID...................................................................................................................28  

2.2.4   Semantic  Web....................................................................................................30  2.3   Summary .............................................................................................................32  

3   Approach .................................................................................................................33  

3.1   Solutions..............................................................................................................33  3.1.1   Static  applications..............................................................................................33  3.1.2   Dynamic  applications.........................................................................................34  

3.1.3   Meta-­‐applications..............................................................................................35  3.2   Bioinformatics......................................................................................................37  

3.2.1   Databases ..........................................................................................................37  

3.2.2   Service  Protocols ...............................................................................................39  3.2.3   Integration  Applications ....................................................................................40  

3.3   Summary .............................................................................................................42  

4   Work  Plan ................................................................................................................43  4.1   Objectives ............................................................................................................43  4.2   Calendar ..............................................................................................................43  

4.3   Publications .........................................................................................................44  

5   Implications  of  Research ..........................................................................................47  References.........................................................................................................................49  

 

6  

7  

Acronym  List  

API   Application  Programming  Interface  BPM   Business  Process  Management  CSS   Cascading  Style  Sheet  CSV   Comma-­‐separated  Values  

DBMS   Database  Management  System  ESB   Enterprise  Service  BUS  

EU-­‐ADR   European  Adverse  Drug  Reaction  Project  FTP   File  Transfer  Protocol  

GEN2PHEN   Genotype-­‐to-­‐Phenotype:  A  Holistic  Solution  Project  GUI   Graphical  User  Interface  HGP   Human  Genome  Project  

HTML   Hypertext  Markup  Language  HTTP   Hypertext  Transfer  Protocol  HVP   Human  Variome  Project  JSON   Javascript  Object  Notation  LSDB   Locus-­‐specific  Database  NAS   Network-­‐Attached  Storage  

OASIS   Organization  for  the  Advancement  of  Structured  Information  Standards  OQL   Object  Query  Language  OWL   Web  Ontology  Language  

OWL-­‐S   OWL  Semantics  RDF   Resource  Description  Framework  REST   Representational  State  Transfer  RIA   Rich  Internet  Applications  SAN   Storage  Area  Network  

SAWSDL   Semantic  Annotations  for  WSDL  SOA   Service  Oriented  Architecture  

SOAP   Simple  Object  Access  Protocol  SPARQL   SPARQL  Protocol  and  RDF  Query  Language  

SQL   Structured  Query  Language  UDDI   Universal  Description,  Discovery  and  Integration  URI   Uniform  Resource  Identifier  VQL   Visual  Query  Language  W3C   World  Wide  Web  Consortium  WSDL   Web  Service  Description  Language  

WSDL-­‐S   WSDL  Semantics  WWW   World  Wide  Web  XML   Extensible  Markup  Language  

XMPP   Extensible  Messaging  and  Presence  Protocol  XSD   XML  Schema  Definition  

 

8  

 

     

9  

1 Introduction  

Computer   science   is   a   constantly   evolving   field   since   the   middle   of   the   20th   century.   More  recently,   this   evolution   has   been   support   by   the  massification   and   growing   importance   of   the  

Internet.   World   Wide   Web   innovations   lead   to   the   appearance   of   various   applications   and  software  frameworks  (and  their  collaboration  and  communication  toolkits)  like  Google,  YouTube,  Facebook   or   Twitter.   These   novel   applications   maintenance   requires   high   computer   science  

expertise   and   software   engineering   skills,   as   they   must   support   millions   of   users,   millions   of  database   transactions   and   worldwide   deployment   in   real   time.   The   emergence   of   these   web  applications  caused  a  shift  in  the  Internet  paradigm.  Nowadays,  not  only  qualified  technical  staff  

is  able  to  publish  content  online:  anyone  with  a  minimum  experience  can  create  a  blog,  publish  a  video  or  connect  with  friends  in  a  social  network.  Along  with  these  Web2.0  applications,  it  is  also  important   to   note   the   appearance   of   several   remarkable   software   development   toolkits   that  

eased  and  sped  up  the  process  of  planning,  executing,  testing  and  deploying  applications  online.  The  main  result  of  this  evolution  is  an  immense  set  of  online  resources  that  have  to  be  dealt  with  more  efficiently.  

Despite   the   increase   in  online   information  quantity,   its  general  quality  has  decreased.  Even  using   search   engines,   like  Google   or   Bing,   it   is   very   difficult   to   find   information   for   a   particular  topic.  Focusing  our  study  on  a  single  topic,  like  entertainment,  news  or  art  history,  we  rapidly  find  

several   important  databases,  warehouses  and  service  providers.  Hence,  the  need  for  integration  applications  has  risen  and,  consequently,  the  need  for   interoperable  software.  Therefore,  a  new  challenge  is  posed  to  computers  science  experts:  describe  and  give  context  to  any  online  available  

resource   in   order   to   enhance   and   ease   the   interoperable   software   development   process.   This  requires   that   software   engineers   endeavour   efforts   in   ontologies   and   semantic   description  techniques   that   are   crucial   for   an  evolution   to   the  next   level   of   the   Internet:   the   semantic   and  

intelligent  web.  The   Human   Genome   Project   advances   revolutionized   the   life   sciences   research   field.   This  

project   generated   a   tremendous   amount   of   genomic   data   that   required   the   usage   of   software  

tools  for  a  correct  analysis  and  exploration  of  the  decoded  genome  sequences.  With  this  demand,  bioinformatics  is  born.  Human  Genome  Project  successful  efforts  originated  several  other  projects  that  fostered  the  chaotic  appearance  of  various  online  databases  and  services.  Subsequently,  this  

resulted   in   an   exponential   heterogeneity   increase   in   the   bioinformatics   landscape.   The   Human  Genome  Project  also  promoted  a  synergy  between  genomics  and  medicine,  where  a  new  level  of  challenges  that  demand  computer  science  expertise  arisen.  

Computer  sciences  play  a  key  role  in  the  life  sciences  research  evolution  and,  accordingly,  life  sciences  are  a  perfect  scenario  for  innovation  in  computer  science.  Ongoing  efforts  have  the  main  purpose  of  merging  computer  science  expertise  gained  in  web  application  development  and  apply  

it   in   the   development   of   next-­‐generation   bioinformatics   web   applications.   The   research  conducted  in  this  doctorate  envisages  the  design  of  a  software  framework  combined  with  various  implementation   strategies   that   can   prepare   the   bioinformatics   field   for   the   next   step   in   the  

evolution  of  web-­‐based  applications.  To  attain  this  goal,  it  is  necessary  to  study  semantic  resource  

10  

description   techniques   and   the   adequacy   of   service   composition   as   a   strategy   for   dynamic  

integration  of  interoperable  software.  

1.1 Objectives  The   research  conducted   in   this  doctorate   should,  above  all,   lead   to   innovative  developments   in  

the  fields  of  work  and  should  represent  a  valuable  addition  to  general  knowledge  in  our  interest  areas,  mainly  computer  science  and  bioinformatics.  The  main  objectives  behind  this  research  are  as  follows.  

Study,   analyse   and   explore   the   life   sciences   research   field   in   order   to   obtain   a   deep  understanding  about  the  problems,  challenges,  state-­‐of-­‐the-­‐art  applications  and  ongoing  research.  

Perform  a  system  and  requirements  analysis  that  provides  a  comprehensive  description  of  the  software  complexities,  required  features,  data  models,   implementation  details  and  a  careful  definition  of  software-­‐related  purposes  and  software  evaluation  criteria.  

Develop  a  consistent  software  framework  that  fulfils  the  initial  system  requirements  and  features.  This  framework  must  encompass  several  software  tools  ranging  from  desktop  to  web  applications  and  from  databases  to  remote  APIs.  

1.2 Structure  This   thesis   proposal   is   divided   in   four   distinct   sections.   Section   2   contains   a   comprehensive  background  analysis   and   contextualization  of   this   research   in   the   life   sciences   field,   focused  on  bioinformatics  and  its  inherent  problems  and  arisen  requirements  as  well  as  current  technologies  

and  protocols.  Section   3   contains   a   detailed   overview   of   several   projects   and   software   frameworks   from  

both  bioinformatics  and  generic  computer  sciences  research  fields.  These  solutions  are  presented  

as  success  cases  that  represent  the  state-­‐of-­‐the-­‐art  in  the  area.  Next,  there  is  a  work  plan  in  Section  4.  This  work  plan  is  composed  of  a  calendar  estimation  

for  the  four  years  of  research  and  also  our  publication  goals.  

At  last  there  is  Section  5  presenting  some  perspectives  on  the  implications  of  our  research  in  the  computer  science  and  bioinformatics  fields.  

 

11  

2 Background  

Bioinformatics  is  emerging  as  one  of  the  fastest  growing  scientific  areas  of  computer  science.  This  expansion   was   fostered   by   the   computational   requirements   leveraged   by   the   Human   Genome  

Project  [1].  HGP  efforts  resulted  in  the  successful  decode  of  the  human  genetic  code.  HGP  history  starts  in  the  middle  of  the  20th  century  with  the  involvement  of  the  USA  Department  of  Energy  in  the   first   studies   to   analyze  nuclear   radiation  effect   in  human  beings.  However,   it   took   the  DOE  

about   30   years,   circa   1986,   to   propel   and   initiate   the   Human   Genome   Project.   The   ultimate  project  goals  were  as  bold,  audacious  and  visionary  as  the  NASA  Apollo  program.  HGP  main  goal  was  to  decode  the  “Book  of  Life”  in  its  entirety.  Moreover,  this  knowledge  would  be  the  basis  of  a  

new  generation  of  tools  that  can  identify  and  analyze  a  single  character  change  in  the  sentences  that  compose   this  book.  Although  HGP  was  an  ambitious  project,   results  appeared  sooner   than  expected.  This  was  the  outcome  of  a  deep  collaboration  with  computer  scientists  that  leveraged  

the  deployment  of  novel   software  and  hardware   tools   that  aided  biologists’   sequence  decoding  tasks.  This  joint  effort  between  two  large  research  areas,  life  and  computer  sciences,  gave  birth  to  a  new  discipline  denominated  bioinformatics.    

The  Human  Genome  Project  brought  about  a  variety  of  benefits  in  several  fields.  Remarkable  discoveries   in   sequence   decoding   fostered  DNA   forensics,   genetic   expression   studies,   and   drug  advances   and   improved   several   other   fields   like  molecular  medicine,   energy   and   environment,  

risk   assessment   or   bioarchaelogy/evolution   studies.   At   the   positively   premature   ending   of   the  Human  Genome  Project,  the  availability  of  the  human  genome  and  other  genome  sequences  have  revolutionized   all   biomedical   research   fields   [2].   Several   projects  were   started,   riding  with  HGP  

success.  These  new  projects  use  scientific  discoveries  and  technological  advances  generated  from  HGP   in   heterogeneous   scenarios   to   obtain   new   relevant   information.   On   one   hand   we   have  smaller  projects,  which  are  focused  on  specific  genomic  researches  [3,  4].  On  the  other  hand,  we  

have  larger  projects  that  span  through  several  institutions  and  cross  various  physical  borders.  One  of   these   projects   is   the   Human   Variome   Project   [5,   6].   HVP   follows   directly   HGP   steps   and  envisages  complementing  the  latter  discoveries  with  a  new  level  of  knowledge  that  is  both  wider  

(covering  more  life  sciences  topics)  and  deeper  (more  detail  in  each  topic).    HVP  main  goals  reflect  the  computational  advances  originated  in  HGP.  The  life  sciences  goals  

are  tied  to  software  and  hardware  developments,  with  particular  focus  on  web-­‐based  applications  

and  distributed  infrastructures.  The  general  HVP  goal  is  to  collect  and  curate  all  human  variations  –   changes   in   our   genetic   code   –   associated   with   human   diseases   developed   from   specific  mutations.   This   wide   purpose   is   composed   of   smaller   goals   that   focus   on   the   development   of  

software   tools   to   aid   this   process   and   a   set   of   guidelines   and   protocols   to   promote   active  developments   in   this   field.   These   dynamic   developments   are   only   possible   with   new  communication  and  collaboration  tools  that  can  help  in  breaking  physical  barriers  between  work  

groups  and  logical  barriers  between  scientific  areas  such  as  biomedicine  and  computer  science.  At  a  European  scale,  there  are  also  major  projects  with  ongoing  research  in  the  life  sciences  

fields.  Projects  such  as  the  eu-­‐ADR  or  GEN2PHEN  have  a  strong  involvement  from  the  computer  

science   community   and   the   final   outcome   is   intended   to   be   a   large   collection   of   software  

12  

frameworks   and   applications.  With   these   contemporary   projects,   we   are   witnessing   a   growing  

promiscuity   between   computer   science   and   life   science   research.   From   the   information  technologies  point   of   view,  biology   and  biomedicine  pose   several   challenges   that  will   require   a  modernization   of   software   applications   and   the   progressive   development   of   novel   application  

strategies.   That   is,   bioinformatics   is   the   perfect   real-­‐world   scenario   to   nurture   progresses   in  various   computer   sciences   research   areas,   triggering   the   resolution   of   problems   related   to  heterogeneity,  integration,  interoperability  and  description  of  online  resources.  

The  challenge  Our   research   is   directly   connected   to   the   Genotype-­‐to-­‐Phenotype:   A   Holistic   Solution   project  

(GEN2PHEN).   This   project   is   focused   on   the   development   of   tools   that   will   connect   online   life  sciences   resources   containing   information   spanning   from   the   genotype   –   the   human   genetic  sequences   –   to   the  phenotype  –   the  human  visible   traits   such   as  hair   colour  or   penchant   for   a  

specific   disease.   Implicit   in   this   purpose   is   the   improvement   of   personalized   medicine.   This  research   field   was   born   with   the   Human   Genome   Project   and   is   sustained   by   areas   like   gene  sequencing   and   expression,   genotyping,   SNP   mapping   and   pharmacogenomics   -­‐   Figure   1.  

Personalized  medicine   is  focused  on  the  selection  of  the  most  adequate  treatment  for  a  patient  according   to  his  clinical  history,  physiology  and  genetic  profile  and  the  molecular  biology  of   the  disease  [7].  

 

Figure  1  -­‐  Personalized  Medicine  applications  aim  to  integrate  data  from  various  distinct  life  sciences  research  topics  

In  the  future,  personal  electronic  health  records  (EHR)  may  also  contain  genetic  information  required  for  a  fit  treatment.  Two  research  directions  will  generate  data  that  will  feature  EHRs  in  

13  

the   future:   pharmacogenomics   and   gene   expression.   Pharmacogenomics   studies   variability   in  

drug  response,  which  comprises  drug  absorption  and  disposition,  drug  effects,  drug  efficiency  and  adverse   drug   reactions   [8].   Gene   expression   profiling   of   diseases   provides   new   insights   on   the  classification  and  prognostic   stratification  of  diseases  based  on  molecular  profiling  originated   in  

microarray   research   [9,   10].   Both   these   fields   will   generate   a   tremendous   amount   of  heterogeneous  data  that  needs  to  be  integrated  accurately  in  diverse  systems.  This  data  is  made  available   through   various   types   of   online   resources.   Connecting   these   online   resources,   public  

databases,   services   or   simply   static   files,   leverages   a   complexity   increase   in   the   implicit  integration   tasks.   Arisen   issues   revolve   around   heterogeneity,   integration   and   interoperability.  Solving  these  problems  is  not  trivial  and,  despite  the  fact  that  there  are  several  ongoing  research  

projects  in  this  area,  computer  science  researchers  have  not  yet  discovered  an  optimal  solution.    Current   research   trends   are   focused   on   using   semantic   resource   descriptions   to   empower  

autonomous   communication   between   heterogeneous   software.   This   approach   adopts   mainly  service   composition   strategies   to   improve   integration   and   interoperability   in   existent   software  frameworks.  Service  composition  (and,  subsequently,  service  oriented  architectures)  has  proven  

to   be   the   ideal   scenario   to   render   concrete   the   benefits   obtained   from   semantic   resource  description,   as   it   provides   a   solid   foundation   for   standardized   communication   between   distinct  software  elements.  Figure  2  shows  a  general  overview  of  this  doctorate  research  work.  The  main  

practical  purpose  is  to  enhance  the  development  process  of  Rich  Internet  Applications  (RIA)  that  will   accomplish   GEN2PHEN’s   project   goals.   GEN2PHEN   objectives   regarding   integration   and  interoperability   between   online   resources   can   be   encompassed   in   a   more   generic   computer  

science  category.  These  problems  are  not  specific  to  the   life  sciences  area;  they  are  common  to  several   research   areas   that   require,   in   some  manner,   involvement   from   the   computer   sciences  community.   To   achieve   the   desired   levels   of   integration   and   interoperability   we  will   focus   our  

research  on  the  study  and  improvement  of  service  composition  scenarios.  Service  orchestration,  service  choreography  and  mashups  (particularly  workflows)  will  be  studied  in  detail.  As  previously  mentioned,  the  inclusion  of  semantic  resource  descriptions  is  crucial  to  the  successful  creation  of  

service  composition  strategies  and,  therefore,  this  area  will  also  be  covered  thoroughly  during  the  conducted  research.  

 

Figure  2  –  Relations  between  the  requirements  arisen  by  the  GEN2PHEN’s  project  goals  and  the  set  of  computer  science  concepts  required  to  fulfill  those  requirements  

14  

2.1 Problems  and  Requirements  

Research   in   the   life   sciences   area   poses   many   problems   and   requirements.   Among   these  

problems,   the   key   set   is   composed   of   four   distinct   topics:   heterogeneity,   integration,  interoperability   and   description.   Heterogeneity   is   related   to   the   marked   distinctions   between  

access  methods  to  the  various  types  of  online  resources.   Integration  refers  to  the  centralization  and   publication   of   distributed   resources   in   a   single   entry   point.   Interoperability   is   seen   as   the  ability   that   software   has   to   communicate   autonomously   with   external   software.   To   overcome  

these   problems   we   need   to   rely   on   novel   techniques,   mostly   semantic   resource   description  strategies   and   ideas.   Next,   these   topics   are   explained   in   detail   in   a   generic   computer   science  context.  

2.1.1 Heterogeneity  Online  resource  heterogeneity   issues  are  one  of   the  main  research  problems  studied   in   the   last  few  years.  However,  its  relevance  is  growing,  triggered  by  the  constant  evolution  of  the  Internet  

and   the   increased   facility   in   publishing   content   online.   We   can   classify   online   resource  heterogeneity   in   five   distinct   groups,   varying   according   to   the   complexity   in   solving   them   and  their  involvement  at  hardware/software  levels  (Figure  3).  

Hardware  related  issues  arise  when  dealing  with  physical  data  storage  (Figure  3  –  1).  For  instance,   in   a   medical   image   integration   application,   it   may   be   required   that   image  

backups,  stored  in  tapes,  are  integrated  in  the  system  as  well  as  images  in  the  main  facility  web   server.   In   this   scenario,   the   integration   setup  would   be   considerably   complex.   The  implementation   would   have   to   encompass   both   tape   access   methods   and   web   server  

access  methods  that  are  quite  distinct.  Other  complex  integration  scenario  would  involve  the   integration  of   information   that   is   available   in   a   company  FTP   server   and   its   Storage  Are  Network   (SAN)  or  Network  Attached  Storage   (NAS)   storage   facility.  Once  again,   the  

solution  for  this  problem  would  have  to  encompass  distinct   information  access  methods  in  a  single  environment,  therefore  increasing  the  overall  difficulty  in  implementing  such  a  system.  

When  dealing  with  file  access  in  any  storage  facility,  we  may  have  logical  storing  problems  (Figure  3  –  2).  Content  can  be  stored  in  a  relational  database,  a  simple  text  file  or  a  binary  file,   among   others.   Hence,   these   several   formats   are   accessed   with   entirely   different  

interfaces.  For  instance,  to  integrate  data  stored  in  a  Microsoft  SQL  Server  2008  database  in  a  Java  application,  one  would  need  the  most  recent  JDBC  Connector.   If,   in  addition  to  this   scenario,  we   required  a   connection   to  a  MySQL  database,  we  would  need   to  add  a  

new   connection  driver   and   implement   several   distinct  methods.  We   can  even   grow   the  complexity   of   this   system   by   adding   content   that   is   stored   in   a   binary   file.   This   would  leverage  the  need  for  a  new  set  of  access  methods  that  are  completely  different  from  the  

relational   database   ones,   resulting   in   a   scenario   with   great   complexity   and   requiring   a  large  collection  of  methods.  

The  next  level  where  heterogeneity  can  be  a  problem  is  at  data  format  levels  (Figure  3  –  

3).  Data   stored   in   the  same  physical   format  can  be  stored   in  a  distinct   syntax.  Although  programming  languages’  evolution  has  improved  access  to  distinct  file  formats,  reading  a  simple   text   file   or   a   HTML   file   are   operations   that   require   different   strategies   and  

methods.  A  simple  scenario  could  be  the   integration  of  several  accounting  results;  these  

15  

results   are   offered   in   CSV   formats,   Excel   files   and   tabular   text   files.   To   successfully  

integrate   these   files,   developers   must   implement   distinct   access   methods   to   the   three  logical  formats.  

Moving  deeper  in  the  software  layer,  we  reach  the  data  models  level  (Figure  3  –  4).  At  this  

level,   heterogeneity   issues   arise  when   files   are   distinctly   structured   or   do   not   obey   the  same  ontology.  Difficulties  in  solving  this  issue  were  greatly  reduced  with  the  appearance  of  the  XML  standard,  and  most  of  the  modern  applications  rely  on  this  standard.  Despite  

having  normalized  the  process  of  reading  and  storing  information,  XML  allows  an  infinite  number   of   valid   distinct   structures,  which   are   different   from   application   to   application.  The   simple   scenario   of   describing   a   person   name   can   result   in   several   hierarchical  configurations:  we  can  have  the  element  name  with  two  sub-­‐elements,  first  and  last,  or   we   can   just   define   an   element   fullname.   If   we   want   to   store   the   person’s   name  

initials  or  her  nickname  the  number  of  solutions  would  be  even  greater  for  such  a  small  piece  of  information.  Similarly,  the  same  concept  may  also  be  stored  in  this  diverse  ways  

in  a  relational  database:  despite  the  fact  that  the  logical  storing  is  the  same,  the  storage  model  may   be   different,   leveraging   the   requirement   of   relation   and   concept  mappings  that   must   be   developed   and   implemented   by   researchers.   Considering   that   we   are  

integrating  data  for  a  well-­‐defined  scientific  topic,  it  probably  has  one  or  more  ontologies  that   define   logic   structures   and   relations   between   elements   of   a   thesaurus.   The   issues  that  arise  in  this  specific  scenario  are  driven  by  the  fact  that  there  is  not  a  single  ontology  

for  a  specific  area.  Usually,   there  are  several  ontologies  that  define  the  same  content   in  distinct  and  non-­‐interoperable  manners.  Once  again,  heterogeneity  has  to  be  solved  with  information   and   relation  mappings   that   can   correctly   transpose   information   structured  

following  ontology  A   to  ontology  B.  These  mappings  are  quite  complex  and  traditionally  require  some  kind  of  human  effort  for  success.  

Finally  we  address   the   access  methods  heterogeneity   (Figure  3   –   5).  Web   services  have  

evolved  and  are  the  primary  method  for  remote  data  access  with  standard  protocols  and  data  exchange  formats.  Nevertheless,  web  services  may  be  divided  in  HTTP  web  services  (REST  or  SOAP)  and  XMPP  web  services   (with  the   IO  Data  extension).  REST  web  services  

are  much   simpler:   they   can   be   easily   accessed   through  HTTP   requests   and  may   display  data  in  a  customized  format  (that  can  differ  from  application  to  application).  On  the  other  hand,   SOAP  web   services   rely  on  WSDL   to  perform  data  exchanges.   This  means   that   an  

application  using  this  strategy  must  follow  the  applicable  standards  thus  resulting  in  more  entangled  underpinnings.  XMPP  web  services  are  based  in  the  Extensible  Messaging  and  Presence   Protocol,   a   protocol   for   message   exchange   widely   used   by   instant   messaging  

applications.   These   three   types   of   services   are   explained   in   detail   further   in   this  document.  In  addition,  the  remote  APIs  may  be  implemented  in  distinct  languages  and  it  may   be   required   that   the   platform   merges   content   from   both   local   and   remote   data  

sources.  The  resulting  scenario  involves  the  development  of  separate  sets  of  methods  and  strategies.    

16  

 

Figure  3  -­‐  Content  heterogeneity  organization  according  to  hardware/software  dependence  and  complexity  

Summarily,   resource   heterogeneity   raises   many   difficulties   in   the   development   of   novel  information   integration   platforms.   These   issues   can   only   be   solved   with   some   kind   of   human  

effort  and  require  particular  resource  integration  and  interoperability  strategies  that  are  detailed  further  in  this  document.  

2.1.2 Integration  To  deal  with   resource  heterogeneity   issues  or   to   simply   centralize   large  amounts  of  distributed  data   in   a   single   system,   researchers   have   to   develop   state   of   the   art   resource   integration  

architectures.  The  main  goal  of  any  integration  architecture  is  to  support  a  unified  set  of  features  working  over  disparate  and  heterogeneous  applications.  These  architectures  will  always   require  the  implementation  of  several  methods  to  access  the  integrated  data  sources.  The  heterogeneity  

may   be   located   at   the   previously   presented   levels,   which   include   software   and   hardware  platforms,  diversity  of   architectural   styles   and  paradigms,   content   security   issues  or   geographic  location.  In  addition  to  these  technical  restrictions  to  integration,  there  are  also  other  hindrances  

such  as  enterprise/academic  boundaries  or  political/ethical  issues.  Whether  we  are  simply  dealing  with   the   integration  of  a   set  of  XML   files  or  with  distributed   instances  of   similar  databases,   the  concept   of   resource   integration   will   generically   rely   on   hard-­‐coded   coordination   methods   to  

centralize  the  distributed  information  or  to  give  the  idea  that  the  data  is  centralized.  Several  strategies  for  data  integration  can  be  used  -­‐  Figure  4.  These  approaches  differ  mostly  

on  the  amount  and  kind  of  data  that  is  merged  in  the  central  database.  Different  architectures  will  

also  generate  a  different  impact  on  the  application  performance  and  efficiency.  Warehouse  solutions  (Figure  4  -­‐  A)  consist  on  the  creation  of  a  large  database  that  contains  

data  gathered  from  several  resources.  The  central,  larger  database  –  the  warehouse  –  may  consist  

of   a  mesh   of   connected   repositories   that   the   data   access   layer   sees   as   a   single   database.   The  Database  Management  System   (DBMS)   is   responsible   for   the  management  and  maintenance  of  the  warehouse.   In   terms   of   implementation,   this  model   requires   that   a  mapping   is  made   from  

each  data  source  to  the  central  warehouse  data  model.  Next,  the  content  is  moved  entirely  from  its  source  to  the  new  location.  The  final  result  is  a  new  data  warehouse  where  the  content  from  the  integrated  data  sources  is  completely  replicated.  

This  model  raises  several  problems  in  terms  of  scalability  and  flexibility:  warehouses’  size  can  grow  exponentially  and  each  database  requires   its  own   integration  schema.  This  means  that  for  each  distinct  database,  developers  have  to  create  a  new  set  of  integration  methods,  resulting  in  a  

very  rigid  platform.  Despite  these  issues,  this  technique  is  very  mature  and  a  considerable  amount  of   work   has   already   been   done   to   improve  warehouse   architectures.   Nowadays,   the   debate   is  

17  

focused  on  enhancing  warehouse  integration  techniques  [11]  and  solving  old  problems  with  state-­‐

of-­‐the-­‐art  technologies  [12,  13].    Another  widespread  strategy  involves  the  development  of  mediators  –  a  middleware  layer  –  

connecting   the   application   to   the   data   sources.   This   middleware   layer   enables   a   dynamic  

customization  of  user  queries  performed  in  the  centralized  entry  point,  extending  their  scope  to  several   databases   previously   modelled   in   a   new   virtually   larger   database.   Kiani   and   Shiri   [14]  describe  these  solutions  and  a  good  example  can  be  DiscoveryLink  [15].  Mediator-­‐based  solutions  

are  usually   constrained  by  data  processing  delays:   they   require   real-­‐time  data   gathering,  which  can  be  bottlenecked  by  the  original  data  source.  Additionally,  the  gathered  content  also  has  to  be  processed  to  fit  in  the  presentation  model,  hence,  compromising  even  more  the  overall  efficiency  

of  the  system  represented  in  Figure  4  –  B.  Finally,   link-­‐based   resource   integration   (Figure   4   –   C)   consists   of   aggregating   in   a   single  

platform   links   to   several   relevant   resources   throughout   the  Web.   This   is   the  most  widely   used  integration  model  due  to  the  simplicity  of  collecting  and  showing  links  related  to  a  certain  subject.  However,   inherent   in   this   simplicity   are   several   drawbacks,   especially   regarding   the   limitations  

imposed  by   the   fact   that   there   is  no   real   access   to  data,  only   to   their  public  URLs.  Most  of   the  modern   resources   are   dynamic  which  means   that   access   to   content  may   be   generated   in   real-­‐time.  Also,  in  the  scientific  research  area,  there  is  new  data  emerging  daily.  Therefore,  the  system  

requires  constant  maintenance  in  order  to  keep  the  system  updated  with  the  area  novelties.  The   link-­‐based   integration   strategy   has   the   major   drawback   of   restraining   access   to   the  

original   resources.   The   integration   application   will   act   as   a   proxy   of   the   integrated   resources;  

therefore,  it  will  hide  the  original  resource.  Without  this  access,  the  range  of  features  that  can  be  implemented  and  the  scope  of  the  features  that  are  offered  to  users  is  reduced.  Entrez  [16]  is  a  link-­‐based   integration  application  that  does  not  show  the   integrated  resources  directly.  To  have  

access   to   the   original   content   source,   users   must   analyze   the   data   and   follow   a   hyperlink  highlighted  in  certain  identifiers.  DiseaseCard’s  approach  [17]  is  more  direct:  users  can  view  and  navigate  inside  the  original  resources’  interface  which  is  a  part  of  DiseaseCard’s  layout.  

 

Figure   4   -­‐   Data   integration  models   categorized   according   to   their   relation   with   the   integration   application   and   the  integrated  online  resources  

Despite   the   fact   that   these   approaches   cover   almost   all   possible   solutions   for   data  integration,  there  are  many  problems  that  have  not  yet  been  solved.  Figure  5  shows  a  comparison  between  data   integration  models,  highlighting   the  main  advantages  and  disadvantages  of  each.  

After  a  careful  analysis  of  these  models  we  can  conclude  that  the  best  option  is  to  create  a  hybrid  solution   that   is   capable  of   coping  with   the  main  disadvantages  of   the   three   strategies  and   take  

18  

advantage  of  their  main  benefits  as  well.  Arrais  studied  this  scenario  and  used  this  strategy  for  the  

integration  of  heterogeneous  data  sources  in  GeneBrowser  [18].  

 

Figure   5   –  Comparison  between   the   studied   resource   integration  models  highlighting   the   respective  advantages  and  disadvantages  

The  development  of  hybrid  approaches  has  gained  momentum  in  the  recent  years  especially  with   the   introduction   of   novel   data   access   techniques   like   remote   APIs   in   the   form   of   web  services.   This   trend   consists   in  making   resources  available  as  a   service   that   can  be  executed  by  

any   other   software   system.   Service   oriented   architectures   (SOA)   rely   on   a   paradigm   shift   in  integration   application   development,   based   on   “everything-­‐as-­‐a-­‐service”   ideals.   These   ideals  define  that  anything  whether  it  is  a  simple  database  access  or  a  complex  mathematical  equation  

can  be  requested  or  solved  by  a  mere  access  to  a  predefined  URI.  These  are  the  main  principles  that  define   service-­‐oriented  architectures   [19,  20].   In  SOA,  any  kind  of   software  module  can  be  considered   as   a   service   and   be   integrated   in   any   kind   of   external   application   through   the  

definition  of  a  standardized  communication  protocol.  In  spite  of  the  strategy  chosen  to  integrate  a  collection  of  heterogeneous  resources,  there  are  

several   concerns   that   should   be   taken   in   account   [21]:   application   coupling,   intrusiveness,  

technology  selection,  data  format  and  remote  communication.  Integration  ecosystems  often  require  that  distinct  applications  call  each  other.  Despite  being  

dissimulated  as  local  calls  by  the  integration  engine,  these  remote  calls,  available  in  the  majority  

of   programming   languages,   are   very   different   due   to   the   resort   to   network   capabilities.  Traditional  (and  erroneous)  distributed  computing  assumptions  –  zero  network  latency  or  secure  and  reliable  communication  –  must  be  measured  and  shunned.  Remote  communication  concerns  

are  reduced  with   the  adoption  of  asynchronous  communication  techniques  and  the  support   for  communication  error  solving,  thus  reducing  network  error  susceptibility.  

One  of  the  main  integration  issues  is  definitely  the  data  format.  Integrated  applications  must  

adopt  a  unified  data  format.  Traditionally,  this  requirement  is  impossible  to  fulfil  due  to  the  fact  that  some  of  the  integrated  data  sources  are  closed  or  considered  legacy.  In  these  scenarios,  the  solution  consists  in  creating  a  translator  that  maps  the  distinct  data  formats  to  a  single  model.  In  

this  case,  issues  may  arise  when  data  formats  evolve  or  are  extended.  Intrusiveness  should  be  one  of  the  main  concerns  when  developing  integration  applications.  

The  integration  process  should  not  impose  any  modifications  in  the  constituent  applications.  That  is,   the   integration   strategy   should  operate  without   any   interference   in   the  existing   applications  and   both   the   integrator   and   the   integrated   application   should   be   completely   independent.  

Nevertheless,  sometimes-­‐major  changes  are  necessary  to  increase  integration  quality.  

19  

Application  coupling   is  directly  connected  with   intrusiveness.  A  good  software  development  

practice   is   “low   coupling,   high   cohesion”   [22]   and   this   ideal   is   also   applicable   to   integration  strategies.   High   coupling   results   in   high   application   dependencies   on   each   other,   reducing   the  possibilities   of   the   applications   evolving   individually   without   affecting   other   applications.   The  

optimal   results  would  be   resource   integration   interfaces   that  are   specific  enough   to   implement  the  desired  features  and  generic  enough  to  allow  the  implementation  of  changes  as  needed.  

The   implicit   complexities   that   arise   when   dealing   with   online   resource   integration   require  

large  efforts  and  expertise  to  be  overcome.  Like  any  other  issue,  integration  is  firmly  related  with  the  scientific  area  in  question  and,  with  this  in  mind,  the  adopted  strategy  or  model  must  take  in  account  several  variables  present  in  the  environment  where  it  will  be  implemented.  

2.1.3 Interoperability  Along   with   integration   comes   interoperability.   Integration   deals   with   the   development   of   a  

unified   system   that   includes   the   features   of   its   constituent   parts.   On   the   other   hand,  interoperability   deals   with   single   software   entities   that   can   be   easily   deployed   in   future  environments.   This  means   that   interoperability   is   a   software   feature   that   facilitates   integration  

and  collaboration  with  other  applications.   ISO/IEC  2382-­‐01,   Information  Technology  Vocabulary,  Fundamental  Terms  defines  interoperability  as  follows:  “The  capability  to  communicate,  execute  programs,  or  transfer  data  among  various  functional  units   in  a  manner  that  requires  the  user  to  

have  little  or  no  knowledge  of  the  unique  characteristics  of  those  units”.  Interoperable   systems   can   access   and   use   parts   of   other   systems,   exchange   content   with  

other   systems  and  communicate  using  predefined  protocols   that  are   common   to  both   systems.  

This  interoperability  can  be  achieved  at  several  distinct  levels  as  pointed  by  Tolk’s  work  [23].  For  our   research,   the   essential   levels   are   the   ones   that   encompass   syntactic   and   semantic  interoperability.  

Software  syntactic  interoperability  can  be  defined  as  the  characteristic  that  defines  the  level  where   multiple   software   components   can   interact   regardless   of   their   implementation  

programming   language   or   software/hardware   platform.   Syntactic   software   interoperability  may  be   achieved   with   data   type   and   specification   level   interoperability.   Data   type   interoperability  consists   in   distributed   and   distinct   programs   supporting   structured   content   exchanges  whether  

through   indirect  methods  –  writing   in   the   same   file   –  or  direct  methods  –  API   invoked   inside  a  computer   or   through   a   network.   Specification   level   interoperability   encapsulates   knowledge  representation   differences   when   dealing   with   abstract   data   types,   thus,   enabling   programs   to  

communicate  at  higher  levels  of  abstraction  –  web  service  level  for  instance.  Semantics   is   a   term   that   usually   refers   to   the   meaning   of   things.   In   practice,   semantic  

metadata   is   used   to   specify   the   concrete   description   of   entities.   These   descriptions   and   their  

relevance   are   detailed   further   in   this   document.   Summarily,   they   intend   to   provide   contextual  details   about   entities:   their   nature,   their   purpose   or   their   behaviour   among   others.   Hence,  semantic   software   interoperability   represents   the   ability   for   two   or   more   distinct   software  

applications  to  exchange  information  and  understand  the  meaning  of  that  information  accurately,  automatically  and  dynamically.  Semantic  interoperability  must  be  prepared  in  advance,  in  design  time  and  with  the  purpose  of  predicting  behaviour  and  structure  of  the  interoperable  entities.  

According   to   Tolk,   we   can   have   seven   distinct   levels   of   interoperability   measured   in   the  “Levels  of  Conceptual  Interoperability  Model”  (Figure  6).  

20  

 

Figure  6  –  Levels  of  conceptual  interoperability  model  defined  by  Tolk  

The  highest  level  of  interoperability  is  only  attained  when  access  to  content  and  the  usage  of  that   content   is   completely   automated.   This   is   only   possible  when   programming   and  messaging  

interfaces   conform   to   standards   with   a   consistent   syntax   and   format   across   all   entities   in   the  ecosystem.  

Level   0   interoperability  defines   a   stand-­‐alone   independent   system  with  no   interoperability.  

Level   1   defines   technical   interoperability   characterized   by   features   like   the   existence   of   a  communication  protocol  that  enables  the  exchange  of  information  in  the  lowest  digital  software  allowed:   bits.   Level   2   interoperability   deals   with   a   common   structure   –   data   format   –   for  

information  exchange.  Level  3  is  achieved  when  a  common  exchange  reference  model  exists,  thus  enabling  meaningful   data   sharing.   Level   4   can   be   reached   when   the   independent   systems   are  aware   of   the   methods   and   procedures   that   each   entity   in   the   environment   is   using.   Level   5  

interoperability  deals  with  the  comprehension  of  state  changes  in  the  ecosystem  that  occur  over  time  and  the  impact  of  these  changes  –  at  any  level  –  in  the  system.  Level  6  is  achieved  when  the  assumptions   and   constraints   of   the   meaningful   abstraction   of   reality   are   aligned.   Conceptual  

models  must  be  based  on  engineering  methods  resulting  in  a  “fully  specified,  but  implementation  independent  model”  [24].  

2.1.4 Description  After   analyzing   the  problems   regarding   the   integration  of   heterogeneous  online   resources,   it   is  crucial  to  move  our  study  to  the  solutions  tested  so  far.  The  most  extensively  tested  and  applied  

solution   to   deal   with   the   integration   and   interoperability   issues   is   resource   description:   the  semantic  web  [25].  

21  

Any   scientific   research   field   deals   with   specific   terminology   that   is   associated   with   that  

particular  area.  For  instance,  researchers  working  with  ancient  history  have  a  thesaurus  of  terms  that  is  completely  different  from  the  one  used  in  medicine:  on  one  hand  we  have  symbols,  kings,  religions,  wars,  and  on  the  other  hand  we  have  diseases,  symptoms  or  diagnostics.  Therefore,  it  is  

of   utmost   importance   that   researchers   are   aware   of   the   ontology   used   in   their   research   area.  Ontology   [26]   defines   the   collection   of   terms   and   relations   between   terms   that   are   more  adequate   for   a   given   topic.   These   relations,   often   designated   axioms,   establish   connections  

between  terms  in  the  thesaurus  that  mimic  the  real  world.  For  instance,  in  history  studies,  there  could  be  a  definition  between  the  terms  King  and  Prince  defining  that  Prince  is  a  son  of  the  King.  There   can   be   an   immense   number   of   axioms   relating   terms   and   by   spreading   these   relations  

between  terms  we  can  define  ontology.  Ontologies   are   the   basis   for   the   enhancements   proposed   by   the   semantic   web.   Semantic  

web’s   main   goal   is   to   enable   autonomous   interoperation   between   machines   based   on   the  description   of   content   and   services   that   are   available   in   the   Internet   (Figure   7).   Web1.0  established   the   Internet   as   a   set   of   Producers   generating   content   for   a   large   number   of  

Consumers.  The  majority  of  online  available  resources  were  created  by  a  small  group  of  technical  staff   that  was  entirely  dedicated  to  web  development  and  adapting  existing  company  strategies  to  the  Internet  era.  In  Web2.0  we  have  witnessed  a  shift  in  the  Producer-­‐Consumer  relation  that  

dominated   the   Internet.   Nowadays,   Internet   content   is   mostly   published   by   end-­‐users   that  previously  were  only  Consumers.  The  frontier  between  Consumer-­‐Producer  roles  is  blurred,  as  it  is  getting  easier  to  publish  content  on  the  web  thanks  to  Web2.0  tools  such  as  blogs,  micro-­‐blogs,  

media-­‐sharing  applications  or   social  networks.   In   the   future,   the   semantic  web  –   the   intelligent  web   –   will   include   specific   software   that   will   analyze   user   generated   content   and   messages  between  users,  searching  for  contextual  information  and  improving  everyday  online  tasks.  

In  order  to  make  the  semantic  web  possible,  developers  and  researchers  need  to  cooperate  to  achieve  several  goals.  It  is  important  to  create  and  broadcast  centralized  ontologies  for  several  public   interest  areas  and   to  empower   the  adoption  of   these  ontologies  by   research  groups  and  

private   companies.   Nevertheless,   this   crucial   step   can   only   be   given   if   the   cooperation   efforts  originate   enhanced   semantic   technologies   that   ease   the   complex   task   of   describing   content.   A  description   of   these   technologies   is   made   further   in   this   document.   Whatever   group   of  

technologies   we   chose,   the   critical   aspect   is   that   developers   must   adapt   their   application   and  prepare   their   research   for   semantic   integration.   Research   groups  working  with   state-­‐of-­‐the-­‐art  technologies  must   promote   this   difficult   step   that   will   require   deep   changes   in   the   developed  

applications.   Only   promoting   this   usage,   we   can   foster   the   development   of   a   new,   cleverer,  Internet.  

 

22  

 

Figure  7  –  Evolution  of  the  Internet  according  to  the  improvements  in  the  used  communication  paradigms  

2.2 Technologies  We   have   described   the   main   issues   that   arise   when   one   wishes   to   develop   centralized   user  interfaces  delivering  access  to  a  wide  range  of  online  resources  in  a  single  environment.  Resource  

heterogeneity,   integration,   interoperability   and   description   often   complicate   developers’   tasks  and  delay  innovation  in  this  research  area.  

To   overcome   the   arising   issues,   we   can   recur   to   several   strategies   that   rely   on   distinct  

technologies.   The  evolution  of   these   technologies  has  gained   focus  over   the   last   few  years   and  their   promiscuity   with   the   World   Wide   Web   has   proven   to   be   extremely   profitable.   We   can  organize   these   technologies   at   four   distinct   levels,   according   to   the   dependencies   among   each  

other   (Figure  8).  Online   resource  access   services  are   located  at   the  bottom  as   they  allow  direct  access   to   resources   and   empower   developers   to   create   the   next   level   which   comprises   web  services,  an  indirect  data  access  method.  GRID  technologies  use  web  services  and/or  data  access  

strategies   to   enhance   both   computing   power   capabilities   as   well   as   data   access   capabilities.  Semantic   technologies   promote   resource   description.   Hence,   they   enable   the   fulfilment   of  integration  and  interoperability  requirements.  

 

23  

 

Figure  8  –  State-­‐of-­‐the-­‐art  technologies,  concepts  and  their  respective  dependencies  

2.2.1 Online  resource  access  Online  resource  access  services  are  responsible  for  encapsulating  online  resources  and  making  it  available  to  other  systems.  They  allow  access  to  local  or  relational  databases,  tools,  file  systems  or  any  kind  of  external  storage  methods.  The  services  encapsulation  should  be  made  using  wrappers  

to  enhance  and  ease   integration  and   interoperability.  With   this   in  mind,   it   is   important   to   take  into  account  some  generic  concerns.  

Performance   is   a   crucial   concern   especially   because   the   end-­‐users   of   the   system  will  want  

fast  and  responsive  applications  regardless  of  the  operation  they  are  executing.  Performance  can  be  optimized  by   reducing   the  amount  of  data   that   is   sent   across   the  network  or  by  minimizing  query  interdependence,  thus  reducing  latency.  

Usability   is  also  essential   in  any  modern  system.  Expressiveness   should  be  enough   to  allow  users  to  pose  almost  any  query  to  the  system.  Usability  and  expressiveness  depend  on  metadata.  Metadata  should  be  carefully  selected  and  constrained  to  the  minimal   information  necessary  to  

interpret  what  the  wrapper  is  encapsulating.  Researchers  are  constantly  dealing  with  data  that   is   located   in  distinct  geographic   locations  

and   most   of   the   times   they   are   accessing   this   data   from   different   computers.   To   prepare  integration   systems   for   distributed   and   remote   access   is   very   important   because   access   to  distributed   resources   has   to   be   transparent   and   remote   access   to   resources   must   be   possible  

without  a  fuss.  Resource   access   services   are   shown   to   end-­‐users   in   three   distinct   flavours:   command-­‐line  

access,  web-­‐based  access  and  interactive  visual  access.  Whether  we  are  dealing  with  databases  or  

a   FTP   server,   there   is   always   a   text-­‐based   query   language   allowing   access   to   resources.   This  language   provides   user  with   a   command-­‐line   access   to   both   the   resource   structure   and   to   the  resource  itself.  There  are  several  examples  of  command-­‐line  access  languages:  SQL  for  relational  

Semanoc  Web  

• URI  +  RDF  +  OWL  +  SPARQL  • Microformats  • Resource  Descripoon  

GRID  

• Compuong  (High-­‐processing  power)  • Data  (Large  distributed  datasets)  

Web  Services  

• Remote  APIs  (DAS)  • WSDL  +  SOAP  +  UDDI  • REST  • XMPP  

Online  resource  access  

• Command  line  (SQL,  XQuery)  • Web  based  (ASP,  JSP,  AJAX)  • Interacove  GUI  

24  

databases,  OQL  to  query  object-­‐oriented  databases,  XQuery  for  XML  databases  or  shell  scripting  

in  Linux.  The  main  problem  with  these   languages   is   their   learning  difficulty   for  end-­‐users.  These  languages  were  meant  to  be  used  by  developers,  thus,  the  resource  structure  is  hidden  and  it   is  required  that  the  resource  organization  is  known  before  formulating  a  query.  

Accessing  online  resources  through  web-­‐based  methods  is  an  attempt  to  overcome  the  main  problems  existing   in   the  command-­‐line  usage.  World  Wide  Web  advances  promoted  and  eased  access   to   online   resources   and   empowered   the   creation   of   novel   applications.   These   new  

applications  provide  effective  query  forms  and  access  to  remote  APIs  that  users  can  fill  or  execute  in   order   to   get   resource   access.   For   instance,   with   web   access,   and   if   the   original   application  provides   the  correct  set  of   features,   it   is  possible   to  access  almost  all   the  content  available   in  a  

database.  Nevertheless,  this  type  of  resource  is  not  perfect.  On  one  hand,  developers  can  hide  the  internal  resource  organization  and  only  offer  a  limited  access  to  some  datasets,  according  to  the  

system’s  goals.  On  the  other  hand,  this  limits  the  amount  of  information  users  can  extract,  as  they  can   only   access   content   that   has   explicit   access   methods.   That   is,   end-­‐users   do   not   have   the  freedom  to  create  their  own  data  queries  and  cannot  view  the  resource  structure;  therefore  they  

only  get  in  contact  with  small  views  of  a  larger  system.  The  lesser-­‐known  resource  access  methods  are  interactive  GUIs.  These  kinds  of  applications  

rely  on  a  visual  construction  of  queries,  using  VQL,  which  enables  access  to  resources.  Query  by  

Example   [27]   was   the   first   proposed   VQL   to   query   relational   databases.   Assembling   distinct  blocks,   like   LEGO   pieces,   creates   access   queries.   These   blocks   represent   tables,   methods   or  constraints  that  are  arranged  in  a   logical  visual  order  to  mimic  the  access  to  the  resource.  Once  

again,  this  is  a  very  restrict  resource  access  method  mostly  because  it  depends  on  a  small  set  of  blocks  that  can  be  arranged  in  a  finite  number  of  manners.  

2.2.2 Web  Services  Web   services   [28]   are,   nowadays,   the   most   widely   used   technology   for   the   development   of  distributed  web  applications.  The  World  Wide  Web  Consortium   (W3C)  defines  web  services  “as  

software   system   designed   to   support   interoperable   machine-­‐to-­‐machine   interaction   over   a  network”   [29].   This  wide  definition   allows  us   to   consider   a  web   service   as   any   kind  of   Internet  available   service   as   long   as   it   enables   machine-­‐to-­‐machine   interoperability.   Despite   this   all-­‐

embracing  definition,  we  can  divide  existing  web  services   in   two  main  groups:  HTTP-­‐based  web  services   which   encompass   generic   web   services   following   W3C’s   and   OASIS’s   standards   or  application-­‐specific  REST  web  services  and  XMPP-­‐based  web  services.  

REST  web  services  are  a  minority  among  the  web   interoperability  world,  although,  they  are  emerging  as  a  viable  alternative  to  standardised  web  services.  REST  web  services  consist  in  simple  web   applications   that   respond   to   replies   posted   in   a   HTTP   URL.   Developers   can   configure   this  

page  to  respond  with  HTML,  XML,  JSON,  CSV  or,  most  simply,  free  text.  Using  REST  web  services  it  does  not  matter  what  is  the  response  structure  and  its  inner  format,  the  essential  requirement  is  that   the   exchanged  messages   are   understood   between   both   the   intervenient   in   the   exchange.  

This   feature   makes   REST   web   services   a   lightweight   and   highly   customizable   approach   for  exchanges   between   machines   [30].   Nevertheless,   though   this   approach   is   more   attractive   to  developers,   it   still   holds   against   itself   the   fact   that   it   lacks   the   robustness   of   a   standard-­‐based  

strategy.  Standardised  web  services  have  the  main  purpose  of  providing  a  unified  data  access  interface  

and   a   constant   data   model   of   the   data   sources.   Simple   Object   Access   Protocol   (SOAP)   [31],  

25  

Universal   Description,   Discovery   and   Integration   (UDDI)   [32]   and   Web   Services   Description  

Language   (WSDL)   [33]   are   the   currently   used   standards   and   they   define   machine-­‐to-­‐machine  interoperability   at   all   levels,   ranging   from   the   data   transport   protocol   to   the   query   languages  used.  Web  service  interoperation  occurs  among  three  different  entities:  the  service  requester,  the  

service  broker  and  the  service  provider  -­‐  Figure  9.  When  certain  software  wants  to  access  a  web  service,  it  contacts  the  service  broker  in  order  to  search  for  the  service  that  is  most  adequate  to  accomplish  its  needs.  The  service  broker   is   in  constant  communication  with  the  service  provider  

and  will  provide  the  service  requester  with  the  data  it  needs  to  establish  a  direct  communication  with   the   service   provider.   Communication   with   the   service   broker   is   done   to   exchange  WSDL  configurations.   When   the   service   requester   knows   what   service   to   reach,   it   initiates   a  

conversation  with  the  service  provider  exchanging  the  necessary  messages  in  the  SOAP  format.  

 

Figure  9  -­‐  Web  Service  interaction  diagram  

SOAP  is  a  protocol  used  over  the  traditional  Internet  HTTP  protocol  and  is  used  to  specify  the  

structure   of   information   exchanges   used   in   the   implementation   of   web   services   in   computer  networks.   The   message   formats   are   defined   in   XML   and   the   protocol   relies   on   underlying  protocols   for   message   negotiation   and   transmission.   SOAP   standard   defines   a   comprehensive  

architecture   based   on   several   layers   where   all   the   components   required   for   a   basic   message  exchange  framework  are  defined.  These  components  include  message  format,  message  exchange  patterns,   and   message   processing   models,   HTTP   transport   protocol   bindings   and   protocol  

extensibility.  Nevertheless,  SOAP  still   requires  a  protocol   to  define   its   interface.  This  protocol   is  WSDL  and  SOAP  clients  can  read  it  dynamically  and  adjust  inner  message  settings  to  it.  

WSDL   standardises   the   description   of   web   service   endpoints.   This   description   enables  

automation  of   communication  processes  by  documenting   (with  accurate   semantic  descriptions)  every   element   involved   in   the   interaction   (from   the   entities   to   the   exchanged   messages).   In  WSDL,  a  service  is  a  collection  of  network  endpoints  capable  of  exchanging  data.  Their  definition  

is  separated  in  abstract  message  definition  and  concrete  network  deployment  data.  The  definition  encompasses   several   components   that  are   structured   in  order   to   facilitate   communication  with  other  machines  and  ease  the  readability  of  the  web  service  by  humans.  Obviously,  if  we  think  of  a  

complex  web  service,  we  realize  that  there  are  numerous  data  types  that  need  to  be  described.  

26  

WSDL  recognizes  this  need  and  can  use  XML  Schema  Definition  (XSD)  as  its  canonical  type  system.  

Despite  this,  we  cannot  expect  that  this  grammar  will  cover  all  possible  data  types  and  message  formats  in  the  future.  To  overcome  this  issue,  WSDL  is  extensible,  allowing  the  addition  of  novel  protocols,  data  formats  or  structures  to  existing  messages,  operations  or  endpoints.  

A  perfect  example   for  WSDL  extensibility   is   the   inclusion  of   semantic  annotations   in  WSDL.  SAWSDL  used  WSDL-­‐S  as  its  main  input  [34]  and  is  now  a  W3C  recommendation.  SAWSDL  purpose  is   to   define   the   organization   of   the   novel   semantic   structures   that   can   be   added   to   WSDL  

documents.   These   novel   structures   are  mainly   deeper   descriptions   of   several   traditional  WSDL  components   such   as   input   and   output   messages,   interfaces   or   operations.   The   relevance   of  describing  content  and  services  was  mentioned  previously.  Although,  it  is  crucial  to  reinforce  that  

annotating  WSDL  services  will   improve   their   categorization   in  a   central   registry,   thus  enhancing  service  discovery  and  composition  tasks.  

UDDI   provides   web   service  management   on   the   web.   As   the   name   explicitly   explains,   the  purpose  of  UDDI   is  to  provide  a  XML/SOAP  based  framework  that  allows  describing,  discovering  and  managing   in   the   web   services   environment.   UDDI   central   registries   usually   offer   a   central  

registry  with  “publish  and  subscribe”  features  that  allows  the  storage  of  service  descriptions  and  detailed  technical  specifications  about  the  web  services.  The  storage  mechanism  relies  once  again  on  XML   to  define  a  metadata   schema   that   can  be  easily   searched  by  any  discovery  application.  

Standardising  web  service  registry  has  the  main  benefit  of  organizing  the  disordered  web  services  world.  UDDI  promotes  uniform  patterns  for  both  the  internal  organization  of  the  services  as  well  as  to  the  external  presentation.  Hence,  it  enhances  the  development  of  integration  strategies  and  

management  of  the  access  to  distributed  services  in  the  web  environment.  The  Extensible  Messaging  and  Presence  Protocol   (XMPP)   is  an  open  and  decentralized  XML  

routing  technology  that  allows  any  entity  to  actively  send  XMPP  messages  to  another  [35].  XMPP  

works   as   a   complete   communication   protocol,   independent   from   HTTP   or   FTP   for   data  transferring.  A  XMPP  network   is  composed  of  XMPP  servers,  clients  and  services.  XMPP  is  more  famously   known   by   the   Jabber   messaging   framework.   The   Jabber   ID   indentifies   uniquely   each  

XMPP  entity.  XMPP  services  are  hosted  in  XMPP  servers  and  offer  remote  features  to  other  XMPP  entities   in   the   network,   for   instance,   XMPP   clients.   Being   a   messaging   protocol,   it   has   been  conventionally   used   by   Jabber   and   Google   Talk.   Nevertheless,   a   collection   of   XMPP   Extension  

Protocols   (XEPs)  extends   the   initial   core   specification,  widening   the   scope  of  XMPP   into  various  directions   including   remote   computing   and   web   services.   Both   HTTP   and   XMPP   are   used   for  content  transfers,  the  main  distinction  between  is  that  XMPP  requires  a  XML  environment  while  

HTTP  supports  any  kind  of  unstructured   information.  A  XEP,   IO  Data,  was  created  to  enable  the  dispatch  of  messages   from  one  computer   to  another,  providing  a   transport  package   for   remote  service  invocation  [36].  Despite  being  an  experimental  XEP,   it  already  solves  two  primary  issues:  

the   unneeded   separation   between   the   description   (WSDL)   and   the   actual   SOAP   service   and  asynchronous  service  invocation.  XMPP  infrastructure  can  be  used  to  discover  published  services  [37]  and  being  asynchronous  implies  that  clients  do  not  have  to  poll  repeatedly  for  status  of  the  

service  execution,  instead,  the  service  sends  the  results  back  to  the  client  upon  completion.  

Service  Composition  Web   service   composition   [38]   defines   the   collection   of   protocols,  messages   and   strategies   that  have  to  be  applied  in  order  to  coordinate  a  heterogeneous  set  of  web  services  and  reach  a  given  goal.  However,   the  coordination  mechanism   is  not  complete   if   it  does  not  offer  a   seamless  and  

27  

transparent   integration   environment   to   end-­‐users.   The   underlying   architecture   of   service  

composition   scenarios   requires   the   development   of   a   composition   engine   that   is   able   to  coordinate   the   execution   of   the   web   service   workflow,   communicate   with   the   distinct   web  services  and  organize  the  information  flow  between  the  web  services.  An  architecture  with  such  

complexity   relies   on   a   customized   semantic   structure   to   describe   the   composition   [30].  Traditionally,   service   composition   is   completely   hard-­‐coded:   the   developers   define   a   static  composition  to  achieve  the  initial  goals.  However,  modern  service  composition  scenarios  combine  

web  services  with  semantic   features.  This  combination   leverages  automation   in  the  web  service  composition   engine.   That   is,   the   web   services   workflow   is   established   dynamically   and   web  service  interoperability  and  execution  is  triggered  automatically.  In  these  scenarios,  the  end  users  

only  need   to  define   the   input   and   the  output   they  desire   and   the   system  will   organize   itself   in  order  to  satisfy  the  original  constraints.  

Service   composition   can   be   applied   in   two   distinct   scenarios:   service   orchestration   and  service  choreography  [39].  These  scenarios  are  very  similar  and  they  can  be  applied  to  any  kind  of  interoperable  mesh  of  services.  Service  orchestration  relies  on  a  central  web  service  controller  to  

deal  with  the   information  workflow  and  web  service   interoperability.  This  main  controller   is   the  maestro  of  a  service  collection  and  will  organize  them  in  order  to  solve  the  initial  problem.  Service  choreography  consists  in  an  autonomous  discovery  of  the  best  combination  of  services  to  attain  a  

given  goal.  Benefits  and  drawbacks  of  these  scenarios  are  common  to  centralized  and  distributed  architectures.   Currently,   developers   opt   to   create   service   orchestration   architectures.   Although  they  are  more  primitive,  they  are  easier  to  implement  and,  in  most  cases,  end-­‐users  need  to  have  

some  control  over  the  system,  becoming  the  maestros  of  a  particular  web  service  collection.  Web  service  choreography  scenarios  are  more  modern  and  their  implementation  is  being  eased  by  the  latest  developments  in  artificial  intelligence  and  the  semantic  web.  

Service  Oriented  Architectures  Service  Oriented  Architectures  (SOA)  is  a  modern  application  deployment  architectural  style.  

The  rationale  behind  these  architectures  is  that  what  the  applications  are  connecting  are  services  

and  not  other  applications.  Considering  that  every  component  that  applications  required  can  be  considered   as   an   independent   service,   one   can   create   an   implementation   and   deployment  strategy  based   in  this  paradigm.   In  traditional  web  application  architectures  (Figure  10  –  A),   the  

deployment  can  be  decomposed   in   three  generic   layers:   the  presentation   layer,   the  application  layer   and   the   data   access   layer.   Each   of   these   layers   encompasses   several   programmatic  components  that  permit  a  stable  communication  with  the  upper  and  lower  layers  of  the  model.  In  

SOA  architectures   (Figure  10  –  B),   the   layers  are   independent.  That   is,  each   layer  component   is  independent   from   the   remaining   and   the  multiple   applications   can  be   composed  by   combining  components   belonging   to   each   layer.   This   empowers   two   main   concepts:   reusable   software,  

which   are   software   applications,   wrapped   as   services   that   can   be   used   in   a   multitude   of  applications  and  application  composition,  where  applications  can  be  built  by  combining  a  set  of  

services  like  LEGO  pieces.  

28  

 

Figure  10  -­‐  Distinguishing  traditional  architectures  from  SOA  architectures  

Nevertheless,  it  is  not  enough  to  a  have  a  set  of  interoperable  services  to  implement  service  oriented   architectures.  Additionally,   two  main   components   are   required:   the   Enterprise   Service  Bus  (ESB)  and  the  Registry.  The  ESB  is  a  software  architecture  component  that  acts  as  a  message  

centre  inside  the  SOA.  Its  main  feature  is  message  forwarding.  In  SOAs,  the  ESB  is  responsible  for  managing   the  messages  exchanges   inside   the  system  between   the  application   intervenient.  ESB  operability   is,   at  a  more  abstract   level,   a   service  proxy   that   interacts  and  controls  every   service  

that  composes  the  system.  The  Registry  is  a  central  service  repository,  acting  as  a  service  broker.  The  main  purpose  is  to  store  service  metadata  that  can  be  searched  by  other  services,  giving  them  the  ability  to  find  other  services,  autonomously,  according  to  various  criteria.    

Service  oriented  architectures  have  gain  popularity  recently,   following  the  “everything-­‐as-­‐a-­‐service”   trend.   This   has   leveraged   its   importance   in   the   computer   science   world,   especially  regarding   web   development.   Web   and   distributed   applications   are   a   perfect   scenario   for   the  

implementation   of   SOA.   Distributed   and   heterogeneous   resources   are   common   on   the   World  Wide  Web  and  to  connect  them  is  a  crucial  task  that  can  be  aided,  significantly,  by  implementing  a  service  oriented  architecture.  

2.2.3 GRID  GRID   computing   is   one   of   the   many   breakthroughs   that   have   been   made   possible   with   the  

evolution   of   Internet   technologies.   The   GRID   is   a   combination   of   software   and   hardware  infrastructures   that   provide   pervasive,   consistent   and   low-­‐cost   access   to   highly   capable  computational   capabilities   [40].   Though   this   is   the  main   idea   for   the   GRID,   it   is   very   basic   and  

somewhat   inadequate  to  current  standards.  The  evolution  of  this  concept   leads  to  a  model  that  unites   heterogeneous   and   distributed   data   sources   to   achieve   seamless   and   advanced  interoperable  functionality.  The  real  problems  in  the  GRID  concept  derive  from  resource  sharing  

and   problem   solving   in   dynamic,   multi-­‐institutional   virtual   organizations.   And   in   this   particular  scenarios,  resources  can  be  seen  as  either  software  or  hardware  capabilities.  

29  

This   ability   to   share   resources   is   essential   in   modern   science   where   collaboration   and  

multidiscipline   are   daily   topics.   If  we   consider  wider  modern   projects   in   any   scientific   research  area,  we  must  be  aware  that   the  developed  work  spans  workgroups,   institutions,  countries  and  even   continents.   The   possibility   to   connect   distributed   data,   computers,   sensors   and   other  

resources  in  a  single  virtual  environment  represents  an  excellent  opportunity.  Though,  this  kind  of  architecture  has  to  be  supported  by  various  protocols,  services  and  software  that  make  possible  controlled  resource  sharing  on   large  scale.  The  foundation  for  a  generic  GRID  architecture  must  

encompass   several   attributes   like   distributed   resource   coordination;   usage   of   standard,   open,  general-­‐purpose  protocols  and  interfaces  and  must  deliver  non-­‐trivial  qualities  of  service  to  end-­‐users.  In  addition  to  these  attributes  there  has  to  be  a  certain  number  of  features  implemented  to  

sustain   the   GRID   operation.   These   features   include:   remote   storage   and/or   replication   of  datasets;   logical  publication  of  datasets   in  a  distributed  catalogue;   security,   specially   focused   in  

AAAC;   uniform   access   to   remote   resources;   composition   of   distributed   applications;   resource  discovery  methods;  aggregation  of  distributed  services  with  mapping  and  scheduling;  monitoring  and   steering   of   job   execution;   code   and   data   exchanges   between   user   personal  machines   and  

distributed  resources  and  all  these  have  to  be  delivered  taking  in  account  basic  quality  of  service  requirements.  

The   GRID   architecture   may   lead   to   several   GRID   implementations   focused   on   the  

development   of   specific   features.   Therefore,  we   can   develop   and   classify   GRID   technologies   in  three  categories:  computing  grids,  data  grids  and  knowledge  grids.  This  is  an  empiric  classification  measured  by  the  main  purpose  of  the  GRID  technologies  being  used  in  each  category.  

Computing   grids   are   focused   on   hardware   resources   with   particular   incidence   on   “high  throughput  computing”.  Nevertheless,  it  is  important  to  distinguish  “high  throughput  computing”  from   “high   performance   computing”.   The   latter   aims   at   short   turnaround   time   on   large   scale  

computing   using   parallel   processing   techniques   [41,   42].   Despite   the   main   purpose   of   the  computing  GRID  being  parallel  and  distributing  computing  there  is  a  remarkable  difference  in  the  relevance   of   network   latency   and   robustness.   This   fact   gains   its   relevance   from   the   network  

latency   that   exists   in   virtual   organizations   that   can   span   geographically   distributed   locations   in  contrast   to   cluster   computing,  where  machines   are   physically   co-­‐located   and   the   latency   times  are  low.  This  high-­‐latency  in  computing  grids  must  be  handled  in  the  system  architecture,  as  the  

existence  of  lengthy  execution  jobs  must  be  supported.  Data   grids   can   be   seen   as   large   data   repositories:   a   resourceome   where   data   should   be  

explicitly   characterized   and   organized   [43].   This   data   grid   requires   a   unified   interface   that  

provides  access   to  any   integrated  database  and  application.  These   interfaces  must  allow  secure  remote   access   and   should   contain   ontology   and/or   metadata   information   to   promote  autonomous  integration.  

At   last  we  have   knowledge  grids,  which   are   a   lesser-­‐known   concept.   For   starters,  we  must  understand  the  dogma  that   the  knowledge  we  can  represent  on  computers   is   just  a  part  of   the  knowledge   we   can   create   and   share   among   a   community.   Despite   being   a   controversial   term,  

some   researchers   [44]   define   knowledge   grids   as   an   environment   to   design   and   execute  geographically   distributed   high-­‐performance   knowledge   discovery   applications.   The   main  distinction   between   knowledge   grids   and   generic   grids   is   the   usage   of   knowledge-­‐based  

methodologies  such  as  knowledge  engineering  tools,  discovery  and  analysis  techniques  like  data  mining  or  machine   learning  and  artificial   intelligence  concepts   like   software  agents.  The   idea  of  knowledge   in  the  web   is  merged  with  the  semantic  web.  With  this   in  mind,  we  can  convey  that  

30  

the  Semantic  Web  is  a  Knowledge  GRID  that  emphasizes  on  distributed  knowledge  representation  

and   integration  and  a  Knowledge  GRID   is   a  platform   for  distributed  knowledge  processing  over  the  Semantic  Web  [45].  

Modern   large-­‐scale   research   projects   usually   rely   on   some   kind   of   GRID   architecture   to  

support   the   sharing   of   resources   among   the   projects   peers.   The   problems   and   requirements  raised   in   this   sharing  environment  are   the  ones  debated   in  section  2.1:   resource  heterogeneity,  integration,   interoperability   and   description.   Therefore,   a   deep   understanding   of   GRID  

technologies   and   architectures   is   essential   to   newer   developments   in   service   composition   and  integration  for  any  scientific  research  area.  

2.2.4 Semantic  Web  The  dramatic  increase  in  content  promoted  by  recent  web  developments  like  Web2.0  and  social  tools,  combined  with  the  ease   in  the  publication  of  online  content,  have  the  major  drawback  of  

increasing   the   complexity   of   resource   description   tasks.   Standard   web   technologies   cannot  support   this  exponential   increase.  Researchers  are  required  to  perform  manual  searches  on  the  vast  amount  of  online  content,   interpret  and  process  page  content  by  reading   it  and   interacting  

with   the  web  pages.  Additionally,   they  have   to   infer   relations  between   information  obtained   in  distinct   pages,   integrate   resources   from   multiple   sites   and   consolidate   the   heterogeneous  information  while  preserving  the  understanding  of  its  context.  These  tasks  are  executed  daily  by  

researchers   in  the  most  diverse  scientific  areas  and  the  web  cannot  offer  simple  mechanisms  to  execute  them  without  relying  on  some  kind  of  computer  science  knowledge.  

Despite   the   appearance   of   specific   integration   applications,   there   are   no   general   solutions  

and   case-­‐by-­‐case   applications   have   to   be   developed.   Nevertheless,   the   development   of   these  applications  is  strained  by  the  lack  of  modern  resource  publishers:  service  providers  still  assume  that   users   will   navigate   through   the   information   with   a   traditional   browser   and   do   not   offer  

programmatic   interfaces   and   without   them,   autonomous   processing   is   difficult   and   fragile.  Semantic  Web  aims   to   enable   automated  processing   and  effective   reuse  of   information  on   the  

Web  that  will  support  intelligent  searches  and  improved  interoperability  [46,  47]  .  Tim  Berners-­‐Lee,   the  self-­‐proclaimed   inventor  of   the  modern   Internet  and  director  of  W3C,  

promoted   semantic   Web   developments   in   2001   [25].   His   initiative   envisaged   to   smoothly   link  

personal   information  management,   enterprise  application   integration  and  worldwide   sharing  of  knowledge.  Therefore,  tools  and  protocols  were  developed  to  facilitate  the  creation  of  machine-­‐understandable   resources   and   to   publish   this   new   semantically   described   resource   online.   The  

long-­‐term  purpose  is  to  make  the  Web  a  place  where  resources  can  be  shared  and  processed  by  both  humans  and  machines.  The  W3C  Semantic  Web  Activity  group  has  already  launched  a  series  of   protocols   to   promote   the   developments   in   this   area.   Adding   semantic   features   to   existing  

content  involves  the  creation  of  a  new  level  of  metadata  about  the  resource  [48].  This  new  layer  will  allow  an  effective  use  of  described  data  by  machines  based  on  the  semantic  information  that  describes  it.  This  metadata  must   identify,  describe  and  represent  the  original  data  in  a  universal  

and  machine  understandable  way.  To  achieve  this  there  is  a  combination  of  four  web  protocols:  URI,  RDF  [49],  OWL  [50]  and  SPARQL  [51].   In  parallel  to  W3C  efforts,  there  are  developments   in  microformats.  Microformats   are   very   small  HTML  patterns   that   are   used   to   identify   context   on  

web  pages.   The   idea   is   to  use  existing  HTML  capabilities,   such  as   the  attributes   inside  a   tag,   to  make  a  simple  content  description.  For  instance,  if  we  have  a  person  name  inside  a  p  tag,  we  can  

31  

use   the   hCard   (http://www.xfront.com/microformats/hCard.html)   microformat   to   describe   the  

person  and  her  personal  information.  A   URI   is   a   simple   and   generic   identifier   that   is   built   on   a   sequence   of   characters   and   that  

enables   the   uniform   identification   of   any   resource.   Promoting   uniformity   in   resource   location  

allows  several  URI  features  like  usage  of  distinct  types  of  identifiers  in  the  same  context,  unified  semantic   interpretation   of   resources,   introduction   of   new   resource   types   without   damaging  existing  ones  or  reuse  of  identifiers  in  distinct  situations.  The  “Resource”  term  is  used  in  a  general  

sense:  it  can  identify  any  kind  of  component.  URI  can  identify  electronic  documents,  services,  data  sources   and   other   resources   that   cannot   be   access   via   Internet   like   humans   or   corporations.  Identifier   refers   to   the  operation  of   unequivocally   distinguish  what   is   being   identified   from  any  

other  element  in  the  scope  of  identification.  This  means  that  we  must  be  able  to  distinguish  one  resource  from  all  other  resources  regardless  the  working  area  or  resource  purpose.  

Metadata   is   a   term   defining   concrete   descriptions   of   things.   These   descriptions   should  provide   details   about   nature,   intent   or   behaviour   of   the   described   entity   as   well   as   being,  generically,  “data  about  data”.  RDF  was  designed  as  a  protocol  to  enable  the  description  of  web  

resources  in  a  simple  fashion  [52].  The  syntax  neutral  data  model  is  based  on  the  representation  of  predicates  and  their  values.  A  resource  can  be  anything  that  is  correctly  referenced  by  an  URI  and  is  currently,  like  the  latter,  not  limited  to  describing  web  resources.  In  RDF  we  can  represent  

concepts,  relations  and  taxonomies  of  concepts.  This  triplet  characteristic  results  in  a  simple  and  flexible   system.   Although   these   are   the   main   RDF   benefits,   they   are   also   an   issue:   in   certain  scenarios,  its  generality  must  be  formally  confined  so  that  software  entities  are  able  to  correctly  

exchange  the  encoded  information.  To  query  RDF  files  and,   in  a   larger  scale,  the  Semantic  Web,  W3C   developed   the   SPARQL   syntax   [53].   SPARQL   is   an   SQL-­‐like   query   language   that   acts   as   a  friendly  interface  to  RDF  information,  either  being  stored  in  RDF  triplets  or  traditional  databases  

(using  appropriate  wrappers).  Describing   content  with  metadata   is   not   enough   and   there   has   to   be   an   understanding   of  

what  the  described  data  means.   In  this  field,  ontologies  come  to  play.  As  mentioned  previously,  

ontologies  are  used  to  characterize  controlled  vocabularies  and  background  knowledge  that  can  be   used   to   build  metadata.   Ontology   [26]   consists   on   the   collection   of   consensual   and   shared  models   in   an   executable   form   of   concepts,   relations   and   their   constraints   tied   to   a   scaffold   of  

taxonomies  [54].  In  practical  terms,  we  use  ontologies  to  assert  facts  about  resources  described  in  RDF   and   referenced   by   an   URI.   OWL   is   the   de   facto   ontology   standard   and   extends   the   RDF  schema  with  three  variants:  OWL-­‐Lite,  OWL-­‐DL  and  OWL-­‐Full.  These  three  variants  offer  different  

levels  of  expressiveness  and  can,  summarily,  define  existential  restrictions,  cardinality  constraints  in  properties  and  several  property  types  like  inverse,  transitive  or  symmetric.  OWL  main  benefit  is  that  data  represented  with  OWL  can  be  reasoned  and  inferred  to  deduce  new  information.  More  

recently,   OWL-­‐S   was   built   on   top   of   OWL   and   is   focused   on   the   features   that   describe   web  services.   This   new   protocol   allows   developers   to   create   a   description   about   what   the   service  provides,  how  the  service  works  and  how  to  access  the  service  [55].  

Combining   these   technologies  will  enable   the   interoperability  between  heterogeneous  data  sources  resulting   in  the  possibility  of  data   in  one  source  to  be  connected  with  data   in  a  distinct  source  [56].  Semantic  interoperability  is  the  ability  of  two  or  more  computer  systems  to  exchange  

information  and  have  the  meaning  of  that   information  accurately  and  automatically   interpreted  by   both   systems.   This   design   intent   can   only   be   achieved   by   recurring   to   these   protocols   and  semantic  web  concepts  since  the  beginning  of  the  project  development.  

32  

2.3 Summary  

Researchers’   daily   tasks   are   getting   more   complex   as   traditional   simple   tasks   like   locating  

necessary   information,   gathering   it   and  working  with   tools   to   process   it   get  more  difficult.   The  growing   number   of   software   tools   is   not   helpful   as  well.   Despite   their   quantity,   their   quality   is  

questionable   and   each   tool   works   differently   and   requires   distinct   end-­‐user   knowledge   to   be  usable.   Along  with   application   complexity,   there   is   also   the   immense   number   of   data   formats:  even   if  we  only  consider  a  single  scientific  area,   there  are  numerous  data   formats,  data  models  

and   data   types   to   consider   and   that   impose   time   consuming   tasks   like   manual   data  transformations   or   development   of   custom  wrappers   and   converters   which   are   generically   far  beyond  scientific  researchers’  scope.  

With  this   in  mind,  scenarios  of  service  composition  gain  relevance  and  represent  a  research  opportunity   for   software   engineers   and   computer   science   specialists.   The   autonomous  coordination   of   processes,   tools   and   the   shared   integration   of   resources   in   scientific  

environments   require   high   knowledge   and   expertise   in   computer   sciences.   Although   this  challenging  task  depends  on  a  deep  insight  about  the  existing  problems  and  state-­‐of-­‐art  solutions,  researchers  contribute  and  collaboration  is  essential  to  provide  a  solid  working  basis  which  efforts  

will,  hopefully,  result  in  an  improved  set  of  tools  and  working  environments  for  any  researcher  in  the  field.  

 

33  

3 Approach  

Any  researcher  working  in  the  bioinformatics  field  can  rapidly  obtain  a  general  idea  of  the  existing  problems  related  to  the  integration  of  distributed  and  heterogeneous  online  resources.  It   is  also  

true  that  the  crescent  number  of  technological  solutions  to  cope  with  these  issues  is  not  a  major  benefit.   The   number   of   approaches   that   can   be   designed  using   one  or  more   of   the  mentioned  technology  is  vast  and  choosing  the  right  path  is  not  trivial.  In  this  section  we  present  a  discussion  

on  the  more  widely  used  solutions  and  practical   implementation  scenarios  of   these  solutions   in  the  bioinformatics  field.  

3.1 Solutions  The   presented   solutions   result   of   the   combination   of   several   strategies   and   technologies   to  

develop  new  applications  that  rely  on  state-­‐of-­‐the-­‐art  components  to  achieve  the  initial  goals  and  fulfil   the   initial   requirements.  This   section  contains  a   summary   roundup  of  application  concepts  that  can  be  implemented  to  achieve  the  concrete  goal  of  service  composition.  

3.1.1 Static  applications  The  simplest  approach  to  integrate  heterogeneous  components  would  be  to  simply  program  the  

entire  application  workflow.  Obviously  this  approach  has  the  major  drawback  of  being  static  and  fragile.  However,  for  some  specific  scenarios,  designing  an  application  that  solves  a  small  subset  of  problems  is  the  fastest  way  to  deploy  a  viable  solution  in  a  short  amount  of  time.    

To  create  static  applications,  developers  only  need  to  be  aware  of  the  features  they  want  to  implement   and   the   main   characteristics   of   the   resources   they   wish   to   integrate.   These  applications   combine   a   collection   of   methods   that   must   be   developed   to   integrate   each  

application.   As   a   result,   a   static   integration   application   is   composed   by   a   set   of   wrappers   that  encapsulate   the   access   to   distributed   data   resources.   These   wrappers   are   developed  independently  and  they  can  only  exchange  data  with  a  single  resource.  If  the  main  purpose  of  the  

application  is  to  offer  data  obtained  from  web  services  and  relational  databases  there  has  to  be  a  wrapper   to   access   each   service,   a   wrapper   to   access   each   database   and   a   set   of   methods   to  coordinate,  statically,  the  access  and  data  exchanges  between  the  various  wrappers  -­‐  Figure  11.  

 

34  

 Figure  11  –  Static  applications  architecture  example,  focusing  on  the  static  and  manually  maintained  integration  engine  

At  a  first  sight,  these  applications  do  not  represent  a  valuable  solution  for  the  integration  of  resources.   Nonetheless,   this   solution   is   widely   used   specially   due   to   the   simplicity   of   the  

development  and  the  speed  of  deployment.  With  static  applications,  developers  do  not  need  to  program  dynamic  or  generic  components.  Static  wrappers  are  much  easily  deployed,  they  can  be  

developed  and   tested   faster  and  added   to   the  application  engine  without  any  difficulty.  On   the  downside,  static  applications  are  not  generic,  flexible  or  robust.  Anytime  one  wish  to  add  a  new  resource,   developers   must   program   the   access   to   that   particular   service   and   add   it   to   the  

application.  At  a  small  level,  this  solution  is  feasible.  However,  when  we  are  dealing  with  complex  environments  and  a  constantly  evolving  scenario  –  which   is  the  case   in  the  majority  of  scientific  research  projects  –  this  solution  is  not  enough.  

Another  major   drawback   of   static   applications   is   related   to   control.   In   this   particular   case,  control  refers  to  the  ability  to  vary  application  inputs,  application  outputs  and  the  inner  processes  executed   inside   the   system.  Static   solutions  only  allow   the   input  of   a   single  data   type  and  only  

provide  a  single  data  type  output.  For  instance,  a  static  system  can  receive  a  String  as  input  and  reply   a   hexadecimal   output.   This   system   has   static   data   entries   that   user   cannot   customize   or  change.  In  addition,  the  processes  used  to  convert  a  character  string  to  a  hexadecimal  value  –  a  

web  service  for  instance  –  cannot  be  controlled  as  well:  they  are  settled  and  cannot  be  changed.  

3.1.2 Dynamic  applications  The   design   and   development   of   dynamic   solutions   requires   a   higher-­‐level   of   computer   science  knowledge  and  a  steady  background  on  the  working  area  to  support  the  various  iterations  of  the  project  execution.  Dynamic  applications  are  the  expected  evolution  of  static  applications  [57]  and  

are  distinguished   for  being  able   to  allow  changes   in   its   inputs  and  outputs  as  well  as  conveying  distinct  resource  combinations  to  reach  a  given  goal.  

Dynamic  applications  [58]  involve  the  development  of  a  middleware  framework  that  supports  

specific   features   related   to   resource   control.   In   dynamic   applications,   control   is   no   longer  attached   to   the   application   code   as   it  mutates   according   to   the   final   result   expected   from   the  application.   That   is,   dynamic   applications   can   recognize   distinct   inputs   and   outputs   and   work  

respectively:   the  application   relies  on  a  middleware   layer   to  organize   the  used   set  of   resources  related   to   the   inputted   data   and   the   desired   output.   An   exemplifying   scenario   could   be   an  application   that  accepts   various  data   type   inputs   that   identify  a   country  –   for   instance:  2   letter  

Application Layer

Static Integration Engine

WS 1 DB 1 File 1 WS n DB n File n

35  

code,   full   name  or  phone  prefix   –   and   retrieves   the  wiki   page   for   that   country.   The  application  

identifies   the   data   type   of   the   input   and   communicates   with   a   service   specific   to   that   input,  retrieving  the  information  that  allows  a  successful  output  reply.  

Additionally,  dynamic  applications  also  allow  a  higher-­‐level  of  user  control.  This  means  that  

users   can   select   what   resources   to   use   from   a   predefined   set   of   methods   or   add   their   own  resources   to   the   system   that   will   be   recognized   automatically.   Offering   this   kind   of   features  increases   the   platform   complexity   dramatically.   Dynamic   access   to   services   and   autonomous  

semantic  composition  require  the  development  of  several  focused,  flexible  and  generic  protocols  [59].   Designing   these   protocols   implies   recurring   to   many   technologies   debated   before   and  requires  the  adoption  of  semantic  web  strategies  to  describe  incorporated  resources  and  permit  

the   integration   of   novel   ones.   Only   recurring   to   intelligent   web   mechanisms   we   can   improve  existing  applications  and  obtain  significant  advances   in  dynamic  applications,  as  opposed  to   the  

current  semi-­‐dynamic  solutions.  

3.1.3 Meta-­‐applications  Metadata  is,  generally,  data  about  data.  Applying  the  same  idea  to  applications  we  can  conceive  

the  paradigm  of  meta-­‐applications:  applications  about  applications.  Meta-­‐applications  are  state-­‐of-­‐the-­‐art   systems   that   integrate   distributed   applications   empowering   interoperability   among  heterogeneous   systems.   Recent   developments   have   also   promoted   a   concept   described   as  

“software-­‐as-­‐a-­‐service”:   any   software   solution   should   be   provided   as   a   remote   service   [60].   If  software   engineers   follow   this   paradigm,   any   application   could   act   as   a   service,   easing   the  integration  tasks.  

Meta-­‐applications   are   a   specific   set   of   dynamic   applications   that   are   directed   to   integrate  services   offering   distinct   levels   of   integration   control   to   end-­‐users.   This   control   can   be  manipulated:  on  one  hand,  users  can  have   full  control  over  what  services   they  want   to  execute  

and  what  answers  they  want  to  obtain;  on  the  other  hand,  users  can  have  zero  control  over  the  application,   providing   only   the   initial   problem   and   the   type   of   result   they   want,   forcing  

autonomous   interoperability   between   the   integrated   services   to   attain   the   proposed   goals.  Developing   meta-­‐applications   is   even   more   complex   than   dynamic   applications.   The   problems  arisen  by  heterogeneity  and  distribution  are  very  difficult  to  deal  with  and  to  reach  a  flexible  and  

generic  solution  is  a  cumbersome  assignment.  The   mashup   term   was   initially   used   in   the   music   industry   to   categorize   combinations   of  

several   songs   in   a   single   track.   This   term  has  been  ported   to   the  WWW  and   characterizes  web  

hybrid  applications:  applications  that  mesh  applications.  Mashups  are  the  main  meta-­‐application  instance.  Their  purpose  is  to  combine  data  gathered  from  multiple  services  to  offer  a  centralized  wider  service  [61,  62].  Mashups  allow  easy  and  fast  integration  relying  on  remote  APIs  and  open  

data  sources  and  services  [63].  Mashups  and  meta-­‐applications  share  a  common  basic  purpose:  to  offer  a  new  level  of  knowledge  that  was  not  possible  by  accessing  each  service  separately.  

Workflows  According  to  the  Workflow  Management  Coalition  a  workflow  is  a  logical  organization  of  a  series  of   steps   that   automate   business   processes,   in   whole   or   part,   and   where   data   or   tasks   are  exchanged   from  one  element   to  another   for  action   [64].  Adapting   this   concept   to   software,  we  

can   convey   that   a   workflow   is   a   particular   implementation   of   a   mashup   that   consists   on   an  ordered  information  flow  that  triggers  the  execution  of  several  activities  to  deliver  an  output  or  

36  

achieve   a   goal   (Figure   12)   [65-­‐67].   A   crucial   workflow   requirement   is   that   the   inputs   of   each  

activity  must  match  with   the   precedent   activity   outputs,   to  maintain   this   consistency   and   deal  with   workflow   execution   operations,   developers   must   implement   a   workflow   management  system  [68].  

 Figure  12  -­‐  Workflow  example  with  two  initial  inputs  and  one  final  output  

A   workflow   management   system   defines,   manages   and   executes   workflows   through   the  execution   of   software   that   is   driven   by   a   computer   representation   of   the   workflow   logic.  Describing   the   workflow   requires   a   complete   description   of   its   elements:   task   definition,  

interconnection   structure,   dependencies   and   relative   order.   The   most   common   solution   to  describe   the   workflow   is   to   use   a   configuration   file   or   a   database   to   store   the   required  information.  

Existing  workflow  systems  can   support   complex  operations  and  deal  with   large  amounts  of  data.   Though,  many   science   research   fields   require  much  more   than   that.   There   are   emerging  requirements   that   must   be   handled   by   workflow   management   system   such   as   interactive  

steering,  event-­‐driven  activities,  streaming  data  and  collaboration  between  distinct  personnel   in  distinct  parts  of  the  globe.   In  novel  scientific  domains,  modern  researchers  prefer  to  design  and  execute   the   workflows   in   real-­‐time,   next   they   will   want   to   analyse   the   given   results   and  

reorganize   the   workflow   accordingly.   This   exploratory   proceeding   requires   more   than   what  current   workflow   enactment   applications   can   offer   and   implies   the   development   of   meta-­‐applications  and  unified  working  ecosystems  that  offer  a  wide  range  of  heterogeneous  features  to  

researchers.  To   deal   efficiently   with   such   complexity,   developers   must   overcome   the   previously  

mentioned  problems   related   to  heterogeneity.   Scientific   communities   from  every   research   field  

should   promote   interoperability   and   semantic   descriptions   to   foster   the   development   of  applications  that  can  accurately  integrate  online  resources  based  on  service  composition.  

 

37  

3.2 Bioinformatics  

To   cope   with   the   aforementioned   biology   and   biomedicine   challenges,   bioinformatics   have   to  

cross   a   long   way.   In   the   beginning,   bioinformatics   applications   consisted   on   small   desktop  software   tools   that   simplified   genotype   sequencing   and   genomic   sequence   analysis.  Nowadays,  

bioinformatics   applications   encompass   large   resource   networks   and   web   applications   that  connect  biomedical  scientific  communities  worldwide.  Web-­‐based  applications  are  a  key  factor  to  rapid  bioinformatics  improvements.  Researchers  use  web-­‐based  applications  on  a  daily  basis  and  

publish   their  discoveries  online,   spreading   them  faster  and  reaching  more  colleagues.  However,  like   in  many  other  areas,   the  growing  number  of  web  applications  and   resources  has   increased  the   level  of  heterogeneity   and,   consequently,   the  difficulty   in   finding   information  with   certified  

quality.   While   a   few   years   ago,   only   some   of   the   best   research   groups   published   information  online,   nowadays   anyone   can   publish   anything   online.   This   increase   in   information   quantity   as  resulted  in  an  overall  quality  decrease.  

An   endless   number   of   workgroups   have   been   developing   software   solutions   to   solve   the  problems   and   requirements   mentioned   in   section   2.   These   efforts   resulted   in   remarkable  developments   mostly   on   resource   integration   and   interoperability,   resource   description   and  

mashup/workflow   applications.   As   well   as   being   prosperous   developments   in   the   biomedicine  field,   these   applications   also   represent   innovation   and   state-­‐of-­‐the-­‐art   solutions   in   computer  sciences,   requiring   highly   expertise   and   knowledge   form   the   information   technologies  

community.  We  organize  the  existing  bioinformatics  applications  in  three  logic  groups:  databases,  service  protocols  and  integration  applications.  Next  we  present  a  brief  description  of  some  of  the  most  relevant  research  outcomes  that  are  valuable  for  a  resolution  of  our  initial  challenge.  

3.2.1 Databases  There   are   many   databases   that   contain   biological   and   biomedical   information.   In   most   of   the  

cases,  the  databases  offer  their  data  through  web  services  or  flat  files  that  can  be  easily  accessed  or   parsed.   Databases   do   not   follow   a   single   model   or   notation.   Therefore,   we   have   the   same  biological  concept  represented   in  several  distinct  manners  and  with  various   identifiers.  The  task  

of  converting  an  entity  from  one  data  type  to  other  is  often  quite  complex  due  to  the  multitude  of  existing  data  types  and  models.  

The  Kyoto  Encyclopaedia  of  Genes  and  Genomes  (KEGG)  is  a  Japanese  initiative  with  the  main  

goal  of  collecting  genomic  information  relevant  to  metabolic  pathways  and  organism  behaviours  [69].   KEGG   is   composed   of   five   main   databases,   each   with   a   distinct   focus:   Pathways,   Atlas,  Genes,  Ligand  and  BRITE.  Meshing  these  databases,  KEGG  aims  to  obtain  a  digital  representation  

of  the  biological  system  [70].    UniProt   intends   to   be   a   universal   protein   resource   [71].   It   is   maintained   by   a   consortium  

composed   by   the   European   Bioinformatics   Institute   (EBI),   the   Swiss   Institute   of   Bioinformatics  

(SIB)  and  the  Protein  Information  Resource  (PIR).  Each  of  the  consortium  members  is  focused  on  distinct  areas,  and  the  convergence  of  their  efforts  is  a  huge  database,  one  of  the  best  regarding  curated  protein  information  and  functional  information.    

From   the   association   between   the   EBI   and   the   European   Molecular   Biology   Laboratory  (EMBL)   resulted   various   ongoing   research   projects   that   have   already   left   their   footprint   in   the  bioinformatics   community.   ArrayExpress   [72]   archives   public   functional   genomics   data   in   two  

databases.  The  Experiments  Archive  stores  results   from  conducted  experiments  submitted   from  

38  

the   entire   world.   The   Gene   Expression   Atlas   is   a   curated   and   re-­‐annotated   subset   of   the  

Experiments  Archive  that  is  directed  to  gene  expression  studies.  Ensembl  [73]  is  another  genome  database   that   contains   information   from  a   large  number  of   species   and   is   accessible   through  a  large  number  of  web  services.   Interpro   [74]   is  another  EBI  database  with   focus  on  proteins  and  

the  proteome.   In  some  manner,   it   is  a  smaller-­‐scale  competitor   to  UniProt.  EMBL-­‐EBI  has  many  ongoing   projects   in   the   bioinformatics   field.   From   these   projects   several   new  web   applications  and  databases  are  born.    Medline  and  the  European  Genome-­‐Phenome  Archive  are  some  of  these  

projects  that  are  not  as  popular  as  the  main  projects  although  they  have  a  growing  importance  in  the  life  sciences  community.    

USA’s  National   Center   for   Biotechnology   Information   (NCBI)   -­‐   associated  with   the  National  

Library   of   Medicine   -­‐   is   a   resource   for   molecular   biology   information   organized   in   various  categories  each  containing  several  databases.  From  the  extensive  NCBI  database  list  we  can  note  

some  major   databases.   dbSNP   [75]   stores   information   about   Single   Nucleotide   Polymorphisms  (SNP),  particular  changes  in  our  genetic  sequence  that  are  relevant  for  the  detection  of  anomalies  in   our   genes.   The  Mendelian   Inheritance   in  Man   (MIM)   is   a   library   of   know   diseases   that   are  

mainly  caused  by  genetic  disorders.  NCBI  is  responsible  for  the  Online  MIM  [76].  This  allows  them  to  act  as  a  key  point  for  other  disease-­‐centric  and  disease-­‐related  databases  and  applications  (like  DiseaseCard   [17]   for   instance).  Medical   Subject   Headings   (MeSH)   [77]   are   also  made   available  

online   in  NCBI   facilities   and  are   also   correlated  with  other  NCBI  databases   such  as   the  Medical  Literature  Analysis  and  Retrieval  System   (MEDLINE),  a  huge  bibliographic  database  of  published  material   referred   to   life   sciences   and   biomedicine,   that   can   be   accessed   through   PubMed,   a  

online   search   engine,   providing   access   to   the   entire   MEDLINE   library.   GenBank   is   an   open  sequence   database   that   contains   information   from   laboratories   throughout   the   world   and  regarding   a   huge   number   of   distinct   species.   Navigating   in   the   entirety   of   NCBI   databases   and  

applications   is   not   easy.   To   facilitate   this   process,  NCBI   created   the   Entrez  Global  Query   Cross-­‐Database   Search   System   (Entrez)   [16],   offering   online   access   to   a  multitude   of   NCBI   databases  through  a  single  user  interface.  Entrez  is  also  a  remarkable  project  in  online  resource  integration,  

proving   that   normalized   data   formats   and   coherency   across   databases   and   services   is   the   best  method  to  promote  interoperability  and  to  achieve  dynamic  integration.    

Another  hot  topic  in  bioinformatics  research  is  phenotypic  information.  PhenomicDB  [78,  79]  

is   a   database   for   comparative   genomics   regarding   various   species   and   genotype-­‐to-­‐phenotype  information.   Information   is   obtained   from   several   public   databases   and   merged   in   a   single  database   schema   improving   database   access   performance   and   making   several   other   features  

possible.   PhenoBank   started   as   a   complex   phenotype   study   for   a   single   species,   evolving   to   a  intelligent   solution   regarding   heterogeneous   resource   integration.   PhenoGO   [80]   is   a   Gene  Ontology   centric   database   that   intends   to   support   high   throughput   mining   of   phenotypic   and  

experimental  data.  In  addition  to  these  general-­‐purpose  databases,  there  are  a  large  amount  of  others  focusing  

on  specific  topics.  Locus-­‐specific  databases  (LSDB)  contain  gene  centred  variation  information  and  

are  one  of  the  first  scenarios  of  a  wide  integration  effort.  Leiden  Open-­‐source  Variation  Database  (LOVD)  [81]  follows  the  “LSDB-­‐in-­‐a-­‐box”  approach  to  deliver  a  customizable  and  easily-­‐deployable  LSDB   application.   This  means   that   any   research   group   can   deploy   its   own   LSDB   and   follow   the  

same  data  model,   thus  promoting  data   integration  and  resource  distribution.  Though,   there  are  various  others  locus  specific  databases  like  the  Inserm’s  Bioinformatics  Group  Universal  Mutation  Database   (UMD)   [82],   which   is   directed   to   clinicians,   geneticists   and   research   biologists;   or  

39  

downloadle   variation   viewers   like   VariVis   [83]   from   the   University   of   Melbourne’s   Genomic  

Disorders  Research  Centre.  

3.2.2 Service  Protocols  Data  management  in  life  sciences  offers  constant  challenges  to  software  engineers  [84].  Offering  this  data  to  end-­‐users  and  researchers  worldwide   is  an  even  bigger  challenge.  Web  applications  tend   to   be   complex   and   cluttered   with   data   resulting   in   non-­‐usable   interfaces   and   fragile  

workspaces.  The  possibility  to  offer  data  as  a  service  is  a  valuable  option  that  is  being  used  more  often.  The  greatest  benefit  of   these   remote  services   is   that   they  allow  static  programmatic  and  real-­‐time   dynamic   integration.   That   is,   developers   can   merge   several   distributed   services   in   a  

single  centralized  application.  The   Distributed   Annotation   System   (DAS)   [85]   specifies   a   protocol   for   requesting   and  

returning   annotation   data   for   genomic   regions.   DAS   relies   on   distributed   servers   that   are  

integrated   in   the  client   for  data   supply  and   is  expanding   to   several   life   sciences  areas,  not  only  sequence  annotation.  The  main  idea  behind  DAS  is  that  distributed  resources  can  be  integrated  in  various   environments   without   being   aware   of   other   intervenient.   That   is,   resources   can   be  

replicated   and   integrated   in   several   distinct   systems,   not   only   in   a   single   static   combination   of  resources.  

 BioMart  [86]  consists  of  a  generic  framework  for  biological  data  storage  and  retrieval  using  a  

range  of  queries  that  allow  users  to  group  and  refine  data  based  upon  many  different  criteria.  Its  main   intention   is   to   improve   data   mining   tasks   and   it   can   be   downloaded,   installed   and  customized   easily.   Therefore,   it   promotes   localized   and   specific   integration   systems   that   can  

merge  their  data  from  larger  databases.  The   European   Molecular   Biology   Open   Software   Suite   (EMBOSS)   [87,   88]   is   a   software  

analysis   package   that   unifies   a   collection   of   tools   related   to   molecular   biology   and   includes  

external  service  access.  Applications  are  catalogued  in  about  30  groups  ranging  several  areas  and  operations  related  to  the  life  sciences.    

Soaplab   was   developed   at   the   EBI   and   is   a   set   of   web   services   that   provide   remote  programmatic  access  to  several  other  applications  [89].  Included  in  the  framework  are  a  dynamic  web   service   generator   and   powerful   command-­‐line   programs,   including   support   for   EMBOSS  

software.  Integration  efforts  conducted  in  Soaplab  resulted  in  making  possible  the  use  of  a  single  generic   interface   when   accessing   any   Soaplab   web   service   regardless   of   the   interfaces   of  underlying  software.  

BioMOBY   is   a  web-­‐service   interoperability   initiative   that   envisages   the   integration   of  web-­‐based  bioinformatics  resources  supported  by  the  annotation  of  services  and  tools  with  term  from  well-­‐known  ontologies.  BioMOBY  was  initiated  in  the  Model  Organism  Bring  Your  own  Database  

Interface  Conference  (MOBY-­‐DIC)  and  the  proposed  integration  may  be  achieved  semantically  or  through  web  services.  The  BioMOBY  protocol  stack  defines  every   layer   in  the  protocol   from  the  ontology  to  the  service  discovery  properties  [90,  91].    

The  Web  API  for  Biology  (WABI)  is  an  extensive  set  of  SOAP  and  REST  web  life  sciences  APIs,  focused   on   data   processing   and   conversion   between   multiple   formats   [92,   93].   WABI   defines  mainly  a  set  of  rules  and  good-­‐practices  that  should  be  followed  when  the  outcome  of  a  research  

project  is  a  set  of  web  services.  Along   with   these   biomedical   service   protocols,   there   are   the   traditional   web   services  

protocols  debated  previously  that  give  access  to  several  other  resources.  The  growing  number  of  

40  

bioinformatics-­‐specific  web  services  protocols  is  another  step-­‐back  in  application  interoperability  

and  integration  in  the  life  sciences  area.  

3.2.3 Integration  Applications  Goble  [94]  conveyed  recently  a  “state  of  the  nation”  of  integration  in  bioinformatics  study  and  her  main  conclusions  were  that  there  is  still  a  long  path  to  traverse,  specially  concerning  integration  efficiency.   Nonetheless,   there   were   remarkable   developments   in   the   last   few   years.   These  

developments  include  novelties  in  data  and  services  integration,  semantic  web  developments  and  the   implementation   of   mashups/workflows   in   bioinformatics.   Moreover,   as   stated   by   Stein,  integration  is  a  vital  element  in  the  creation  of  a  large  bioinformatics  ecosystem  [95].  

The   data   integration   issue   can   be   approached   following   many   different   strategies   and,  worldwide,   there   are  many   research   groups   solving   it   differently   [96].   The   adoption   of   a   GRID  perspective   is   one   of   these   approaches   [97,   98].   myGRID   is   a   multi-­‐institutional   and   multi-­‐

disciplinary  consortium  that  intends  to  promote  e-­‐Science  initiatives  and  projects.  The  most  well-­‐known  outcome  from  this  project  is  the  workflow  enactment  application  Taverna,  which  we  will  discuss  further  in  this  document.  More  recently,  GRID  is  given  place  to  cloud-­‐computing  strategies  

[99]   like   Wagener’s   work   using   XMPP   [100],   a   field   that   is   still   lacking   interest   in   the  bioinformatics  community,  though  it  will  gain  relevance  in  a  near  future.    

Regarding  the  resource  integration  models  presented  previously  –  warehouse,  mediator  and  

link  –  there   is  a  huge  collection  of  applications  that   implement  them.  DiseaseCard  [17,  101]   is  a  public   information   system   that   integrates   information   from   distributed   and   heterogeneous  medical  and  genomic  databases.   It  provides  a   single  entry  point   to  access   relevant  medical  and  

genetic   information   available   in   the   Internet   about   rare   human   diseases.   Using   link   discovery  strategies,   DiseaseCard   can   update   its   database   and   include   novel   applications.   Following   this  approach,   it   is  easy   to  design  a   simple   integration  mechanism   for  disperse  variome  data   that   is  

available  from  existing  LSDBs.  With  this  system,  the  life  sciences  community  can  access  the  entire  biomedical  information  landscape,  transparently,  from  a  single  centralized  point.  

GeneBrowser   [18]   adopts   a   hybrid   data   integration   approach,   offering   a   web   application  focused  on  gene  expression  studies  which  integrates  data  from  several  external  databases  as  well  as   internal  data.   Integrated  data   is  stored   in  an   in-­‐house  warehouse,   the  Genomic  Name  Server  

(GeNS)  [102].  From  there  it  is  possible  to  present  content  in  multiple  formats,  from  the  replicated  data  sources,  and  to  obtain  data,  in  real-­‐time,  from  the  link-­‐based  integrated  resources.  

Biozon   is     “a   unified   biological   database   that   integrates   heterogeneous   data   types   such   as  

proteins,   structures,   domain   families,   protein-­‐protein   interactions   and   cellular   pathways,   and  establishes   the   relations   between   them”   [103].   Biozon   is   a   data   warehouse   implementation  similar   to   GeNS,   holding   data   from   various   large   online   resources   like   UniProt   or   KEGG   and  

organized   around   a   hierarchical   ontology.   Biozon   clever   internal   organization   (graph   model,  document   and   relation  hierarchy)   confers   a  high  degree  of   versatility   to   the   system,   allowing  a  correct  classification  of  both  the  global  structure  of  interrelated  data  and  the  nature  of  each  data  

entity.  Bioconductor  is  an  open-­‐source  and  open  development  software  package  providing  tools  for  

the   analysis   and   comprehension   of   genomic   data   [104].   The   software   package   is   constantly  

evolving   and   can   be   downloaded   and   installed   locally.   The   software   tools   that   compose   the  package  are  made  available  from  service  providers,  generally   in  R   language.   Integration   is  made  

41  

on   the   clients   through   the   enhancement   and   coherence   in   the   access   to   various   life   sciences  

distributed  tools  and  services.  Large  databases  Ensembl  [73]  and  Entrez  [16]  are  also  major  web  service  providers,  offering  

access  to  their  entire  content  through  a  simple  layer  of  comprehensive  tools.  

Despite   being   a   novelty   in   bioinformatics   [47],   semantic   developments   have   already   found  their   space   in   several   research  groups   [105].  The  complexities   inherent   to   the   life   sciences   field  increase  the  difficulty  in  creating  ontologies  and  semantics  to  describe  the  immense  set  of  biology  

concepts  and  terms.  Gene  Ontology  [106]   is  the  most  widely  accepted  ontology,  aiming  to  unify  the   representation   of   gene-­‐related   terms   across   all   species.   This   is   only   possible   by   providing  access  to  an  annotated  and  very  rich  controlled  vocabulary.  Similar  efforts  are  being  developed  in  

other   related   areas   [54].   Reactome   is   a   generic   database   of   biology,   mostly   human   biology,  describing  in  detail  operations  that  occur  at  a  molecular  level  [107].  There  are  also  ongoing  efforts  

to   map   proteins   and   their   genetic   effects   in   diseases   [108].   Despite   being   developed   at  W3C,  BioDASH  [109]  is  a  semantic  web  initiative  envisaging  the  creation  of  a  platform  that  enables  an  association,  similar  to  the  one  that  exists  in  real  world  laboratories,  between  diseases,  drugs  and  

compounds   in   terms   of  molecular   biology   and   pathway   analysis.   RDFScape   [110]   and   Bio2RDF  [111]  are  the  most  known  among  several  studies  in  bioinformatics  semantics  [90,  112,  113].  The  main  purpose  of   these  projects   is   to  create  a  platform  that  offers  access   to  well-­‐known  data   in  

RDF  with  the  triplet  format.  The  underlying  complexities  in  modelling  current  database  models  to  a  new  ontology   and  hierarchy  are   a   remarkable   accomplishment   that  will   be   very  useful   in   the  future  of  semantic  bioinformatics.    

The   integration   of   services   is   another   area   where   innovation   takes   place.   Service  composition,  mashups  or  workflows  are  among   the  hottest   trends   in  bioinformatics   application  development.  Service  composition,  which  encompasses  service  orchestration  and  choreography,  

is  already  possible   in  various  scenarios   like  BioMOBY  or  Bio-­‐jETI   [90,  91,  114,  115].  Bio-­‐jETI  uses  the  Java  Electronic  Tool  Integration  (jETI)  platform,  which  allows  the  combination  of  features  from  several  tools  in  an  interface  that  is  intuitive  and  easy  to  new  users.  jETI  enables  the  integration  of  

heterogeneous  services  (REST  or  SOAP)  from  different  providers  or  even  from  distinct  application  domains.   This   approach   is   adapted   to   the   life   sciences   environment   resulting   in   a   platform   for  multidisciplinary  work  and  cross-­‐domain  integration.  However,  mashups  and  workflows  are  in  the  

front-­‐line   of   dynamic   integration   in   bioinformatics  with   desktop   applications   like   Taverna   [114,  116-­‐118]  or,  more  recently,  with  various  web  applications.    

Taverna   is   the   best   state-­‐of-­‐the-­‐art   application   regarding  workflow   enactment.   It   is   a   Java  

based  desktop  application  that  enables  the  creation  of  complex  workflows  allowing  access  to  files  and  complex  data  manipulation.  Additionally,  Taverna  also  configures,  automatically,   the  access  to  BioMOBY,  Soaplab,  KEGG  and  other  services.  Along  with  these  predefined  services,  users  can  

also  dynamically  add  any  web  service  through  its  WSDL  configuration.  These  interesting  features  increase   Taverna’s   value   significantly,   overcoming   its   major   drawback,   being   a   heavy   desktop-­‐based  application.  BioWMS  is  an  attempt  to  create  a  Taverna-­‐like  web  based  workflow  enactor.  

The   set   of   features   is   not   as   complete   as   Taverna   and   the   availability   is   very   limited   (unlike  Taverna,   which   is   available   freely   for   the   major   operating   systems)   [119].   The   Workflow  Enactment  Portal   for  Bioinformatics   (BioWEP)  consists  of  a  simple  web-­‐based  application  that   is  

able   to   execute  workflows   created   in   Taverna   or   in   BioWMS   [120,   121].   Currently,   it   does   not  support  workflow  creation  and  the  available  workflow  list  is  quite  restricted.    

42  

The   Bioinformatics   Workflow   Builder   Interface   (BioWBI)   is   another   web-­‐based   workflow  

creator   that   connects   to   a   Workflow   Execution   Engine   (WEE)   through   web-­‐services   to   offer  complete  web-­‐based  workflow  enactment  [122].  

DynamicFlow  is  also  a  web-­‐based  workflow  management  application  that  relies  on  Javascript  

to   render   a   Web2.0   interface   that   enables   the   creation   of   custom   workflow   based   on   a  predefined  set  of  existing  services  and  operations  [123,  124].  Available  services  are  semantically  described  in  an  XML  configuration  file,  allowing  real-­‐time  workflow  validation.  

An  interesting  perspective  is  used  in  BioFlow  [125].  The  main  idea  behind  this  initiative  is  to  create   a   generic   workflow   execution   language   that   encompasses   a   definition   for   the   various  elements  active   in  a  workflow.  This  new  declarative  query   language  permits   the  exploitation  of  

various  recent  developments  like  wrapped  services  and  databases,  semantic  web  and  ontologies  or  data  integration.  

3.3 Summary  

Bioinformatics  is  no  longer  an  emerging  field  that  required  software  engineers’  assistance.  When  the  bioinformatics  research  field  gained  momentum,   it  used  traditional  computational  tools  and  software   techniques   to   evolve.   In   the   last   few   years,   we   are   witnessing   a   shift   in   the   relation  

between   life   sciences   and   informatics:   bioinformatics   requirements   are   fostering   computer  science  evolution  and  not  otherwise.  Bioinformatics   is  no   longer  a  small   information  technology  research   group;   it   has   evolved   steadily   and   can   now   promote   innovation   and   foster   the  

development  of  state-­‐of-­‐the-­‐art  computer  science  applications.  With   this   in  mind,   it   is   crucial   to  escort  computer   science  developments  and  apply   them   in  

bioinformatics.   Whether   we   are   dealing   with   the   latest   web   trend   or   new   data   integration  

architectures,  it  is  essential  to  enhance  existing  bioinformatics  resources  and  prepare  them  for  an  intelligent  web  where  semantic  descriptions  are  the  key  to  deal  with  heterogeneity  in  integration  and  interoperability.  

     

43  

4 Work  Plan  

It   is   common   sense   that   project   planning   is   a   key   factor   in   project   success   [126].   Carefully  planning  ahead  and  pursuing  a  concise  initial  vision  of  the  matter  in  hands  are  very  important  to  

reach  initial  project  goals.  In   the   computers   sciences   field,   the   major   problem   in   planning   ahead   is   the   constant  

evolution  of  existing  software  and  technologies.  It  is  not  failure  proof  to  plan  long-­‐term  goals  and  

developments.  Everyday  new  and  innovative  applications  appear  worldwide  and  new  techniques  are   published.   One   cannot   predict   the   discovery   of   a   novel   technology   or   application   that   will  completely  disrupt  everything  we  have  studied  so  far.  In  addition,  dealing  with  web  applications  is  

even  more  complex  because  are   the  users  who  define  the   Internet.  The  WWW  lives  of  concept  trends  that  are  adopted  by  general  users  and  software  applications  in  scientific  fields  must  reflect  these  user  interests.  

Next  we  present  the  global  doctorate  objectives,  an  estimated  calendar  and  list  of  activities  for   the  4  years   that   comprise   this   research  work  and  a   targeted  publication   list   that  will  define  several  moments  in  that  calendar.  

4.1 Objectives  During   the  duration  of   this   research  work,   there  are   several   strategic  objectives   that   should  be  accomplished.  At   a   development   level,   the  main  purpose   is   to  propose,   develop   and   validate   a  software   framework   that   deals   with   service   composition   and   dependent   topics,   providing   an  

added   value   to   the   scientific   community   whether   through   web   applications   destined   to   life  sciences  researchers  or  through  software  toolkits,  like  remote  APIs,  that  can  reach  bioinformatics  developers.  

Along  with  software  developments,  this  doctorate  work  must  also  be  composed  of  scientific  work.  In  addition  to  the  final  thesis,  there  have  to  be  various  published  scientific  breakthroughs  in  both   computer   science   and   bioinformatics   research   fields.   Scientific   publications   have   major  

relevance   in  an  expanding  scientific  community   like  bioinformatics.  Moreover,  peer  reviews  and  feedback   from   other   researchers   represent   a   valuable   foundation   for   further   improvements   in  any  research.  

4.2 Calendar  The   following  Gantt   chart   (Table  1)   contains  an  estimation  of   the  work  being  developed  during  this   doctorate.   Each   year   is   divided   quarterly   and   the   work   is   divided   in   three   categories:  

software,  thesis  and  publications.  The  thesis  is  composed  of  two  deliveries:  a  thesis  proposal  (this  document)  at  the  end  of  the  first  year  and  the  final  document  at  the  end  of  the  fourth  year.  One  cannot  also  plan  what  the  software  outcome  of  the  project  will  be.  Therefore,  it  is  not  reasonable  

44  

to  define  static  software  milestones.  Instead,  we  organize  the  software  being  developed  in  three  

main   cycles.   It   is   expected   that   at   the   end   of   each   cycle,   a   software   evaluation   is   conducted,  where  new  software  is  presented,  validated  and  published  to  the  community.  This  evaluation  will  also  work  as  a  re-­‐assessment  of  the  project  objectives  and  the  adequacy  of  the  developed  work  to  

the   initial  problem  and  requirements.  Scientific  publications  are  a  specific  case  of  planning.  One  cannot   predict  when  will   the  most   adequate   conference   be   organized  or  when   a   paper  will   be  accepted   in   a   journal.  With   this   in  mind,   the   following  Gantt   chart   comprises   publications   as   a  

general  term.  It  is  desirable  to  obtain  high  impact  factor  publications  –  scientific  journals  or  books  –  and  several  others  medium  factor  publications  –  conference  proceedings  or  workshops.  

Table  1  -­‐  Gantt  chart  calendar  comprising  the  activities  being  developed  during  this  doctorate  

  Year  1   Year  2   Year  3   Year  4  

  Q1   Q2   Q3   Q4   Q1   Q2   Q3   Q4   Q1   Q2   Q3   Q4   Q1   Q2   Q3   Q4  

Thesis  Writing  State  of  the  Art                                  Domain  Analysis                                  Proposal                                  Main  corpus                                  Delivery                                  

Software  Preliminary  Research                                  System  Analysis                                  Modelling                                  Active  Development                                  Deliveries                                  

Publications  High  IF                                  Medium  IF                                  

 

Legend:      Initial  analysis  and  preparation  for  the  task  in  hand        Active  development:  implementing  software  or  writing        Finalization:  final  software  versions,  rewrites  and  deliveries  

4.3 Publications  As   previously   mentioned,   publication   organization   cannot   be   planned   in   advance.   There   are  numerous   constraints   related   to   conference   dates,   open   publication   calendar   and,   more  importantly,   acceptance  dates.   These   constraints   limit   the  planning  phase,   as  one   cannot   know  

when   or   where   a   scientific   article   will   be   accepted.   Additionally,   conference   calendars   change  throughout  the  years  and  new  international  events  are  constantly  appearing.  Therefore,  we  can  only   analyse   several   scientific   publications   and   choose   the   ones  we   find  more   adequate   to   the  

work  we  will  develop.  Journal  impact  factor  [127]  is  a  calculated  measure  used  to  evaluate  the  relative  importance  

of  a  specific  journal  within  its  field  of  research.  Knowing  a  publication  impact  factor,  we  can  assert  

45  

about  the  visibility  our  work  will  gain  if  published  in  that  journal.  As  a  result  of  multiple  scientific  

progresses,  we  wish   to  publish   three  articles   in  high   impact   factor  magazines.  These  magazines  include   Science   (http://www.sciencemag.org),   Hindawi   (http://www.hindawi.com),   BMC  Bioinformatics   (http://www.biomedcentral.com/bmcbioinformatics)   or   the   Bioinformatics  

(http://bioinformatics.oxfordjournals.org)   and   various   Oxford   Journals  ((http://www.oxfordjournals.org)  like  Bioinformatics,  Database  or    Nucleic  Acids  Research.  

Journal   and   magazine   publications   are   the   more   valuable   means   of   publication   in   the  

scientific  community.  Nevertheless,  publishing  work  in  conference  proceedings  is  also  a  relevant  way   to   publicize   developments   to   our   peers.   In   this   scenario,   publication  magnitude   is   a   direct  consequence   of   the   scientific   group   that   indexes   the   proceedings   (if   any).   IEEE  

(http://www.ieee.org),   ACM   (http://www.acm.org),   dblp   (http://www.informatik.uni-­‐trier.de/~ley/db/)  or  Springer   (http://www.springer.com)  are  the  more  relevant   indexing  groups  

and   participate   in   the   indexing   process   of   numerous   conferences,   in   a   diversity   of   topics,  worldwide.  

The   following  table   lists   the  works  published  so   far.  The   first   two  [123,  124]  are   full  papers  

and  the  others  poster  presentations.  

Table  2  –  List  works  already  published  in  this  doctorate  

  Indexing   Publishing  Date  Dynamic  Service  Integration  using  Web-­‐based  Workflows  [123]   ACM   November  2008  DynamicFlow:  A  Client-­‐side  Workflow  Management  System  [124]   Springer     June  2009  Arabella:  A  Directed  Web  Crawler   dblp   October  2009  Link  Integrator:  A  Link-­‐based  Data  Integration  Architecture   dblp   October  2009  

 Table   3   lists   interesting   conferences   that   took  place   in   the   last   couple  of   years   and  have   a  

large   probably   of   occurring   during   this   doctorate.   These   conferences   are   focused   on   relevant  computer  science  and  bioinformatics  topics,  especially  in  online  resources  integration.  

Table  3  –  Interesting  conference  list  

  Indexing   Topic  International   Conference   on   Information   Integration   and  Web-­‐based  Applications  &  Services  

ACM   Computer  Science  

International  Conference  on  Bioinformatics  and  Bioengineering   IEEE   Bioinformatics  International   Conference   on   Bioinformatics   and   Biomedical  Engineering  

IEEE   Bioinformatics  

International  Conference  on  Bioinformatics  and  Biomedicine   IEEE   Bioinformatics  Data  Integration  in  the  Life  Sciences   Springer   Bioinformatics  International  Conference  on  Web  Search  and  Data  Mining   ACM   Computer  Science  Symposium  on  Applied  Computing:  Web  Technologies   ACM   Computer  Science  Conference  on  Web  Application  Development     Computer  Science  International  Conference  on  Enterprise  Information  Systems   dblp   Computer  Science  International   Conference   on   Web   Information   Systems   and  Technologies  

dblp   Computer  Science  

International   Workshop   on   Services   Integration   in   Pervasive  Environments  

ACM   Computer  Science  

International  Workshop  on  Lightweight  Integration  on  the  Web   Springer   Computer  Science  International  Workshop  on  Information  Integration  on  the  Web     Computer  Science    

46  

47  

5 Implications  of  Research  

Any   researcher   developing   efforts   in   computer   science   must   be   constantly   aware   of   external  innovations  and   improvements  made   in  the  areas  related  to  the  matter   in  hands.  This  everyday  

evolution   is   leveraged   by   numerous   workgroups   endeavours.   Nowadays,   web   application  innovations   are   related   to   ideals   like  Web2.0,  Web3.0   or   the   Intelligent/Semantic  Web.   These  WWW  innovations  must  be  taken  in  account  when  developing  new  applications   in  any  scientific  

field.  There  has  to  be  a  bet   in   innovation   in  bioinformatics  and  the  best  way  to  win  this  bet   is   to  

adopt   well-­‐know   concepts   and   trends   from   the   Internet   and   applied   them   to   the   life   sciences  

field.  However,  while  in  areas  like  entertainment  or  journalism  we  have  easy  access  to  a  myriad  of  resources,  this  does  not  happen  in  the  life  sciences  field.  The  life  sciences  research  field  is  so  vast  that  the  number  of  topics  that  a  single  application  can  cover   is  much  reduced.  This   leads  to  the  

appearance  of  an  immense  number  of  applications  and,  consequently,  an  even  bigger  number  of  resources   to   integrate  and  an  overwhelming  heterogeneity.   In  order   to   transform   the  web   into  the   main   bioinformatics   application   platform,   it   is   necessary   to   design   and   develop   new  

architectures  that  promote  integration  and  interoperability.  The   work   that   will   be   executed   during   this   doctorate   aims   to   create   an   innovative   and  

comprehensive   software   framework   that   can   enhance   the   development   of   novel   web-­‐based  

applications  in  the  bioinformatics  field.  We  believe  that  this  is  an  essential  step  that  will  improve  life  sciences  research  at  many  levels.  New  web-­‐based  tools  will  provide  easier  and  faster  access  to  resources:   data,   services   or   applications,   by   providing   a   set   of   software   tools   that   foster  

integration  and   interoperability;  amend  the  application  development  cycle,   reducing   integration  complexities   and   accelerating   deployment;   ease   everyday   biologists   and   clinicians   tasks   by  empowering   pervasive   bioinformatics;   open   the   path   to   the   intelligent   web   by   leveraging  

resource  description  and,  most  importantly,  promote  communication  and  cooperation  among  the  scientific  community  that  will  ultimately  result  in  bold  and  ambitious  scientific  discoveries.  

 

 

48  

 

49  

References  

 [1]   J.  D.  Watson,  "The  human  genome  project:  past,  present,  and  future,"  Science,  vol.  248,  

pp.  44-­‐49,  April  6,  1990  1990.  [2]   R.  Tupler,  G.  Perini,  and  M.  R.  Green,  "Expressing  the  human  genome,"  Nature,  vol.  409,  

pp.  832-­‐833,  2001.  [3]   D.   Primorac,   "Human   Genome   Project-­‐based   applications   in   forensic   science,  

anthropology,  and  individualized  medicine,"  Croat  Med  J,  vol.  50,  pp.  205-­‐6,  Jun  2009.  [4]   L.  Biesecker,   J.  C.  Mullikin,  F.  Facio,  C.  Turner,  P.  Cherukuri,  R.  Blakesley,  G.  Bouffard,  P.  

Chines,   P.   Cruz,   N.   Hansen,   J.   Teer,   B.   Maskeri,   A.   Young,   N.   Comparative   Sequencing  Program,  T.  Manolio,  A.  Wilson,  T.  Finkel,  P.  Hwang,  A.  Arai,  A.  Remaley,  V.  Sachdev,  R.  Shamburek,  R.  Cannon,  and  E.  D.  Green,  "The  ClinSeq  Project:  Piloting  large-­‐scale  genome  sequencing  for  research  in  genomic  medicine,"  Genome  Res,  Jul  14  2009.  

[5]   R.   G.   H.   Cotton,   "Recommendations   of   the   2006   Human   Variome   Project   meeting,"  Nature  Genetics,  vol.  39,  pp.  433-­‐436,  2007.  

[6]   H.   Z.   Ring,   P.-­‐Y.   Kwok,   and   R.   G.   Cotton,   "Human   Variome   Project:   an   international  collaboration  to  catalogue  human  genetic  variation,"  Pharmacogenomics,  vol.  7,  pp.  969-­‐972,  2006.  

[7]   M.  G.  Aspinall   and  R.  G.  Hamermesh,   "Realizing   the  promise  of  personalized  medicine,"  Harv  Bus  Rev,  vol.  85,  pp.  108-­‐17,  165,  Oct  2007.  

[8]   D.  M.  Roden,  R.  B.  Altman,  N.  L.  Benowitz,  D.  A.  Flockhart,  K.  M.  Giacomini,  J.  A.  Johnson,  R.  M.  Krauss,  H.  L.  McLeod,  M.   J.  Ratain,  M.  V.  Relling,  H.  Z.  Ring,  A.  R.  Shuldiner,  R.  M.  Weinshilboum,   S.   T.   Weiss,   and   for   the   Pharmacogenetics   Research   Network,  "Pharmacogenomics:  Challenges  and  Opportunities,"  Ann   Intern  Med,  vol.  145,  pp.  749-­‐757,  November  21,  2006  2006.  

[9]   Erwin  P.  Bottinger,  "Foundations,  promises  and  uncertainties  of  personalized  medicine,"  Mount   Sinai   Journal   of  Medicine:   A   Journal   of   Translational   and   Personalized  Medicine,  vol.  74,  pp.  15-­‐21,  2007.  

[10]   J.  N.  Hirschhorn  and  M.  J.  Daly,  "Genome-­‐wide  association  studies  for  common  diseases  and  complex  traits,"  Nat  Rev  Genet,  vol.  6,  pp.  95-­‐108,  Feb  2005.  

[11]   S.   S.   S.   Reddy,   L.   S.   S.   Reddy,   V.   Khanaa,   and   A.   Lavanya,   "Advanced   Techniques   for  Scientific  Data  Warehouses,"  in  International  Conference  on  Advanced  Computer  Control,  ICACC,  2009,  pp.  576-­‐580.  

[12]   N.   Polyzotis,   S.   Skiadopoulos,   P.   Vassiliadis,   A.   Simitsis,   and   N.   Frantzell,   "Meshing  Streaming  Updates  with  Persistent  Data   in   an  Active  Data  Warehouse,"  Knowledge  and  Data  Engineering,  IEEE  Transactions  on,  vol.  20,  pp.  976-­‐991,  2008.  

[13]   Y.   Zhu,   L.   An,   and   S.   Liu,   "Data   Updating   and   Query   in   Real-­‐Time   Data   Warehouse  System,"   in  Computer   Science   and   Software   Engineering,   2008   International   Conference  on,  2008,  pp.  1295-­‐1297.  

[14]   A.  Kiani  and  N.  Shiri,  "A  Generalized  Model  for  Mediator  Based  Information  Integration,"  in  11th  International  Database  Engineering  and  Applications  Symposium,  pp.  268-­‐272.  

[15]   L.  M.  Haas,  P.  M.  Schwarz,  P.  Kodali,  E.  Kotlar,  J.  E.  Rice,  and  W.  C.  Swope,  "DiscoveryLink:  A  system  for  integrated  access  to  life  sciences  data  sources,"  IBM  Systems  Journal,  vol.  40,  pp.  489-­‐511,  2001.  

[16]   D.   Maglott,   J.   Ostell,   K.   D.   Pruitt,   and   T.   Tatusova,   "Entrez   Gene:   gene-­‐centered  information  at  NCBI,"  Nucleic  Acids  Research,  vol.  35,  2007.  

50  

[17]   J.  L.  Oliveira,  G.  M.  S.  Dias,  I.  F.  C.  Oliveira,  P.  D.  N.  S.  d.  Rocha,  I.  Hermosilla  ,  J.  Vicente,  I.  Spiteri,  F.  Martin-­‐Sánchez,  and  A.  M.  M.  d.  S.  Pereira  "DiseaseCard:  A  Web-­‐based  Tool  for  the   Collaborative   Integration   of   Genetic   and  Medical   Information,"   in   5th   International  Symposium,  ISBMDA  2004:  Biological  and  Medical  Data  Analysis,  2004,  pp.  409-­‐417.  

[18]   J.   Arrais,   B.   Santos,   J.   Fernandes,   L.   Carreto,   M.   Santos,   A.   S.,   and   J.   L.   Oliveira,  "GeneBrowser:  an  approach  for  integration  and  functional  classification  of  genomic  data,"  2007.  

[19]   I.   Jerstad,   S.   Dustdar,   and   D.   V.   Thanh,   "A   service   oriented   architecture   framework   for  collaborative   services,"   in   Enabling   Technologies:   Infrastructure   for   Collaborative  Enterprise,  2005.  14th  IEEE  International  Workshops  on,  2005,  pp.  121-­‐125.  

[20]   C.   Papagianni,   G.   Karagiannis,   N.   D.   Tselikas,   E.   Sfakianakis,   I.   P.   Chochliouros,   D.  Kabilafkas,   T.   Cinkler,   L.   Westberg,   P.   Sjodin,   M.   Hidell,   S.   H.   de   Groot,   T.   Kontos,   C.  Katsigiannis,   C.   Pappas,   A.   Antonakopoulou,   and   I.   S.   Venieris,   "Supporting   End-­‐to-­‐End  Resource  Virtualization  for  Web  2.0  Applications  Using  Service  Oriented  Architecture,"  in  GLOBECOM  Workshops,  2008  IEEE,  2008,  pp.  1-­‐7.  

[21]   G.   Hohpe   and   B.   Woolf,   Enterprise   Integration   Patterns:   Designing,   Building,   and  Deploying  Messaging  Solutions:  Addison-­‐Wesley,  2004.  

[22]   R.   Kazman,   G.   Abowd,   L.   Bass,   and   P.   Clements,   "Scenario-­‐based   analysis   of   software  architecture,"  Software,  IEEE,  vol.  13,  pp.  47-­‐55,  1996.  

[23]   A.  Tolk  and  J.  A.  Muguira,  "Levels  of  Conceptual  Interoperability  Model,"  in  Fall  Simulation  Interoperability  Workshop,  Orlando,  Florida,  USA,  2003,  pp.  14-­‐19.  

[24]   P.   K.   Davis   and   R.   H.   Anderson,   "Improving   the   Composability   of   DoD   Models   and  Simulations,"   The   Journal   of   Defense   Modeling   and   Simulation:   Applications,  Methodology,  Technology,  vol.  1,  pp.  5-­‐17,  April  1,  2004  2004.  

[25]   T.  Berners-­‐Lee,  J.  Hendler,  and  O.  Lassila,  "The  Semantic  Web,"  Sci  Am,  vol.  284,  pp.  34  -­‐  43,  2001.  

[26]   M.   Uschold   and   M.   Gruninger,   "Ontologies:   Principles,   Methods   and   Applications,"  Knowledge  Engineering  Review,  vol.  11,  pp.  93-­‐155,  1996.  

[27]   M.  Zloof,  "Query  by  example,"  in  Proceedings  of  the  May  19-­‐22,  1975,  national  computer  conference  and  exposition  Anaheim,  California:  ACM,  1975.  

[28]   S.   Staab,   "Web   Services:   Been   there,   Done   That?,"   IEEE   Intelligent   Systems,   pp.   72-­‐85,  2003.  

[29]   W.  W.  W.  C.  W3C,  "Web  Services,"  World  Wide  Web  Consortium,  2002.  [30]   F.  Rosenberg,   F.   Curbera,  M.   J.  Duftler,   and  R.   Khalaf,   "Composing  RESTful   Services   and  

Collaborative  Workflows:  A  Lightweight  Approach,"  Internet  Computing,  IEEE,  vol.  12,  pp.  24-­‐31,  2008.  

[31]   W.  W.  W.  C.  W3C,  "Simple  Object  Access  Protocol,"  World  Wide  Web  Consortium,  2007.  [32]   OASIS,  "Universal  Description,  Discovery  and  Integration,"  OASIS,  2005.  [33]   W.  W.  W.   C.  W3C,   "Web   Service  Description   Language,"  World  Wide  Web   Consortium,  

2001.  [34]   J.  Kopecky,  T.  Vitvar,  C.  Bournez,  and  J.  Farrell,  "SAWSDL:  Semantic  Annotations  for  WSDL  

and  XML  Schema,"  Internet  Computing,  IEEE,  vol.  11,  pp.  60-­‐67,  2007.  [35]   E.   M.   a.   P.   P.   S.   F.   XMPP   Standards   Foundation,   "Extensible   Messaging   and   Presence  

Protocol,"    http://xmpp.org/:  IETF,  Internet  Engineering  Task  Force,  1999.  [36]   E.   M.   a.   P.   P.   S.   F.   XMPP   Standards   Foundation,   "XEP-­‐0244:   IO   Data,"    

http://xmpp.org/extensions/xep-­‐0244.html:   XMPP   Standards   Foundation,   Extensible  Messaging  and  Presense  Protocol  Standards  Foundation,  2008.  

[37]   E.   M.   a.   P.   P.   S.   F.   XMPP   Standards   Foundation,   "XEP-­‐0030:   Service   Discovery,"    http://xmpp.org/extensions/xep-­‐0030.html:   XMPP   Standards   Foundation,   Extensible  Messaging  and  Presense  Protocol  Standards  Foundation,  1999.  

[38]   N.   Milanovic   and  M.  Malek,   "Current   solutions   for  Web   service   composition,"   Internet  Computing,  IEEE,  vol.  8,  pp.  51-­‐59,  2004.  

51  

[39]   C.   Peltz,   "Web   services   orchestration   and   choreography,"  Computer,  vol.   36,   pp.   46-­‐52,  2003.  

[40]   I.   Foster   and   C.   Kesselman,   The   Grid   2:   Blueprint   for   a   New   Computing   Infrastructure:  Morgan  Kaufmann  Publishers  Inc.,  2003.  

[41]   M.   Taiji,   T.   Narumi,   Y.   Ohno,   N.   Futatsugi,   A.   Suenaga,   N.   Takada,   and   A.   Konagaya,  "Protein  Explorer:  A  Petaflops  Special-­‐Purpose  Computer  System  for  Molecular  Dynamics  Simulations,"   in  Proceedings  of   the  2003  ACM/IEEE  conference  on  Supercomputing:   IEEE  Computer  Society,  2003.  

[42]   S.   Masuno,   T.   Maruyama,   Y.   Yamaguchi,   and   A.   Konagaya,   "Multidimensional   Dynamic  Programming   for   Homology   Search   on   Distributed   Systems,"   in   Euro-­‐Par   2006   Parallel  Processing,  2006,  pp.  1127-­‐1137.  

[43]   N.   Cannata,   E.   Merelli,   and   R.   B.   Altman,   "Time   to   Organize   the   Bioinformatics  Resourceome,"  PLoS  Comput  Biol,  vol.  1,  p.  e76,  2005.  

[44]   M.  Cannataro  and  D.  Talia,  "Semantics  and  knowledge  grids:  building  the  next-­‐generation  grid,"  Intelligent  Systems,  IEEE,  vol.  19,  pp.  56-­‐63,  2004.  

[45]   C.  Goble,  R.  Stevens,  and  S.  Bechhofer,  "The  Semantic  Web  and  Knowledge  Grids,"  Drug  Discovery  Today:  Technologies,  vol.  2,  pp.  225-­‐233,  2005.  

[46]   J.  Hendler,  "COMMUNICATION:  Enhanced:  Science  and  the  Semantic  Web,"  Science,  vol.  299,  pp.  520-­‐521,  January  24,  2003  2003.  

[47]   E.  Neumann,  "A  Life  Science  Semantic  Web:  Are  We  There  Yet?,"  Sci.  STKE,  vol.  2005,  pp.  pe22-­‐,  May  10,  2005  2005.  

[48]   M.  Stollberg  and  A.  Haller,  "Semantic  Web  services  tutorial,"  in  Services  Computing,  2005  IEEE  International  Conference  on,  2005,  p.  xv  vol.2.  

[49]   W.   W.   W.   C.   W3C,   "Resource   Description   Framework,"   World   Wide   Web   Consortium,  2004.  

[50]   W.  W.  W.  C.  W3C,  "Web  Ontology  Language,"  World  Wide  Web  Consortium,  2007.  [51]   A.   Ruttenberg,   T.   Clark,  W.   Bug,  M.   Samwald,   O.   Bodenreider,   H.   Chen,   D.   Doherty,   K.  

Forsberg,  Y.  Gao,  V.  Kashyap,  J.  Kinoshita,  J.  Luciano,  M.  S.  Marshall,  C.  Ogbuji,  J.  Rees,  S.  Stephens,  G.  T.  Wong,  E.  Wu,  D.  Zaccagnini,  T.  Hongsermeier,  E.  Neumann,  I.  Herman,  and  K.  H.  Cheung,   "SPARQL:  Advancing   translational   research  with   the  Semantic  Web,"  BMC  Bioinformatics,  vol.  8,  p.  S2,  2007.  

[52]   E.  J.  Miller,  "An  Introduction  to  the  Resource  Description  Framework,"  Journal  of  Library  Administration,  vol.  34,  pp.  245-­‐255,  2001.  

[53]   S.   Harris   and   N.   Shadbolt,   "SPARQL   Query   Processing   with   Conventional   Relational  Database   Systems,"   in  Web   Information   Systems   Engineering   –  WISE   2005  Workshops,  2005,  pp.  235-­‐244.  

[54]   R.  Stevens,  C.  A.  Goble,  and  S.  Bechhofer,  "Ontology-­‐based  knowledge  representation  for  bioinformatics,"  Brief  Bioinform,  vol.  1,  pp.  398-­‐414,  January  1,  2000  2000.  

[55]   D.  Martin,  M.  Paolucci,  S.  McIlraith,  M.  Burstein,  D.  McDermott,  D.  McGuinness,  B.  Parsia,  T.  Payne,  M.  Sabou,  M.  Solanki,  N.  Srinivasan,  and  K.  Sycara,  "Bringing  Semantics  to  Web  Services:  The  OWL-­‐S  Approach,"  in  Semantic  Web  Services  and  Web  Process  Composition,  2005,  pp.  26-­‐42.  

[56]   M.   Hepp,   "Semantic   Web   and   semantic   Web   services:   father   and   son   or   indivisible  twins?,"  Internet  Computing,  IEEE,  vol.  10,  pp.  85-­‐88,  2006.  

[57]   H.   M.   Sneed,   "Software   evolution.   A   road   map,"   in   Software   Maintenance,   2001.  Proceedings.  IEEE  International  Conference  on,  2001,  p.  7.  

[58]   Z.   Zou,   Z.  Duan,   and   J.  Wang,   "A  Comprehensive   Framework   for  Dynamic  Web  Services  Integration,"  in  European  Conference  on  Web  Services  (ECOWS'06),  2006.  

[59]   G.   O.   H.   Chong  Minsk,   L.   E.   E.   Siew   Poh,   H.   E.  Wei,   and   T.   A.   N.   Puay   Siew,   "Web   2.0  Concepts  and  Technologies  for  Dynamic  B2B  Integration,"  IEEE,  pp.  315-­‐321,  2007.  

[60]   M.  Turner,  D.  Budgen,  and  P.  Brereton,  "Turning  software  into  a  service,"  Computer,  vol.  36,  pp.  38-­‐44,  2003.  

52  

[61]   L.  Xuanzhe,  H.  Yi,  S.  Wei,  and  L.  Haiqi,  "Towards  Service  Composition  Based  on  Mashup,"  in  Services,  2007  IEEE  Congress  on,  2007,  pp.  332-­‐339.  

[62]   N.  Yan,  "Build  Your  Mashup  with  Web  Services,"  in  Web  Services,  2007.  ICWS  2007.  IEEE  International  Conference  on,  2007,  pp.  xli-­‐xli.  

[63]   Q.  Zhao,  G.  Huang,  J.  Huang,  X.  Liu,  and  H.  Mei,  "A  Web-­‐Based  Mashup  Environment  for  On-­‐the-­‐Fly  Service  Composition,"  in  Service-­‐Oriented  System  Engineering,  2008.  SOSE  '08.  IEEE  International  Symposium  on,  2008,  pp.  32-­‐37.  

[64]   D.  Hollingsworth,  The  Workflow  Reference  Model,  1995.  [65]   G.  Preuner  and  M.  Schrefl,  "Integration  of  Web  Services  into  Workflows  through  a  Multi-­‐

Level  Schema  Architecture,"  in  4th  IEEE  Int’l  Workshop  on  Advanced  Issues  of  E-­‐Commerce  and  Web-­‐Based  Information  Systems  (WECWIS  2002),  2002.  

[66]   J.   Cardoso   and   A.   Sheth,   "Semantic   E-­‐Workflow   Composition,"   Journal   of   Intelligent  Information  Systems,  2003.  

[67]   P.   C.   K.   Hung   and   D.   K.  W.   Chiu,   "Developing  Workflow-­‐based   Information   Integration  (WII)   with   Exception   Support   in   a   Web   Services   Environment,"   in   37th   Hawaii  International  Conference  on  System  Sciences  -­‐  2004,  2004.  

[68]   S.  Petkov,  E.  Oren,  and  A.  Haller,  "Aspects  in  Workflow  Management,"  2005.  [69]   M.  Kanehisa  and  S.  Goto,  "KEGG:  Kyoto  Encyclopedia  of  Genes  and  Genomes,"  Nucl.  Acids  

Res.,  vol.  28,  pp.  27-­‐30,  January  1,  2000  2000.  [70]   M.  Kanehisa,  S.  Goto,  M.  Hattori,  K.  F.  Aoki-­‐Kinoshita,  M.  Itoh,  S.  Kawashima,  T.  Katayama,  

M.  Araki,  and  M.  Hirakawa,  "From  genomics  to  chemical  genomics:  new  developments  in  KEGG,"  Nucl.  Acids  Res.,  vol.  34,  pp.  D354-­‐357,  January  1,  2006  2006.  

[71]   A.  Bairoch,  R.  Apweiler,  C.  H.  Wu,  W.  C.  Barker,  B.  Boeckmann,  S.  Ferro,  E.  Gasteiger,  H.  Huang,  R.  Lopez,  M.  Magrane,  M.  J.  Martin,  D.  A.  Natale,  C.  O'Donovan,  N.  Redaschi,  and  L.-­‐S.   L.   Yeh,   "The   Universal   Protein   Resource   (UniProt),"   Nucl.   Acids   Res.,   vol.   33,   pp.  D154-­‐159,  January  1,  2005  2005.  

[72]   H.  Parkinson,  U.  Sarkans,  M.  Shojatalab,  N.  Abeygunawardena,  S.  Contrino,  R.  Coulson,  A.  Farne,  G.  Garcia  Lara,  E.  Holloway,  M.  Kapushesky,  P.  Lilja,  G.  Mukherjee,  A.  Oezcimen,  T.  Rayner,   P.   Rocca-­‐Serra,   A.   Sharma,   S.   Sansone,   and   A.   Brazma,   "ArrayExpress-­‐-­‐a   public  repository   for  microarray  gene  expression  data  at   the  EBI,"  Nucl.  Acids  Res.,  vol.  33,  pp.  D553-­‐555,  January  1,  2005  2005.  

[73]   T.   Margaria,   M.   G.   Hinchey,   H.   Raelt,   J.   Rash,   C.   A.   Rou,   and   B.   Steffen,   "Ensembl  Database:  Completing  and  Adapting  Models  of  Biological  Processes,"  Proceedings  of   the  Conference   on   Biologically   Inspired   Cooperative   Computing   (BiCC   IFIP):   20-­‐25   August  2006;  Santiago  (Chile),  pp.  43  -­‐  54,  2006.  

[74]   N.   J.  Mulder,   R.   Apweiler,   T.   K.   Attwood,   A.   Bairoch,   A.   Bateman,   D.   Binns,   P.   Bork,   V.  Buillard,   L.   Cerutti,   R.   Copley,   E.   Courcelle,  U.  Das,   L.   Daugherty,  M.  Dibley,   R.   Finn,  W.  Fleischmann,   J.  Gough,  D.  Haft,  N.  Hulo,  S.  Hunter,  D.  Kahn,  A.  Kanapin,  A.  Kejariwal,  A.  Labarga,   P.   S.   Langendijk-­‐Genevaux,   D.   Lonsdale,   R.   Lopez,   I.   Letunic,   M.   Madera,   J.  Maslen,  C.  McAnulla,   J.  McDowall,   J.  Mistry,  A.  Mitchell,  A.  N.  Nikolskaya,  S.  Orchard,  C.  Orengo,  R.  Petryszak,  J.  D.  Selengut,  C.  J.  A.  Sigrist,  P.  D.  Thomas,  F.  Valentin,  D.  Wilson,  C.  H.  Wu,  and  C.  Yeats,  "New  developments  in  the  InterPro  database,"  Nucl.  Acids  Res.,  vol.  35,  pp.  D224-­‐228,  January  12,  2007  2007.  

[75]   S.  T.  Sherry,  M.-­‐H.  Ward,  M.  Kholodov,  J.  Baker,  L.  Phan,  E.  M.  Smigielski,  and  K.  Sirotkin,  "dbSNP:   the  NCBI  database  of   genetic   variation,"  Nucl.  Acids  Res.,  vol.   29,   pp.   308-­‐311,  January  1,  2001  2001.  

[76]   A.   Hamosh,   A.   F.   Scott,   J.   S.   Amberger,   C.   A.   Bocchini,   and   V.   A.   McKusick,   "Online  Mendelian   Inheritance   in  Man   (OMIM),   a   knowledgebase   of   human   genes   and   genetic  disorders,"  Nucl.  Acids  Res.,  vol.  33,  pp.  D514-­‐517,  January  1,  2005  2005.  

[77]   C.  E.  Lipscomb,  "Medical  Subject  Headings  (MeSH),"  Bull  Med  Libr  Assoc,  vol.  88,  pp.  265-­‐6,  Jul  2000.  

53  

[78]   A.  Kahraman,  A.  Avramov,  L.  G.  Nashev,  D.  Popov,  R.  Ternes,  H.-­‐D.  Pohlenz,  and  B.  Weiss,  "PhenomicDB:   a   multi-­‐species   genotype/phenotype   database   for   comparative  phenomics,"  Bioinformatics,  vol.  21,  pp.  418-­‐420,  February  1,  2005  2005.  

[79]   P.   Groth,   N.   Pavlova,   I.   Kalev,   S.   Tonov,   G.   Georgiev,   H.-­‐D.   Pohlenz,   and   B.   Weiss,  "PhenomicDB:  a  new  cross-­‐species  genotype/phenotype  resource,"  Nucl.  Acids  Res.,  vol.  35,  pp.  D696-­‐699,  January  12,  2007  2007.  

[80]   L.  Sam,  E.  Mendonca,  J.  Li,  J.  Blake,  C.  Friedman,  and  Y.  Lussier,  "PhenoGO:  an  integrated  resource   for   the  multiscale  mining   of   clinical   and   biological   data,"  BMC   Bioinformatics,  vol.  10,  p.  S8,  2009.  

[81]   Ivo  F.A.C.  Fokkema,  Johan  T.  den  Dunnen,  and  Peter  E.M.  Taschner,  "LOVD:  Easy  creation  of   a   locus-­‐specific   sequence   variation   database   using   an   ldquoLSDB-­‐in-­‐a-­‐boxrdquo  approach,"  Human  Mutation,  vol.  26,  pp.  63-­‐68,  2005.  

[82]   Christophe   BÈroud,   GwenaÎlle   Collod-­‐BÈroud,   Catherine   Boileau,   Thierry   Soussi,   and  Claudine   Junien,   "UMD   (Universal  Mutation  Database):   A   generic   software   to   build   and  analyze  locus-­‐specific  databases,"  Human  Mutation,  vol.  15,  pp.  86-­‐94,  2000.  

[83]   T.   Smith   and   R.   Cotton,   "VariVis:   a   visualisation   toolkit   for   variation   databases.,"   BMC  Bioinformatics,  vol.  9,  p.  206,  2008.  

[84]   S.   Haider,   B.   Ballester,   D.   Smedley,   J.   Zhang,   P.   Rice,   and  A.   Kasprzyk,   "BioMart   Central  Portal-­‐-­‐unified   access   to   biological   data,"  Nucl.   Acids   Res.,   vol.   37,   pp.  W23-­‐27,   July   1,  2009  2009.  

[85]   A.   Jenkinson,  M.  Albrecht,  E.  Birney,  H.  Blankenburg,  T.  Down,  R.  Finn,  H.  Hermjakob,  T.  Hubbard,   R.   Jimenez,   P.   Jones,   A.   Kahari,   E.   Kulesha,   J.  Macias,   G.   Reeves,   and  A.   Prlic,  "Integrating   biological   data   -­‐   the   Distributed   Annotation   System,"   BMC   Bioinformatics,  vol.  9,  p.  S3,  2008.  

[86]   D.  Smedley,  S.  Haider,  B.  Ballester,  R.  Holland,  D.  London,  G.  Thorisson,  and  A.  Kasprzyk,  "BioMart  -­‐  biological  queries  made  easy,"  BMC  Genomics,  vol.  10,  p.  22,  2009.  

[87]   M.  Sarachu  and  M.  Colet,  "wEMBOSS:  a  web  interface  for  EMBOSS,"  Bioinformatics,  vol.  21,  pp.  540-­‐1,  Feb  15  2005.  

[88]   P.   Rice,   I.   Longden,   and   A.   Bleasby,   "EMBOSS:   the   European   Molecular   Biology   Open  Software  Suite,"  Trends  Genet,  vol.  16,  pp.  276-­‐7,  Jun  2000.  

[89]   S.  Pillai,  V.  Silventoinen,  K.  Kallio,  M.  Senger,  S.  Sobhany,  J.  Tate,  S.  Velankar,  A.  Golovin,  K.  Henrick,  P.  Rice,  P.  Stoehr,  and  R.  Lopez,  "SOAP-­‐based  services  provided  by  the  European  Bioinformatics  Institute,"  Nucl.  Acids  Res.,  vol.  33,  pp.  W25-­‐28,  July  1,  2005  2005.  

[90]   M.  DiBernardo,  R.  Pottinger,  and  M.  Wilkinson,  "Semi-­‐automatic  web  service  composition  for  the  life  sciences  using  the  BioMoby  semantic  web  framework,"  Journal  of  Biomedical  Informatics,  vol.  41,  pp.  837-­‐847,  2008.  

[91]   M.  Wilkinson  and  M.  Links,  "BioMoby:  An  open  source  biological  web  services  proposal,"  Brief  Bioinform,  vol.  3,  pp.  331  -­‐  341,  2002.  

[92]   Y.   Kwon,   Y.   Shigemoto,   Y.   Kuwana,   and   H.   Sugawara,   "Web   API   for   biology   with   a  workflow  navigation  system,"  Nucl.  Acids  Res.,  vol.  37,  pp.  W11-­‐16,  July  1,  2009  2009.  

[93]   H.  Sugawara  and  S.  Miyazaki,  "Biological  SOAP  servers  and  web  services  provided  by  the  public  sequence  data  bank,"  Nucl.  Acids  Res.,  vol.  31,  pp.  3836-­‐3839,  July  1,  2003  2003.  

[94]   C.   Goble   and   R.   Stevens,   "State   of   the   nation   in   data   integration   for   bioinformatics,"  Journal  of  Biomedical  Informatics,  vol.  41,  pp.  687-­‐693,  2008.  

[95]   L.  Stein,  "Creating  a  bioinformatics  nation,"  Nature,  vol.  417,  pp.  119  -­‐  20,  2002.  [96]   L.  D.  Stein,  "Integrating  biological  databases,"  Nature  Genetics,  vol.  4,  pp.  337-­‐345,  2003.  [97]   R.   Stevens,   A.   Robinson,   and   C.   Goble,   "myGrid:   personalized   bioinformatics   on   the  

information  grid,"  Bioinformatics,  vol.  19,  pp.  I302  -­‐  I304,  2003.  [98]   V.  Bashyam,  W.  Hsu,  E.  Watt,  A.  A.  T.  Bui,  H.  Kangarloo,  and  R.  K.  Taira,   "Informatics   in  

Radiology:  Problem-­‐centric  Organization  and  Visualization  of  Patient  Imaging  and  Clinical  Data,"  Radiographics,  p.  292085098,  January  23,  2009  2009.  

54  

[99]   M.   A.   Vouk,   "Cloud   computing   -­‐   Issues,   research   and   implementations,"   in   Information  Technology  Interfaces,  2008.  ITI  2008.  30th  International  Conference  on,  2008,  pp.  31-­‐40.  

[100]   J.   Wagener,   O.   Spjuth,   E.   Willighagen,   and   J.   Wikberg,   "XMPP   for   cloud   computing   in  bioinformatics  supporting  discovery  and  invocation  of  asynchronous  Web  services,"  BMC  Bioinformatics,  vol.  10,  p.  279,  2009.  

[101]   G.   Dias,   F.-­‐J.   Vicente,   J.   L.   Oliveira,   and   F.   Martin-­‐Sánchez   "Integrating   Medical   and  Genomic   Data:   a   Sucessful   Example   For   Rare   Diseases,"   in   MIE   2006:   The   20th  International  Congress  of  the  European  Federation  for  Medical  Informatics,  2006,  pp.  125  -­‐  130.  

[102]   J.   Arrais,   J.   Pereira,   and   J.   L.  Oliveira,   "GeNS:  A   biological   data   integration   platform,"   in  ICBB  2009,  International  Conference  on  Bioinformatics  and  Biomedicine,  Venice,  2009.  

[103]   A.  Birkland  and  G.  Yona,  "BIOZON:  a  system  for  unification,  management  and  analysis  of  heterogeneous  biological  data,"  BMC  Bioinformatics,  vol.  7,  2006.  

[104]   T.   Margaria,   R.   Nagel,   and   B.   Steffen,   "Bioconductor-­‐jETI:   A   Tool   for   Remote   Tool  Integration,"  Proceedings   of   the   11th   International   Conference   on   Tools   and  Algorithms  for  the  Construction  and  Analysis  of  Systems  (TACAS):  4-­‐8  April  2005;  Edinburgh,  U.K.,  pp.  557  -­‐  562,  2005.  

[105]   N.   Cannata,   M.   Schroder,   R.   Marangoni,   and   P.   Romano,   "A   Semantic   Web   for  bioinformatics:   goals,   tools,   systems,   applications,"   BMC   Bioinformatics,   vol.   9,   p.   S1,  2008.  

[106]   M.  Ashburner,   C.   A.   Ball,   J.   A.   Blake,  D.   Botstein,  H.   Butler,   J.  M.   Cherry,   A.   P.  Davis,   K.  Dolinski,   S.   S.  Dwight,   J.   T.   Eppig,  M.  A.  Harris,  D.   P.  Hill,   L.   Issel-­‐Tarver,   A.   Kasarskis,   S.  Lewis,   J.  C.  Matese,   J.   E.  Richardson,  M.  Ringwald,  G.  M.  Rubin,   and  G.   Sherlock,   "Gene  Ontology:  tool  for  the  unification  of  biology,"  Nat  Genet,  vol.  25,  pp.  25-­‐29,  2000.  

[107]   I.  Vastrik,  P.  D'Eustachio,  E.  Schmidt,  G.  Joshi-­‐Tope,  G.  Gopinath,  D.  Croft,  B.  de  Bono,  M.  Gillespie,   B.   Jassal,   S.   Lewis,   L.  Matthews,   G.  Wu,   E.   Birney,   and   L.   Stein,   "Reactome:   a  knowledge  base  of  biologic  pathways  and  processes,"  Genome  Biolology,  vol.   8,  p.  R39,  2007.  

[108]   A.  Mottaz,  Y.  Yip,  P.  Ruch,  and  A.  Veuthey,   "Mapping  proteins   to  disease   terminologies:  from  UniProt  to  MeSH.,"  BMC  Bioinformatics,  vol.  9  Suppl  5,  p.  S3,  2008.  

[109]   E.   K.   Neumann   and   D.   Quan,   "BioDASH:   a   Semantic   Web   dashboard   for   drug  development,"  Pac  Symp  Biocomput,  pp.  176  -­‐  187,  2006.  

[110]   A.   Splendiani,   "RDFScape:   Semantic  Web  meets   Systems   Biology,"   BMC   Bioinformatics,  vol.  9,  p.  S6,  2008.  

[111]   F.   Belleau,  M.-­‐A.  Nolin,   N.   Tourigny,   P.   Rigault,   and   J.  Morissette,   "Bio2RDF:   Towards   a  mashup   to   build   bioinformatics   knowledge   systems,"   Journal   of   Biomedical   Informatics,  vol.  41,  pp.  706-­‐716,  2008.  

[112]   M.   Schroeder,   A.   Burger,   P.   Kostkova,   R.   Stevens,   B.   Habermann,   and   R.   Dieng-­‐Kuntz,  "From  a  Services-­‐based  eScience  Infrastructure  to  a  Semantic  Web  for  the  Life  Sciences:  The   Sealife   Project,"   Proceedings   of   the   Sixth   International  Workshop   NETTAB   2006   on  "Distributed   Applications,   Web   Services,   Tools   and   GRID   Infrastructures   for  Bioinformatics",  2006.  

[113]   K.-­‐H.   Cheung,   V.   Kashyap,   J.   S.   Luciano,   H.   Chen,   Y.  Wang,   and   S.   Stephens,   "Semantic  mashup   of   biomedical   data,"   Journal   of   Biomedical   Informatics,   vol.   41,   pp.   683-­‐686,  2008.  

[114]   R.  de  Knikker,  Y.  Guo,  J.-­‐l.  Li,  A.  Kwan,  K.  Yip,  D.  Cheung,  and  K.-­‐H.  Cheung,  "A  web  services  choreography   scenario   for   interoperating   bioinformatics   applications,"   BMC  Bioinformatics,  vol.  5,  p.  25,  2004.  

[115]   T.   Margaria,   C.   Kubczak,   and   B.   Steffen,   "Bio-­‐jETI:   a   service   integration,   design,   and  provisioning   platform   for   orchestrated   bioinformatics   processes,"   BMC   Bioinformatics,  vol.  9,  p.  S12,  2008.  

55  

[116]   T.  Oinn,  M.  Addis,  J.  Ferris,  D.  Marvin,  M.  Senger,  M.  Greenwood,  T.  Carver,  K.  Glover,  M.  R.   Pocock,   A.  Wipat,   and   P.   Li,   "Taverna:   a   tool   for   the   composition   and   enactment   of  bioinformatics  workflows,"  Bioinformatics,  vol.  20,  pp.  3045  -­‐  3054,  2004.  

[117]   B.  Ludascher,  I.  Altintas,  C.  Berkley,  D.  Higgings,  E.  Jaeger,  M.  Jones,  E.  A.  Lee,  J.  Tao,  and  Y.   Zhao,   "Taverna:   Scientific  Workflow  Management   and   the   Kepler   System,"   Research  Articles,   Concurrency  and  Computation:  Practice  &  Experience,  vol.   18,  pp.   1039   -­‐   1065,  2006.  

[118]   G.  Carole  Anne  and  R.  David  Charles  De,  "myExperiment:  social  networking  for  workflow-­‐using  e-­‐scientists,"  in  Proceedings  of  the  2nd  workshop  on  Workflows  in  support  of  large-­‐scale  science  Monterey,  California,  USA:  ACM,  2007.  

[119]   E.  Bartocci,   F.  Corradini,  E.  Merelli,   and  L.  Scortichini,   "BioWMS:  a  web-­‐based  Workflow  Managemt  System  for  bioinformatics,"  BMC  Bioinformatics,  vol.  8,  p.  14,  2007.  

[120]   P.   Romano,   E.   Bartocci,   G.   Bertolini,   F.   De   Paoli,   D.  Marra,   G.  Mauri,   E.  Merelli,   and   L.  Milanesi,   "Biowep:   a   workflow   enactment   portal   for   bioinformatic   applications,"   BMC  Bioinformatics,  vol.  8,  2007.  

[121]   P.   Romano,   D.   Marra,   and   L.   Milanesi,   "Web   services   and   workflow   management   for  biological  resources,"  BMC  Bioinformatics,  vol.  6,  p.  S24,  2005.  

[122]   T.  Life  Sciences  Practice,  "BioWBI  and  WEE:  Tools  for  Bioinformatics  Analysis  Workflows,"  2004.  

[123]   P.   Lopes,   J.   Arrais,   and   J.   L.   Oliveira,   "Dynamic   Service   Integration   using   Web-­‐based  Workflows,"   in   10th   International   Conference   on   Information   Integration   and   Web  Applications  &  Services,  Linz,  Austria,  2008,  pp.  622-­‐625.  

[124]   P.   Lopes,   J.   Arrais,   and   J.   Oliveira,   "DynamicFlow:   A   Client-­‐Side  Workflow  Management  System,"  in  Distributed  Computing,  Artificial  Intelligence,  Bioinformatics,  Soft  Computing,  and  Ambient  Assisted  Living,  2009,  pp.  1101-­‐1108.  

[125]   H.   Jamil  and  B.  El-­‐Hajj-­‐Diab,   "BioFlow:  A  Web-­‐Based  Declarative  Workflow  Language   for  Life  Sciences,"  in  Proceedings  of  the  2008  IEEE  Congress  on  Services  -­‐  Part  I  -­‐  Volume  00:  IEEE  Computer  Society,  2008.  

[126]   D.   Dvir,   T.   Raz,   and   A.   J.   Shenhar,   "An   empirical   analysis   of   the   relationship   between  project  planning  and  project  success,"   International  Journal  of  Project  Management,  vol.  21,  pp.  89-­‐95,  2003.  

[127]   E.  Garfield,  "The  History  and  Meaning  of  the  Journal  Impact  Factor,"  JAMA,  vol.  295,  pp.  90-­‐93,  January  4,  2006  2006.