Integration of Bioinformatics Web Services through the Search Computing Technology

38
Integration of Bioinformatics Web Services through the Search Computing Technology Davide Chicco [email protected] Dipartimento di Elettronica e Informazione Doctoral Minor Research Project Defense 19th November 2012

Transcript of Integration of Bioinformatics Web Services through the Search Computing Technology

Page 1: Integration of Bioinformatics Web Services through the Search Computing Technology

Integration of Bioinformatics Web Services

through the Search Computing Technology

Davide Chicco [email protected]

Dipartimento di Elettronica e Informazione

Doctoral Minor Research

Project Defense

19th November 2012

Page 2: Integration of Bioinformatics Web Services through the Search Computing Technology

SUMMARY

1. The problem

• Multidomain questions

2. The proposed solution

• GPDW Data Warehouse

• Search Computing

• Bio-SeCo

3. Developed and added services

• Exploiting GPDW Data Warehouse

• Semantic Similarity

4. Conclusions

Page 3: Integration of Bioinformatics Web Services through the Search Computing Technology

3

Data and search service scenario

in the Life Sciences

• In the Life Sciences: numerous data, sparsely distributed in

many heterogeneous sources

• Many are ranked data (or partially ranked) of various

types, representing different phenomena, e.g.:

– physical ordering, e.g. within a genome

– analytical order through algorithmically assigned

scores, e.g. representing levels of sequence similarity

– experimentally measured values, such as gene

expression levels

• The ordering may represent a range of different notions,

such as quantity, confidence, or location

Page 4: Integration of Bioinformatics Web Services through the Search Computing Technology

4

Life Science questions and their answering

– Several Life Science questions:

- are complex

- to be answered require integration and comprehensive

evaluation of different data

– often distributed, many of which ranked

• Answering complex questions requires integration of vertical

search services to create multi-topic searches

• where the different topic searches either refine or augment previous

search results

• Bioinformatics data integration platforms exist

– Ordered data are poorly served or no supported at all by

current data integration platforms

Page 5: Integration of Bioinformatics Web Services through the Search Computing Technology

5

Life Science multidomain question

Example: “Which genes encode proteins in different

organisms with high sequence similarity to a protein X and

have some biomedical features in common e.g. up/down

significantly co-expressed in the same biological tissue or

condition Y and involved in the biological function Z?”

Information to answer such queries is available on the Internet,

but no available software system is capable of computing the

answer

The user should search in

different resources, often

indipendent.

Page 6: Integration of Bioinformatics Web Services through the Search Computing Technology

6

• Several integrated databanks, including:

• Entrez Gene, Ensembl

• Homologene

• IPI, UniProt/Swiss-Prot

• Gene Ontology, GOA

• BioCyc, KEGG, Reactome

• InterPro, Pfam

• OMIM, eVOC, …

• Numerous integrated data, including:

• 8,085,152 genes of 8,410 organisms

• 31,347,655 proteins of 367,853 specie

• 33,252 Gene Ontology terms and 61,899 relations (is a, part of)

• 27,667 biochemical pathways

• 14,163 protein domains; 7,215 OMIM genetic disorders; …

Homologene

Entrez

Gene IPI

eVOC KEGG

Reactome Gene

Ontology

On-line databanks

Genomic and Proteomic

Data Warehouse

Database

server

GOA BioCyc

Automatic

updating

procedures

GPDW Data Warehouse

Page 7: Integration of Bioinformatics Web Services through the Search Computing Technology

7

Search Computing project at PoliMi

Search Computing (SeCo) aims at:

1. Developing the informatics framework required for

computing multi-topic searches by combing single topic

search results from search engines, which are often ranked,

with other data and computational resources

• directly supporting multi-topic ordered data

• taking into account order when the results of several

requests are combined

• enabling exploration and expansion of search results

2. Applying SeCo technology in different fields, including Life

Sciences => Bio-SeCo: Support answering complex

bioinformatics queries

Page 8: Integration of Bioinformatics Web Services through the Search Computing Technology

8

Bio-SeCo: SeCo technologies to answer Life Science questions

Life Science example query:

“Which genes encode proteins in different organisms with

high sequence similarity to a protein X and have some

biomedical features in common, e.g. up/down significantly

co-expressed in the same biological tissue or condition Y

and involved in a biological function Z?”

This multi-topic case study question can be decomposed into

the following four single topic sub-queries, each of these sub-

queries can be mapped to an available search service.

Page 9: Integration of Bioinformatics Web Services through the Search Computing Technology

9

Bio-SeCo: SeCo technologies to answer Life Science questions

• “Which proteins in different organisms have high sequence

similarity to a protein X ?”

BLAST, a sequence similarity search program, in one

of its many implementations, e.g. WU-BLAST or

NCBI-Blast

• “Which genes encode which proteins ?”

GPDW (Genomic and Proteomic Data Warehouse), a

query service to a database of genomic and proteomic

data (GPDW_protein2gene)

Page 10: Integration of Bioinformatics Web Services through the Search Computing Technology

10

Bio-SeCo: SeCo technologies to answer Life Science questions

• “Which genes are up/down significantly co-expressed in the

same biological condition / tissue Y ?”

Array Express Gene Expression Atlas, a search

engine of gene expression data

• “Which genes are involved in a biological function Z ?

GPDW (Genomic and Proteomic Data Warehouse), a

query service to a database of genomic and proteomic

data (GPDW_gene2biologicalFunctionFeature)

Page 11: Integration of Bioinformatics Web Services through the Search Computing Technology

11

Bio-SeCo: SeCo technologies to answer Life Science questions

Each quesiton part answer is integrated with others, with all

the ranked results found

BLAST

ArrayExpress

GPDW_gene2biologicalFunctionFeature

GPDW_protein2gene

Page 12: Integration of Bioinformatics Web Services through the Search Computing Technology

12

What I have done for my Minor Research project

Page 13: Integration of Bioinformatics Web Services through the Search Computing Technology

13

Gene Protein

Biological Function Feature

Gene Expression

Is_similar_to

Is_encoded_by

Has

Is_involved_in

Is_involved_in

Semantic network: before

Page 14: Integration of Bioinformatics Web Services through the Search Computing Technology

14

Gene Protein

Biological Function Feature

Gene Expression

Is_similar_to

Is_encoded_by

Has

Is_involved_in

Genetic Disorder

Is_involved_in

Codes

Is_involved_in

Pathway

Is_involved_in Is_involved_in

Is_involved_in

Is_functional_similar_to

Semantic network: now

Page 15: Integration of Bioinformatics Web Services through the Search Computing Technology

15

Gene Protein

Biological Function Feature

Gene Expression

Is_similar_to

Is_encoded_by

Has

Is_involved_in

Genetic Disorder

Is_involved_in

Codes

Is_involved_in

Pathway

Is_involved_in Is_involved_in

Is_involved_in

Is_functional_similar_to

Services I added: GPDW exploitation

A Genetic Disorder is an illness caused by abnormalities in genes

or chromosomes, especially a condition that is present from

before birth.

In biochemistry, Metabolic Pathways are series of chemical

reactions occurring within a cell. In each pathway, a principal

chemical is modified by a series of chemical reactions.

Page 16: Integration of Bioinformatics Web Services through the Search Computing Technology

16

Gene Protein

Biological Function Feature

Gene Expression

Is_similar_to

Is_encoded_by

Has

Is_involved_in

Genetic Disorder

Is_involved_in

Codes

Is_involved_in

Pathway

Is_involved_in Is_involved_in

Is_involved_in

Is_functional_similar_to

Services I added: GPDW exploitation

Which Genetic Disorders is the Gene X involved in ?

GPDW (Genomic and Proteomic Data Warehouse), a query

service to a database of genomic and proteomic data

(GPDW_gene2geneticDisorder)

Page 17: Integration of Bioinformatics Web Services through the Search Computing Technology

17

Gene Protein

Biological Function Feature

Gene Expression

Is_similar_to

Is_encoded_by

Has

Is_involved_in

Genetic Disorder

Is_involved_in

Codes

Is_involved_in

Pathway

Is_involved_in Is_involved_in

Is_involved_in

Is_functional_similar_to

Services I added: GPDW exploitation

Which Genetic Disorders is the Protein Y involved in ?

GPDW (Genomic and Proteomic Data Warehouse), a query

service to a database of genomic and proteomic data

(GPDW_protein2geneticDisorder)

Page 18: Integration of Bioinformatics Web Services through the Search Computing Technology

18

Gene Protein

Biological Function Feature

Gene Expression

Is_similar_to

Is_encoded_by

Has

Is_involved_in

Genetic Disorder

Is_involved_in

Codes

Is_involved_in

Pathway

Is_involved_in Is_involved_in

Is_involved_in

Is_functional_similar_to

Services I added: GPDW exploitation

Which Genes does the Genetic Disorder X involve?

GPDW (Genomic and Proteomic Data Warehouse), a query

service to a database of genomic and proteomic data

(GPDW_geneticDisorder2gene)

Page 19: Integration of Bioinformatics Web Services through the Search Computing Technology

19

Gene Protein

Biological Function Feature

Gene Expression

Is_similar_to

Is_encoded_by

Has

Is_involved_in

Genetic Disorder

Is_involved_in

Codes

Is_involved_in

Pathway

Is_involved_in Is_involved_in

Is_involved_in

Is_functional_similar_to

Services I added: GPDW exploitation

Which Proteins does the Genetic Disorder X involve?

GPDW (Genomic and Proteomic Data Warehouse), a query

service to a database of genomic and proteomic data

(GPDW_geneticDisorder2gene)

Page 20: Integration of Bioinformatics Web Services through the Search Computing Technology

20

Gene Protein

Biological Function Feature

Gene Expression

Is_similar_to

Is_encoded_by

Has

Is_involved_in

Genetic Disorder

Is_involved_in

Codes

Is_involved_in

Pathway

Is_involved_in Is_involved_in

Is_involved_in

Is_functional_similar_to

Services I added: GPDW exploitation

Same questions and GPDW services for Metabolic Pathways: • GPDW_gene2pathway

• GPDW_protein2pathway

• GPDW_pathway2gene

• GPDW_pathway2protein

Page 21: Integration of Bioinformatics Web Services through the Search Computing Technology

21

Gene Protein

Biological Function Feature

Gene Expression

Is_similar_to

Is_encoded_by

Has

Is_involved_in

Genetic Disorder

Is_involved_in

Codes

Is_involved_in

Pathway

Is_involved_in Is_involved_in

Is_involved_in

Is_functional_similar_to

Services I added: GPDW exploitation

A Biological Function Feature is an item of information about a

gene or a protein. It defines a certain peculiarity of a biomolecular

entity. E.g.: “is involved in lung cancer”

GPDW_protein2biological_function_feature

Page 22: Integration of Bioinformatics Web Services through the Search Computing Technology

22

Services I added

• These new services (Genetic Disorder and Pathway) are

very useful and important, but they don’t take advantage of

the main novelty provided by the Search Computing

technology: the Integration of ranked results

• There’s no ranking on “being involved” in a Genetic

Disorder or a Pathway…

Page 23: Integration of Bioinformatics Web Services through the Search Computing Technology

23

Services I added: Gene Semantic Similarity

• The other service (SemSim) I integrated on Bio-SeCo is

related to the computation of the semantic similarity of a

gene into a list of genes:

• This service provides ranked results (given a gene X, it

returns a list of gene ranked from the most semantic similar

to X to the less semantic similar one)

• SemSim takes advantage of the Search Computing

potentiality of integrating ranked results

Gene

Is_functional_similar_to

Page 24: Integration of Bioinformatics Web Services through the Search Computing Technology

24

Semantic Similarity?!? What does it mean?

• Keypoint: given the gene X and gene Y, how much similar

are they?

• Semantically similar genes can be involved in similar

activities, can be involved in similar pathways, and can have

many annotations in common

• To measure this similarity, I chose Latent Semantic

Indexing method, based on a matrix build with gene-

related annotations

Page 25: Integration of Bioinformatics Web Services through the Search Computing Technology

25

Biomolecular annotation

• The concept of annotation: association of nucleotide or amino

acid sequences with useful information describing their features

• This information is expressed through controlled

vocabularies, sometimes structured as ontologies, where

every controlled term of the vocabulary is associated with a

unique alphanumeric code

• The association of such a code with a gene or protein ID

constitutes an annotation

Gene /

Protein

Biological function feature

Annotation

gene2bff

Page 26: Integration of Bioinformatics Web Services through the Search Computing Technology

26

Biomolecular annotation

• The association of an information/feature with a gene or

protein ID constitutes an annotation

• Annotation example:

• gene: GD4

• feature: “is present in the mitochondrial membrane”

Gene /

Protein

Biological function feature

Annotation

gene2bff

Page 27: Integration of Bioinformatics Web Services through the Search Computing Technology

27

Latente Semantic Indexing:

Singular Value Decomposition – SVD

– Annotation matrix A {0, 1} m x n

− m rows: genes / proteins

− n columns: annotation terms

A(i,j) = 1 if gene / protein i is annotated to term j or to any

descendant of j in the considered ontology structure (true

path rule)

A(i,j) = 0 otherwise (it is unknown)

term01 term02 term03 term04 … termN

gene01 0 0 0 0 … 0

gene02 0 1 1 0 … 1

… … … … … … …

geneM 0 0 0 0 … 0

Page 28: Integration of Bioinformatics Web Services through the Search Computing Technology

28

Latente Semantic Indexing:

Singular Value Decomposition – SVD

– Annotation matrix A {0, 1} m x n

− m rows: genes / proteins

− n columns: annotation terms

A(i,j) = 1 if gene / protein i is annotated to term j or to any

descendant of j in the considered ontology structure (true

path rule)

A(i,j) = 0 otherwise (it is unknown)

term01 term02 term03 term04 … termN

gene01 0 0 0 0 … 0

gene02 0 1 1 0 … 1

… … … … … … …

geneM 0 0 0 0 … 0

Page 29: Integration of Bioinformatics Web Services through the Search Computing Technology

29

Compute SVD:

Compute reduced rank approximation:

• An annotation prediction is performed by computing a reduced

rank approximation Ak of the annotation matrix A

(where 0 < k < r, with r the number of non zero singular values

of A, i.e. the rank of A)

TA U V

TA U V

TA U V TA U V TA U V

T

k k k kA U V

k

T

k k k kA U V T

k k k kA U V T

k k k kA U V T

k k k kA U V

k

Latente Semantic Indexing: Singular Value Decomposition – SVD

Page 30: Integration of Bioinformatics Web Services through the Search Computing Technology

30

Compute reduced rank approximation:

• A : genes – features matrix

• Uk : gene vectors matrix

• Σk : singular value matrix

• VTk : feature vectors matrix

T

k k k kA U V

k

T

k k k kA U V T

k k k kA U V T

k k k kA U V T

k k k kA U V

k

Latente Semantic Indexing: Singular Value Decomposition – SVD

Page 31: Integration of Bioinformatics Web Services through the Search Computing Technology

31

• Uk : gene vectors matrix

• Σk : singular value matrix

• VTk : feature vectors matrix

• These matrices can be considered for measuring the distances

between objects (genes or feature) in the k-dimensional space.

• For example, is possibile to compute the distance between

two gene vector to understand their similarity level. The same

thing could be done for features.

Latente Semantic Indexing: Singular Value Decomposition – SVD

Page 32: Integration of Bioinformatics Web Services through the Search Computing Technology

32

• Uk : gene vectors matrix

• Σk : singular value matrix

• VTk : feature vectors matrix

• For our implementation of the LSI, we chose to compute the

cosine similarity as measure of the semantic similarity

between genes.

Latente Semantic Indexing: Singular Value Decomposition – SVD

Page 33: Integration of Bioinformatics Web Services through the Search Computing Technology

33

• A preprocessing software computes the Singular Value

Decomposition (SVD) algorithm

• It prints the matrices (Uk, Σk, VT

k) in three different files

• These files are inserted into the \data\ directory of the SemSim

REST web application

• SemSim (JSP + Java) computes the Latent Semantic Indexing

(LSI) measures and returns the ranked list of genes

Minor Research Project

Page 34: Integration of Bioinformatics Web Services through the Search Computing Technology

34

• Developed with REST technology

• Integrated on Bio-SeCo as an external service, with a wrapper

• Input: gene (ID, name, taxonomy)

Minor Research Project

Page 35: Integration of Bioinformatics Web Services through the Search Computing Technology

35

• Input: list of genes ranked on their semantic similarity with the

input gene

Minor Research Project

Page 36: Integration of Bioinformatics Web Services through the Search Computing Technology

36

• Now is possible to answer to many other biological questions.

For example:

Among the proteins that are encoded by genes, in Chicken

organism, with higher functional semantic similarity to gene X,

which are those with higher sequence similarity to protein Y ?

Minor Research Project

SemSim ProteinByGene Sequence

Alignment

Input

Output

Page 37: Integration of Bioinformatics Web Services through the Search Computing Technology

37

• Now is possible to answer to many other biological questions,

that involve Gene Semantic Similarity computation, Genetic

Disorders or Metabolic Pathways. For example:

Among the proteins that are encoded by genes, in Chicken

organism, with higher functional semantic similarity to gene X,

which are those with higher sequence similarity to protein Y ?

Minor Research Project

SemSim ProteinByGene Sequence

Alignment

Input

Output DEMO

Page 38: Integration of Bioinformatics Web Services through the Search Computing Technology

38

Thanks for your attention