Research Article Gene-Disease Interaction Retrieval from Multiple...

Research ArticleGene-Disease Interaction Retrieval from Multiple SourcesA Network Based Method

Lan Huang12 Ye Wang1 Yan Wang12 and Tian Bai12

1College of Computer Science and Technology Jilin University Changchun 130012 China2Key Laboratory of Symbolic Computation and Knowledge Engineering Jilin University Ministry of EducationChangchun 130012 China

Correspondence should be addressed to Tian Bai baitianjlueducn

Received 4 February 2016 Revised 10 May 2016 Accepted 14 June 2016

Academic Editor Daniele DrsquoAgostino

Copyright copy 2016 Lan Huang et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

The number of gene-related databases has been growing largely along with the research on genes of bioinformaticsThose databasesare filledwith various gene functions pathways interactions and so forth whilemuch biomedical knowledge about humandiseasesis stored as text in all kinds of literatures Researchers have developed many methods to extract structured biomedical knowledgeSome study and improve text mining algorithms to achieve efficiency in order to cover as many data sources as possible while somebuild open source database to accept individual submissions in order to achieve accuracy This paper combines both efforts andbiomedical ontologies to build an interaction network of multiple biomedical ontologies which guarantees its robustness as wellas its wide coverage of biomedical publications Upon the network we accomplish an algorithm which discovers paths betweenconcept pairs and shows potential relations

1 Introduction

Gene data of many kinds has been increasing sharply sincethe gene sequencing technology has been developed andimproved over decades Each gene database is built to storeandmanage a certain kind of gene featuresWhen researchersneed to get multiple features of some genes they need to gothrough many gene databases which is quite inefficient [1ndash5] Besides those gene databases usually cover genes of manyspecies from bacteria to animals [6] Some gene featuresextracted from those databases are not related to humanor human diseases thus there are a lot of noises fromgene databases while extracting gene features for biomedicalresearch [7] As in this paper we intend to implementa human gene relation network based on similarity withextension of human diseases for biomedical researchers Andas it is pointed out above there are mainly two obstacles toachieve our goal

(i) The distributed storage and management of gene fea-tures in various databases make it inefficient to gatherall gene-related features Researchers need to query

many gene databases to get enough information onrequired genes

(ii) Most of gene databases contain multiple gene infor-mation from various species Most commonly lots ofgene databases include gene information from yeastand mice Some of the gene information from thiskind of databases is difficult in building connectionsto biological process of human or human diseasesSuch kind of nonrelated information may be confus-ing for biomedical researchers

In order to implement a gene-related network we gatherseveral gene features from gene databases Gene symbolsare widely acknowledged and used in most literatures onbiomedical research to indicate genes We adopt gene sym-bols as fundamental frame for gene relation network forit is extensively used and is robust Upon this frame allcollected gene features are put in vectors and are calculatedby similarity algorithm which is defined in next chapter togenerate a similarity matrix of genes The prototype of generelation network is extracted with a threshold value from thematrix

Hindawi Publishing CorporationBioMed Research InternationalVolume 2016 Article ID 3594517 9 pageshttpdxdoiorg10115520163594517

2 BioMed Research International

Genetic association database

Gene relation network Disease ontology

Potential relations of gene-disease

HLA-AHGF

Gene symbols

LeukemiaPterygium

Disease terms

HLA-A | Leukemia | middot middot middot

middot middot middot middot middot middot

Gene symbol | diseaseGene | disease name

middot middot middot | middot middot middot

middot middot middot | middot middot middot | middot middot middot

Gene | disease name | middot middot middot

Figure 1 Extended gene-disease network

There are many gene-disease relations that have beenverified in biomedical literatures [14ndash22] Those literatureshave been digitized over years and researchers have devel-oped ways to extract relation pairs such as gene relationpairs from the literatures The approaches that have beendeveloped so far can be divided into two categories in generalcooccurrence algorithms and expert inspection Those twocategories of approaches to extract relations from literaturesboth have advantages and drawbacks

(i) Cooccurrence algorithms are relatively high in effi-ciency and can process the literature databaseslike Medline with no or little supervision fromresearchers Therefore this category of approachescan process a lot of literatures and newly publishedarticles to get massive and up-to-date results Besidesit can also improve itself by adopting latest cooccur-rence algorithms

(ii) Expert inspection on the other hand surpassescooccurrence algorithms in accuracy Since relationpairs in biomedical demand high precision it ishard to replace expert inspection with cooccurrencealgorithms now Expert inspection is realized by agroup of specialists or a domanial community It isconsiderably low in efficiency coverage of literaturesand self-improvement It takes time for a group ofspecialists or a domanial community to accept newknowledge so the relation pairs extracted this way aremore conservative

In order to extend the gene relation network to humandiseases and not to obtain the drawbacks of either categoryof relation pair extraction approaches we adopt a two-step extension method In the first step we extend generelation network to human diseases with databases createdthrough expert inspection approach in order to ensure theaccuracy of gene-disease relations Second step we map

disease ontology to the disease names of the databases that areused The second step makes it possible to append relationsthrough cooccurrence approaches or to implement gene-disease relation retrieval

So we implement a broader retrieval of human gene-disease relations by building an extended human gene-disease network which is realized by combining gene rela-tions disease ontology and genetic association database Anyinteraction of genes and diseases within limited paths mayindicate an uncovered relation in biomedical research and thenetwork can be a useful method to instruct relation discoveryof human gene-disease

The network in Figure 1 is the main method to overcomethe drawbacks of current databases Genetic associationdatabase (GAD) provides solid relations between humangene and disease while gene relation network and diseaseontology function as extensions to implement broader cov-erage of genes and diseases Those pairs of gene and diseasethat cannot be found in GAD may possibly be discovered inextended gene-disease network The pairs that are found inour network rather than GAD may be potential relations ofgene-disease

In this paper we first build gene relation network andcombine GAD with gene relation network and disease ontol-ogy to form extended gene-disease network Then we applyan extended retrieval algorithm on this network and verifythe validity of this method

2 Materials and Methods

21 Gene Databases Disease and Gene-Disease RelationDatabase The fact that one gene relates to another gene orthat two genes are similarmeans either that one gene interactswith another gene or that both genes share the same pathwaygene function biological process and so forth So multiplegene databases that contain such knowledge are needed tobuild gene similarity We collect those features from dbSNP

BioMed Research International 3

Table 1 Gene relation databases

Database name Relation type URLdbSNP ID conversion httpwwwncbinlmnihgovSNPDrugBank ID conversion httpwwwdrugbankcaEnsembl ID conversion httpasiaensemblorgindexhtmlredirect=noKEGG Pathway httpwwwgenomejpkeggGenBank Gene-protein httpwwwncbinlmnihgovgenbankNCIPID Pathway httpspidncinihgovReactome Pathway httpwwwreactomeorgSTRING Interaction httpstring-dborgUniGene ID conversion httpwwwncbinlmnihgovunigeneGO ID conversion httpgeneontologyorg

[23] DrugBank [24] Ensembl [25] KEGG [26] GenBank[27] NCIPID [28] Reactome [29] STRING [30] UniGene[31] and Gene Ontology [32] as in Table 1 We use bioDBnetto convert all gene-related ID from database above to genesymbol of HUGO gene nomenclature committee

Gene symbols are introduced as frame to present genesand to manage gene relations Since gene symbols are com-monly used in biomedical literatures and databases there isno need to convert relations from databasesmentioned aboveto gene symbol which avoids precision loss and ambiguous-ness Therefore the information of those databases can befully and precisely imported We adopt the whole set of genesymbol to ensure a wide coverage

Disease names have several widely used thesauri inbiomedical literatures We adopt disease ontology as thefundamental thesaurus for disease names andMeSH UMLSSNOMED-CT and ICD10 as supplement in Table 2

Many disease ontology terms have references to the otherthesauri and it is possible to link the other four thesaurito disease ontology In this way we extend the coverage ofdisease names and maintain the tree structure of ontology

We introduce genetic association database (GADTable 3)[33] as gene-disease relation databaseThe GAD is a databasethat displays an archive of the results of genetic associationstudies of complex human diseases and disorders Thisdatabase has been retired in 2014 but there are over a hundredthousand lines in it In this paper we apply this database tobuild links between genes and human diseases

22 Gene Similarity and Extended Gene Network Genesinteract with each other in many ways and there are fewdatabases that contain gene interactions of all kinds We usegene relations to extend existing gene-disease relations so theinteraction types of genes are not important in our networkand all types of interactions are introduced to build generelation network equally

Before building a gene network we need to calculate thesimilarity of genes First we extract gene-related informationfrom the 10 databases mentioned above and create genevector as follows

⟨Gene Symbol (gene features list from dbSNP) (gene features list from GO)⟩

Table 2 Disease name thesauri

Thesaurus name URLDisease ontology [8] httpdisease-ontologyorgMeSH [9] httpswwwnlmnihgovmeshUMLS [10] httpswwwnlmnihgovresearchumls

SNOMED-CT [11] httpswwwnlmnihgovresearchumlsSnomedsnomed mainhtml

ICD10 [12] httpwwwwhointclassificationsicden

Table 3 Genetic association database (part)

Disease Dis class GeneLeukemia CANCER HLA-AAlzheimerrsquos disease NEUROLOGICAL HFEThalassemia HEMATOLOGICAL HBA1Emphysema CARDIOVASCULAR GSTT1PAH metabolites urinary METABOLIC GSTT1

Each gene has a corresponding gene vector and we needto compare each pair of genes and calculate the similarityvalue according to the equation below

119878 (119892119904 119892119905) =

119899

sum119898=1

((1198711cap 1198712) minus (119871

1cup 1198712) 2

(1198711cup 1198712) 2

times 120572119898)

119899

sum119898=1

120572119898

= 1

(1)

where 119878(119892119904 119892119905) is the similarity value of gene

119904and gene

119905 119871

is the list of features from gene relation databases 120572119898is the

adjustment parameter for each database and the default valuefor each 120572

119898is 1119899 and 119899 is the number of databases

In this way we get the similarity value of each gene pairsand the value is from minus1 to 1 The greater the value is themore similar the genes are and according to the feature weselected genes with higher similarity values aremore likely toappear in the same gene functions cellular components andgene pathways We build up the similarity matrix of geneswhere the horizontal axis and vertical axis are both lists of


genes the value in the matrix is the similarity value of thecorresponding genes on axes Consider

Sig (119899) =119890| sum119898 =119899119878(119892119898119892119899)|

119873 minus 1sum119898 =119899

1003816100381610038161003816119878 (119892119898 119892119899)1003816100381610038161003816

119878 (119892119898 119892119899) (2)

where Sig(119899) is the importance of a node and 119873 is thenumber of nodes Since the similarity value is from minus1 to 1we consider that genes with values below 0 are not similarand those whose values above 0 are similar We first sum upthe sign value of one gene with all other genes to determinethe similarity of this gene within all genes As we considerbefore the total sign represents the similarity of this geneIf the absolute value of similarity is close to 1 it meansthat those genes are either very much alike or are not alikeat all Those values describe the character of gene we useexponential function to enhance the valueThose gene with ahigh absolute value has a strong connectionwithin orwithoutall other genes which means this gene is more significantthan others The greater the value is the more important thenode is in the network It helps us to find important nodesand to determine a threshold value for the similarity matrix

Since minus1 means that the gene pair has nothing in com-mon we need to eliminate such gene pairs The thresholdvalue of similarity is 0 by default and can vary from minus1 to 1 tocontrol the credibility of similarity of gene pairs Those genepairswith similarity value greater than threshold are activatedin the gene relation network

23 Broader Human Gene-Disease Relations with Extension ofGenes and Diseases We use genetic association database tobuild links between genes and diseases The gene thesaurusof GAD is gene symbol the same as the thesaurus of generelation network so the GAD can be directly linked to thegene relation network Through the internal links of GADgene relation network is extended to gene-disease relationnetwork

However the disease names of GAD cannot be linked todisease ontology directly So we need to develop method tolink GAD to disease ontology

First we extract disease names from genetic associationdatabase and match them to disease names in disease ontol-ogy As we can see only about one-third of the disease namescan be matched to disease ontology Since GAD is a relativelysmall database of disease-gene relation this matching rateis unacceptable because too much useful information orknowledge is wasted Therefore we introduce several widelyused disease thesauruses to improve the mapping result

Almost every term in disease ontology has one or morex-refs or synonyms Synonyms barely improve the mappingrate because the synonyms are too similar to the diseaseontology name and not every disease in DO has one Sinceall data in GAD is extracted from biomedical literatures andMeSH is a standard thesaurus in this field and is amongthe x-refs in disease ontology it is reasonable to introduceMeSH and some other thesauri to extend the vocabulary ofdisease ontologyThose disease names that cannot bemappeddirectly to disease ontology now can bematched inMeSHandmapped indirectly to disease ontology through x-refs In this

Table 4 Percentage of matched disease names

Method Direct match Half ambiguous With x-ref AmbiguousPercentage 10 45 65 96

Table 5 Examples of four methods

Disease namefrom GAD

Disease name fromthesaurus

Direct match Leukemia Leukemia

Half ambiguous Diabetes Type 2 Type 2 DiabetesMellitus

With x-refs Sleep Disorders Sleep Disorder

Ambiguous TesticularNeoplasms Testicular Disease

Table 6 Nodes and edges of the network

Edge Number of edges Number of nodesGene-gene 207051075 18998Disease-disease 6932 6588Gene-disease 121309 25586

way over half of the disease names from GAD can be linkedto disease ontology but the rate is still too low

In order to find out the reason why so many diseasenames from GAD cannot be mapped to disease ontologywe manually check the disease names by randomly selectingsome of the names and then scanning multiple diseasethesauruses for the selected namesMost of the disease namesdo not come up in those thesauruses However some diseaseswith similar names are in the result list Such kinds of diseasesare not the same by name but are the same type of diseasesSo we consider such kind of disease names as the child nodesof a disease with the largest number of similar names In thisway over 96 of the disease names have at least one link todisease ontology as in Table 4

The rest of the names that cannot be linked to diseaseontology are initials for diseases medical test or otherdisease related factors that are not diseases We build a chartseparately for these names Examples of those 4 methods arein Table 5

24 Gene-Disease Network As in Table 6 there are 3 kindsof edges which represent 3 kinds of relations The gene-gene relations are built based on the similarity of genes Theyare calculated within pathways interaction and biologicalprocess of 10 databases The disease-disease relations aremainly the ldquois-ardquo relationship of disease ontology Otherkinds of disease-disease relations are indirect link fromGADdiseases to disease ontology terms and disease related termsthat are not diseases such as symptoms The gene-diseaserelations are all internal links of GAD and are all reliable

25 Disease Name Similarity Score Function Since a greatnumber of diseases from various sources cannot be linked todisease ontology terms directly we adopt ambiguous methodand x-ref method to map more disease names to disease


ontology terms at the expanse of accuracy of the linkageAs is given in Table 5 ldquoTesticular Neoplasmsrdquo are relatedto ldquoTesticular Diseaserdquo but they are not equal Therefore asimilarity score function is developed for disease names togive a computable confidence of two disease names

We take ldquoDiabetes Type 2rdquo and ldquoType 2 Diabetes Mel-litusrdquo ldquoTesticular Neoplasmsrdquo and ldquoTesticular Diseaserdquo asexamples to demonstrate how the function scores

First disease names are divided into individual words Ifthe disease names have only one word ignore this step Ifthere is a comma in the name put the words after the commain front of the word before comma For example ldquoDiabetesType 2rdquo is divided into three words ldquoDiabetesrdquo ldquoTyperdquo andldquo2rdquo Then switch ldquoDiabetesrdquo and ldquoTyperdquo and ldquo2rdquo Finally weget a list of words ldquoTyperdquo ldquo2rdquo and ldquoDiabetesrdquo

Second each word in the word list gets an initial valueand the total value of the words in a list equals 1 Afterobserving some disease names we consider the word at theend of the list to bemore important than theword at the top ofthe list for ldquoDiabetesrdquo is moremeaningful than ldquoTyperdquoThenif there is only one word in the list its initial value equals 1Otherwise the initial value of the former word equals half ofthe initial value of the latter word in the list unless the formerword is the first word in the list In this case the initial valueof the first word in the list equals the second For word listldquoTyperdquo ldquo2rdquo ldquoDiabetesrdquo and ldquoMellitusrdquo the initial values foreach word are 0125 0125 025 and 05 respectively

Below is the similarity score function

SC (dis119898 dis119899) =

length(119898)

sum119904=1

IV (119904) 120593 times

length(119899)

sum119905=1

IV (119905) 120596 (3)

where SC(dis119898 dis119899) is the score of disease names119898 and 119899 120593

and 120596 are valid value for 119898 and 119899 and IV is the initial valueIf a word in list of disease119898 is found in list of disease 119899 thenits 120593 equals 1 otherwise it equals 0 Similarly if a word in listof disease 119899 is found in list of disease 119898 then its 120596 equals 1otherwise it equals 0

For disease names mapped to disease ontology directlyor through x-ref they have a score of 1 For disease namesmapped with ambiguous method they get value fromsimilarity function For example ldquoDiabetes Type 2rdquo andldquoType 2 Diabetes Mellitusrdquo have a similarity score of 05while ldquoTesticular Neoplasmsrdquo and ldquoTesticular Diseaserdquo havea similarity score of 025

26 Confidence Function for Disease-Gene Interactions Sincewe have defined similarity score for disease-disease and gene-gene each disease-gene interactions now can be valued

The confidence function is

DGI (119889 119892) = prod119904isin119863

SC (119889 119904)prod119905isin119866

119878 (119892 119905) (4)

where DGI(119889 119892) is confidence value of disease 119889 and gene 119892119863 is the set of all diseases and 119866 is the set of all genes

If there is a path between disease 119889 and gene 119892 thenDGI(119889 119892) = 0 should always be true However 119878(119892 119905) canbe zero even there is a path so the value of gene similarity

needs to be optimized A path means a connection betweentwo terms either directly or indirectly

The optimized function is

DGI1015840 (119889 119892) = SC1015840 (119889 119904) 1198781015840 (119892 119905) (5)

SC1015840 (119889) = 10prod119904isin119863(SC(119889119904))

(6)

1198781015840

(119892) = 10prod119905isin119866((119878(119892119905)+1)2)

(7)

DGI1015840(119889 119892) is over 0 as long as there is a path between geneand disease The value of DGI1015840(119889 119892) sub (0 100) where 100means that the interaction exists and has been verified Thisis because when (5) equals 100 it means that the exponentsof (6) and (7) are both 1 whichmeans by definition that thereare direct connections

27 Expanded Retrieval of Human Genes and Diseases Thereare over 100000 edges in the gene-disease relation networkEach pair of gene and disease is theoretically connectedWe develop an algorithm to get pathways between gene anddisease under certain conditions

We consider (term 1 term 2) as a set of two termsand ⟨term 1 term 2⟩ denotes that they are connected by anedge Each term has a value that equals the number of edgesthat connected to it The input term pair from users shouldbe disease and gene and the term pairs generated in thealgorithm can be diseases genes or disease and gene Eachterm is a node in the network Consider

119882(119899) = 119890(119864119899

119900minus119864119899

119894)

times 119864119899

119900times 120576 (8)

119882(119899) is the score for each node 119864 is the edges of a termand those edges that connect to input terms are in-edges (119864

119894)

and others are out-edges (119864119900) 120576 is a random number from 05

to 1 to provide flexibility and if this node is already in the listit equals zero This equation provides the heuristic functionof the following algorithm

Path detection algorithm first we input a pair ofterms (term 1 term 2) and maximum length of the path(maxlength) Then we search from both terms and recordthe path length of the expended layers When the sum ofboth path length exceed maxlength the algorithm stopsOtherwise for one term all of its neighbors are found andsorted by their value and then added to the end of the listof this term If there is a duplicated term in the list changethe value of the node to zero If the intersection of lists ofboth terms is empty get the first nonzero term from bothlists delete the headmost terms and extend them to theirneighbors and check again If the intersection is not emptyreturn the shared term and its path towards the input termsThe procedures of the algorithm are in Figure 2

3 Experiment and Results

31 Experiment Design The experiment is designed as inFigure 3We collect PubMed articles and extract gene-diseasepairs SinceGADceases to update in 2014 those relation pairsextracted from PubMed articles in 2015 are very likely notin our gene-disease network We apply our path detection


Term pairs

Term 1

Expand term 1 andget a list of terms

Rank term list ofterm 1

Term listof term 2

Term listof term 1

Term 2



End

Both lists havecommon term(s)or reach full steps

NoNo

Yes

Figure 2 Workflow of algorithm

Disease

Juvenile polyposis syndrome

Gene

SMAD4Interaction

Autosomal dominant diseaseBeta thalassemiaBeta thalassemia

ITGA8KLF1

GATA1No interaction


Expanded gene-disease network


ITGA8KLF1

GATA1Interaction

Disease ontology Gene relation network

Figure 3 New paths for GAD pairs in expanded gene-disease network [34ndash36]


Table 7 Distribution of test results

Results In GAD Only in expandednetwork All

Percentage of total 32 822 854Count 5120 130918 159259

Table 8 Alternative paths comparison of pairs in GAD

Results Without alternative paths With alternative pathsPercentage 715 284Count 3663 1457

algorithm to find paths of gene-disease pairs from PubMedarticles in 2015 and test how many of them can be found inour network

We collect all articles from October 2015 to November2015 in PubMed and extract gene-disease pairs from thesearticlesWe obtain 172686 pieces of articles and extract 159259pairs of gene-disease We then apply our algorithm on bothGAD and our expanded gene-disease network The resultscan be divided into three categories as shown in Table 7

32 Results Analysis The results can be put in 3 categoriesThe first kind of gene-disease pairs can be found directly inGADThe second kind cannot be found directly in GAD butthere are paths to connect them in our network The thirdkind can be found directly in GAD and there are other pathsto connect them in our network

There is a small number of gene-disease pairs that can befound in GAD directly Since there is no relation attribute ofedges in our network or GAD these gene-disease pairs canrepresent either new relations or existing ones The reasonof the low percentage is that those gene-disease pairs areextracted from most recent literatures in PubMed while theGAD has ceased to update It is reasonable that GAD cannotcover most new pairs Most of these pairs are extracted fromliteratures which explain new relations ormake comparisons

From Table 7 we can see that over 45 pairs of gene-disease can be found in our network alone The expandednetworks of genes and diseases contribute the most in dis-covering this high percentage of gene-disease pairs Diseaseontology terms and other reference medical thesauri unifydisease names and allow ambiguous match of disease namesGene relation network provides similarities between genesIts connectivity is controlled by the threshold value of sim-ilarity In this experiment we set an intermediate thresholdvalue for gene relation network and allow ambiguous matchof disease name So the result is relatively high in numberand the relations between the gene-disease pairs are relativelyweak It is possible to control the parameters of the expandedgene-disease network and get a looser or tighter result Theparameters can help the network to fit in different needs ofretrieval

There are new paths for gene-disease pairs in expandednetwork that can be found inGAD FromTable 8many gene-disease pairs do not have alternative paths because manydiseases and related genes have already formed a small group

Table 9 Results of alpha thalassemia and beta thalassemia in GAD

Disease Related genesAlphathalassemia Hb HP UGT1A1 and ABO

Betathalassemia

HBG2 PROCR NOS1 NOS2 HBG1 F2 F5COL1A1 HAMP VDR TNF SERPINE1 AHSPITGB3 APOE APOB LARGE HBBP1 HLA-CGSTT1 GSTM1 FGB F13A1 ESR1 ACE andCOL1A1

Alpha and betathalassemia

MYB MTHFR HBA HBS1L HBB HBA2HBA1 G6PD HBB BCL11A UGT1A1 and HFE

Table 10 Examples of gene-disease pairs found in GAD

Gene DiseaseTHBS1 Prostate cancerTH Mental disorderTH Mood disorderTH Borderline personality disorderMB Breast cancerTNF LeptospirosisPC Colorectal cancer

in the network and they have reached their high connectivityHowever those that have new paths are more intriguingbecause they may potentially represent new relations thathave yet been proved This indicates that even though GADhas no relations of those pairs our expanded network still cangive paths to connect them Tables 9 and 10 are examples ofintermediate steps of the experiment

Poon et al [37] developed a method to extract medicalrelated knowledge from PubMed and demonstrate it onweb It accomplished a simple reasoning function to showpotential term pairs that may interact through a third termThere are resemblances between their work and ours butthose two works have different goals and different method tofulfill it Comparisons are in Table 11

4 Conclusion

We implement a gene-disease relation network based ongene symbols and disease ontology The network containsthree kinds of relations and all terms are linked to otherswithin unlimited length of path By controlling the lengthpath between two terms we can discover reasonable pathwaybetween terms which have never been brought up togetherWe test latest research outcome in our network and some ofthem are found in the network even before the articles So thepotential relations in the network can be used to inspire otherresearchers

5 Future Work

We accomplished score algorithms and retrieval algorithmin this paper on sample set of genes and diseases frommultiple sources In future we will apply our methods oncomplete sources of genes and diseases and import other


Table 11 Comparisons of Literome DisGeNet and expanded gene-disease network

Literome DisGeNet [13] Expanded gene-disease network

Sources PubMed GAD CTD and other 12 more GAD PubMed and 14 moredatabases

PubMed articles linkage Yes No PartialGene-disease interactions Yes Yes YesInteraction path Partial No YesPath length 3 1 Can be assigned

Demonstration PMID and marked texts List of triplet Disease names and gene names inpath

Database update Yes Yes YesSupported medical termsretrieval

Genic interactionsgenotype-phenotype Disease and gene Gene and disease interactions

medical related terms like phenotype and provide a fullyfunctional platform as service for users Meanwhile theretrieval algorithm will be improved to ensure efficiency onhuge data

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

This work is supported in part by the National NaturalScience Foundation of China (nos 61300147 6147215961572227) China Postdoctoral Science Foundation(2014M561293) Science and Technology Planning Projectof Jilin Province China (2014N143 20150520064JH20130101179JC-03) the Science and Technology Program ofChangchun (no 14GH014) and Graduate Innovation Fundof Jilin University

References

[1] M S Cline M Smoot E Cerami et al ldquoIntegration ofbiological networks and gene expression data using CytoscaperdquoNature Protocols vol 2 no 10 pp 2366ndash2382 2007

[2] P D Stenson E V Ball M Mort et al ldquoHuman gene mutationdatabase (HGMD) 2003 updaterdquoHumanMutation vol 21 no6 pp 577ndash581 2003

[3] B Smith M Ashburner C Rosse et al ldquoThe OBO Foundrycoordinated evolution of ontologies to support biomedical dataintegrationrdquo Nature Biotechnology vol 25 no 11 pp 1251ndash12552007

[4] L Y Geer A Marchler-Bauer R C Geer et al ldquoThe NCBIbiosystems databaserdquo Nucleic Acids Research vol 38 supple-ment 1 Article ID gkp858 pp D492ndashD496 2009

[5] DMaglott J Ostell K D Pruitt and T Tatusova ldquoEntrez Genegene-centered information at NCBIrdquo Nucleic Acids Researchvol 35 supplement 1 pp D26ndashD31 2007

[6] B T Sherman D W Huang Q Tan et al ldquoDAVID Knowl-edgebase a gene-centered database integrating heterogeneousgene annotation resources to facilitate high-throughput gene

functional analysisrdquo BMC Bioinformatics vol 8 article 4262007

[7] N Tiffin J F Kelso A R Powell H Pan V B Bajic and WA Hide ldquoIntegration of text- and data-mining using ontologiessuccessfully selects disease gene candidatesrdquo Nucleic AcidsResearch vol 33 no 5 pp 1544ndash1552 2005

[8] L M Schriml C Arze S Nadendla et al ldquoDisease ontologya backbone for disease semantic integrationrdquo Nucleic AcidsResearch vol 40 no 1 pp D940ndashD946 2012

[9] C E Lipscomb ldquoMedical subject headings (MeSH)rdquo Bulletinof the Medical Library Association vol 88 no 3 pp 265ndash2662000

[10] O Bodenreider ldquoThe Unified Medical Language System(UMLS) integrating biomedical terminologyrdquo Nucleic AcidsResearch vol 32 supplement 1 pp D267ndashD270 2004

[11] L Bos and K Donnelly ldquoSNOMED-CT the advanced ter-minology and coding system for eHealthrdquo Studies in HealthTechnology and Informatics vol 121 pp 279ndash290 2006

[12] World Health Organization International statistical classifica-tion of diseases and health related problems (The) ICD-10 [PhDdissertation] World Health Organization Geneva Switzerland2004

[13] N Q Rosinach T Kuhn C Chichester M Dumontier andF Sanz Laura Ines Furlong DisGeNET as NanopublicationsbioRxiv Barcelona Spain 2014

[14] D Hristovski B Peterlin J A Mitchell and S M HumphreyldquoUsing literature-based discovery to identify disease candidategenesrdquo International Journal of Medical Informatics vol 74 no2ndash4 pp 289ndash298 2005

[15] M Parkes A Cortes D A vanHeel andMA Brown ldquoGeneticinsights into common pathways and complex relationshipsamong immune-mediated diseasesrdquo Nature Reviews Geneticsvol 14 no 9 pp 661ndash673 2013

[16] I Lee U M Blom P I Wang J E Shim and E M MarcotteldquoPrioritizing candidate disease genes by network-based boost-ing of genome-wide association datardquoGenome Research vol 21no 7 pp 1109ndash1121 2011

[17] A Bauer-MehrenM BundschusM RautschkaMAMayer FSanz and L I Furlong ldquoGene-disease network analysis revealsfunctional modules in mendelian complex and environmentaldiseasesrdquo PLoS ONE vol 6 no 6 Article ID e20284 2011

[18] X Wang N Gulbahce and H Yu ldquoNetwork-based methodsfor human disease gene predictionrdquo Briefings in FunctionalGenomics vol 10 no 5 pp 280ndash293 2011


[19] A P Davis C G Murphy R Johnson et al ldquoThe comparativetoxicogenomics database update 2013rdquo Nucleic Acids Researchvol 41 no 1 Article ID gks994 pp D1104ndashD1114 2013

[20] T Ideker and N J Krogan ldquoDifferential network biologyrdquoMolecular Systems Biology vol 8 no 1 article 565 2012

[21] Y Liu S Maxwell T Feng et al ldquoGene pathway and net-work frameworks to identify epistatic interactions of singlenucleotide polymorphisms derived from GWAS datardquo BMCSystems Biology vol 6 supplement 1 article S15 2012

[22] P Yang X-L Li J-P Mei C-K Kwoh and S-K Ng ldquoPositive-unlabeled learning for disease gene identificationrdquo Bioinformat-ics vol 28 no 20 pp 2640ndash2647 2012

[23] S T SherryM-HWardM Kholodov et al ldquodbSNP the NCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[24] D S Wishart C Knox A C Guo et al ldquoDrugBank a compre-hensive resource for in silico drug discovery and explorationrdquoNucleic Acids Research vol 34 supplement 1 pp D668ndashD6722006

[25] T Hubbard D Barker E Birney et al ldquoThe Ensembl genomedatabase projectrdquo Nucleic Acids Research vol 30 no 1 pp 38ndash41 2002

[26] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[27] D A Benson I Karsch-Mizrachi D J Lipman J Ostell B ARapp and D L Wheeler ldquoGenBankrdquo Nucleic Acids Researchvol 28 no 1 pp 15ndash18 2000

[28] C F Schaefer K Anthony S Krupa et al ldquoPID the pathwayinteraction databaserdquo Nucleic Acids Research vol 37 supple-ment 1 pp D674ndashD679 2009

[29] G Joshi-Tope M Gillespie I Vastrik et al ldquoReactome aknowledgebase of biological pathwaysrdquo Nucleic Acids Researchvol 33 supplement 1 pp D428ndashD432 2005

[30] C Von Mering L J Jensen B Snel et al ldquoSTRING knownand predicted protein-protein associations integrated andtransferred across organismsrdquo Nucleic Acids Research vol 33supplement 1 pp D433ndashD437 2005

[31] J U Pontius L Wagner and G D Schuler 21 UniGeneA Unified View of the Transcriptome The NCBI HandbookNational Library of Medicine (US) NCBI Bethesda Md USA2003

[32] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[33] K G Becker K C Barnes T J Bright and S A Wang ldquoThegenetic association databaserdquoNature Genetics vol 36 no 5 pp431ndash432 2004

[34] S Kohl D-Y Hwang G C Dworschak et al ldquoMild recessivemutations in six Fraser syndrome-related genes cause isolatedcongenital anomalies of the kidney and urinary tractrdquo Journalof the American Society of Nephrology vol 25 no 9 pp 1917ndash1922 2014

[35] M E Paglietti S Satta M C Sollaino et al ldquoThe problemof borderline hemoglobin A2 levels in the screening for 120573-thalassemia carriers in sardiniardquo Acta Haematologica vol 135pp 193ndash199 2016

[36] J-B Arlet M Dussiot I C Moura O Hermine and G Cour-tois ldquoNovel players in120573-thalassemia dyserythropoiesis and newtherapeutic strategiesrdquo Current Opinion in Hematology vol 23no 3 pp 181ndash188 2016

[37] H Poon C Quirk C DeZiel and D Heckerman ldquoLiteromePubMed-scale genomic knowledge base in the cloudrdquo Bioinfor-matics vol 30 no 19 pp 2840ndash2842 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of


Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology


Molecular Biology International

GenomicsInternational Journal of


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


BioinformaticsAdvances in

Marine BiologyJournal of



Signal TransductionJournal of


BioMed Research International

Evolutionary BiologyInternational Journal of



Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Genetics Research International


Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational



Enzyme Research



Microbiology



Gene relation network Disease ontology

Potential relations of gene-disease

HLA-AHGF

Gene symbols

LeukemiaPterygium

Disease terms

HLA-A | Leukemia | middot middot middot

middot middot middot middot middot middot

Gene symbol | diseaseGene | disease name

middot middot middot | middot middot middot

middot middot middot | middot middot middot | middot middot middot

Gene | disease name | middot middot middot

Figure 1 Extended gene-disease network

There are many gene-disease relations that have beenverified in biomedical literatures [14ndash22] Those literatureshave been digitized over years and researchers have devel-oped ways to extract relation pairs such as gene relationpairs from the literatures The approaches that have beendeveloped so far can be divided into two categories in generalcooccurrence algorithms and expert inspection Those twocategories of approaches to extract relations from literaturesboth have advantages and drawbacks

(i) Cooccurrence algorithms are relatively high in effi-ciency and can process the literature databaseslike Medline with no or little supervision fromresearchers Therefore this category of approachescan process a lot of literatures and newly publishedarticles to get massive and up-to-date results Besidesit can also improve itself by adopting latest cooccur-rence algorithms

(ii) Expert inspection on the other hand surpassescooccurrence algorithms in accuracy Since relationpairs in biomedical demand high precision it ishard to replace expert inspection with cooccurrencealgorithms now Expert inspection is realized by agroup of specialists or a domanial community It isconsiderably low in efficiency coverage of literaturesand self-improvement It takes time for a group ofspecialists or a domanial community to accept newknowledge so the relation pairs extracted this way aremore conservative

In order to extend the gene relation network to humandiseases and not to obtain the drawbacks of either categoryof relation pair extraction approaches we adopt a two-step extension method In the first step we extend generelation network to human diseases with databases createdthrough expert inspection approach in order to ensure theaccuracy of gene-disease relations Second step we map

disease ontology to the disease names of the databases that areused The second step makes it possible to append relationsthrough cooccurrence approaches or to implement gene-disease relation retrieval

So we implement a broader retrieval of human gene-disease relations by building an extended human gene-disease network which is realized by combining gene rela-tions disease ontology and genetic association database Anyinteraction of genes and diseases within limited paths mayindicate an uncovered relation in biomedical research and thenetwork can be a useful method to instruct relation discoveryof human gene-disease

The network in Figure 1 is the main method to overcomethe drawbacks of current databases Genetic associationdatabase (GAD) provides solid relations between humangene and disease while gene relation network and diseaseontology function as extensions to implement broader cov-erage of genes and diseases Those pairs of gene and diseasethat cannot be found in GAD may possibly be discovered inextended gene-disease network The pairs that are found inour network rather than GAD may be potential relations ofgene-disease

In this paper we first build gene relation network andcombine GAD with gene relation network and disease ontol-ogy to form extended gene-disease network Then we applyan extended retrieval algorithm on this network and verifythe validity of this method

2 Materials and Methods

21 Gene Databases Disease and Gene-Disease RelationDatabase The fact that one gene relates to another gene orthat two genes are similarmeans either that one gene interactswith another gene or that both genes share the same pathwaygene function biological process and so forth So multiplegene databases that contain such knowledge are needed tobuild gene similarity We collect those features from dbSNP



















119878 (119892119904 119892119905) =

119899

sum119898=1

((1198711cap 1198712) minus (119871

1cup 1198712) 2

(1198711cup 1198712) 2

times 120572119898)

119899

sum119898=1

120572119898

= 1

(1)


119904and gene

119905 119871







Sig (119899) =119890| sum119898 =119899119878(119892119898119892119899)|

119873 minus 1sum119898 =119899

1003816100381610038161003816119878 (119892119898 119892119899)1003816100381610038161003816

119878 (119892119898 119892119899) (2)





























SC (dis119898 dis119899) =

length(119898)

sum119904=1

IV (119904) 120593 times

length(119899)

sum119905=1

IV (119905) 120596 (3)






DGI (119889 119892) = prod119904isin119863

SC (119889 119904)prod119905isin119866

119878 (119892 119905) (4)





DGI1015840 (119889 119892) = SC1015840 (119889 119904) 1198781015840 (119892 119905) (5)

SC1015840 (119889) = 10prod119904isin119863(SC(119889119904))

(6)

1198781015840

(119892) = 10prod119905isin119866((119878(119892119905)+1)2)

(7)




119882(119899) = 119890(119864119899

119900minus119864119899

119894)

times 119864119899

119900times 120576 (8)


119894)







Term pairs

Term 1



Term listof term 2

Term listof term 1

Term 2



End


NoNo

Yes


Disease


Gene

SMAD4Interaction


ITGA8KLF1

GATA1No interaction




ITGA8KLF1

GATA1Interaction

















Betathalassemia








4 Conclusion


5 Future Work











Competing Interests


Acknowledgments


References















































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology



















119878 (119892119904 119892119905) =

119899

sum119898=1

((1198711cap 1198712) minus (119871

1cup 1198712) 2

(1198711cup 1198712) 2

times 120572119898)

119899

sum119898=1

120572119898

= 1

(1)


119904and gene

119905 119871







Sig (119899) =119890| sum119898 =119899119878(119892119898119892119899)|

119873 minus 1sum119898 =119899

1003816100381610038161003816119878 (119892119898 119892119899)1003816100381610038161003816

119878 (119892119898 119892119899) (2)





























SC (dis119898 dis119899) =

length(119898)

sum119904=1

IV (119904) 120593 times

length(119899)

sum119905=1

IV (119905) 120596 (3)






DGI (119889 119892) = prod119904isin119863

SC (119889 119904)prod119905isin119866

119878 (119892 119905) (4)





DGI1015840 (119889 119892) = SC1015840 (119889 119904) 1198781015840 (119892 119905) (5)

SC1015840 (119889) = 10prod119904isin119863(SC(119889119904))

(6)

1198781015840

(119892) = 10prod119905isin119866((119878(119892119905)+1)2)

(7)




119882(119899) = 119890(119864119899

119900minus119864119899

119894)

times 119864119899

119900times 120576 (8)


119894)







Term pairs

Term 1



Term listof term 2

Term listof term 1

Term 2



End


NoNo

Yes


Disease


Gene

SMAD4Interaction


ITGA8KLF1

GATA1No interaction




ITGA8KLF1

GATA1Interaction

















Betathalassemia








4 Conclusion


5 Future Work











Competing Interests


Acknowledgments


References















































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology



Sig (119899) =119890| sum119898 =119899119878(119892119898119892119899)|

119873 minus 1sum119898 =119899

1003816100381610038161003816119878 (119892119898 119892119899)1003816100381610038161003816

119878 (119892119898 119892119899) (2)





























SC (dis119898 dis119899) =

length(119898)

sum119904=1

IV (119904) 120593 times

length(119899)

sum119905=1

IV (119905) 120596 (3)






DGI (119889 119892) = prod119904isin119863

SC (119889 119904)prod119905isin119866

119878 (119892 119905) (4)





DGI1015840 (119889 119892) = SC1015840 (119889 119904) 1198781015840 (119892 119905) (5)

SC1015840 (119889) = 10prod119904isin119863(SC(119889119904))

(6)

1198781015840

(119892) = 10prod119905isin119866((119878(119892119905)+1)2)

(7)




119882(119899) = 119890(119864119899

119900minus119864119899

119894)

times 119864119899

119900times 120576 (8)


119894)







Term pairs

Term 1



Term listof term 2

Term listof term 1

Term 2



End


NoNo

Yes


Disease


Gene

SMAD4Interaction


ITGA8KLF1

GATA1No interaction




ITGA8KLF1

GATA1Interaction

















Betathalassemia








4 Conclusion


5 Future Work











Competing Interests


Acknowledgments


References















































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology







SC (dis119898 dis119899) =

length(119898)

sum119904=1

IV (119904) 120593 times

length(119899)

sum119905=1

IV (119905) 120596 (3)






DGI (119889 119892) = prod119904isin119863

SC (119889 119904)prod119905isin119866

119878 (119892 119905) (4)





DGI1015840 (119889 119892) = SC1015840 (119889 119904) 1198781015840 (119892 119905) (5)

SC1015840 (119889) = 10prod119904isin119863(SC(119889119904))

(6)

1198781015840

(119892) = 10prod119905isin119866((119878(119892119905)+1)2)

(7)




119882(119899) = 119890(119864119899

119900minus119864119899

119894)

times 119864119899

119900times 120576 (8)


119894)







Term pairs

Term 1



Term listof term 2

Term listof term 1

Term 2



End


NoNo

Yes


Disease


Gene

SMAD4Interaction


ITGA8KLF1

GATA1No interaction




ITGA8KLF1

GATA1Interaction

















Betathalassemia








4 Conclusion


5 Future Work











Competing Interests


Acknowledgments


References















































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology


Term pairs

Term 1



Term listof term 2

Term listof term 1

Term 2



End


NoNo

Yes


Disease


Gene

SMAD4Interaction


ITGA8KLF1

GATA1No interaction




ITGA8KLF1

GATA1Interaction

















Betathalassemia








4 Conclusion


5 Future Work











Competing Interests


Acknowledgments


References















































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology















Betathalassemia








4 Conclusion


5 Future Work











Competing Interests


Acknowledgments


References















































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology










Competing Interests


Acknowledgments


References















































Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology




























Volume 2014

Zoology






















Advances in

Virolog y



Volume 2014




Enzyme Research



Microbiology

Research Article Gene-Disease Interaction Retrieval from Multiple...

Documents

Transcript of Research Article Gene-Disease Interaction Retrieval from Multiple...