The ‘A-DaGO-Fun’ Package - University of Cape...

29
The ‘A-DaGO-Fun’ Package Tuesday 1 st September, 2015 Type Package Title An adaptable Gene Ontology semantic similarity based functional analysis tool Version 15.1 Contributors Gaston K. Mazandu, Emile R. Chimusa, Mamana Mbiyavanga and Nicola J. Mulder Maintainer Gaston K. Mazandu <gmazandu@{gmail.com, cbio.uct.ac.za}, [email protected]> Description The A-DaGO-Fun package is a python library that provides a comprehensive and customized set of Gene Ontology (GO) based functional analysis tools that exploit the biological knowledge that GO offers in describing genes or groups of genes. This is achieved through the use of GO semantic similarity measures and leads to biological knowledge discovery for a set of genes or to better understanding of the biological phenomena underlying experimental data. A-DaGO-Fun is an integrated set of four or six different tools depending on whether 1. is counted as one or three different tools: 1. Computing Information Content (IC), term and protein semantic similarity scores (getTermFeatures, termsim and funcsim [IT-GOM]). 2. Identifying enriched GO terms accounting for uncertainty in an annotation dataset (gossfeat [GOSS-FEAT]). 3. Discovering functionally related or similar genes/proteins based on their GO terms (proteinfct [GOSP-FCT]). 4. Retrieving genes or proteins by their GO annotations for disease gene and target discovery (proteinfit [GOSP-FIT]). The initial version of DaGO-Fun retrieves Protein GO annotations from the Gene Ontology Anno- tation (GOA) UniProtKB project and can only analyze proteins found in the GOA dataset, making the tool unable to handle sets of proteins or proteomes that are not yet included in this dataset, especially for sets of newly sequenced genes and their predicted annotations. This issue, which is common to all the existing tools so far, is very serious, considering ongoing sequencing and genome annotation projects. The new A-DaGO-Fun overcomes this issues by providing a more general tool able to analyze a newly annotated set of proteins which are not yet incorporated in the GOA dataset. Dependencies Python (2.7.3) Imports scipy, networkx and matplotlib License GLP URL https://web.cbio.uct.ac.za/ITGOM/adagofun 1 Mazandu et al., 2015

Transcript of The ‘A-DaGO-Fun’ Package - University of Cape...

  • The ‘A-DaGO-Fun’ Package

    Tuesday 1st September, 2015

    Type PackageTitle An adaptable Gene Ontology semantic similarity based functional analysis toolVersion 15.1Contributors Gaston K. Mazandu, Emile R. Chimusa, Mamana Mbiyavanga and Nicola J. MulderMaintainer Gaston K. Mazandu

    Description

    The A-DaGO-Fun package is a python library that provides a comprehensive and customized setof Gene Ontology (GO) based functional analysis tools that exploit the biological knowledge thatGO offers in describing genes or groups of genes. This is achieved through the use of GO semanticsimilarity measures and leads to biological knowledge discovery for a set of genes or to betterunderstanding of the biological phenomena underlying experimental data. A-DaGO-Fun is anintegrated set of four or six different tools depending on whether 1. is counted as one or threedifferent tools:

    1. Computing Information Content (IC), term and protein semantic similarity scores(getTermFeatures, termsim and funcsim [IT-GOM]).

    2. Identifying enriched GO terms accounting for uncertainty in an annotation dataset(gossfeat [GOSS-FEAT]).

    3. Discovering functionally related or similar genes/proteins based on their GO terms(proteinfct [GOSP-FCT]).

    4. Retrieving genes or proteins by their GO annotations for disease gene and target discovery(proteinfit [GOSP-FIT]).

    The initial version of DaGO-Fun retrieves Protein GO annotations from the Gene Ontology Anno-tation (GOA) UniProtKB project and can only analyze proteins found in the GOA dataset, makingthe tool unable to handle sets of proteins or proteomes that are not yet included in this dataset,especially for sets of newly sequenced genes and their predicted annotations. This issue, which iscommon to all the existing tools so far, is very serious, considering ongoing sequencing and genomeannotation projects. The new A-DaGO-Fun overcomes this issues by providing a more generaltool able to analyze a newly annotated set of proteins which are not yet incorporated in the GOAdataset.

    Dependencies Python (≥ 2.7.3)Imports scipy, networkx and matplotlibLicense GLPURL https://web.cbio.uct.ac.za/ITGOM/adagofun

    1 Mazandu et al., 2015

    https://web.cbio.uct.ac.za/ITGOM/adagofun

  • Contents 2

    Contents

    A-DaGO-Fun tool documented 3

    A-DaGO-Fun package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    getTermFeatures function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    termsim function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    funcsim function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    gossfeat function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    proteinfct function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    proteinfit function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    Appendix-1 A-DaGO-Fun Installation and Administration 12

    Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    Examples: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    Important notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    Appendix-2 Information Content-based GO Semantic Similarity measures 15

    Computing term information content scores . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    Different GO term semantic similarity approaches . . . . . . . . . . . . . . . . . . . . . . 16

    Resnik, Lin and Nunivers approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    Improving Annotation-based GO Semantic Similarity Scores . . . . . . . . . . . . . . 17

    Wang, Zhang and GO-universal approaches . . . . . . . . . . . . . . . . . . . . . . . 18

    Functional similarity measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    Term semantic-based measures: Avg, Max, BMA, BMM, ABM and HDF . . . . . . 18

    Direct term-based measures: SimGIC, SimDIC, SimUIC and Cosine . . . . . . . . . 21

    Appendix-3 Other Existing IC-based GO semantic similarity tools 23

    Issues related to the choice of functional similarity measures . . . . . . . . . . . . . . . . . 23

    Issues related to organism-based information content value . . . . . . . . . . . . . . . . . 25

    Issue related to the use of standard protein or gene identifier systems . . . . . . . . . . . . 26

    Final Note on the A-DaGO-Fun Python Package . . . . . . . . . . . . . . . . . . . . . . . 26

    References 28

    2 Mazandu et al., 2015

  • A-DaGO-Fun Package 3

    A-DaGO-Fun package Adaptable Gene Ontology Semantic Similarity based FunctionalAnalysis Tool

    Description

    A repository of python modules for functionally analyzing protein or gene sets based GeneOntology annotations using Information Content-based Semantic Similarity Measures.

    Details

    GO semantic similarity measures allow the integration of the biological knowledge containedin the GO structure, and have contributed to the improvement of biological analyses. Sev-eral GO semantic similarity measures have been proposed in recent years and they were ac-companied by the development of tools (http://neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools) that facilitate effective exploration of these measures. Existing toolsare often context and organism dependent and only implement semantic similarity measuresshown to perform well in a specific application. This is the case even for existing GO analysistools that do not use semantic similarity measures, such as GO term enrichment analysis tools.A-DaGO-Fun provides a context- and organism-independent tool for analyzing GO annotationsusing Information Content-based Semantic Similarity Measures. It implements 101 differentfunctional similarity measures as shown in Table 1: Each of the eight annotation-based and threetopology-based approaches, namely Resnik, XGraSM-Resnik, Nunivers, XGraSM-Nunivers, Lin,XGraSM-Lin, Relevance, Li et al., Wang et al., Zhang et al. and GO-universal, is implementedwith seven known IC-based non-direct functional similarity measures (Avg, Max, ABM, BMA,BMM, HDF and VHDF). A-DaGO-Fun also includes the three IC-based direct term functionalsimilarity measures: SimGIC, SimDIC, SimUIC and Cosine with its two normalization schemes(SimCOU, SimCOT), for the annotation-based family and each of the three topology-based ap-proaches, and their particular cases: SimUI, Dice (SimDB) and Universal (SimUB), in whichterm IC values are set to 1 [3], as well as the NTO measure, which is the normalized term over-lap [32] implemented by several tools. Descriptions and symbols of these measures are shownin Table 1 and we refer the interested reader to Appendix 2 where the complete descriptionsand algebraic forms of all these measures are provided. Note that all semantic similarity scoresimplemented in A-DaGO-Fun are well defined and range between 0 and 1, making differentmeasures more understandable and comparable.

    Table 1: Symbols of different GO IC and Semantic Similarity models. The prefixes r, n, l,li, s, x, a, z, w, and u represent the approaches and stand for Resnik, Nunivers, Lin, Li, Relevance,XGraSM, Annotation-based, Zhang, Wang and GO-universal, respectively. The suffixes gic, uic, dic,cou and cot represent SimGIC, SimUIC, SimDIC, SimCOU and SimCOT measures, respectively, andui, db, ub and nto are used for SimUI, Dice (SimDB), Universal (SimUB) and NTO measures. Incases where the prefix ‘x’ is used, it is immediately followed by the approach prefix.

    IC Model Term Semantic Similarity Functional Similarity Measures

    Symbol Description Symbol Description Symbol Description

    a Annotation-based r Resnik ravg Avg based on Resnik

    rmax Max based on Resnikrbma BMA based on Resnik

    rabm ABM based on Resnik

    rbmm BMM based on Resnik

    Continued on next page

    3 Mazandu et al., 2015

    http://neurolex.org/wiki/Category:Resource:Gene_Ontology_Toolshttp://neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools

  • A-DaGO-Fun Package 4

    Table 1 – continued from previous page

    IC Model Term Semantic Similarity Functional Similarity Measures

    Symbol Description Symbol Description Symbol Description

    rhdf Hausdorff based on Resnikrvhdf Variant Hausdorff based on Resnik

    xr XGraSM-Resnik xravg Avg based on XGraSM-Resnik

    xrmax Max based on XGraSM-Resnikxrbma BMA based on XGraSM-Resnik

    xrabm ABM based on XGraSM-Resnik

    xrbmm BMM based on XGraSM-Resnikxrhdf Hausdorff based on Resnik

    xrvhdf Variant Hausdorff based on XGraSM-Resnikn Nunivers navg Avg based on Nunivers

    nmax Max based on Nunivers

    nbma BMA based on Nuniversnabm ABM based on Nunivers

    nbmm BMM based on Nunivers

    nhdf Hausdorff based on Nuniversnvhdf Variant Hausdorff based on Nunivers

    xn XGraSM-Nunivers xnavg Avg based on XGraSM-Nunivers

    xnmax Max based on XGraSM-Nuniversxnbma BMA based on XGraSM-Nunivers

    xnabm ABM based on XGraSM-Nuniversxnbmm BMM based on Nuniversxnhdf Hausdorff based on XGraSM-Nunivers

    xnvhdf Variant Hausdorff based on NXGraSM-universl Lin lavg Avg based on Lin

    lmax Max based on Linlbma BMA based on Linlabm ABM based on Linlbmm BMM based on Lin

    lhdf Hausdorff based on Linlvhdf Variant Hausdorff based on Lin

    xl XGraSM-Lin xlavg Avg based on XGraSM-Lin

    xlmax Max based on XGraSM-Linxlbma BMA based on XGraSM-Lin

    xlabm ABM based on XGraSM-Linxlbmm BMM based on XGraSM-Lin

    xlhdf Hausdorff based on XGraSM-Linxlvhdf Variant Hausdorff based on XGraSM-Lin

    li Li et al. liavg Avg based on Li et al.

    limax Max based on Li et al.

    libma BMA based on Li et al.liabm ABM based on Li et al.

    libmm BMM based on Li et al.lihdf Hausdorff based on Li et al.Nuniverslivhdf Variant Hausdorff based on Li et al.

    s Relevance savg Avg based on Relevance

    smax Max based on Relevancesbma BMA based on Relevancesabm ABM based on Relevancesbmm BMM based on Relevanceshdf Hausdorff based on Relevance

    svhdf Variant Hausdorff based on Relevanceagic Annotation-based SimGIC

    adic Annotation-based SimDICauic Annotation-based SimUICacou Annotation-based SimCOUacot Annotation-based SimCOT

    u GO-universal uavg Avg based on GO-universal

    umax Max based on GO-universalubma BMA based on GO-universal

    uabm ABM based on GO-universal

    Continued on next page

    4 Mazandu et al., 2015

  • A-DaGO-Fun Package 5

    Table 1 – continued from previous page

    IC Model Term Semantic Similarity Functional Similarity Measures

    Symbol Description Symbol Description Symbol Description

    ubmm BMM based on GO-universaluhdf Hausdorff based on GO-universal

    uvhdf Variant Hausdorff based on GO-universal

    ugic GO-universal-based SimGICudic GO-universal-based SimDIC

    uuic GO-universal-based SimUIC

    ucou GO-universal-based SimCOUucot GO-universal-based SimCOT

    w Wang et al. wavg Avg based on Wang et al.wmax Max based on Wang et al.

    wbma BMA based on Wang et al.wabm ABM based on Wang et al.

    wbmm BMM based on Wang et al.whdf Hausdorff based on Wang et al.wvhdf Variant Hausdorff based on Wang et al.

    wgic Wang et al.-based SimGICwdic Wang et al.-based SimDIC

    wuic Wang et al.-based SimUICwcou Wang et al.-based SimCOUwcot Wang et al.-based SimCOT

    z Zhang et al. zavg Avg based on Zhang et al.zmax Max based on Zhang et al.zbma BMA based on Zhang et al.zabm ABM based on Zhang et al.zbmm BMM based on Zhang et al.

    zhdf Hausdorff based on Zhang et al.zvhdf Variant Hausdorff based on Zhang et al.zgic Zhang et al.-based SimGIC

    zdic Zhang et al.-based SimDICzuic Zhang et al.n-based SimUIC

    zcou Zhang et al.-based SimCOUzcot Zhang et al.n-based SimCOT

    ui ui Union-Intersection (SimUI)ub ub Universal-based (SimUB)

    db db Dice-based (SimDB)nto nto Normalized Ierm Overlap (NTO)

    Contributors

    Gaston K. Mazandu, Emile R. Chimusa, Mamana Mbiyavanga and Nicola J. MulderMaintainer: Mazandu GK

    Main references

    1. Mazandu GK, Mulder NJ (2013) DaGO-Fun: Tool for Gene Ontology-based functional anal-ysis using term information content measures.BMC Bioinformatics 14: 284.

    2. Mazandu GK, Mulder NJ (2013) Information content-based Gene Ontology semantic simi-larity approaches: Toward a unified framework theory. BioMed Research International 2013:Article ID 292063, 11 pages.

    3. Mazandu GK, Mulder NJ (2014) Information Content-Based Gene Ontology Functional Sim-ilarity Measures: Which One to Use for a Given Biological Data Type? PLoS ONE 9(12):e113859.

    5 Mazandu et al., 2015

  • A-DaGO-Fun Package 6

    getTermFeatures retrieving Information Content (IC) scores and other GO term featuresfrom the GO directed acyclic graph (DAG) structure.

    Description

    Given a list of GO IDs or a file containing a list of GO ID pairs, this function retrieves thesecharacteristics of these GO terms in the GO DAG, including their IC scores related to the inputIC model. The default model is ‘u’ (GO-universal model), i.e., if no model is provided theGO-universal model is used.

    Usages

    (a) getTermFeatures(InputData, model = ’u’, drop = 0, output = 1)

    Arguments

    InputData A GO ID, a list of GO IDs or the name of the file containing a list of the GOIDs.

    model One of the IC model: ‘a’ for Annotation-based, ‘u’ for GO-universal, ‘w’ forwang et al. and ‘z’ for Zhang et al. models. Report to Table 1 containing thesymbol used for each approach.

    drop A Boolean parameter only useful in the context of Annotation-based approach.It is set to 1 if Inferred from Electronic Annotation (IEA) evidence code shouldbe removed and to 0 if all evidence codes should be considered.

    output An Enum parameter taking values of 0, 1 and 2. It is set to 1 to output resultson the screen, to 0 to output results in a file and to 2 for outputting a Pythonobject containing the output for possible further usage.

    See http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO (Detailed description and use ofmain functions, section #1)

    Contributors

    Gaston K. Mazandu, Emile R. Chimusa, Mamana Mbiyavanga and Nicola J. MulderMaintainer: Mazandu GK

    6 Mazandu et al., 2015

    http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO

  • A-DaGO-Fun Package 7

    termsim computing GO term semantic similarity scores for a given list of GO term pairs.

    Description

    Given two GO IDs or two lists of GO IDs or a file containing a list of GO ID pairs, this functioncomputes term semantic similarity scores. GO ID pairs in the file are separated by white space.

    Usages

    (a) termsim(‘GOID1’, ‘GOID2’, ontology = ‘BP’, approach = ‘u’, drop = 0, output=1)(b) termsim(GO1, GO2, ontology = ‘BP’, approach = ‘u’, drop = 0, output=1)(c) termsim(GO1, ontology = ‘BP’, approach = ‘u’, drop = 0, output=1)(d) termsim(‘FileName’, ontology = ‘BP’, approach = ‘u’, drop = 0, output=1)

    Arguments

    GOID1 The first GO ID.

    GOID2 The second GO ID.

    GO1 A list of GO IDs.

    GO2 Another list of GO IDs of the same length as GO1.

    FileName The name of the file containing the list of the GO ID pairs.

    ontology One of the GO ontologies: BP, MF and CC.

    approach One of the term semantic similarity approaches or a tuple/list of up to 4 termsemantic similarity approaches. Report to Table 1 containing the symbol usedfor each approach.

    drop A Boolean parameter only useful in the context of Annotation-based approach.It is set to 1 if Inferred from Electronic Annotation (IEA) evidence code shouldbe removed and to 0 if all evidence codes should be considered.

    output An Enum parameter taking values of 0, 1 and 2. It is set to 1 to output resultson the screen, to 0 to output results in a file and to 2 for outputting a Pythonobject containing the output for possible further usage.

    Note concerning GO terms or list/tuple of GO terms:1. For a given two GO terms as arguments, the function computes the semantic similarity

    between these two terms.

    2. For a list or a tuple of GO IDs, the function computes similarity scores between all GO IDspairs (a, b) for a and b in a list or tuple with a 6= b.

    3. If two lists A and B are given, then similarity scores are computed between all pairs (aı, bı)with aı ∈ A and bı ∈ B and 0 ≤ ı ≤ min (len(A), len(B))− 1

    4. If the file of GO ID pairs is provided, the function computes similarity scores between theseGO ID pairs

    See http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO (Detailed description and use ofmain functions, section #2)

    AuthorsGaston K. Mazandu, Emile R. Chimusa, Mamana Mbiyavanga and Nicola J. MulderMaintainer: Mazandu GK

    7 Mazandu et al., 2015

    http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO

  • A-DaGO-Fun Package 8

    funcsim computing functional similarity scores between proteins/genes or two sets of GOterm pairs.

    Description

    Given two sets of GO IDs or a dictionary with protein or gene ID as keys and set of GO IDsas values or a file containing a list of protein or gene IDs with their associated GO Ids, thisfunction computes functional similarity scores between them.

    Usages

    (a) funcsim(GO1, GO2, ontology = ‘BP’, measure = ‘ubma’, drop = 0, output=1)(b) funcsim(ProtGO, TargetPairs = [ ], ontology = ‘BP’, measure = ‘ubma’, drop = 0, output=1)(c) funcsim(FileName, TargetPairs = [ ], ontology = ‘BP’, measure = ‘ubma’, drop = 0, output=1)

    Arguments

    GO1 A set of GO IDs.

    GO2 Another set of GO IDs.

    ProtGO A dictionary with protein or gene ID as keys and set of GO IDs as values

    FileName The name of the file containing the list of protein or gene IDs and their associ-ated GO ID pairs. This file should contain two columns separated by a whitespace: the first column is the gene or protein ID and the second column is theGO IDs associated to the protein or gene. These GO IDs are separated bycommas.

    TargetPairs List/tuple or the name of the file containing the list/tuple of protein/gene pairsfor which functional similarity scores should be computed. If not provided thenlist of all pairs (a, b) with a 6= b protein/gene IDs in ProtGO or FileName

    ontology One of the GO ontologies: BP, MF and CC.

    measure One of the functional similarity measures or a tuple/list of up to 4 term func-tional similarity measures. Report to Table 1 containing the symbol used foreach measure.

    drop A Boolean parameter only useful in the context of Annotation-based approach.It is set to 1 if Inferred from Electronic Annotation (IEA) evidence code shouldbe removed and to 0 if all evidence codes should be considered.

    output An Enum parameter taking values of 0, 1 and 2. It is set to 1 to output resultson the screen, to 0 to output results in a file and to 2 for outputting a Pythonobject containing the output for possible further usage.

    See http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO (Detailed description and use ofmain functions, section #3)

    Author(s)

    Gaston K. Mazandu, Emile R. Chimusa, Mamana Mbiyavanga and Nicola J. MulderMaintainer: Mazandu GK

    8 Mazandu et al., 2015

    http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO

  • A-DaGO-Fun Package 9

    gossfeat identifying enriched GO terms taking into account uncertainty in the dataset.

    Description

    Given two files, the first (reference or background protein list) containing a list of reference orbackground protein or gene IDs with their associated GO Ids, and the second (target proteinlist) containing the list target proteins or genes, this function retrieves processes most pertinentto the experiment performed based on the target set and background provided.

    Usages

    (a) gossfeat(ReferenceFile, TargetFile, ontology = ‘BP’, approach = ‘u’, score = 0.3, pvalue = 0.05,drop = 0, output=0)

    Arguments

    ReferenceFile The name of the file containing the list of protein or gene IDs and theirassociated GO ID pairs. This file should contain two columns separated by awhite space: the first column is the gene or protein ID and the second columnis the GO IDs associated to the protein or gene. GO IDs are separated bycommas.

    TargetFile The name of the file containing the list of target proteins or genes.

    ontology One of the GO ontologies: BP, MF and CC.

    approach One of the term semantic similarity approaches. Report to Table 1 containingsymbols used for each approach.

    score The threshold score providing the semantic similarity degree at which termsare considered to be semantically close in the GO structure, set to 0.3 bydefault.

    pvalue The significance level cut-off from which an identified term is considered to bestatistically significant using the hyper-geometric test, set to 0.05 by default.

    drop A Boolean parameter only useful in the context of Annotation-based ap-proach. It is set to 1 if Inferred from Electronic Annotation (IEA) evidencecode should be removed and to 0 if all evidence codes should be considered.

    output An Enum parameter taking values of 0 and 2. It is set to 0 to output resultsin a file and to 2 for outputting a Python object containing the output forpossible further usage.

    See http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO (Detailed description and use ofmain functions, section #5)

    Author(s)Gaston K. Mazandu, Emile R. Chimusa, Mamana Mbiyavanga and Nicola J. MulderMaintainer: Mazandu GK

    9 Mazandu et al., 2015

    http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO

  • A-DaGO-Fun Package 10

    proteinfct discovering functionally related or similar genes/proteins based on their GO terms.

    DescriptionGiven two files, the first one (Protein GO Annotation file) containing a list of protein or geneIDs with their associated GO Ids, and the second file (target protein list) containing the listtarget protein or gene IDs to be clustered, this function produces a partition of a gene or proteinset into a set of biological meaningful sub-classes using their functional closeness based on GOannotations and derived from a selected semantic similarity model.

    Usages(a) proteinfct(AnnotationData, TargetIDs = [ ], ontology = ‘BP’, approach = ‘u’, score=0.3,mclust=1, nclust=0, drop = 0, output = 1)

    Arguments

    AnnotationData The name of the file containing the list of protein or gene IDs and theirassociated GO ID pairs. This file should contain two columns separatedby a white space: the first column is the gene or protein ID and the secondcolumn is the GO IDs associated to the protein or gene. GO IDs areseparated by commas.

    TargetIDs List/tuple or the name of the file containing the list/tuple of protein/geneto be clustered (classified). If not provided then list of protein/gene IDs inthe AnnotationData

    ontology One of the GO ontologies: BP, MF and CC.

    measure One of the term semantic similarity approaches. Report to Table 1 con-taining symbols used for each approach.

    score The threshold score providing the semantic similarity degree at which con-cepts are considered to be semantically close or similar in the GO structureand it is set to 0.3 by default.

    mclust Clustering model under consideration, and this function implements threedifferent models and it is set to 1 (hierarchical clustering) by default. Referto http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO (subsection6.1) for more details.

    nclust Number of clusters (nclust) applies only for the kmeans model and it is setto 0 by default.

    drop A Boolean parameter only useful in the context of Annotation-based ap-proach. It is set to 1 if Inferred from Electronic Annotation (IEA) evidencecode should be removed and to 0 if all evidence codes should be considered.

    output An Enum parameter taking values of 0, 1 and 2. It is set to 1 to outputresults on the screen, to 0 to output results in a file and to 2 for outputtinga Python object containing the output for possible further usage.

    See http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO (Detailed description and use ofmain functions, section #6)Author(s)

    Gaston K. Mazandu, Emile R. Chimusa, Mamana Mbiyavanga and Nicola J. MulderMaintainer: Mazandu GK

    10 Mazandu et al., 2015

    http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFOhttp://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO

  • A-DaGO-Fun Package 11

    proteinfit retrieving genes or proteins based on their GO annotations.

    Description

    Given two files or one file and a set of GO IDs, the first (reference or background protein list)containing a list of reference or background protein or gene IDs with their associated GO Ids, andthe second containing the GO ID targets, this function identifies genes or proteins contributing toa given processes at a certain threshold or agreement level based of protein or gene annotations.

    Usages

    (a) proteinfit(AnnotationData, TargetGOIDs, ontology = ‘BP’, approach = ‘u’, score=0.3, drop =0, output = 1)

    Arguments

    AnnotationData The name of the file containing the list of protein or gene IDs and theirassociated GO ID pairs. This file should contain two columns separatedby a white space: the first column is the gene or protein ID and the secondcolumn is the GO IDs associated to the protein or gene. These GO IDsare separated by commas.

    TargetGOIDs A list/tuple or name of the file containing the list of GO ID targets forwhich, protein/gene should be identified.

    ontology One of the GO ontologies: BP, MF and CC.

    approach One of the term semantic similarity approaches. Report to Table 1 con-taining symbols used for each approach.

    score The threshold score providing the semantic similarity degree at which termsare considered to be semantically close in the GO structure. More specif-ically, the score at which an ancestor can be considered to occur throughits descendant.

    drop A Boolean parameter only useful in the context of Annotation-based ap-proach. It is set to 1 if Inferred from Electronic Annotation (IEA) evidencecode should be removed and to 0 if all evidence codes should be considered..

    output An Enum parameter taking values of 0, 1 and 2. It is set to 1 to outputresults on the screen, to 0 to output results in a file and to 2 for outputtinga Python object containing the output for possible further usage.

    See http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO (Detailed description and use ofmain functions, section #4)

    Author(s)

    Gaston K. Mazandu, Emile R. Chimusa, Mamana Mbiyavanga and Nicola J. MulderMaintainer: Mazandu GK

    11 Mazandu et al., 2015

    http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO

  • A-DaGO-Fun Installation and Administration 12

    Appendix-1 A-DaGO-Fun Installation and Administration

    The main website for the A-DaGO-Fun package is http://web.cbio.uct.ac.za/ITGOM/adagofunwhere users can find essential information about obtaining A-DaGO-Fun. It is freely downloadableunder GNU General Public License (GPL), pre-compiled for Linux and protected by copyrightlaws. Users are free to copy, modify, merge, publish, distribute and display information containedin the package, provided that it is done with appropriate citation of the package and by includingthe permission notice in all copies or substantial portions of the module contained in this package.

    To install A-DaGO-Fun is quite straightforward and is similar to installation of other pythonpackages. The whole package is relatively large (around 96Mb) and contains five modules and setsof files (GO term features and IC scores) available for download. It is currently maintained by onemember of the core-development team, Gaston K. Mazandu , who regularly updates the information available in this package and makesevery effort to ensure the quality of this information.

    Installation

    Four packages, namely scipy, matplotlib, networkx and cPickle, need to be installed prior to theinstallation and use of A-DaGO-Fun. To install A-DaGO-Fun, the user needs to download the‘tar.gz’ file and extract all files as follows:

    tar -xzf dagofun.tar.gz

    and then install from the ‘package’ menu. To do this, one uses the following command

    python setup.py install - -user

    Note that one module, namely tabulate.py for Pretty-print tabular data, which is borrowed fromother authors, written by ‘Sergey Astanin ([email protected])’ and collaborators.

    Usage

    This package provides six main modules: TermFeatures.py, TermSimilarity.py, ProteinSimilarity.py,ProteinSearch.py, EnrichmentAnalysis.py and ProteinClustering.py written independently and con-taining functions, termsim, funcsim, proteinfit, gossfeat and proteinfct, respectively. These differ-ent functions have been completely described previously.

    To start with the A-DaGO-Fun package, type following commands:

    >>> import dagofun>>> help(dagofun)

    One can start a module GivenModule.py from the A-DaGO-Fun package as follows:>>> import dagofun.GivenModule as gm>>> help(gm)

    After starting the module GivenModule.py as above, one can call or use the special functionnamed gofunc of this module by typing the following command:

    >>> gm.gofunc(arguments)

    The function named gofunc from the module GivenModule.py can also be made available directlyby typing the following command:

    >>> from dagofun.GivenModule import gofunc

    12 Mazandu et al., 2015

    http://web.cbio.uct.ac.za/ITGOM/adagofun

  • A-DaGO-Fun Installation and Administration 13

    After importing the function gofunc, to get help on how to use the function, type the command>>> help(gofunc)

    The function can be called directly as follows:>>> gofunc(parameters)

    Finally, all the special functions in the package can be made available using the following com-mand:

    >>> from dagofun import *

    and thus, each of the special functions can be called directly as described above.

    Examples

    Go to http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO (See Detailed description and useof main functions)

    or

    go to your local dagofun folder and type the following command line:

    >>> python setup.py - -long-description

    As pointed out previously, the A-DaGO-Fun package contains six main modules: TermFeatures.py,TermSimilarity.py, ProteinSimilarity.py, ProteinSearch.py, EnrichmentAnalysis.py and ProteinCluster-ing.py, each module containing a special function which can be run on a Python interface withcommands as shown above to produce different outputs. One can produce these results usingdirect command lines without making use of a Python interface as shown below. However, it isimportant to note that these command lines are only applicable in the case where user terms’ orproteins’ input data are retrieved from a file. In this case,

    (a) Retrieving terms’ features is achieved using the following command:python $(python -m site - -user-site)/dagofun/TermFeatures.py InputData model drop output

    Note that arguments should be in order as shown in the command and as for commands undera Python interface as described proviously, in which case the InputData file containing the userinput list of terms must be provided. In case where other parameters are not provided, defaultparameters are used.

    (b) Computing term semantic similarity scores is performed using the following command line:python $(python -m site - -user-site)/dagofun/TermSimilarity.py FileName ontology nappr approach

    drop output

    where FileName is a user input file containing the GO ID pairs and must be provided, and napprthe number of approaches to be executed as the the module can retrieve term semantic similarityscores using more than one approach and can go up to four different approaches. These differentapproaches are then provided just after providing this number.

    (c) Calculating Protein functional similarity is accomplished using the following command line:python $(python -m site - -user-site)/dagofun/ProteinSimilarity.py AnnotationFile ProteinPairs

    ontology nmeas measure drop output

    where AnnotationFile is a user input file containing protein and their GO IDs and ProteinPairsis another user input file containing protein pairs for which scores should be computed, and these

    13 Mazandu et al., 2015

    http://web.cbio.uct.ac.za/ITGOM/adagofun/PKG-INFO

  • A-DaGO-Fun Installation and Administration 14

    two files must be provided. nmeas is the number of functional similarity measures to be executedas the the module can compute protein functional similarity scores using more than one approachand can go up to three different measures. These different measures are then provided just afterproviding this number.

    (d) Discovering functionally related or similar genes/proteins based on their GO terms is performedusing the following command line:python $(python -m site - -user-site)/dagofun/ProteinClustering.py AnnotationFile Targets ontology

    measure score mclust nclust drop output

    (e) Searching for proteins based on GO annotations is accomplished as follows:python $(python -m site - -user-site)/dagofun/ProteinSearch.py AnnotationData TargetGOIDs

    ontology approach score drop output

    (f) Identifying enriched GO terms taking into account uncertainty in the dataset is done using thefollowing command line:python $(python -m site - -user-site)/dagofun/EnrichmentAnalysis.py ReferenceFile TargetFile

    ontology approach score pvalue drop output

    Different arguments in these different command lines are as explained in previously defined functionscorresponding to each module. In case where other parameters are not provided, default parametersare used and ‘$(python -m site - -user-site)/dagofun/’ in these commands can be replaced by the pathto the module being run, especially if the package has not been installed.

    Data format samples: To illustrate how user input files should look like, pleaseGo to http://web.cbio.uct.ac.za/ITGOM/adagofun/tests

    or similarly, go to the tests sub-folder in your local dagofun folder.

    1. TestTerms.txt contains a list of GO terms for use in the function getTermfeatures.

    2. TermSimTest.txt contains a list of GO term pairs for use in the function termsim.

    3. TestProteins.txt contains protein/Gene IDs and their associated GO terms for use in functionsfuncsim, proteinfit or proteinfct. This is also the case for ReferenceSetTest.txt, SpecificRef-Set1.txt, SpecificRefSet2.txt

    4. TargetSetTest.txt is a model file of a list of proteins and can be a target set of proteins forthe background or reference set of proteins ReferenceSetTest.txt and can used in the contextof the function gossfeat

    Important notes

    • To efficiently use the A-DaGO-Fun package and to maximally benefit from its use, make surethat you have carefully read this PDF documentation file, which is provided in the package.

    • In some cases, you may be required to provide the name of the file. Please make sure thatthe full path to the file target is provided.

    • make use of the full screen mode when displaying results on it for an adapted visualization.

    14 Mazandu et al., 2015

    http://web.cbio.uct.ac.za/ITGOM/adagofun/tests

  • Information Content-based GO Semantic Similarity measures 15

    Appendix-2 Information Content-based GO Semantic Similarity Measures

    Computing term information content scores

    From its conception, term information content (IC) approaches can be divided into two families:annotation and topology-based IC families. The topology-based family exploits only the intrinsictopology of the GO DAG, but the annotation-based family requires the addition of annotation datafor the corpus under consideration. Exception for the topology-based model proposed by Wanget al. [17], all other approaches compute the IC of terms in a similar way (i.e., using log funtion)despite their conceptual differences. The IC of the term is given by

    IC (x) = − ln (p (x)) (1)

    It is worth remembering that in different GO semantic similarity scores described in here andas implemented in the A-DaGO-Fun package, it is assumed that the three separate ontologies,namely, Molecular Function (MF), Biological Process (BP) and Cellular Component (CC) with GOIds GO:0003674, GO:0008150 and GO:0005575, respectively, are roots for the complete ontology,located at level 0, the reference level, and are assumed to be biologically meaningless, i.e., the ICvalue of a root is 0.

    In the case of annotation-based approaches, p (x) is the relative frequency of the term x in theprotein dataset under consideration, obtained from frequency f (x) representing the number A (x)of proteins annotated with the term x in the dataset considering the ‘true-path rule’ principle ofthe GO DAG structure. Thus, this frequency f (x) is given by

    f (x) =

    A (x) if x is a leafA (x) +

    ∑z∈Ch(x)

    A (z) otherwise.

    where Ch (x) is the set of GO terms having x as a parent, and a leaf is a term that has no child.In the case of the topology-based approach introduced by Zhang et al. [15], f (x) is called the countof the term x, it depends only on the children of a given GO term and is numerically equal to thesum of counts of all its children. f (x) is calculated using a recursive formula starting from leavesin the hierarchical structure, and given by

    f (x) =

    1 if x is a leaf∑z∈Ch(x)

    f (z) otherwise.

    The relative frequency p (x), called the D-value in the case of the topology-based approach usedhere, is then computed independently for each ontology and given by

    p (x) =f (x)

    f (r)

    where f (r) is the frequency (count) of the root term in the ontology under consideration. It isworth mentioning that the Zhang et al. model for computing the IC score follows the Seco et al.approach [16] in its conception and it is adapted to the context of the GO-DAG.

    In the context of the GO-universal approach [14], p(x) is called the topological position character-istic of x, recursively obtained using its parents gathered in the set Px = {t : (t, x) ∈ LGO} where

    15 Mazandu et al., 2015

  • Information Content-based GO Semantic Similarity measures 16

    LGO expresses the set of links in the GO-DAG and (t, x) ∈ LGO represents the link or associationbetween a given parent t and its child x. This topological position characteristic, p(x), is given by

    p (x) =

    1 if x is a root∏t∈Px

    p (t)

    |Ch(t)|otherwise

    (2)

    with |Ch(t)| the number of children with term t as parent.

    Wang et al. [17] introduced a topology-based semantic similarity measure in which the semanticvalue of a given term x is computed using an S-value Sx related to the term x, and given by

    Sx (t) =

    {1 if t = x

    max{ωe ∗ Sx (t′) : t′ ∈ Ch (t)} otherwise(3)

    with Ch (t) the set of children of the term t, and ωe the semantic contribution factor for ‘is a’ and‘part of’ relations set to 0.8 and 0.6, respectively. The information content or a semantic value ofa term x is calculated as follows:

    IC (x) =∑t∈Tx

    Sx (t) (4)

    where Tx = T ∪ {x} and T denotes the set of ancestors of the term x. In this case, the IC scoresof the three roots are 1, the lowest term IC value.

    Different GO term semantic similarity approaches

    Several approaches have been proposed for computing term semantic similarity scores in the contextof GO, especially in case of the annotation-based family. These include Resnik, Lin, Nunivers,Jiang& Conrath and several other corrections, such as Graph-based Similarity (Disjunct CommonAncestor by Couto et al. [11], known as GraSM, and eXtended GraSM, denoted XGraSM [1, 2]),relevance similarity by Schlicker et al. [12] and information coefficient similarity by Li et al. [13],have been proposed in order to improve existing GO term comparison approaches. Unless explicitlystated, the subscript used on term semantic similarity formula of each approach corresponds to itssymbol under the A-DaGO-Fun tool as described in Table 1.

    Resnik, Lin and Nunivers approaches

    For Resnik, the similarity between two terms is the information content of their most informativecommon ancestor (MICA), given by the following formula:

    Sr (a, b) = IC(c) = max{

    IC(x) : x ∈ Ta ∩ Tb}

    (5)

    where c is MICA between terms a and b.

    The Lin semantic similarity approach takes MICA between terms being compared and normalizedby the average of IC values of these terms. Thus, the similarity between two terms is given by:

    Sl (a, b) =2× IC(c)

    IC(a) + IC(b)(6)

    16 Mazandu et al., 2015

  • Information Content-based GO Semantic Similarity measures 17

    Note that the Lin approach produces scores ranging between 0 and 1, and satisfies the propertythat the semantic similarity score between a term and itself is 1, but that is not the case for theResnik approach. So, two strategies were suggested to scale these scores between 0 and 1 [2], oneusing either the possible upper bound of IC values [5], referred to as the Nunif strategy, and anotherone using the highest IC score in the ontology under consideration [4], referred to as the Nmaxstrategy, given by:

    Sr (a, b) =

    IC (c)

    log2Nfor Nunif

    IC (c)

    ICmaxfor Nmax

    (7)

    where N is the number of annotated proteins in the corpus under consideration and ICmax thehighest IC score in the ontology considered. The A-DaGO-Fun tool implements Nmax model,which showed better performance compared to the Nunif model, for the Resnik approach.

    The Nunivers approach [2] has been proposed to satisfy the requirement that the semantic similarityscore between a term and itself should be 1 by normalizing the score by the maximum IC values ofterms and given by:

    Sn (a, b) =IC(c)

    max {IC(a), IC(b)}(8)

    Note that the Jiang and Conrath approach is not used explicitly since it has been shown to bea particular case of the Lin approach [2]. Thus, in the A-DaGO-Fun tool, the Jiang & Conrathsimilarity approach [7] is under the Lin approach label as it is just the non normalized distancederived from the Lin similarity measure and all normalization schemes that have been proposedhave failed to improve the performance of this approach [5].

    Improving Annotation-based GO Semantic Similarity Scores

    Some correction factors were proposed to deal with the issue of score overestimation observed in theannotation-based approaches. We quote the Relevance similarity measure introduced by Schlickeret al. [12] and the Information Coefficient suggested by Li et al. [13] in the context of the Linapproach, and the Graph-based similarity measure (GraSM and XGraSM) [1, 2, 11] which can beapplied to any annotation-based term semantic similarity approaches. In this case, the similarityscore S(a, b) between terms a and b is weighted by a correction factor �, i.e., the corrected scoreSf(a, b) is given by

    Sf(a, b) = �× S(a, b) (9)where f refers to a symbol of the term semantic similarity approach as described in Table 1 and thecorrection factor � is calculated as follows

    � =

    1− exp

    (− IC(c)

    )for Relevance and f=s

    1−(1 + IC(c)

    )−1for Li et al. and f=li

    1

    n

    1 + n−1∑j=1

    IC(tj)

    IC(c)

    for Graph-based and f ∈ {xr, xl, xn} (10)with n the number of disjunctive (for GraSM) or all informative (for XGraSM) common ancestorsbetween terms a and b, the nth ancestor term being the most informative common ancestor (MICA)

    17 Mazandu et al., 2015

  • Information Content-based GO Semantic Similarity measures 18

    between a and b, i.e., the common ancestor with the highest IC value. It is worth mentioning thatXGraSM has been shown to outperform the GraSM approach [2] and finding the disjunctive commonancestors (DCA) between two GO terms makes the original GraSM approach computationallyunattractive. Unfortunately, this computational complexity is not proportional to the improvementin performance, and thus, this approach is not included in the A-DaGO-Fun tool [1].

    Wang, Zhang and GO-universal approaches

    In the case of topology-based family, each approach was set with a specific term semantic similarityapproach, except for the Zhang et al. approach, which is a context dependent method, implementedwith Lin-like term semantic similarity approach under the A-DaGO-Fun tool. Thus, for the Zhanget al approach, the semantic similarity between two terms is given by

    Sz (a, b) =2× IC(c)

    IC(a) + IC(b)(11)

    The GO-universal approach uses the Nunivers normalization model and calculates the similarityscore as follows:

    Su (a, b) =IC(c)

    max {IC(a), IC(b)}(12)

    Finally, for the Zhang et al approach, the semantic similarity between two terms is given by:

    Sw (a, b) =∑

    t∈Ta∩Tb

    Sa(t) + Sb(t)

    IC(a) + IC(b)(13)

    where Tw = T ∪{w} and T denotes the set of ancestors of the term w and Sw is the S-value relatedto the term w as defined previously.

    Functional similarity measures

    A protein can be annotated by a set of terms since it can perform more than one biological functionand be involved in several processes. Thus, a functional similarity can be measured between setsof GO terms, annotating proteins by combining the GO term semantic similarity annotated tothese proteins or sets of GO terms using basic statistical measures of closeness (mean, max, min,etc.), such as Best-Match Average (BMA) [5, 14], Best Match Maximum (BMM) [12], AverageBest-Matches (ABM) [4, 17], Average (Avg) [20] and Maximum (Max) [19], and these measuresare often referred to as term semantic-based or non direct or pairwise or indirect measures. Inthis category of measures, special measures derived from term distance scores from the Hausdorff(HDF) distance [35, 36, 37] have been suggested, used [34, 33] and implemented in several semanticsimilarity tools [25, 27]. In general, these statistical measures of closeness are known to be sensitiveto scores that lie at abnormal distances from the majority of scores, or outliers. This meansthat these measures may produce biases which affect functional similarity scores [14]. Thus, otherfunctional similarity measures, such as SimGIC [5], SimDIC, SimUIC [1, 3] and Cosine [29, 6], whichuse the IC of terms directly to compute functional similarity scores from their GO annotations,were introduced. SimGIC, DimDIC and SimUIC use the Jaccard index [18], but Cosine measureuses a normalized dot product to estimate functional similarity scores. The A-DaGO-Fun toolsupports all these functional similarity measures for each term semantic similarity approach andIC model.

    18 Mazandu et al., 2015

  • Information Content-based GO Semantic Similarity measures 19

    Term semantic-based measures: Avg, Max, BMA, BMM, ABM and HDF

    The average and maximum measures are computed as follows:

    Avg (p, q) =1

    n×m∑

    s∈TXp , t∈TXq

    S (s, t) (14)

    andMax (p, q) = max

    {S (s, t) : s ∈ TXp and t ∈ TXq

    }(15)

    where TXr is a set of GO terms in X representing the molecular function (MF), biological process(BP) or cellular component (CC) ontology annotating a given protein r and n =

    ∣∣TXp ∣∣ andm = ∣∣TXq ∣∣are the number of GO terms in these sets, and S (s, t) is the semantic similarity score.

    The BMA [1, 14] for two annotated proteins p and q is the mean of the following two values:average of best matches of GO terms annotated to protein p against those annotated to protein q,and average of best matches of GO terms annotated to protein q against those annotated to proteinp, given by the following formula:

    BMA (p, q) =1

    2

    1n

    ∑s∈TXp

    S(s, TXq

    )+

    1

    m

    ∑s∈TXq

    S(s, TXp

    ) (16)with S

    (s, TXr

    )= max

    {S (s, t) : t ∈ TXr

    }. It is important to note that the BMM measure,

    also known as RCMax (RowScore and ColumnScore Maximum) measure [21] implemented in theGOSemSim R package, takes the maximum values between them instead of the mean of these twovalues and is given by

    BMM (p, q) = max

    1n ∑s∈TXp

    S(s, TXq

    ),

    1

    m

    ∑s∈TXq

    S(s, TXp

    ) (17)However, the performance of this measure has never been assessed and it is rarely if not never used.This is why this measure is not implemented under the A-DaGO-Fun tool.

    The ABM [1] for two annotated proteins is the mean of best matches of GO terms of each proteinagainst the other, given by the following formula:

    ABM (p, q) =1

    n+m

    ∑s∈TXp

    S(s, TXq

    )+∑s∈TXq

    S(s, TXp

    ) (18)Note ABM and BMA measures produce different scores and they are equal only when n = m,which is not often the case in a set of annotated genes or proteins.

    A class of functional similarity measures was derived from the Hausdorff distance and used in thecontext of GO. The initial Hausdorff distance between proteins p and q is given by:

    HDF (p, q) = max

    {maxs∈TXp

    D(s, TXq

    ), maxs∈TXq

    D(s, TXp

    )}(19)

    where D(s, TXp

    )= min

    {D (s, t) : t ∈ TXp

    }, with D (s, t) is the distance between terms s and t. It

    is clear that if the distance D (s, t) is normalized (ranging between 0 and 1), then HDF (p, q) score

    19 Mazandu et al., 2015

  • Information Content-based GO Semantic Similarity measures 20

    also ranges between 0 and 1, and emphasizes the functional closeness between proteins p and qthrough shared terms between these two proteins. If the two proteins p and q share very similarterms, in which case similarity scores S (s, t) are high for any s ∈ Tp and t ∈ Tq, then distancescores D (s, t) = 1−S (s, t) will be low or close to 0 and consequently the distance score HDF (p, q)between p and q will also be low or close to 0. In this case, the functional similarity score betweenp and q, given by:

    S (p, q) = 1−HDF (p, q) (20)is high or close to 1. Because we have semantic similarity scores rather than distances, to ease thecomputation of distance score between proteins we need to express D

    (s, TXp

    )in terms of semantic

    similarity scores. Thus,

    D(s, TXp

    )= min

    {D (s, t) : t ∈ TXp

    }= min

    {1− S (s, t) : t ∈ TXp

    }= 1−max

    {S (s, t) : t ∈ TXp

    }= 1− S

    (s, TXp

    ) (21)It follows that

    HDF (p, q) = max

    {maxs∈TXp

    D(s, TXq

    ), maxs∈TXq

    D(s, TXp

    )}

    = max

    {maxs∈TXp

    (1− S

    (s, TXq

    )), maxs∈TXq

    (1− S

    (s, TXp

    ))} (22)Finally,

    HDF (p, q) = max

    {1− min

    s∈TXpS(s, TXq

    ), 1− min

    s∈TXqS(s, TXp

    )}(23)

    It was indicated that 24 different measures for object matching can be derived from the Hausdorffdistance and based on their behavior in the presence of noise, the best measure, called modifiedHausdorff distance (MHDF) for object matching and shown to be more robust to outliers [35], isgiven by:

    MHDF (p, q) = max

    1n ∑s∈TXp

    D(s, TXq

    ),

    1

    m

    ∑s∈TXq

    D(s, TXp

    ) (24)And in terms of semantic similarity scores, we can write:

    MHDF (p, q) = max

    1n ∑s∈TXp

    [1− S

    (s, TXq

    ) ],

    1

    m

    ∑s∈TXq

    [1− S

    (s, TXp

    ) ] (25)and the functional similarity derived from MHDF corresponds to the BMM measure defined above,eliciting the need for further assessment of this measure in the context of GO. Note that the BMAand ABM measures also match some variants of the HDF metric [35].

    Another variant of the HDF distance, denoted VHDF, refers to a measure suggested by Lermanand Shakhnovich [34], and computes scores as follows:

    VHDF (p, q) =1

    2

    √√√√ 1n

    ∑s∈TXp

    D2(s, TXq

    )+

    √√√√ 1m

    ∑s∈TXq

    D2(s, TXp

    ) (26)20 Mazandu et al., 2015

  • Information Content-based GO Semantic Similarity measures 21

    It is worth mentioning that MHDF and VHDF measures do not define a metric or distance sincethey violate the triangle inequality property of a metric.

    Direct term-based measures: SimGIC, SimDIC, SimUIC and Cosine

    The SimGIC measure computes the functional similarity score between two proteins p and q asfollows:

    SimGIC (p, q) =

    ∑x∈AXp ∩AXq

    IC (x)

    ∑x∈AXp ∪AXq

    IC (x)(27)

    where IC(x) is the information content value of the term x [2] and AXr a set of GO terms togetherwith their informative ancestors in X representing the ontology (MF, BP or CC) annotating agiven protein r.

    Two other functional similarity measures [1, 14], using Dice (Czekanowski or Lin like measure)and universal indexes, referred to as SimDIC and SimUIC, respectively, are given by the followingformulae:

    SimDIC (p, q) =

    2×∑

    x∈AXp ∩AXq

    IC (x)

    ∑x∈AXp

    IC (x) +∑

    x∈AXq

    IC (x)(28)

    SimUIC (p, q) =

    ∑x∈AXp ∩AXq

    IC (x)

    max

    ∑x∈AXp

    IC (x) ,∑

    x∈AXq

    IC (x)

    (29)

    Finally, the SimUI approach [22], which refers to the union-intersection protein similarity measure,is a particular case of SimGIC assigning equal IC value to all terms in the GO-DAG [14]. Eventhough this assumption is not realistic in the context of the GO DAG, the SimUI measure canstill be used as an alternative measure in practice as it showed relatively good performance whenapplied to these different biological data [3]. Indeed, this measure does not depend on term ICvalues and it is given by

    SimUI (p, q) =

    ∣∣AXp ∩ AXq ∣∣∣∣AXp ∪ AXq ∣∣ (30)Similarly, one can define particular cases based on SimDIC (Dice) [29] and SimUIC (Universal),denoted by SimDB and SimUB, respectively, and given by

    SimDB (p, q) =2×

    ∣∣AXp ∩ AXq ∣∣∣∣AXp ∣∣+ ∣∣AXq ∣∣ and SimUB (p, q) =∣∣AXp ∩ AXq ∣∣

    max{∣∣AXp ∣∣ , ∣∣AXq ∣∣} (31)

    A variant of SimUB was suggested and known as normalized term overlap (NTO) [32], and definedas follows:

    NTO (p, q) =

    ∣∣AXp ∩ AXq ∣∣min

    {∣∣AXp ∣∣ , ∣∣AXq ∣∣} (32)21 Mazandu et al., 2015

  • Information Content-based GO Semantic Similarity measures 22

    In the case of Cosine measure, the functional similarity score between two proteins p and q iscalculated using a dot product and normalized using either usual [29] or Tanimoto coefficient [6]scheme. Using the usual normalization model, this similarity score is given by:

    SimCOU (p, q) =〈p, q〉‖p‖‖q‖

    (33)

    where 〈p, q〉 is the dot product between the two feature protein vectors p and q of proteins p andq, respectively. The feature protein vector of a protein ω = p or q is a vector ω = (ω1, . . . , ωm)of length m =

    ∣∣AXp ∪ AXq ∣∣ in which each component ωı for ı = 1, . . . ,m, is associated with a termtı ∈ AXp ∪ AXq , indicating the absence (0) or presence (1) of term tı in the set of terms annotatingthe protein under consideration and weighted by its IC value. Thus, the component ωı is given by:

    ωı =

    {IC (tı) if tı ∈ AXω0 otherwise

    (34)

    and the norm of ω is computed as ‖ω‖ =

    √√√√ m∑ı=1

    w2ı and the dot product as 〈p, q〉 =m∑ı=1

    (pı × qı).

    Another specialized normalization model is the Tanimoto coefficient calculated as follows:

    SimCOT (p, q) =〈p, q〉

    ‖p‖2 + ‖q‖2 − 〈p, q〉(35)

    It is obvious that SimUI, SimDB, SimUB and NTO measures are equivalent and the only differencebetween them is the normalization scheme used by each of these measures and more importantly∣∣AXp ∩ AXq ∣∣ = 〈p, q〉 with term IC value set to 1. This indicates that using usual and Tanomotocoefficient normalization schemes can lead to two other measures corresponding to the Cosinemeasure and equivalent to SimDB, SimUB and NTO.

    Important Note:

    When computing functional similarity scores using direct IC-based (SimGIC, SimDIC, SimGUIC,SimUI, SimDB, SimUB and NTO) or term semantic similarity scores using Graph-based enhance-ment strategy (XGraSM), we only use informative common ancestors. This means roots of ontolo-gies are removed from the intersection and union of ancestors. Mathematically, we know that thistend to slightly decrease similarity values since

    given two real numbers a and b, such that 0 < a ≤ b: ab≤ a+ δb+ δ

    for any real δ ≥ 0.

    It is not clear that removing roots in these computations may improve the performance of theseapproaches, but theoretically it is more likely going to improve them as adding roots obviouslyleads to over-estimating similarity scores as these roots are assumed to be meaningless, i.e., withIC value 0. Thus, considering the root of the ontology will add more biases to the similarity scoresproduced, but this still needs to be checked on real biological data. In the A-DaGO-Fun package,as well as in DaGO-Fun on-line tool, the root of the ontology is kept out of the computation of thesemantic similarity scores.

    22 Mazandu et al., 2015

  • Other Existing IC-based GO semantic similarity tools 23

    Appendix-3 Other Existing IC-based GO semantic similarity tools

    Several tools have been developed for producing GO term and protein semantic similarity scores [6,38] to facilitate the use of different semantic similarity approaches that were suggested in differentbiological data types. These include on-line web tools and software packages very often implementedin the R and Java programming language as shown in Table 2, each tool with functional similaritymeasures it supports. Table 3 provides a mapping between different notations used by differenttools and different functional similarity measures.

    Table 2: IC-based GO semantic similarity tools currently used. All the known IC-basedGO tools for computing semantic similarity score and some of their features are provided. FSMstands for Functional Similarity Measure(s) and Input size provides acceptable number of proteinpairs. A star on a given number indicates acceptable number of proteins and in this case, proteinfunctional similarity scores of all protein pairs built from the provided set or list of proteins arecomputed.

    GO-Semantic Similarity features implementedTool Format Family Approach FSM Input size url Reference

    KU-GOAL Web Annotation-based Resnik, Lin, Jiang Avg, BMA, HDF Unlimited http://www.ittc.ku.edu/chenlab/goal/index.php [27]Relevance VHDF

    GOssTo Web and Annotation-based Resnik, Lin, Jiang Max, SimGIC, SimUI Unlimited http://www.paccanarolab.org/gosstoweb/ [24]Java Relevance, GraSM

    DaGO-Fun Web Topology-based Wang et al., Zhang et al. Avg, Max, BMA, ABM 3000 http://web.cbio.uct.ac.za/ITGOM/ [1]Annotation-based and GO-universal SimGIC, SimDIC, SimUIC

    Resnik, Lin, Nunivers SimUIXGraSM, Relevance,Li et al.

    G-SESAME Web Topology-based Wang et al. ABM 1 http://bioinformatics.clemson.edu/G-SESAME/ [23]Annotation-based Classical Resnik, Lin

    and Jiang & Conrath

    ProteInOn Web Annotation-based Resnik, Lin, Jiang BMA, SimGIC, SimUI 1000* http://lasige.di.fc.ul.pt/webtools/proteinon/ [28]GraSM

    FuSSiMeg Web Annotation-based Resnik, Lin, Jiang BMA 1 http://xldb.di.fc.ul.pt/rebil/ssm/ [10, 11]GraSM

    FunSimMat Web Annotation-based Resnik, Lin, Jiang Avg, Max, SimGIC, Unlimited http://www.funsimmat.de/ [30, 31]Relevance NTO, SimUI

    SML Java Topology-based Resnik, Lin, Jiang Avg, Max, BMA, BMM Unlimited http://www.semantic-measures-library.org [26]Annotation-based Relevance SimGIC, SimUI, NTO

    SemSim R Topology-based Wang et al. Avg, Max, ABM, BMM Unlimited http://bioconductor.org/packages/2.6/bioc/html/GOSemSim.html [21]Annotation-based Resnik, Lin, Jiang

    Relevance

    csbl.go R Annotation-based Resnik, Lin, Jiang SimGIC, SimDB Unilimited http://csbi.ltdk.helsinki.fi/csbl.go/ [29]GraSM, Relevance Cosine, BMM

    GOSim R Annotation-based Resnik, Lin, Jiang HDF, Cosine Unlimited http://www.bioconductor.org/packages/release/bioc/html/GOSim.html [25]GraSM, Relevance BMA, BMM, Avg, Max

    In terms of input size, most of the tools support unlimited input size, except DaGO-Fun, whichaccepts 3000 GO term or protein pairs, G-SESAME and FuSSiMeg web tools, which accept only onepair of GO terms or proteins, and the ProteInOn tool which may go up to 1000 GO terms or proteinsand outputs all pairs of similarity scores. Even though these tools have enabled the explorationof different functional similarity measures, there are still several issues that need to be addressed,including the no exhaustive inclusion of different functional similarity measures proven relevant inthe biomedical and Bioinformatics applications, difficulty in understanding different approaches,measures and features implemented by each tool, making it hard for the end users to choose the mostsuitable tool to their needs and a measure to use. Thus, there is a redundant effort in developingfeatures that already exist and implementing ideas already proven to be obsolete. Furthermore,considering advances made in high-throughput biology technologies and Bioinformatics scanningapproaches, which have led to an exponential growth of biological data at genome- and proteome-wide levels, there is a need of an easy-to-use and adapted tool that is able to make GO semanticsimilarity measures useful and meaningful to protein analyses at the functional level.

    23 Mazandu et al., 2015

    http://www.ittc.ku.edu/chenlab/goal/index.phphttp://www.paccanarolab.org/gosstoweb/http://web.cbio.uct.ac.za/ITGOM/http://bioinformatics.clemson.edu/G-SESAME/http://lasige.di.fc.ul.pt/webtools/proteinon/http://xldb.di.fc.ul.pt/rebil/ssm/http://www.funsimmat.de/http://www.semantic-measures-library.orghttp://bioconductor.org/packages/2.6/bioc/html/GOSemSim.htmlhttp://csbi.ltdk.helsinki.fi/csbl.go/http://www.bioconductor.org/packages/release/bioc/html/GOSim.html

  • Other Existing IC-based GO semantic similarity tools 24

    Table 3: Mapping between different GO semantic similarity measures and notationsused in the tools identified. All the known IC-based GO functional similarity measures withtheir corresponding notations in the existing tools. ‘x’ indicates that the functional similarity mea-sure is not supported by the tool while ‘O’ indicates that the measure may possibly be supported,but the notation is not provided, and ‘–’ means that the tool uses the same notation as indicated.A tuple in the case of GOSim indicates the measure (method) and the normalization model used.

    FSM A-DaGO-Fun DaGO-Fun KU-GOAL GOssTO FunSimMat G-SESAME ProteInOn FuSSiMeg SML GOSemSim csbl.go GOSim

    SimGIC gic – x – simGIC x – x gic x WeightedJaccard xSimDIC dic – x x x x x x x x x xSimUIC uic – x x x x x x x x x xSimUI ui – x x UI x – x ui x x xSimDB db x x x x x x x x x Czekanowski-Dice xSimDB db x x x x x x x x x x xSimNTO nto x x x NTO x x x nto x x xSimCOU cou x x x x x x x x x Cosine (dot, sqrt)SimCOT cot x x x x x x x x x x (dot, Tanimoto)BMA bma – x x BM x – x bma x x funSimAvgBMM bmm x x x x x x x bmm rcmax O funSimMaxABM abm – AveMax x x O x x x bma x xAvg avg – AveMatch x avg x x x avg avg x meanMax max – x O max x x O max max x maxHDF hdf x HdfDist x x x x x x x x hausdorffVHDF vhdf x AveNMS x x x x x x x x x

    16 8 4 2 6 1 3 1 7 4 4 7

    Issues related to the choice of functional similarity measures

    GO semantic similarity approaches have been successfully deployed in many biomedical and bio-logical applications [1, 2, 3]. These include gene clustering and gene expression data analysis [29],prediction and validation of molecular interactions [4, 39], disease gene prioritization [43] andfunction inference of uncharacterized proteins [6, 44]. It is known that each approach performs dif-ferently for different applications. For example, the maximum approach achieves good performancefor prediction of protein-protein interactions compared to other approaches [4]. The best-matchaverage approaches perform better in protein function prediction and validation [6], and proteinor gene clustering, while the average approach is good for detecting similar protein sequences fromtheir GO annotations [20]. Recently, an extensive performance analysis of different most usedfunctional similarity measures has been conducted for different biological data types to elucidatewhich measure is better fitted to the user’s needs by summarizing the best performing measuresfor different approaches and different biological data or applications is provided in [3].

    Most of tools still use outdated ideas in the context of GO, subjecting the end users to unnecessaryplethora of choices. For example, the path-based semantic similarity approaches were shown tobe less effective [2, 6, 14], but several tools, such as KU-GOAL and SML, still implement theseapproaches. In the case of SML, these path-based approaches may only be relevant for otherontologies and not for GO since the SML tool is not limited to any specific application context [26].In this case, it is important to make it clear in the tool documentation to help GO users interestedin using the tool. Also, in several tools suggested so far, semantic similarity scores produced are notalways normalized (range between 0 and 1), making comparison between approaches difficult andmapping distance to similarity scores inconsistent. As examples, for functional similarity scoresderived from the Hausdorff-like distance, GOSim and KU-GOAL, map HDF and VHDF, usingexponential weight and direct substitution model, respectively, and given by:

    S (p, q) = exp(−HDF (p, q)

    )and S (p, q) = 1

    2

    √√√√ 1n

    ∑t∈TXp

    S2(t, TXq

    )+

    √√√√ 1m

    ∑t∈TXq

    S2(t, TXp

    ) (36)It is not clear why GOSim uses the exponential weight model (perhaps for ensuring the convergence)

    24 Mazandu et al., 2015

  • Other Existing IC-based GO semantic similarity tools 25

    and why KU-GOAL uses the direct substitution (possibly because BMM, BMA and ABM matchdirectly some variant of HDF based similarity measures, but this is not the case for the specificvariant it uses), whereas when dealing with normalized scores, the relation between distance D (p, q)and similarity S (p, q) scores will simply be given by [14]:

    S (p, q) = 1−D (p, q) (37)

    as used in the A-DaGO-Fun package.

    Finally, despite the wide range of IC-based GO semantic similarity applications and the existenceof several approaches to meet requirements of these applications, there was no tool available thatintegrates all the known IC-based functional similarity measures implemented in different exist-ing tools. Thus, researchers had to implement these approaches themselves, use different tools fordifferent approaches, or download the individual software packages, making extraction and compar-ison of these scores difficult and time-consuming. A-DaGO-Fun solves this by allowing researchersto retrieve the integrated set of all IC-based GO semantic similarity approaches. The similarityscores produced are scaled (normalized) to enable comparison between different approaches, andA-DaGO-Fun enables multiple options to be run simultaneously (up to four different semanticsimilarity measures), with a summary or merging of results.

    Issues related to organism-based information content value

    One of the main issues with the existing GO semantic similarity tools is that all these tools, exceptthe R package GOSim [25] and the on-line web tool DaGO-Fun [1], are organism-based tools. Inthese organism-based tools, a given term in the GO DAG may have different information contentvalues depending on the corpus used, whereas a term in the GO DAG is expected to have a uniqueinformation content value which should not depend on the corpus under consideration [2]. It wassuggested that this issue of the uniqueness of the information content value for a given term canbe solved by using the mapping between proteins and the GO annotations provided by the GOannotation (GOA) project [40, 41, 42]. On the other hand, the fact that IC depends on theannotation statistics related to terms may still produce biased IC values since a term can be rarelyused, but not necessarily very specific considering its position in the GO DAG [2]. Even thoughthe use of the IC make senses from a probabilistic point of view [6], the shallowness of annotationartifacts will persist when comparing pairs of proteins annotated with few terms [2, 32]. It isimportant to know that the performance of the annotation-based family will depend on the corpusunder consideration because of its dependence on the frequencies of GO term occurrences in thecorpus.

    Annotations may be unbalanced in their distribution across the GO structure, thus compromisingthese annotation-based approaches, specifically for organisms with sparse GO annotations and maynegatively affect their performances [14]. The use of the whole set of annotations provided by theGOA-UniProtKB project may solve this problem, but only at the cost of an increase in the runningtime and the complexity of these annotation-based approaches. This is expected to worsen as thenumber of protein annotations increases daily, which would potentially hamper the performanceof these approaches in their running time, since processing the annotation file would take a lot oftime before being able to compute the IC values [3]. This is why in using GO data and UniProtproteins with their GO annotations as provided by the GOA-uniProtKB project, A-DaGO-Funprecomputes GO term IC values to enable rapid response to user queries. To efficiently overcomethis issue related to annotation-based IC values, A-DaGO-Fun also implements topology-based

    25 Mazandu et al., 2015

  • Other Existing IC-based GO semantic similarity tools 26

    approaches producing a fixed and well-defined information content value for a given GO termindependent of the corpus under consideration. The package includes the GO-universal metric [14],the Zhang et al. [15], which is an adaptation of the Seco et al. approach [16] in the context ofGO [3], and Wang et al. [17] models.

    Finally, regardless of species coverage these tools may not support less popular species and are verylimited for new annotated organisms, including GOSim and DaGO-Fun. While recent tools havecollectively made substantial progress, compared to those in earlier works, especially in terms ofinput size, these tools are not robust enough for exploring GO semantic similarity measures in thecontext of the high-throughput functional analysis. They need to be updated in the context of GOin order to be practical and to reflect the progress of the field.

    Issue related to the use of standard protein or gene identifier systems

    Each of previously suggested tools uses a specific gene identifier (ID) system in integrating andconstructing its database. In general, web on-line tools and Java packages use UniProt IDs andthe DaGO-Fun tool uses also Gene names in addition to UniProt IDs. The R packages use NCBIEntrez Gene IDs, except csbl.go which uses Gene Names. This indicates that each tool uses its owngene ID system in the back-end when integrating and constructing its database, and only existingGO annotated organisms are integrated for semantic similarity score calculations. Understandinggene ID to annotation content and gene ID to gene ID mapping is the initial and important stepfor retrieving semantic similarity scores from these tools. However, this can be a very seriousissue when the user gene IDs can not be efficiently mapped to the gene ID system system usedby the tool under consideration or when the user gene identifiers are redundant or originated fromdifferent sources. The user should only rely on the existing cross reference mapping systems, suchas uniProt (http://www.uniprot.org/uploadlists/), in order to align his dataset with tool IDrequirements or to produce a dataset with similar conditions to the dataset in the tool being usedwhen exploring semantic similarity measures. Even though these platforms effectively addressthe gene ID cross-mapping issue, it is important to know that there are still some referencingproblems across platforms. For example, UniProt does not reference EBI-ID and each platform usesone system as its major gene identifier and some gene ID-annotation may not favor user systemIDs. Thus, some important annotations may be left out of semantic similarity score computationswithout the user’s awareness, resulting in an incomplete or even a failed semantic similarity scoreretrieval for further analyses. It is worth mentioning that these existing tools do not provide helpon how to deal with the gene ID to annotation and gene ID to gene ID mapping issues.

    Final Note on the A-DaGO-Fun Python Package

    None of the existing tools efficiently address all issues raised above, making them unable to performhigh-throughput gene functional similarity score computations. Thus, we have developed the A-DaGO-Fun Python Package, a portable application, i.e., a software package which can be runwithout the need for installation. A-DaGO-Fun is practical, easy to use and able to meet inputrequirements of current genome- and proteome-wide applications, producing GO term informationcontent (IC), GO term semantic similarity and protein functional similarity scores, which mayassist experimental and computational biologists in several applications involving protein analysesat the functional level. To make semantic similarity measures useful and pratical, A-DaGO-Funalso implements several biological applications related to GO semantic similarity scores, including

    26 Mazandu et al., 2015

    http://www.uniprot.org/uploadlists/

  • Other Existing IC-based GO semantic similarity tools 27

    the retrieval (identification) of genes based on their GO annotations, the clustering of functionallyrelated genes within a set, and term enrichment analysis.

    A-DaGO-Fun aims at providing easy retrieval of IC-based GO term semantic similarity and proteinfunctional similarity scores, which is organism and Gene ID independent, to the large community ofGO users and to a broad range of computational audience, helping tool designers or developers andexperienced end-users as well as non-programmers to retrieve semantic similarity scores using easyand straightforward commands to ensure that GO semantic similarity data and related biologicalapplications are conveniently accessible to researchers and can effectively be used in their proteinanalyses based on GO annotations. Finally, we will be expanding the DaGO-Fun tool to includesome other applications of GO semantic similarity in protein analyses, such as protein functionprediction, annotation system comparisons and integration [45], and categorization of GO termsinto groups of interest (improving gene set enrichment analysis and building GO Slim) [46].

    This package will be updated quarterly (every three months) using an automated scheme in orderto remain up to date to meet requirements of ever increasing applications in the biomedical field.The A-DaGO-Fun Python package is freely available, meaning that one is free to copy, distribute,display and make unrestricted non-commercial use of it under the GNU General Public Licenseprovided that it is done with appropriate citation of the tool and its components.

    27 Mazandu et al., 2015

  • References 28

    References

    [1] Mazandu GK, Mulder NJ (2013) DaGO-Fun: Tool for Gene Ontology-based functional analysis using term information content

    measures. BMC Bioinformatics 14: 284.

    [2] Mazandu GK, Mulder NJ (2013) Information content-based Gene Ontology semantic similarity approaches: Toward a unified

    framework theory. BioMed Research International 2013: Article ID 292063, 11 pages.

    [3] Mazandu GK, Mulder NJ (2014) Information Content-Based Gene Ontology Functional Similarity Measures: Which One to Use

    for a Given Biological Data Type? PLoS ONE, 9(12), e113859.

    [4] Jain S, Bader GD (2010) An improved method for scoring protein-protein interactions using semantic similarity within the gene

    ontology. BMC Bioinformatics 11: 562.

    [5] Pesquita C, Faria D, Bastos H, Ferreira AEN, Falcão AO, et al. (2008) Metrics for GO based protein semantic similarity: a

    systematic evaluation. BMC Bioinformatics 9(Suppl 5): S4.

    [6] Pesquita C, Faria D, Falcão AO, Lord P, Couto FM (2009) Semantic Similarity in Biomedical Ontologies. PLoS Comput Biol

    5(7):e1000443.

    [7] Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the 10th

    International Conference on Research in Computational Linguistics. pp. 19-33.

    [8] Resnik P (1999) Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity

    in natural language. Journal of Artificial Intelligence Research 11: 95-130.

    [9] Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on

    Machine Learning. pp. 296-304.

    [10] Couto F, Silva M, Coutinho P (2007) Measuring semantic similarity between gene ontology terms. Data Knowledge Eng 61(1):137–

    152.

    [11] Couto F, Silva M, Coutinho P (2005) Semantic similarity over the gene ontology: Family correlation and selecting disjunctive

    ancestors. In: CIKM ’05 Proceedings of the 14th ACM international conference on Information and knowledge management. pp.

    343–344.

    [12] Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T (2006) A new measure for functional similarity of gene products based

    on gene ontology. BMC Bioinformatics 7: 302.

    [13] Li B, Wang JZ, Feltus FA, Zhou J, Luo F (2010) Effectively integrating information content and structural relationship to improve

    the go-based similarity measure between proteins. ArXiv e-prints : 1001.0958.

    [14] Mazandu GK, Mulder NJ (2012) A topology-based metric for measuring term similarity in the Gene Ontology. Adv Bioinformatics

    2012: Article ID 975783, 17 pages.

    [15] Zhang P, Jinghui Z, Huitao S, Russo J, Osborne B, Buetow K (2006) Gene functional similarity search tool (GFSST). BMC

    Bioinformatics 7: 135.

    [16] Seco N, Veale T, Hayes J (2004) An intrinsic information content metric for semantic similarity in wordnet. In: ECAI-04. pp.

    1089-1090.

    [17] Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF (2007) A new method to measure the semantic similarity of GO terms.

    Bioinformatics 23(10): 1274–1281.

    [18] Tversky A (1977) Features of similarity. Psychological Review 84(4): 327–352.

    [19] Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, et al. (2005) Correlation between gene expression and go semantic

    similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) archive 2(4): 330–338.

    [20] Lord PW, Stevens PW, Brass A, Goble CA (2003) Investigating semantic similarity measures across the gene ontology: the

    relationship between sequence and annotation. Bioinformatics 19(10): 1275–1283.

    [21] Yu G, Li F, Qin Y, Bo X, Wu Y, Wand S (2010) GOSemSim: an R package for measuring semantic similarity among GO terms

    and gene products. Bioinformatics 26(7):976–978.

    [22] Gentleman R (2005) Visualizing and Distances Using GO, http://bioconductor.org/packages/2.6/bioc/vignettes/GOstats/inst/

    doc/GOvis.pdf.

    [23] Du Z, Li L, Chen CF, Yu PS, Wang JW (2009) G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge

    discovery. Nucleic Acids Research 37(2):D345–D349.

    28 Mazandu et al., 2015

    http://bioconductor.org/packages/2.6/bioc/vignettes/GOstats/inst/doc/GOvis.pdfhttp://bioconductor.org/packages/2.6/bioc/vignettes/GOstats/inst/doc/GOvis.pdf

  • References 29

    [24] Caniza H, Romero AE, Heron S, Yang H, Devoto A, Frasca M, Valentini G, Paccanaro A (2014). GOssTo: a user-friendly

    stand-alone and web tool for calculating semantic similarities on the Gene Ontology. Bioinformatics. Advance Access published.

    [25] Fröhlich H, Speer N, Poustka A, Beißbarth T (2007) GOSim–an R-package for computation of information theoretic GO simi-

    larities between terms and gene products. BMC Bioinformatics 8: 166.

    [26] Harispe S, Ranwez S, Janaqi S, Montmain J (2013). The Semantic Measures library and Toolkit: fast computation of semantic

    similarity and relatedness using biomedical ontologies. Bioinformatics. Advance Access published.

    [27] Jeong JC, Chen XW (2014) A new semantic functional similarity over gene ontology. IEEE/ACM Transactions on Computational

    Biology and Bioinformatics 12(2):322–334.

    [28] Faria D, Pesquita C, Couto FM, Falcão AO (2007) ProteInOn: A Web Tool for Protein Semantic Similarity. URL [http:

    //xldb.fc.ul.pt/xldb/publications/Faria.etal:ProteInOnA%Web:2007_document.pdf].

    [29] Ovaska K, Laakso M, Hautaniemi S (2008) Fast gene ontology based clustering for microarray experiments. BioData Mining

    1:11.

    [30] Schlicker A, Albrecht M (2008) FunSimMat: a comprehensive functional similarity database. Nucleic Acids Research 36:D434–439

    [31] Schlicker A, Albrecht M (2010) FunSimMat update: new features for exploring functional similarity. Nucleic Acids Research 38

    (suppl 1): D244–D248.

    [32] Mistry M, Pavlidis P (2008) Gene Ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics 9:327.

    [33] del Pozo A, Pazos F, Valencia A (2008) Defining functional distances over gene ontology. BMC Bioinformatics 9: 50.

    [34] Lerman G, Shakhnovich BE (2007) Defining functional distance using manifold embeddings of gene ontology annotations. Proc

    Natl Acad Sci 104(27): 11334–11339.

    [35] Dubuisson MP, Jain AK (1994) A modified Hausdorff distance for object matching. In ICPR94, pages A:566–568.

    [36] Memoli F, Sapiro G (2005) Theoretical and computational framework for isometry invariant recognition of point cloud data. J

    Foundations Comp Math 5: 313–347.

    [37] Bronstein AM, Bronstein MM, Mahmoudi M, Kimmel R, Sapiro G (2010) A Gromov-Hausdorff framework with diffusion geometry

    for topologically-robust non-rigid shape matching. Int J Computer Vision 89: 266–286

    [38] Guzzi PH, Mina M, Guerra C, Cannataro M (2012) Semantic similarity analysis of protein data: assessment with biological

    features and issues. Brief Bioinform 13(5):569–85.

    [39] Guo X, Liu R, Shriver CD, Hu H, Liebman MN (2006) Assessing semantic similarity measures for the characterization of human

    regulatory pathways. Bioinformatics 22(8):967–973.

    [40] Barrell D, Dimmer E, Huntley RP, Binns D, O’Donovan C, Apweiler R (2009) The GOA database in 2009–an integrated Gene

    Ontology Annotation resource. Nucleic Acids Research 37(1):D396–D403.

    [41] Dimmer EC, Huntley RP, Alam-Faruque Y, Sawford T, O’Donovan C, Martin MJ (2012) The UniProt-GO Annotation database

    in 2011. Nucleic Acids Research 40:D565–D570.

    [42] Huntley RP, Sawford T, Mutowo-Muellenet P, Shypitsyna A, Bonilla C, Martin MJ, O’Donovan C (2014) The GOA database:

    Gene Ontology annotation updates for 2015. Nucleic Acids Research 43:D1057–D1063.

    [43] Schlicker A, Lengauer T, Albrecht M (2010) Improving disease gene prioritization using the semantic similarity of gene ontology

    terms. Bioinformatics 26(18):i561–i567

    [44] Mazandu GK, Mulder NJ (2012) Using the underlying biological organization of the Mycobacterium tuberculosis functional

    network for protein function prediction. Infection, Genetics and Evolution 12(5):922–932

    [45] Mazandu GK, Mulder NJ (2014) The use of semantic similarity measures for integrating heterogeneous Gene Ontology annotation

    pipelines. Frontiers in Genetics 5:264

    [46] Na D, Son H, Gsponer J (2014) Categorizer: a tool that categorizes genes into user-defined biological groups based on semantic

    similarity. BMC Genomics 15:1091

    29 Mazandu et al., 2015

    [http://xldb.fc.ul.pt/xldb/publications/Faria.etal:ProteInOnA%Web:2007_document.pdf][http://xldb.fc.ul.pt/xldb/publications/Faria.etal:ProteInOnA%Web:2007_document.pdf]

    A-DaGO-Fun tool documentedA-DaGO-Fun packagegetTermFeatures functiontermsim functionfuncsim functiongossfeat functionproteinfct functionproteinfit functionAppendix-1 A-DaGO-Fun Installation and AdministrationInstallationUsageExamples:Important notesAppendix-2 Information Content-based GO Semantic Similarity measuresComputing term information content scoresDifferent GO term semantic similarity approachesResnik, Lin and Nunivers approachesImproving Annotation-based GO Semantic Similarity ScoresWang, Zhang and GO-universal approaches

    Functional similarity measuresTerm semantic-based measures: Avg, Max, BMA, BMM, ABM and HDFDirect term-based measures: SimGIC, SimDIC, SimUIC and Cosine

    Appendix-3 Other Existing IC-based GO semantic similarity toolsIssues related to the choice of functional similarity measuresIssues related to organism-based information content valueIssue related to the use of standard protein or gene identifier systemsFinal Note on the A-DaGO-Fun Python Package

    References