OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit...
-
Upload
adela-watkins -
Category
Documents
-
view
223 -
download
0
Transcript of OntoQA: Metric-Based Ontology Quality Analysis Samir Tartir, I. Budak Arpinar, Michael Moore, Amit...
OntoQA: Metric-Based Ontology Quality Analysis
Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth,
Boanerges Aleman-Meza
IEEE Workshop on Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge
Sources
Houston, Texas, November 27, 2005
The Semantic Web• Current web is intended for human use• Semantic web is for humans and
computers• Semantic web uses ontologies as a
knowledge-sharing vehicle.• Many ontologies currently exist: GO, OBO,
SWETO, TAP, GlycO, PropreO, etc.
Motivation
• Having several ontologies to choose from, users often face the problem of selecting the best ontology that is suitable for their needs.
OntoQA• Metric-Based Ontology Quality Analysis
• Describes ontology schemas and instancebases (IBs) through different sets of metrics
• OntoQA is implemented as a part of SemDis project.
Documentsdatabases
Open/proprietary Heterogeneous Data Sources
HtmlXMLfeeds
PopulatedOntology
Ontology Schema
Emails
Contributions
• Defining the quality of ontologies in terms of:• Schema• Instances
• IB Metrics• Class-extent metrics
• Providing metrics to quantitatively describe each group
I. Schema Metrics• Schema metrics address the design of the
ontology schema.
• Schema quality could be hard to measure: domain expert consensus, subjectivity etc.
• Three metrics:– Relationship richness– Attribute richness– Inheritance richness
I.1 Relationship Richness
• How close or far is the schema structure to a taxonomy?
• Diversity of relations is a good indication of schema richness.
PIsA
PRR
|P|: Number of non-IsA relationships
|IsA|: Number of IsA relationships
I.2 Attribute Richness
• How much information do classes contain?
C
AAR
|A|: Number of literal attributes
|C|: Number of classes
I.3 Inheritance Richness (Fan-out)• General (e.g. spanning various domains) vs.
specific
C
C,CHCC
ijC
SiIR
|Hc(cj, ci)|: Number of subclasses of Class Ci
|C|: Number of classes
II. Instance Metrics• Deal with the size and distribution of the
instance data.
• Instance metrics are grouped into two subcategories:
1. IB metrics: describe the IB as a whole2. Class metrics: describe the way each class that is
defined in the schema is being utilized in the IB
II.1.a Class Richness
• How much does the IB utilizes classes defined in the schema?
• How many classes (in the schema) are actually populated?
C
CCR
`
|C’|: Number of used classes
|C|: Number of defined classes
II.1.b Average Population
• How well is the IB “filled”?
C
IP
|I|: Number of instances
|C|: Number of defined classes
II.1.c Cohesion
• Is IB graph connected or disconnected?
CCCoh
|CC|: Number of connected components
II.2.a Importance
• How much focus was paid to each class during instance population?
I
)I(CImp i
Ci
|Ci(I)|: Number of instances defined for class Ci
|I|: Number of instances
II.2.b Connectivity
• What classes are central and what are on the boundary?
C}C(I),CI(I)CI)I,P(I :{IConn jjjiijijCi
P(Ii,Ij): Relationships between instances Ii and Ij.
Ci(I): Instances of class Ci.
C: Defined classes.
II.2.c Fullness
• Is the number of instances close to the expected?
|)I`(C|
)I(CF
i
i
|Ci(I)|: Number of instances of class Ci.
|Ci’(I)|: Number of expected instances of class Ci.
II.2.d Relationship Richness
• How well does the IB utilize relationships defined in the schema?
)C,C(P
}CC),I(CI),I(CI:))I,I(P(Distinct{RR
ji
jjjiiji
Ci
P(Ii,Ij): Relationships between instances Ii and Ij.
Ci(I): Instances of class Ci.
Cj(I): Instances of class Cj.
C: Defined classes
P(Ci,Cj): Relationships between instances Ci and Cj.
II.2.e Inheritance Richness
• Is the class general or specific?
'C
C,CH
IR'CC
jkC
Cj
i
C’: Classes belonging to the subtree rooted at Ci
|Hc(ck, cj)|: Number of subclasses of Class Ci
Implementation
• Written in Java
• Processes ontology schema and IB files written in OWL, RDF, or RDFS.
• Uses the Sesame to process the ontology schema and IB files.
Testing• SWETO: LSDIS’ general-purpose ontology that covers
domains including publications, affiliations, geography and terrorism.
• TAP: Stanford’s general-purpose ontology. It is divided into 43 domains. Some of these domains are publications, sports and geography.
• GlycO: LSDIS’ ontology for the Glycan Expression
• OBO: Open Biomedical Ontologies
Results – Class Metrics
Ontology # of Classes
# of Instances
Inheritance Richness
Class Richness
Average Population
SWETO 44 1,003,021 0.9 56.8% 22,795.9
TAP 3,230 71,487 1.2 9.4% 22.1
GlycO 356 387 1.3 18.0% 1.1
PropreO 244 0 1.0 0.0% 0.0
Results – Class Importance
Class Importance
010203040506070
Public
atio
n
Scie
ntif
ic_P
ublic
atio
n
Com
pute
r_S
cie
nce_
Researc
her
Org
aniz
atio
n
Com
pany
Confe
rence
Pla
ce
City
Bank
Airport
Terr
orist_
Attack
Event
AC
M_S
ubje
ct_
Desc
ripto
rs
Class
Class Importance
05
101520253035
Mus
icia
n
Ath
lete
Aut
hor
Act
or
Mov
ie
Per
sona
lCom
pute
rG
ame
Boo
k
Pro
duct
Typ
e
Uni
tedS
tate
sCity
Uni
vers
ity City
For
tune
1000
Com
pan
y Ast
rona
ut
Com
icS
trip
Class
SWETO TAP
GlycO
Class Importance
010203040506070
N-g
lyca
n
gly
can
_m
oie
ty
N-g
lyca
n_
resi
du
e
carb
oh
ydra
te_
resi
du
e_
pro
pe
rty
N-g
lyca
n_
alp
ha
-D-
Ma
np
alp
ha
-D-
ma
nn
op
yra
no
syl_
resi
du
e
N-g
lyca
n_
be
ta-D
-G
lcp
NA
c
N-a
cety
l-b
eta
-D-
glu
cop
yra
no
sam
inyl
_re
sid
ue
mo
lecu
lar_
fra
gm
en
t
sug
ar_
con
figu
ratio
n
be
ta-D
-g
ala
cto
pyr
an
osy
l_re
sid
ue
N-g
lyca
n_
be
ta-D
-Ga
lp
N-g
lyca
n_
alp
ha
-N
eu
5A
c
sug
ar_
stru
ctu
ral_
vari
an
t
Class
Results – Class ConnectivityClass Connectivity
0123456789
Terr
orist_
Attack
Bank
Airport
AC
M_S
econd_le
vel
_C
lassifi
catio
n
AC
M_T
hird_le
vel_
Cl
assifi
catio
n City
Sta
te
AC
M_S
ubje
ct_
Desc
ripto
rs
AC
M_T
op_le
vel_
Cla
ssifi
catio
n
Com
pute
r_S
cie
nce_
Researc
her
Scie
ntif
ic_P
ublic
atio
n Com
pany
Terr
orist_
Org
aniz
ati
on
Class
Class Connectivity
01234567
CM
UF
acul
ty
Per
son
Res
earc
hPro
jec
t
Mai
lingL
ist
CM
UG
radu
ateS
tud
ent
CM
UP
ublic
atio
n
CM
U_R
AD
W3C
Spe
cific
ati
on
W3C
Per
son
W3C
Wor
king
Dr
aft
Com
pute
rSci
enti
st
CM
UC
ours
e
Bas
ebal
lTea
m
W3C
Not
e
Class
SWETO TAP
GlycO
Class Connectivity
02468
1012
N-g
lyca
n_be
ta-D
-G
alpN
Ac
N-g
lyca
n_be
ta-D
-G
lcpN
Ac
N-g
lyca
n_al
pha-
Neu
5Ac
N-g
lyca
n_al
pha-
D-
Gal
p
N-g
lyca
n_al
pha-
L-F
ucp
N-g
lyca
n_al
pha-
Neu
5Gc
N-g
lyca
n_be
ta-D
-X
ylp
N-g
lyca
n_be
ta-D
-G
alp
N-g
lyca
n_al
pha-
D-
Glc
p
N-g
lyca
n_al
pha-
D-
Man
p
N-g
lyca
n_be
ta-D
-M
anp
N-a
cety
l-gl
ucos
amin
yl_t
rans
fer
ase_
V
N-g
lyca
n_al
pha-
D-
Glc
pNA
c
N-g
lyca
n_D
-G
lcN
Ac-
ol
Class
BioMedical OntologiesOntology No. of Terms
(Instances)Average No. of
SubtermsConnectivit
y
Protein-protein Interaction
195 4.6 1.1
MGED 228 5.1 0.3
Biological Imaging Methods
260 5.2 1.0
Physico-chemical Process
550 2.7 1.3
Cereal Plant Trait 692 3.7 1.1
BRENDA 2,222 3.3 1.2
Human Disease 19,137 5.5 1.0
Gene Ontology 20,002 4.1 1.4
Conclusions
• More ontologies are introduced as the semantic web is gaining momentum.
• There is no easy way for users to choose the most suitable ontology for their applications.
• OntoQA offers 3 categories of metrics to describe the quality and nature of an ontology.
Future Work
• Calculation of domain dependent metrics that makes use of some standard ontology in a certain domain.
• Making OntoQA a web service where users can enter their ontology files paths and use OntoQA to measure the quality of the ontology.
Questions