CleanTAX, Dave Thau [email protected] Stanford Research Institute Artificial Intelligence Center...

47
CleanTAX, Dave Thau [email protected] CleanTAX, Dave Thau [email protected] Stanford Research Institute Stanford Research Institute Artificial Intelligence Center Seminar Artificial Intelligence Center Seminar 8/16/2007 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher keywords: knowledge management, automatic reasoning, semantic integration, biological classification CleanTAX: An Infrastructure for Reasoning about Biological Taxonomies

Transcript of CleanTAX, Dave Thau [email protected] Stanford Research Institute Artificial Intelligence Center...

Page 1: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

1 of 47

Dave Thau andBertram Ludäscher

keywords: knowledge management, automatic reasoning, semantic integration, biological classification

CleanTAX:An Infrastructure for Reasoning about Biological Taxonomies

Page 2: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

2 of 47

Outline

• Brief Overview of Taxonomies• Impact of Different Taxonomic Views on Data

Analysis• Taxonomies and Relations Between Them• Using Logic to Determine Inconsistencies and

discover new relations• Initial Results of Large Scale Analysis• Some Optimizations• Future Work

Page 3: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

3 of 47

Beginnings of Biological TaxonomyEgypt, 1500 BC: Ebers medical papyrus, classification of medicinal plants

China, 350 BC: Erh-ya dictionary (second century BC) – classifies trees, grasses, herbs, grains, vegetables

Greece, 300 BC: Theophrastus, Historia plantarum and Causae plantarum – 500 plants – trees, herbs, fruiting plants, perennials

Page 4: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

4 of 47

Taxonomies are Everywhere:Systematics

Ranunculales

Ranunculus

Ranunculaceae

Magnoliopsida

Tracheophyta

Ranunculus asiaticus

Plantae kingdom

phylum

class

order

family

genus

species

Page 5: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

5 of 47

Taxonomies are Everywhere:The Dewey Decimal System000 Computers and general reference100 Philosophy and psychology200 Religion300 Social sciences400 Language500 Science600 Technology700 Arts and Recreation800 Literature900 History and geography

Page 6: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

6 of 47

Taxonomies are Everywhere:Phylogenies

From Thomas D. Als, Roger Vila, Nikolai P. Kandul, David R. Nash, Shen-Horn Yen, Yu-Feng Hsu, André A. Mignault, Jacobus J. Boomsma and Naomi E. Pierce. Nature 432, 386-390.

Page 7: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

7 of 47

Taxonomies are Everywhere:Protein Structure

From Ed Green http://compbio.berkeley.edu/people/ed/SeqCompEval/

Page 8: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

8 of 47

Taxonomies are Useful, But Slippery

• In all of these cases, taxonomies– Help us organize information– Allow us to make inferences at many levels of

generality

• However, taxonomies are simply "views" of real data– Dewey Decimal or Library of Congress?– Benson's view of Ranunculus or Kartesz's view?– Conflicting phylogenies are common– SCOP versus CATH

Page 9: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

9 of 47

Different Taxonomies Can Lead To Different Results

Predicted Distribution of Anhinga melanogaster based on Clement's 4th Edition

Predicted Distribution of Anhinga melanogaster based on

contained incontained in contained in

Anhingarufa

Anhingarufa

AnhingaAnhinga

Anhinganova.

Anhinganova.

Anhingamelanogaster

Anhingamelanogaster

is ais a is a

is a

AnhingaAnhinga

Anhingamelanogaster

Anhingamelanogaster

is a is a

Articulations by Santa Barbara Software Products

ph

oto

by D

avid

B

eh

ren

s

Clement's 5th Edition

Page 10: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

10 of 47

Different Taxonomies Complicate Data Analysis

What were the average number of Ranunculus arizonicus seen in transect 1 in 2005?

Page 11: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

11 of 47

• Peet05 articulates relation between Benson’48 and Kartesz’04 names …

• Is that articulation consistent?

• Can we infer additional information?

Reasoning With Taxonomic Concepts

Page 12: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

12 of 47

Problem Statement

• What are taxonomies, anyway?• How do you know a taxonomy makes

sense?• Given some articulations meant to

translate between taxonomies:– do they make sense, or are there internal

contradictions?– have they left out anything which may be

inferred logically?

Page 13: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

13 of 47

What are Taxonomies?A simple definition: A directed acyclic graph of

nodes and edges, where the edges represent a "subtype" relation

Anhingarufa

Anhingarufa

AnhingaAnhinga

Anhinganova.

Anhinganova.

Anhingamelanogaster

Anhingamelanogaster

is ais a

is a

Potential additional constraints:• children are disjoint (child-disjointness, D) • children partition their parents (coverage, C)• nodes are non-empty (non-emptiness, N)

We call these "latent taxonomic assumptions"• More than one LTA may apply• 8 combinations:none, C, D, N, CD, CN, DN, CDN

Page 14: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

14 of 47

Inconsistency in a TaxonomyInconsistent under the ND (non-emptiness

and disjoint children) LTA.

AA

BB CC

DD

If B and C are children of A, then they must be disjoint. However, they both contain elements of D

Page 15: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

15 of 47

How do Taxonomies Relate?

Articulations relate nodes between taxonomies

N M

(v) exclusion

N M

(iv) partial overlap

N M

(ii) proper inclusion

(iii) proper inverseinclusion

N M

(i) congruence

M N

Between any two nodes in the taxonomies, one, and only one, of the following five relations must hold:

M N M > N M < N M o N M x N

Page 16: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

16 of 47

Ranunculusaquatilis

Ranunculusaquatilis

R.a. varaquatilis

R.a. varaquatilis

R.a. vardiffusus

R.a. vardiffusus

R.a. varhispidulus

R.a. varhispidulus

FNA-03, 1997

Ranunculusaquatilis

Ranunculusaquatilis

R.a. varcapillaceus

R.a. varcapillaceus

Benson, 1948

Many Possible Articulation Sets

R.a. varcalvescens

R.a. varcalvescens

Five relationships, plus "unknown/unstated relation", and 3 x 4 nodes results in 612 (over 2 billion) sets of articulations.

<

<

Page 17: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

17 of 47

Articulations: Some Make Sense

AA

BB CC

DD

EE FF

Taxonomy 1 Taxonomy 2

isa isa isa isa

B < F

A < D

C E

Page 18: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

18 of 47

Articulations: Some Are Impossible

AA

BB CC

DD

EE FF

Taxonomy 1 Taxonomy 2

isa isa isa isa

B < F

C > F

Assuming non-emptiness, and disjoint childrenLTAs

Page 19: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

19 of 47

Articulations: Some Imply other Articulations

AA

BB CC

DD

EE FF

Taxonomy 1 Taxonomy 2

isa isa isa isa

A D

C E

Implies B F

Assuming non-emptiness, disjoint children and coverageLTAs

Page 20: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

20 of 47

The Relation Lattice• Sometimes, a single relation between two nodes is unknown. • The relation lattice shows all 32 possible combined relations. • Each node represents a disjunction of relations.

oxox

<ox

<ox

<<>> oo

><>< >o>o <o<o

xx

<< <x<x oo >x>x xx>>

>o

>o

><o

><o

><x

><x

<x

<x

<o

<o

>ox

>ox

>x

>x oxox>

<

><

><o

><o

><x

><x >ox>ox ><ox><ox<ox<ox

><ox><ox

Page 21: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

21 of 47

The Complexity of Developing Articulations

The Ranunculusdata set9 Taxonomies654 Taxa704 Articulations

visualization byMartin Graham

Page 22: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

22 of 47

Example Articulation Set

AA

DD

BB CC CC BB AA

KK LL MM II JJ EE FF GG HH

O

X

O

A: R. petioralisB: R. macrantusC: R. fascicularis

Kartesz, 2004 Benson, 1948

XO

is included inequalsoverlapsdisjoint

Page 23: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

23 of 47

Goal – To Help Bob Know

• that the taxonomies he's working with are consistent

• when he's introduced an articulation that leads to inconsistency

• when an articulation is implied by others

• about ambiguous articulations

Page 24: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

24 of 47

Berendsohn, et. al, 2003 - MoReTaX

Page 25: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

25 of 47

Logic Based Approach

• Devise a language LTax

– First-order logic constraints on single-place predicates, where each predicate is a "taxon"

• Render taxonomies and articulations between them into a set of first-order formulas

• Then can ask, – does a taxonomy follow your definition of taxonomy?– is a pair of taxonomies plus articulations between them

consistent?– are there unstated articulations?

Page 26: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

26 of 47

Translating Taxonomy into Logic

isa for each edge M isa N add x:M(x) N(x)Non-Emptiness

(N) for each node N, add x: N(x)

Child Disjointness

(D) for each two children N1, N2 of M,

add x: N1(x) N2(x)Coverage (C) for each node M with children N1,..NL,

add x:M(x) N1(x) … NL(x)

Congruence M N x:M(x) N(x)

Proper Inclusion M > N x:N(x) M(x) a: M(a) N(a)

Proper Inverse Inclusion

M < N x:M(x) N(x) a: N(a) M(a)

Partial Overlap M o N abc: M(a) N(a) M(b) N(b) M(c) N(c)

Exclusion M x N x: M(x) N(x)

Taxonomy and LTA Formulas

Articulation Formulas

Page 27: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

27 of 47

Theorem Proving

= { x: B.Rac(x) → B.Ra(x), x: B.Rat(x) → B.Ra(x), x: B.Ra(x) ↔ K.Ra(x), x: B.Rat(x) → K.Ra(x)...}

= x: B.Rac(x) → K.Ra(x) a: K.Ra(a) B.Rac(a)

Want to show that ╞ , that holds in

To prove it, show: {} ├

Page 28: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

28 of 47

CleanTax Methodology

1. Check each taxonomy under each LTA set to see if it's consistent

2. Check the articulations under each LTA set to see if they are consistent

3. Check the taxonomies plus the articulations under the LTA sets from above and make sure the combination is consistent

4. If so, for each pair-wise combination of nodes, try to prove each possible relationship under each consistent LTA set.

Given a set of taxonomies and articulations between them

Implemented using python. The theorem prover prover9, and the model searcher mace4, are used to prove relationships and check consistency.

Page 29: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

29 of 47

The CleanTAX Infrastructure• Features

– Designed to plug in a variety of reasoners– Works with computer clusters (Sun Grid Engine)– Can work with whole taxonomies or subsets

• Command line options– Specify taxonomies and articulation sets to test– Specify relations to test– Specify LTAs to test– Specify nodes to test– Pass parameters to the reasoners

• Inputs– Taxonomic Concept Schema (an XML spec)– Individual reasoner files– Internal representation

• Example Reports– Which taxonomies are consistent under which LTAs– For each pair of nodes tested, for each relation, under each LTA, whether or not it can be

proven true– For each set of taxonomies and articulations, under each LTA, a graph showing new infered

relations

Page 30: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

30 of 47

Initial resultsWe ran two Ranunculus taxonomies (Benson 1948, 218 Taxa and Kartesz 2004, 142 Taxa) and 206 Articulations from Peet 2005.

When the taxonomies and the articulations were analyzed as a whole, only two LTA combinations were provably consistent: no LTAs and non-emptiness.

This involved 928,680 judgments and took 46.0 hours.

To get a better sense for the impact of LTAs, the combined taxonomies and articulations were divided into 82 connected subgraphs

Among these we found 5 inconsistencies and 1946 new articulations

This involved 166,920 judgments and took 4.8 hours.

Page 31: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

31 of 47

Discovered Inconsistent Mappingunder the {coverage, disjointness, non-emptiness} LTA set

Peet, 2005: B.1948:R.h.stolonifer is congruent to K.2004:R.h.stoloniferB.1948:R.h.typicus is congruent to K.2004:R.h.typicusB.1948:R. hydrocharoides is congruent to K.2004:R. hydrocharoides

The most likely fix here is to change the congruence relation between the toptwo nodes to instead state that Benson's R. hydrocharoides includesKartesz's

Ranunculushydrocharoides

Ranunculushydrocharoides

R.h. varnatans

R.h. varnatans

R.h. varstolonifer

R.h. varstolonifer

R.h. vartypicus

R.h. vartypicus

Benson, 1948

Ranunculushydrocharoides

Ranunculushydrocharoides

R.h. varstolonife

r

R.h. varstolonife

rR.h. vartypicus

R.h. vartypicus

Kartesz, 2004

Page 32: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

32 of 47

Formal Proof of Inconsistency

Page 33: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

33 of 47

Inferring Additional KnowledgeDoes C = E? Or, is C > E?

AA

BB CC DD

Benson, 1948

EE

GG

Kartesz, 2004

FF IIHH

KKJJ

Taxonomy provided isa ()Articulated Proper Inverse Inclusion (<)Articulated Congruence ()

< < < <

< A: Ranunculus hispidusB: R.h. var caricetorumC: R.h. var hispidusD: R.h. var nitidusE: Ranunculus hispidusF: R.h. var eurylobusG: R.h. var greenmaniiH: R.h. var marilandicusI: R.h. var typicusJ: R. septentrionalisK: R. carolinanis

Page 34: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

34 of 47

Most Informative Relation (MIR)

oxox

<ox

<ox

<<>> oo

><>< >o>o <o<o

xx

<< <x<x oo >x>x xx>>

>o

>o

><o

><o

><x

><x

<x

<x

<o

<o

>ox

>ox

>x

>x oxox>

<

><

><o

><o

><x

><x >ox>ox ><ox><ox<ox<ox

><ox><ox

Page 35: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

35 of 47

Latent Taxonomic Assumptions vs New Maximally Informative Relations

The Basic Five Relations

The Other 28 Relations

No LTAs 245 304

All Three LTAs

475 74

Numbers represent novel provably true relations within 75 sub-taxonomies.

Main finding: More constraints lead to more specificity in provably true relations

Page 36: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

36 of 47

Optimizations

DDNN

CC

NDCNDC

NDND NCNC DCDC

LTA Optimization

If a set of axioms is inconsistent under one node, it will be inconsistent under all the supersets of that node.

Page 37: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

37 of 47

Finding the MIRAlgorithm 1: Bottom Up (A↑)

oxox

<ox

<ox

<<>> oo

><>< >o>o <o<o

xx

<< <x<x oo >x>x xx>>

>o

>o

><o

><o

><x

><x

<x

<x

<o

<o

>ox

>ox

>x

>x oxox>

<

><

><o

><o

><x

><x >ox>ox ><ox><ox<ox<ox

><ox><ox

Try relations on the bottom rank in order,then, if none is true, go to the next rank.

Page 38: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

38 of 47

Finding the MIRAlgorithm 2: Top Down (A↓)

oxox

<ox

<ox

<<>> oo

><>< >o>o <o<o

xx

<< <x<x oo >x>x xx>>

>o

>o

><o

><o

><x

><x

<x

<x

<o

<o

>ox

>ox

>x

>x oxox>

<

><

><o

><o

><x

><x >ox>ox ><ox><ox<ox<ox

><ox><ox

Just check the relations in penultimate rank

((A B C D) E)

((B C D E) A)

(B C D )

Page 39: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

39 of 47

Relation Lattice Optimization Results 1

A0 A↑ A↓

Number of Judgments

928,680 912,779 154,780

Time (hours) 46.0 45.3 7.8(a 5.8x speedup)

Logical Steps (millions)

2,634 2,589 442

Comparing the two full taxonomies, under the nonemptiness LTA shows a strong improvement for the top-down optimization

Page 40: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

40 of 47

Relation Lattice Optimization Results 2

A0 A↑ A↓

Number of Judgments

17,019 2,194 2,745

Time (seconds)

574.59 83.61(a 6.9x speedup)

100.47 (a 5.7x speedup)

Logical Steps (thousands)

2,484 384 394

Under more restrictive constraints, the bottom-up optimization improves. Results are for 75 sub-taxonomies under the NDC LTA.

Page 41: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

41 of 47

Summary: Contributions To Date

• Represented taxonomies and articulations between them in logic

• Clarified and represented latent taxonomic assumptions

• Created an infrastructure capable of applying reasoners large taxonomies and articulation sets– discovering inconsistencies– discovering interesting new relations– elucidating impact of LTAs on reasoning

• Described and tested three optimizations

Page 42: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

42 of 47

Future Work: Applications

Paul Craig and Jessie Kennedy (2007), School of Computing, Napier University, Edinburgh

Page 43: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

43 of 47

Future Work: Suggesting Fixes

Ranunculushydrocharoides

Ranunculushydrocharoides

R.h. varnatans

R.h. varnatans

R.h. varstolonifer

R.h. varstolonifer

R.h. vartypicus

R.h. vartypicus

Benson, 1948

Ranunculushydrocharoides

Ranunculushydrocharoides

R.h. varstolonife

r

R.h. varstolonife

rR.h. vartypicus

R.h. vartypicus

Kartesz, 2004

Inconsistency found, suggested fixes:1. Change relation between Ranunculus hydrocharoides (Benson, 1948) and

Ranunculus hydrocharoides (Kartesz, 2004) from to >.2. Relax Non-Emptiness constraint, allowing Ranunculus hydrocharoides var.

natans to be empty.3. Relax Coverage constraint, allowing R. hydrocharoides to contain

specimens not contained in its children4. …

Page 44: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

44 of 47

Future Work: Other Logics – DL

Ranunculuspetiolaris

Benson, 1948

Ranunculuspetiolaris

Kartesz, 2004

Ranunculus Ranunculus

Ranunculusmacranthus

<

Page 45: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

45 of 47

Other Future Work

• Better parallelization

• Better interfaces (GUI, Web Services)

• Applications to other domains

• Enhancing reporting tools to better support data curation

Page 46: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

46 of 47

Conclusions

• Taxonomies are more complicated than you may have thought.

• Logic is a useful tool for discovering inconsistencies and new relations in taxonomies and articulations between them.

• This is an interesting interdisciplinary line of research combining elements from systematics, artificial intelligence, and high-performance computing.

Page 47: CleanTAX, Dave Thau thau@learningsite.com Stanford Research Institute Artificial Intelligence Center Seminar 8/16/2007 1 of 47 Dave Thau and Bertram Ludäscher.

CleanTAX, Dave Thau [email protected], Dave Thau [email protected] Research Institute Stanford Research Institute Artificial Intelligence Center SeminarArtificial Intelligence Center Seminar8/16/20078/16/2007

47 of 47

Thanks!Acknowledgements

SEEK is supported by the National Science Foundation under awards 0225676. 0225665, 0225635, and 0533368.

Invaluable Consultation: Bertram Ludäscher and Shawn Bowers

Ranunculus Data Set: Bob Peet

Visualization Tools: Jessie Kennedy, Martin Graham and Paul Craig

Niche Modeling: Kirsten Menger-Anderson

Funding and Context: The SEEK project

D. Thau and B. Ludäscher. Reasoning about Taxonomies in First-Order Logic. Journal of Ecological Informatics, (accepted for publication in 2007).

D. Thau and B. Ludäscher. Toward Optimizing CleanTAX: An Automated Reasoning Method for Taxonomies and Articulations. (submitted to 2007 IEEE/WIC/ACM International Conference on Web Intelligence.

References