1 The Refined Semantic Network James Geller Yehoshua Perl New Jersey Institute of Technology.

37
1 The Refined Semantic Network James Geller Yehoshua Perl New Jersey Institute of Technology

Transcript of 1 The Refined Semantic Network James Geller Yehoshua Perl New Jersey Institute of Technology.

1

The Refined Semantic Network

James Geller

Yehoshua Perl

New Jersey Institute of Technology

2

Sources

This presentation is based:

• on a pending proposal on Auditing and Extending the UMLS;

• [Gu et al., JAMIA 2000] and [MEDINFO YEARBOOK 2001]

3

Fundamental Observation

• The UMLS requires that there is an assignment of one or several Semantic Types to each concept.

• This assignment provides semantics for concepts.

• We call the set of all concepts to which a Semantic Type S has been assigned the Extent of S.

4

Problems (1)

• Assigning Semantic Types to new concepts is a complex manual task due to complexity, ambiguity and homonymy of medical concepts.

• Categorization is highly dependent on an editor’s specialty, background and priorities and thus not fully predictable.

5

Problems (2)• The extents of most Semantic Types are not

uniform. If we look at the extent of a Semantic Type it may contain concepts with “second” assignments that are different from each other.

• A desire was expressed to make the SN deeper [McCray and Nelson 1995].

6

Example (concrete)

• We are looking at all the 61 concepts to which the Semantic Type Environmental Effect of Humans (henceforth EEH) has been assigned (I.e., the extent of EEH).

7

Are these concepts similar?

• Are Classroom environment, Sanitation problem, Acid rain, and Industrial waste really similar to each other?

• ONLY EEH is assigned to 54 of these 61 concepts.

• To 7 other concepts combinations of 3 additional Semantic Types are assigned. That’s why we say that the extent of EEH is not uniform.

8

• EEH & Finding: 2 concepts: Poor Sanitation, Sanitation Problem

• EEH & Hazardous or Poisonous Substance: 4 concepts: Acid rain, Radioactive fallout, Radioactive waste, Smoke

• EEH & Manufactured Object & Hazardous or Poisonous Substance: 1 concept: Industrial waste

Concepts of non-uniform semantics

9

g, h

a, b, d, e, f, g

b, g

c, d, e, g

W

X

Y

Z

Abstract Example with 4 Semantic Types. Boxes show extents.

10

Problem even in simple case

• One has to look into all boxes to see if a concept occurs in them or not.

• Definition: Intersection of two extents: The intersection of two extents contains all and only the concepts that occur in BOTH extents. We will use the symbol & for it.

• Example: Intersection of [a, b, c] & [b, c, d] --> [b, c]

11

a f

gb d e

h

c

X

Y

W

Z

Example of Venn Diagram for X, Y, Z, W

12

Our Solution to all 3 Problems

• Identify all existing intersections.

• Display every concept exactly once, in its pure “original box” or in a new “box” that corresponds to a unique intersection.

13

h

a, f

c

W

X

Y

Z

W & X & Y & Z g

X & Y b

X & Z d, e

PURE Semantic Types (simple semantics)

Intersection Types (Compound Semantics)

g, h

a, b, d, e, f, g

b, g

c, d, e, g

W

X

Y

Z

Original Semantic Types (non uniform Semantics)

14

Intersection Types

• Intersection Types are “new” Semantic Types that are constructed by intersection of the extents of their component Semantic Types.

• The “names” of Intersection Types are constructed by chaining the names of their component Semantic Types together with &-signs.

15

Semantic Refinement

• We call the process of constructing all necessary Pure Semantic Types and Intersection Types Semantic Refinement.

• Concepts are reassigned, so that every concept occurs only in one extent.

• After Semantic Refinement, every Semantic Type has a uniform extent.

16

• Pure Semantic Types have extents of simple concepts. (They are uniform.)

• Intersection Types have extents of compound concepts. (They are also uniform!)

17

Advantages• Extents of pure and intersection types now

have a uniform semantics. That means, every extent contains concepts that are highly similar.

• Small sets of concepts are easy to review and also more suspicious.

• It is easier to see a concept that “does not belong” to a small set or even that a concept is “missing.”

18

What does this have to do with the Semantic Network?

• Every intersection type S of types X, Y, Z,… should be added to the Semantic Network as follows.

• S is made a child of several appropriate Semantic Types. We allow multiple parents.

• This is the Refined Semantic Network: RSN

19

W XY Z

X & Y X & Z

W & X & Y & Z

The Refined Semantic Network of the Semantic Types W, X, Y, Z

20

Thing

Event Entity

Phenomenon or Process Conceptual Entity Physical Object

Finding Manufactured Object SubstanceHuman-Caused

Phenomenon or Process

EEH

Chemical

Chemical Viewed Functionally

Hazardous or Poisonous Substance

EEH & Haz. or Poi.. Sub.EEH & Finding Manu. Obj. & Haz. or Poi.. Sub.

EEH & Manu. Obj. & Haz. or Poi.. Sub.

Subnetwork of SN with EEH Intersections and all their ancestors

21

• The RSN supports auditing.

• Auditing has helped us find mistakes in the UMLS.

• Removal of mistakes typically leads to simplifications of the UMLS and of the RSN itself, by removing wrong intersections.

22

EEH Auditing Example• The intersection of the extents of three

Semantic Types EEH and Manufactured Object and Hazardous or Poisonous Substance contained only one concept: Industrial Waste

• Industrial smog and Factory smoke are not considered Manufactured Objects, and our audit suggested that Industrial Waste should not be one either.

23

More strange intersections• We found concepts belonging to both

Human-caused phenomenon or process and Manufactured object.

• It is out of the question that something is at the same time a process and an object.

• By creating the RSN we found this.

• It was caused by homonyms. E.g. Video recording as the process and as its result.

24

Wrong Categorizations

• By reviewing the pure semantic types and intersection types we found various errors.

• Drinking water problem and PBC Airborne level are missing a Finding assignment.

• Smoke is assigned Hazardous or Poisonous Substance, but its subconcepts Factory smoke and Second hand smoke are missing such an assignment.

25

• Classroom Environment and College Environment should not be assigned EEH at all.

• These and other errors were exposed by review of the extents, which should be semantically uniform.

• After correcting these errors, the concepts of EEH look very different.

26

Venn Diagrams before/after audit

2

3

1 4

54

EEH

FindingHazardous or Poisonous Substance

Manufac-tured Object

40

105

3

4

EEH

Manufactured Object

Hazardous or Poisonous Substance

Finding

Substance

27

Thing

Event Entity

Phenomenon or Process Conceptual Entity Physical Object

Finding Manufactured Object SubstanceHuman-Caused

Phenomenon or Process

EEH

Chemical

Chemical Viewed Functionally

Hazardous or Poisonous Substance

EEH & Substance

EEH & FindingManu. Obj. & Haz. or Poi.. Sub.

EEH & Haz. or Poi.. Sub.

Revised Subnetwork of SN with EEH Intersections and all their ancestors

28

Exclusive Semantic Types

• We found 143 concepts that are classified as both Organic Chemical and Inorganic Chemical!!

• Of those, 82 are assigned to additional semantic types.

29

Redundant Categorizations• Many concepts are assigned to a Semantic

Type S and the parent or ancestor T of S. This is a redundant categorization, a no-no. [McCray and Nelson, 1995][Peng et al., AMIA 2002]

• Sample in 1998: Desertification was assigned EEH and also PHENOMENON OR PROCESS, a redundant categorization. It was removed after our report.

30

Auditing simplifies the RSN

• After correcting the assignments of those 143 (“organic”) concepts, 13 invalid Intersection Types disappeared.

• The RSN becomes simpler, as it has fewer Intersection Types.

• In a sample of 100 intersections with only one concept, only 15 were deemed legal. [Gu, JAMIA 2000]

31

Renaming Intersection Types

• Instead of Environmental Effect of Humans & Hazardous and Poisonous Substance we rather have a designer rename it into: Environmentally Hazardous or Poisonous Substance.

• An intersection of Body Part & Manufactured Object is a Prosthesis.

32

Thing

Event Entity

Phenomenon or Process Conceptual Entity Physical Object

Finding Manufactured Object SubstanceHuman-Caused

Phenomenon or Process

EEH

Chemical

Chemical Viewed Functionally

Hazardous or Poisonous Substance

Human- produced Environmental Substance

Environmental Finding Manufactured

Hazardous or Poisonous Substance

Environmentally Hazardous or Poisonous Substance

Subnetwork with simplified names

33

Overall Results for UMLS 1998

Level Number of Pure Semantic Types at Level Number of Intersection Types

1 1 0

2 2 0

3 4 0

4 20 0

5 41 56

6 23 203

7 23 163

8 17 187

9 2 234

10 0 212

11 0 89

12 0 16

13 0 3

14 0 1

34

Concept Distribution UMLS 98Number of concepts/intersection type How many intersection types with so many concepts

1 421 } Many of these will

2 147 } disappear

3 102 } [GU, AIM 2004]

4 65

5 35

6 41

7 32

8 15

9 13

….

3947 1

4582 1

6705 1

19349 1

41564 1

35

Streamlining Categorizations

• Currently several UMLS EDITORs may assign to new concepts any combination of semantic types. Even combinations that don’t make sense.

• We propose that concepts may be assigned only existing pure and intersection types.

• If a new intersection type is desired it has to be “approved” by the NLM.

36

Summary (1)

• We propose to change the SN as follows:

• Allow a DAG structure, to enable intersections (with multiple parents)

• Create the “lower half” of the RSN by our method of Semantic Refinement. That takes care of deepening!

• Use various auditing techniques to eliminate all wrong intersections.

37

Summary (2)

• Rename legitimate intersection types.

• The RSN limits the choices of a UMLS editor to reasonable intersections.

• This will prevent future UMLS mistakes.

• The RSN streamlines categorization, making it more accurate and easier.