Forensic Firearm Identification of Semiautomatic Handguns - Lizotte
Semiautomatic domain model building from text-data
description
Transcript of Semiautomatic domain model building from text-data
![Page 1: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/1.jpg)
SMAP 2011, Vigo, Spain, December 1-2, 2011
![Page 2: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/2.jpg)
The basic tasks in creating a domain model: selection of domain and scope consideration of reusability finding a important terms defining classes and class hierarchy defining properties of classes and
constraints creation of instances of classes
Goals designing a method for semiautomatic
domain creation different input documents different languages design and implementation of tool
![Page 3: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/3.jpg)
Algorithm and tasks work with domain model
different document formats different languages domain model
concepts, relations domain model creation = time
consuming‐ manual creation‐ automatic creation‐ semiautomatic creation
![Page 4: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/4.jpg)
natural language processing – NLP Stanford NLP
‐ Stanford Parser‐ Stanford POS tagger‐ Stanford Named Entity Recognizer
multi-language environment – Google Translate
WordNet (synsets)
Tool – Java, SWING, XML, jTidy, JAWS, SNLP, JUNG
![Page 5: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/5.jpg)
An/DT integer/NN character/NN
constant/NN has/VBZ type/NN int/NN ./.
<html><body><p>An integer character
constant has type int.</p></body></html>
![Page 6: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/6.jpg)
input TXT, HTML, PDF removal of occurrences of special
characters using regular expressions numeric designation of chapters and
references removal of single letter prepositions(\\s+[^Aa\\s\\.]{1})+\\s+ parentheses, dashes, and other
translation into English – the tools work only with english text Google Translate
![Page 7: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/7.jpg)
Stanford CoreNLP Stanford Parser, Stanford POS tagger,
Stanford Named Entity Recognizer machine learning over large data,
statistical model of maximum entropy learned models included
Activities tokenization sentence splitting POS tagging - Part-of-speech lemmatization NER - Named Entity Recognition
![Page 8: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/8.jpg)
<html><body><p>An integer character constant has type int.</p></body></html>
An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN ./.
![Page 9: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/9.jpg)
tokens marked by POS tagger as nouns are first concept candidates
one word or multi-words nouns identifying token as concept by
disambiguation from WordNet assigning synset – automatic, manual using domain term for searching possible selection of incorrect synset –
with other meaning
![Page 10: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/10.jpg)
unoriented / oriented unnamed / named WordNet – concept must have synset
‐ hyperonyms and hyponyms – IsA relations‐ holonyms and meronyms – partOf relations‐ relation orientation based on concept order
only direct relations from text
lexical-syntactic patterns decomposition of multi-word terms – right part
of term corresponds to existing concept assignment expression
assignment expression IsA expression sentence syntax analysis – amod parser
(adjectival modifier), adjective followed by noun
integral type IsA type
![Page 11: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/11.jpg)
![Page 12: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/12.jpg)
ANSI/ISO C language comparison with existing manually
created ontology 2 experiments
all concept candidates only first 200 candidates 3 variants of experiment
‐ only candidates‐ candidates and IsA proposals‐ candidates and IsA proposals and NER
entities
![Page 13: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/13.jpg)
type 645 argument 182 Behavior 149
Value 571 member 180 result 148
Character 529 String 180 Return 135
function 447 Stream 172 Macro 127
Pointer 329 Array 160 Declaration 119
Object 322 Sequence 160 Implementation 118
Expression 304 char 158 Conversion 111
Identifier 220 Operator 155 Integer 105
int 195 Number 155 File 102
operand 184 Description 155 Reference 100
![Page 14: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/14.jpg)
Variant Added Items in model
Found concepts
Found / Items
Found / total in ontology
Found / can be found
All
- 3137 395 13 % 38 % 73 %
IsA 4519 450 10 % 43 % 84 %
IsA + NER 4558 465 10 % 45 % 86 %
200
- 200 98 49 % 9 % 18 %
IsA 1802 152 8 % 15 % 28 %
IsA + NER 1962 318 16 % 31 % 59 %
![Page 15: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/15.jpg)
Variant of experiment without IsA relations only with NER entities
Variant Items Found Concepts / Items
Concepts / total
Concepts / can be found
All + NER 3204 444 13.9 % 42.8 % 82.4 %
200 + NER 360 265 73.6 % 25.5 % 49.2 %
![Page 16: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/16.jpg)
concepts => lightweight ontology enables better automatic relations
mining
![Page 17: Semiautomatic domain model building from text-data](https://reader035.fdocuments.us/reader035/viewer/2022081514/56814b99550346895db87a40/html5/thumbnails/17.jpg)
Petr ŠalounFEECS, VSB–Technical University of [email protected]
Petr Klimánek(was: Faculty of Science, University of Ostrava)[email protected]
Zdenek VelartFEECS, VSB–Technical University of [email protected]