Content analysis and CERN Roman Chyla. Artificial intelligence Natural language processing Web of...
-
Upload
aldous-booth -
Category
Documents
-
view
233 -
download
7
Transcript of Content analysis and CERN Roman Chyla. Artificial intelligence Natural language processing Web of...
Semantic dictionary
• Link between infinite and finite domains
• Must be prepared (or at least revised) by humans– Purposeful– Incomplete– Constantly changing
• Very expensive to create/maintain– Solution? Use existing data!
Basic principles
• Keep it simple, stupid (I didn‘t want believe it could work, it was too simple!)
• You can‘t get it 100% right• Dictionary ~ Universal semantic language
– Not really a language, but taxonomy (not even ontology)– Lackss expresiveness– Still very much vague (but that is a feature, not bug!)– Cannot infere from facts
BUT it is:– Simple to maintain– Ready to change and evolve, ready to accomodate other resources – Language independent– Problem of research question– Problem of universal and domain specific taxonomy
Word sense disambiguation
• Homonyms are obvious problem• … and Seman can work with many
definitions at the same time (think of 3 people and their definition of one word)
• Possible solutions:– Disambiguation by harvested definitions– Rules– Neural network (supervised learning)– If problems are few, humans can decide
So what I want to do…
• Prepare another semantic dictionary for HEP (using whatever I can) and for english in general (UDC + existing seman)
• Diferentiate HEP core and non-core• Search corrections (did you mean?)• Search results categorization/facets• Identify entities, data elements… make them available
(this is mainly IE task)• Identification of topics (metrics of similarity between
document and „known characteristics“)• Keywording – identification of statically significant
occurences of concepts (not words)• Come up with faster ways to enrich the taxonomy