Content analysis and CERN Roman Chyla. Artificial intelligence Natural language processing Web of...

Content analysis and CERN

Roman Chyla

Artificial intelligenceNatural language processing

Web of data

Content analysis

Semantic Web

Information extraction

A lot to do…

Semantic dictionary

• Link between infinite and finite domains

• Must be prepared (or at least revised) by humans– Purposeful– Incomplete– Constantly changing

• Very expensive to create/maintain– Solution? Use existing data!

Basic principles

• Keep it simple, stupid (I didn‘t want believe it could work, it was too simple!)

• You can‘t get it 100% right• Dictionary ~ Universal semantic language

– Not really a language, but taxonomy (not even ontology)– Lackss expresiveness– Still very much vague (but that is a feature, not bug!)– Cannot infere from facts

BUT it is:– Simple to maintain– Ready to change and evolve, ready to accomodate other resources – Language independent– Problem of research question– Problem of universal and domain specific taxonomy

Word sense disambiguation

• Homonyms are obvious problem• … and Seman can work with many

definitions at the same time (think of 3 people and their definition of one word)

• Possible solutions:– Disambiguation by harvested definitions– Rules– Neural network (supervised learning)– If problems are few, humans can decide

cat

http://en.wikipedia.org/wiki/Cat_(disambiguation)

So what I want to do…

• Prepare another semantic dictionary for HEP (using whatever I can) and for english in general (UDC + existing seman)

• Diferentiate HEP core and non-core• Search corrections (did you mean?)• Search results categorization/facets• Identify entities, data elements… make them available

(this is mainly IE task)• Identification of topics (metrics of similarity between

document and „known characteristics“)• Keywording – identification of statically significant

occurences of concepts (not words)• Come up with faster ways to enrich the taxonomy

• Semantic dictionary

• Did you mean?

• IE engine

• (Bibclassify)

Thank you for your attention.

Questions?

Content analysis and CERN Roman Chyla. Artificial intelligence Natural language processing Web of...

Documents

Transcript of Content analysis and CERN Roman Chyla. Artificial intelligence Natural language processing Web of...