SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing...
Transcript of SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing...
![Page 1: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/1.jpg)
SYMPOSIUM ON BIAS AND DIVERSITY IN IR A TESTBED FOR DIVERSIFICATON IN SEARCH
Koblenz, August 31, 2011 Michael Ma:hews, Barcelona Media/Yahoo! Research
1
![Page 2: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/2.jpg)
OVERVIEW • Introduc0on to LivingKnowledge Testbed – The Diversity Engine
• GeAng started – Our first applica0on! • Adding text analysis • Adding mul0media analysis • Evalua0on • Indexing and search • Developing applica0ons • Future work
2
![Page 3: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/3.jpg)
DIVERSITY ENGINE • Provide collec0ons, annota0on tools and an evalua0on framework to allow for collabora0ve and comparable research
• Supports indexing and searching on a wide variety of document annota0ons including en00es, bias, trust, polarity, and mul0media features
• Support development of bias and diversity aware applica0ons
![Page 4: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/4.jpg)
ARCHITECTURE
Document Collections
Analysis Pipeline
Index/ Search
Application Development
• Prediction of Community Acceptance• Sentiment in Comments ßà Comment Ratings• Polarizing Videos ßà Distribution of Ratings• Topic of Videos ßà Distribution of Ratings
Yahoo! News
ARC Crawls
NYT
Evaluation Framework
![Page 5: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/5.jpg)
DESIGN DECISIONS
• Use Open Source tools when available • Programming Language -‐ Java 1.6 • Data format – LK XML • Analysis tools Opera0ng System – Linux (any so\ware language)
• Indexing/Search -‐ Solr • GUI – JSP, HTML, JavaScript, CSS
5
![Page 6: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/6.jpg)
LK-‐XML format.
![Page 7: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/7.jpg)
DOCUMENT COLLECTIONS
• Supported Formats -‐ARC (Internet Memory Crawls) ,Text, HTML. Kyoto, BBN, NYT
• Collec0ons – Tes0ng Examples included with Diversity Engine
– Large ARCs available from Internet Memory – Converters provided for other collec0ons (MPQA, BBN, NYT) that have licensing restric0ons
7
![Page 8: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/8.jpg)
ANALYSIS MODULES
8
Image Annotation Processing
Image Processing Text Processing
Text Annotation Processing
Face Detection
Naturalness
Colourfulness
SIFT Features
City/Landscape
Tone
Photomontage
Face Tampering
Photo/Cartoon/CG Annotations
SentimentHistogram
Sentence Subjectivity
Syntax & Semantics
POS
OpenNLP Entities
SuperSense Tagger
Vector Quantisation
Dictionary
Phrases
Quotes
Disambiguated Entities
Document Layout
RDFa Injection
Readability4J
TimeML
Statements
Subjective Expressions
URLs
Wikipedia People
Wikipedia Places
EXIF Image Clustering
![Page 9: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/9.jpg)
INDEXING/SEARCH
• Solr – Enterprise search pladorm built on top of Lucene – Xml input and output allows for easy integra0on with Diversity Engine
– Plug-‐in framework allows customiza0on – Built-‐in facet capabili0es support indexing and searching on annota0ons
• Integra0on – Converter from LK XML – Solr XML – Plug-‐in for facet ranking and speed improvements
9
![Page 10: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/10.jpg)
APPLICATION DEVELOPMENT
10
• Basis for LivingKnowledge Applica0ons – Future Predictor – Media Content Analysis
• Support development – coding required! • Real World Problems
– HTML Extrac0on – Scaling to Large Collec0ons – Provenance – Some pluggable GUI components – Examples to ease learning curve
![Page 11: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/11.jpg)
APPLICATION DEVELOPMENT
11
![Page 12: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/12.jpg)
APPLICATION DEVELOPMENT
12
![Page 13: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/13.jpg)
EVALUATION FRAMEWORK
• Framework for the evalua0on of analysis tools
• Evaluates any possible annota0on pipeline
• Measures correctness and quality • Outputs Precision + Recall • Compares annota0on output of pipeline with ground truth data
13
![Page 14: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/14.jpg)
OUR FIRST APPLICATION
• Download Diversity Engine release from SourceForge • tar xzvf [release file] • cd testbed • ant build • apps/testbed conf/testbed/tutorial-‐applica0on.xml • What happened?
– 197 text files and 127 images files converted from arc format to LK XML and stored in devapps/example/data/lkxml
– 2 annotators were run over collec0on • OpenNLP for tokeniza0on, sentence spliAng, Pos tags • SST named en0ty recognizer • Results stored in devapps/example/data/lkxml
– Files were converted to Solr xml format and indexed using solr • Solr XML stored to devapps/example/data/solr
– HTML Visualiza0on Files stored in devapps/example/data/html • ant deploy-‐testbed
– Solr running at hnp://localthost:8983/solr/ – Example app running at hnp://localhost:8983/testbed/
14
![Page 15: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/15.jpg)
EXAMPLE SOLR OUTPUT
15
hnp://localhost:8983/solr/select/?q=pu0n
![Page 16: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/16.jpg)
EXAMPLE APPLICATION
16
hnp://localhost:8983/testbed/results.jsp?query=pu0n
![Page 17: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/17.jpg)
EXAMPLE DOCUMENT
17
![Page 18: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/18.jpg)
CONFIGURATION FILE
18
<lk-application logDir="log" appDir="devapps/example"> <corpus dir="corpora/examples/smallarc" format="arc"/> <image-pipeline> <annotators> </annotators> </image-pipeline> <pipeline> <annotators> <annotator exec="./opennlp"/> <annotator exec="./sst"/> </annotators> </pipeline> <visualize/> <indexer solrHomeDir="solr/solr“ solrDataDir="solr/solr/data“ converter="conf/testbed/tutorial-lk2solr.xml"/> <searcher appTitle="LivingKnowledge - Example Application"
appShortTitle="Example Application" appUrl="http://localhost:8983/solr/">
<facets> <facet field="per" description="Person"/> <facet field="loc" description="Location"/> </facets>
</searcher> </lk-application>
![Page 19: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/19.jpg)
TEXT ANALYSIS
19
<pipeline> <annotators> <annotator exec="./opennlp"/> <annotator exec="./sst"/> </annotators> </pipeline>
<pipeline> <annotators> <annotator exec="./opennlp"/> <annotator exec="./sst"/> <annotator exec="./facts"/> <annotator exec="./unitn_tagger"/> <annotator exec="./unitn_subjexpr"/> </annotators> </pipeline>
apps/testbed –run pipeline conf/testbed/tutorial-application.xml apps/testbed –run visualization conf/testbed/tutorial-application.xml
![Page 20: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/20.jpg)
TEXT ANALYSIS -‐ FACTS
20
devapps/example/data/lkxml/EA-‐EUElecKons2009-‐euobserver-‐0729-‐20090729085530-‐00000.arc.15521713.facts.xml
![Page 21: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/21.jpg)
TEXT ANALYSIS -‐ FACTS
21
devapps/example/data/html/EA-‐EUElecKons2009-‐euobserver-‐0729-‐20090729085530-‐00000.arc.15521713.html
![Page 22: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/22.jpg)
<pipeline> <annotators> <annotator exec="./opennlp"/> <annotator exec="./sst"/> <annotator exec="./facts"/> <annotator exec="./unitn_tagger"/> <annotator exec="./unitn_subjexpr"/> <annotator exec="./imageannots"/> </annotators> </pipeline>
IMAGE ANALYSIS
22
<image-pipeline> <annotators> <annotator exec="./soton_haarfacedetector"/> </annotators> </pipeline>
apps/testbed –run pipeline,image-pipeline –pipeline imageannots conf/testbed/tutorial-application.xml ls devapps/example/data/lkxml/img/*
![Page 23: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/23.jpg)
ANALYSIS API
• Documents in LK XML format • Annotators passed a single document directory –They should add
annota0ons for each document in directory • Files will have consistent naming conven0on
– LkText file = id + “.lktext.xml” – LkMedia = id + “.lkmedia.xml” – LkAnnota0on = id + “.” + annotatorId + “.xml”
• Annotators will be processed sequen0ally in the order listed in the XML file
• Annotators can be wrinen in any language but must run on Linux – Helper classes will exist for Java, but there is no obliga0on to use them.
• Add applica0on calling your new annotator to apps directory • Add your applica0on to the configura0on file as before
23
![Page 24: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/24.jpg)
ANALYSIS API – JAVA
• Extend class org.diversityengine.annotator.AbstractAnnotator • Implement Methods
– getName() – getType() -‐ TEXT OR IMAGE
• For Image Analysis implement – LkAnnota0on getLkAnnota0on(ImageDocument document)
• For Text Analysis implement – LkAnnota0on getLkAnnota0on(TextDocument document)
• In main, instan0ate and call annotator – NewAnnotator annotator = new NewAnnotator() – annotator.processDirectory(args[0]);
• Add applica0on calling your new annotator to apps directory • Add your applica0on to the configura0on file as before
24
![Page 25: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/25.jpg)
EVALUATION
25
<lk-application logDir="log" appDir="devapps/evaluation"> <corpus dir="corpora/evaluation/sst/text/" format="bbn"/> <pipeline>
<annotators> <annotator exec="./sst"/> </annotators>
</pipeline> <evaluation evalDir="evaluation/sst/"> <evaluator provides="ENTITIES"
goldDir="corpora/evaluation/sst/gold/" goldAnnotator="sstgold" annotator="sst" />
</evaluation> </lk-application>
Evalua0on works with same configura0on file. Simply add evalua0on element
apps/testbed conf/evaluation/sst.xml
![Page 26: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/26.jpg)
EVALUATION RESULTS
26
<evaluation goldDir="/home/mikemat/code/livingknowledge/WP6/testbed/corpora/evaluation/sst/gold/" lkDir="/home/mikemat/code/livingknowledge/WP6/testbed/devapps/evaluation/data/lkxml" annotation="sst" goldAnnotation="sstgold" provides="ENTITIES"> <docs> <doc id="WSJ0375" N="19" tp="18" fp="1" fn="1" /> <doc id="WSJ0380" N="19" tp="15" fp="4" fn="1" /> <doc id="WSJ0376" N="72" tp="61" fp="11" fn="7" /> <doc id="WSJ0377" N="26" tp="17" fp="9" fn="6" /> <doc id="WSJ0378" N="10" tp="10" fp="0" fn="0" /> <doc id="WSJ0379" N="24" tp="19" fp="5" fn="2" /> </docs> <totals N="170" tp="140" fp="30" fn="17" p="0.8235294117647058" r="0.89171974522293" f="0.8562691131498471" /> </evaluation>
cat evaluation/sst/sst.ENTITIES.xml
![Page 27: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/27.jpg)
INDEXING AND SEARCH
• Search Engines -‐ Tradi0onal – Bag-‐of-‐words representa0on – Inverted index (words -‐> documents) for efficiency – 10 docs ranked according d-‐idf similarity with query
• Search Engines – Today – Much metadata associated with documents – Ranking based on 100s of features (date, loca0on, pagerank,
click data, etc, personaliza0on) – Richer display
• Facets for exploratory search • Answers when appropriate • etc..
– Many open source op0ons -‐ Lucene/Solr most widely used
27
![Page 28: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/28.jpg)
APACHE LUCENE/SOLR
28
Lucene/Solr
![Page 29: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/29.jpg)
FACETED SEARCH
29 Diagram by Yonik Seeley
![Page 30: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/30.jpg)
FACETED SEACH
30
• Summarize query results aggrega0on proper0es of returned pages – price ranges for product query – related people or loca0ons for news query
• Exploratory Search – Show documents that matching the query term and a selected
facet – Make inferences not clear from simple document list
• Living Knowledge Analysis is modeled very well by facets – Topics as determined by en0ty and fact extrac0on – Loca0on and Time diversity dimensions – Opinions as determined by opinion extrac0on
![Page 31: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/31.jpg)
LK XML TO SOLR
31
• Solr has well defined XML input format for adding new documents
• Diversity Engine provides a simple language to map LX XML to Solr XML
![Page 32: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/32.jpg)
LK2SOLR CONVERSION
32
<lktosolr>
<field solr="per" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.PerValueFilter"/> <field solr="loc" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.LocValueFilter"/> <field solr="keywords" annotation="TOP_ENTITIES" value="$text" /> <field solr="pubdate" annotation="metainfo:lktext" value="date“ type="date"/>
</lktosolr>
solr – Name of the field in solr annotation – Name of the LKXML Annotation value – Value of annotation filter – Allows post processing on annotation type – Only Date supported currently
<indexer solrHomeDir="solr/solr“ solrDataDir="solr/solr/data“ converter="conf/testbed/tutorial-lk2solr.xml"/>
![Page 33: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/33.jpg)
ADDING FACTS TO INDEX
33
<lktosolr>
<field solr="per" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.PerValueFilter"/> <field solr="loc" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.LocValueFilter"/> <field solr="keywords" annotation="TOP_ENTITIES" value="$text" /> <field solr="pubdate" annotation="metainfo:lktext" value="date“ type="date"/> <field solr="yago" annotation="yago-entities" value="$text" /> <field solr="yago-country" annotation="facts" value="xpath:/entity-information[facts/type/text()= 'wordnet_country_108544813']/id/text()" />
</lktosolr>
apps/testbed –run convert-solr conf/testbed/tutorial-application.xml ls devapps/example/data/solr/* apps/testbed –run index conf/testbed/tutorial-application.xml
![Page 34: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/34.jpg)
FACTS TO SOLR
34
<field solr="yago" annotation="yago-entities" value="$text" />
![Page 35: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/35.jpg)
FACTS TO SOLR
35
<field solr="yago-country" annotation="facts" value="xpath:/entity-information[facts/type/text()= 'wordnet_country_108544813']/id/text()" />
![Page 36: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/36.jpg)
ADDING IMAGES TO INDEX
36
<lktosolr> <field solr="per" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.PerValueFilter"/> <field solr="loc" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.LocValueFilter"/> <field solr="keywords" annotation="TOP_ENTITIES" value="$text" /> <field solr="yago" annotation="yago-entities" value="$text" /> <field solr="yago-country" annotation="facts" value="xpath:/entityinformation[facts/type/text() ='wordnet_country_108544813']/id/text()" />
<field solr="pubdate" annotation="metainfo:lktext" value="date“ type="date"/> <field solr="image" annotation="IMAGE_ANNOTS" value="$text" /> <field solr="bestimage" annotation="BEST_IMAGES" value="$text" />
</lktosolr>
apps/testbed –run convert-solr conf/testbed/tutorial-application.xml ls devapps/example/data/solr/* apps/testbed –run index conf/testbed/tutorial-application.xml
![Page 37: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/37.jpg)
APPLICATION DEVELOPMENT
• Examples • HTML Extrac0on • Scaling to Large Collec0ons • Provenance • Some pluggable GUI components
37
![Page 38: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/38.jpg)
FACT/IMAGE APPLICATION
38
<searcher appTitle="LivingKnowledge - Example Application" appShortTitle="Example Application" appUrl="http://localhost:8983/solr/">
<facets> <facet field=“yago" description=“Yago"/> <facet field=“yago-country" description=“Country"/>
<facet field="per" description="Person"/> <facet field="loc" description="Location"/> <facet field=“image" description=“Images"/> </facets>
</searcher>
ant deploy-testbed
![Page 39: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/39.jpg)
FACT/IMAGE APPLICATION
39
hnp://localhost:8983/testbed/results.jsp?query=pu0n
![Page 40: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/40.jpg)
OPINION APPLICATION Opinions are at sentence level, not document level – same analysis, but different
indexing cat conf/testbed/tutorial-‐lk2solr-‐sentence.xml
40
<lktosolr solrDoc="SENTENCES" contextSize="1"> <field solr="per" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.PerValueFilter“ source="solrdoc" /> <field solr="loc" annotation="ENTITIES_CLEAN" value="$text“ filter="org.diversityengine.solr.converter.filters.LocValueFilter“ source="solrdoc" /> <field solr="keywords" annotation="TOP_ENTITIES" value="$text" /> <field solr="yago" annotation="yago-entities" value="$text“ source="solrdoc" /> <field solr="image" annotation="IMAGE_ANNOTS" value="$text" /> <field solr="bestimage" annotation="BEST_IMAGES" value="$text" /> <field solr="pubdate" annotation="metainfo:lktext" value="date“ type="date"/> <field solr="polarity" annotation="MPQA-expressive-subjectivity,MPQA-direct-subjective“ value="xpath:/node()[@pol]/@pol" source="solrdoc“ filter="org.diversityengine.solr.converter.filters.PolarityValueFilter"/> <field solr="pol-int“ annotation="MPQA-expressive-subjectivity,MPQA-direct-subjective“ value="xpath:concat(/node()[@pol and @int]/@pol,/node()[@int and @pol]/@int)“ source="solrdoc"/>
</lktosolr>
apps/testbed –run convert-solr,index conf/testbed/tutorial-application-sentence.xml
ls devapps/example/data/solr/*
![Page 41: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/41.jpg)
SOLR XML – SENTENCE
41
![Page 42: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/42.jpg)
OPINION APPLICATION
42
<web-app xmlns="http://java.sun.com/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" version="2.5"> <description> LivingKnowledge Testbed Example Application </description> <display-name>Testbed Examples</display-name>
<context-param> <param-name>applicationDef</param-name>
<param-value>conf/testbed/tutorial-application-sentence.xml</param-value>
<description>The Living Knowledge application description XML file </description> </context-param> </web-app>
ant deploy-testbed
modify webapp\WEB-INF\web.xml
![Page 43: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/43.jpg)
OPINION APPLICATION
43
hnp://localhost:8983/testbed/results.jsp?query=pu0n
![Page 44: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/44.jpg)
HTML EXTRACTION
44
Main Article Other StuffHeadline
![Page 45: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/45.jpg)
HTML EXTRACTION
• Boilerplate can lead to false posi0ve results and inaccurate facet aggrega0on – Real example – before extrac0on developed, most common person for most queries was in a top story 0tle (on all pages) the day of the crawl!
• Titles, Authors and Dates are important for bias and diversity aware search
45
![Page 46: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/46.jpg)
PROVENANCE
• How an annota0on is derived is o\en as important as the annota0on itself – Users want to verify results – Developers need to validate results
• Open Provenance provides an open source solu0on
• Testbed annota0ons can be extended with Open Provenance chains
46
![Page 47: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/47.jpg)
Provenance Diagram
47
![Page 48: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/48.jpg)
SCALING TO LARGE COLLECTIONS
• In the real world, even “small” datasets have million of documents
• NLP/Image processing is expensive – 1 doc/sec = 11 days for 1 million docs!
• Hadoop Mapper allows for scaling – scales linearly with number of machines
• ZipCollec0on writer allows par00oning data into subsets for processing
48
![Page 49: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/49.jpg)
COMPONENTS-‐ OPINIONS
49
![Page 50: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/50.jpg)
COMPONENTS -‐ TIME
50
![Page 51: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/51.jpg)
COMPONENTS -‐ GEO
51
![Page 52: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/52.jpg)
FUTURE WORK
• More components • Maven to manage dependencies • Bener integra0on of Timeline and Geo visualiza0on components
• Integra0on of ranking algorithms • Bener Documenta0on J
52
![Page 53: SYMPOSIUMON(BIAS(AND( DIVERSITY(IN(IR · "ANALYSIS"MODULES" 8 Image&AnnotationProcessing Image&Processing Text&Processing Text&AnnotationProcessing FaceDetection Naturalness Colourfulness](https://reader036.fdocuments.us/reader036/viewer/2022081407/60541cbdc8c1fe256a52fde0/html5/thumbnails/53.jpg)
Thanks!
• LivingKnowledge Partners! • You for coming!! • Ques0ons?
53