Automated evaluation of crowdsourced annotations in the cultural heritage domain
-
Upload
dreamgirl314 -
Category
Data & Analytics
-
view
223 -
download
1
description
Transcript of Automated evaluation of crowdsourced annotations in the cultural heritage domain
+
Automated Evaluation of Crowdsourced Annotations in the Cultural Heritage DomainArchana Nottamkandath, Jasper Oosterman, Davide Ceolin and Wan FokkinkVU University Amsterdam and TU Delft, The Netherlands
1
+Overview
Project Overview
Use case
Research Questions
Experiment
Results
Conclusion
2
+Context
• COMMIT Project – ICT Project in Netherlands– Subprojects: SEALINCMedia and Data2semantics
• Socially Enriched Access to Linked Cultural Media (SEALINCMedia)– Collaboration with cultural heritage institutions to enrich
their collections and make them more accessible
3
+Use case CH institutions have large collections which are poorly
annotated (Rijksmuseum Amsterdam: over 1 million items)
Lack of sufficient resources: knowledge, cost, labor
Solution: Crowd sourcing
4
+Crowdsourcing Annotation Tasks
5
AnnotatorFrom crowd
RosesGarden
Provides Annotations
Artefact (Painting or Objects)
Car
Car
Garden
Roses
Evaluation
+Annotation evaluation
Manual evaluation is not feasible Institutions have large collections( Rijksmuseum: over 1 million) Crowd provides quite a lot of annotations Costs time and money Museums have limited resources
6
+Need for automated algorithms
Thus there is a need to develop algorithms to automatically evaluate annotations with good accuracy
7
+Previous approach
Building user profile and tracking user reputation based on semantic similarity
Tracking provenance information for users
Realized: There is lot of data provided and meaningful info can be derived
Current approach: Can we determine quality of information based on features?
8
+Research questions
Can we evaluate annotations based on properties of the annotator and the annotation?
Can we predict reputation of annotator based on annotator properties?
9
RosesAge: 25
Male
Arts degree
No typo
Noun
In Wordnet
+Relevant features
Features of annotation Annotator Quality score Length Specificity…
Features of annotator Age Gender Education Tagging experience…
10
+Semantic Representation
11
FOAF to represent Annotator properties
Open Annotation model to represent annotation
+ExperimentSteve.museum dataset
We performed our evaluations on Steve.Museum dataset Online dataset of images and annotations
12
Stat features Values
Provided tags 45,733
Unique tags 13,949
Tags evaluated as useful 39,931(87%)
Tags evaluated as not-useful
5,802(13%)
Number of annotators/registered
1218/488(40%)
+Steve.museum annotation evaluation
The annotations in Steve.museum project were evaluated into multiple categories, we classified evaluations as either useful or not-useful
13
Usefulness-usefulJudgement-positiveJudgement-negativeProblematic-foreignProblematic-typo…
Usefulness-not useful
+Identify relevant annotation properties
Manually select properties (F_man) Is_adjective, is_english, in_wordnet
List of all possible properties (F_all) F_man + [created_day/hour, length, specificty, nrwords, frequency]
Apply feature selection algorithm on F_all to choose properties (F_ml) Feature selection algorithm from WEKA toolkit
WEKA is a collection of machine learning algorithms for data mining tasks http://www.weka.net.nz/
14
Usefulness-useful
+Build train and test data Split the Steve dataset annotations into test set and train set
The train set has features and goal(quality) Test set: only the features
Fairness: Train set had 1000 useful and 1000 not-useful annotations
15
Tag Feature 1
Feature 2
Feature n
Quality
Rose f1 f2 fn Useful
House f11 f12 f1n Not-useful
Tag Feature 1 Feature 2 Feature n
Lily f1 f2 fn
Sky f11 f12 f1n
Train data
Test data
+Machine learning
Apply Machine learning techniques Learning: Learn about features and goal from training set Predictions: Apply learning from the training set to the test set
Used SVM with default polykernel in WEKA to predict quality of annotations Commonly used, fast and resistant against over-fitting
16
+Results
Method is good to predict useful tags, but not for predicting not-useful tags
17
Feature set
Class Recall Precision
F-measure
F_man Useful 0.90 0.90 0.90
Not useful
0.20 0.21 0.20
F_all Useful 0.75 0.91 0.83
Not useful
0.42 0.18 0.25
F_ml Useful 0.20 0.98 0.34
Not useful
0.96 0.13 0.23
+Identify relevant features of annotator
Are these features helpful to Determine annotation quality? Predict annotator reputation?
18
Age: 25
MaleArts degree
+Building annotator reputation
Probabilistic logic called Subjective Logic
Annotator opinion = (belief, disbelief, uncertainty)
(p,n) = (positive,negative) evaluations
Belief = p/(p+n+2) Uncertainty = 2/(p+n+2)
Expectation value(E) is the reputation
E = (belief + apriori * uncertainty) Apriori = 0.5
19
+Identify relevant annotator properties
Manually identified properties F_man = [Community, age, education, experience, gender,
tagging experience…]
List of all properties F_all = F_man + [vocabulary_size, vocab_diversity,
is_anonymous, # annotations in wordnet]
Feature selection algorithm on F_all F_ml_a for annotation F_ml_u for annotator
20
+Results
Trained on features using SVM to make predictions
21
Feature set
Class Recall Precision
F-measure
F_man Useful 0.29 0.90 0.44
Not useful
0.73 0.11 0.20
F_all Useful 0.69 0.91 0.78
Not useful
0.43 0.15 0.22
F_ml_a Useful 0.55 0.91 0.68
Not useful
0.53 0.13 0.21
+Results
Used regression to predict reputation values based on features of registered annotator
Since annotator reputation is highly skewed (90% > 0.7), we could not predict reputation successfully
22
Feature_set
corr RMS Error
Mean Abs Errr
Rel Abs Err
F_man -0.02 0.15 0.10 97.8%
F_all 0.22 0.13 0.09 95.1%
F_ml_u 0.29 0.13 0.09 90.4%
+Evaluation
The possible reasons why method not successful for predicting not-useful annotations: They are minority (13% of whole dataset) Need more in-depth analysis of features to determine not-
useful annotations Requires study from different datasets
23
+Relevance
Our experiments help to show that there is a correlation between features of annotator and annotation to the quality of annotations
With a small set of features we were able to predict 98% of the useful and 13% of the not useful annotations correctly.
Helps to identify which features are relevant to certain tasks
24
+Conclusions
Machine learning techniques help to predict useful evaluations but not not-useful ones
Devised a model using SVM to predict annotation evaluation and annotator
reputation Using regression to predict annotator reputation
25
+Future work
Need to extract more in-depth information from both annotation and annotator
Need to build reputation of the annotator per topic
Apply the model on different use cases
26
+ Thank [email protected]
27