Automated evaluation of crowdsourced annotations in the cultural heritage domain

+

Automated Evaluation of Crowdsourced Annotations in the Cultural Heritage DomainArchana Nottamkandath, Jasper Oosterman, Davide Ceolin and Wan FokkinkVU University Amsterdam and TU Delft, The Netherlands

1

+Overview

Project Overview

Use case

Research Questions

Experiment

Results

Conclusion

2

+Context

• COMMIT Project – ICT Project in Netherlands– Subprojects: SEALINCMedia and Data2semantics

• Socially Enriched Access to Linked Cultural Media (SEALINCMedia)– Collaboration with cultural heritage institutions to enrich

their collections and make them more accessible

3

+Use case CH institutions have large collections which are poorly

annotated (Rijksmuseum Amsterdam: over 1 million items)

Lack of sufficient resources: knowledge, cost, labor

Solution: Crowd sourcing

4

+Crowdsourcing Annotation Tasks

5

AnnotatorFrom crowd

RosesGarden

Provides Annotations

Artefact (Painting or Objects)

Car

Car

Garden

Roses

Evaluation

+Annotation evaluation

Manual evaluation is not feasible Institutions have large collections( Rijksmuseum: over 1 million) Crowd provides quite a lot of annotations Costs time and money Museums have limited resources

6

+Need for automated algorithms

Thus there is a need to develop algorithms to automatically evaluate annotations with good accuracy

7

+Previous approach

Building user profile and tracking user reputation based on semantic similarity

Tracking provenance information for users

Realized: There is lot of data provided and meaningful info can be derived

Current approach: Can we determine quality of information based on features?

8

+Research questions

Can we evaluate annotations based on properties of the annotator and the annotation?

Can we predict reputation of annotator based on annotator properties?

9

RosesAge: 25

Male

Arts degree

No typo

Noun

In Wordnet

+Relevant features

Features of annotation Annotator Quality score Length Specificity…

Features of annotator Age Gender Education Tagging experience…

10

+Semantic Representation

11

FOAF to represent Annotator properties

Open Annotation model to represent annotation

+ExperimentSteve.museum dataset

We performed our evaluations on Steve.Museum dataset Online dataset of images and annotations

12

Stat features Values

Provided tags 45,733

Unique tags 13,949

Tags evaluated as useful 39,931(87%)

Tags evaluated as not-useful

5,802(13%)

Number of annotators/registered

1218/488(40%)

+Steve.museum annotation evaluation

The annotations in Steve.museum project were evaluated into multiple categories, we classified evaluations as either useful or not-useful

13

Usefulness-usefulJudgement-positiveJudgement-negativeProblematic-foreignProblematic-typo…

Usefulness-not useful

+Identify relevant annotation properties

Manually select properties (F_man) Is_adjective, is_english, in_wordnet

List of all possible properties (F_all) F_man + [created_day/hour, length, specificty, nrwords, frequency]

Apply feature selection algorithm on F_all to choose properties (F_ml) Feature selection algorithm from WEKA toolkit

WEKA is a collection of machine learning algorithms for data mining tasks http://www.weka.net.nz/

14

Usefulness-useful

+Build train and test data Split the Steve dataset annotations into test set and train set

The train set has features and goal(quality) Test set: only the features

Fairness: Train set had 1000 useful and 1000 not-useful annotations

15

Tag Feature 1

Feature 2

Feature n

Quality

Rose f1 f2 fn Useful

House f11 f12 f1n Not-useful

Tag Feature 1 Feature 2 Feature n

Lily f1 f2 fn

Sky f11 f12 f1n

Train data

Test data

+Machine learning

Apply Machine learning techniques Learning: Learn about features and goal from training set Predictions: Apply learning from the training set to the test set

Used SVM with default polykernel in WEKA to predict quality of annotations Commonly used, fast and resistant against over-fitting

16

+Results

Method is good to predict useful tags, but not for predicting not-useful tags

17

Feature set

Class Recall Precision

F-measure

F_man Useful 0.90 0.90 0.90

Not useful

0.20 0.21 0.20

F_all Useful 0.75 0.91 0.83

Not useful

0.42 0.18 0.25

F_ml Useful 0.20 0.98 0.34

Not useful

0.96 0.13 0.23

+Identify relevant features of annotator

Are these features helpful to Determine annotation quality? Predict annotator reputation?

18

Age: 25

MaleArts degree

+Building annotator reputation

Probabilistic logic called Subjective Logic

Annotator opinion = (belief, disbelief, uncertainty)

(p,n) = (positive,negative) evaluations

Belief = p/(p+n+2) Uncertainty = 2/(p+n+2)

Expectation value(E) is the reputation

E = (belief + apriori * uncertainty) Apriori = 0.5

19

+Identify relevant annotator properties

Manually identified properties F_man = [Community, age, education, experience, gender,

tagging experience…]

List of all properties F_all = F_man + [vocabulary_size, vocab_diversity,

is_anonymous, # annotations in wordnet]

Feature selection algorithm on F_all F_ml_a for annotation F_ml_u for annotator

20

+Results

Trained on features using SVM to make predictions

21

Feature set

Class Recall Precision

F-measure

F_man Useful 0.29 0.90 0.44

Not useful

0.73 0.11 0.20

F_all Useful 0.69 0.91 0.78

Not useful

0.43 0.15 0.22

F_ml_a Useful 0.55 0.91 0.68

Not useful

0.53 0.13 0.21

+Results

Used regression to predict reputation values based on features of registered annotator

Since annotator reputation is highly skewed (90% > 0.7), we could not predict reputation successfully

22

Feature_set

corr RMS Error

Mean Abs Errr

Rel Abs Err

F_man -0.02 0.15 0.10 97.8%

F_all 0.22 0.13 0.09 95.1%

F_ml_u 0.29 0.13 0.09 90.4%

+Evaluation

The possible reasons why method not successful for predicting not-useful annotations: They are minority (13% of whole dataset) Need more in-depth analysis of features to determine not-

useful annotations Requires study from different datasets

23

+Relevance

Our experiments help to show that there is a correlation between features of annotator and annotation to the quality of annotations

With a small set of features we were able to predict 98% of the useful and 13% of the not useful annotations correctly.

Helps to identify which features are relevant to certain tasks

24

+Conclusions

Machine learning techniques help to predict useful evaluations but not not-useful ones

Devised a model using SVM to predict annotation evaluation and annotator

reputation Using regression to predict annotator reputation

25

+Future work

Need to extract more in-depth information from both annotation and annotator

Need to build reputation of the annotator per topic

Apply the model on different use cases

26

+ Thank [email protected]

27

Automated evaluation of crowdsourced annotations in the cultural heritage domain

Data & Analytics

Transcript of Automated evaluation of crowdsourced annotations in the cultural heritage domain