StrepHit IEG Kick-off Seminar
-
Upload
marco-fossati -
Category
Technology
-
view
1.381 -
download
0
Transcript of StrepHit IEG Kick-off Seminar
STREPHITA WIKIMEDIA FOUNDATION IEG PROJECT
MARCO FOSSATI - HJFOCS - [email protected]
TRENTO, 15TH JANUARY 2016
WHO?
‣ ADVISOR: CLAUDIO GIULIANO ‣ VOLUNTEERS: ‣ AUVA87, BOLIOLIANDREA, DANROK,
NISPRATEEK, PROJEKT ANA, VLADIMIR ALEXIEV
6
WHAT?
‣ IS A NLP PIPELINE ‣ HARVESTS STRUCTURED DATA FROM
RAW TEXT ‣ PRODUCES WIKIDATA CONTENT WITH
REFERENCES
7
▸Reliability of content across Wikimedia projects
▸ Trust needed on the content addition process
▸Mature in Wikipedia, but what about Wikidata?
WHY
THE CRITICAL ISSUE
9
WHY
THE CRITICAL ISSUE
▸ StrepHit = novel, automatic process
▸Generates trust and reliability over Wikidata content
▸Alleviates the burden of manual curation
10
WHY
THE TECHNICAL PROBLEM
▸Content should be validated against third-party resources
▸References to external authoritative sources
▸Ensure at least one reference for each piece of data
12
HOW?
‣ INPUT = PRIMARY SOURCES CORPUS ‣ OUTPUT = DATASET FOR WIKIDATA ‣ AUTHENTICATE EXISTING CONTENT ‣ PROPOSE NOVEL CONTENT ‣ VIA REFERENCES TO SUCH SOURCES
13
HOW
MAIN TASKS
1. Sources selection
2. Corpus harvesting
3. Corpus analysis
4. Frame repository selection
5. Training set construction
6. Frame extraction
7. Dataset production
15
A. BIOGRAPHIES B. COMPANIES C. BIOMEDICAL
which domain?
FIRST STEP 17
THANKS NEMO FOR OUR PRECIOUS CONVERSATION
FIRST STEP
BIOGRAPHIES
▸ plenty of existing data
▸ broad coverage
▸ potentially easy to find valuable primary sources
18
LIBRARIANS, WHAT DO YOU THINK?
FIRST STEP
COMPANIES
▸ relatively biased domain
▸ ad-prone content
▸ the company edits the page on the company itself
▸ low-quality data
19
FIRST STEP
BIOMEDICAL
▸ great primary source
▸ PubMed: scientific papers
▸ proof of usage for an Open Access corpus
20
OPEN DISCUSSION DOMAIN + SOURCES SELECTION
MARCO FOSSATI - HJFOCS - [email protected]
TRENTO, 15TH JANUARY 2016
THIS WORK IS LICENSED UNDER A CC BY SA 4.0 LICENSE
https://pad.okfn.org/p/strephit