A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information...

A CROSS-LINGUAL ANNOTATION PROJECTION-

BASED SELF-SUPERVISION APPROACH

FOR OPEN INFORMATION EXTRACTION

The 5th International Joint Conference on Natural Language Processing (IJCNLP 2011) November 10th, 2011, Chiang Mai

Seokhwan Kim (POSTECH) Minwoo Jeong (Microsoft Bing)

Jonghoon Lee (POSTECH) Gary Geunbae Lee (POSTECH)

Contents

• Introduction

• Open Information Extraction

• Cross-Lingual Annotation Projection

• Implementation

• Evaluation

• Conclusions

Contents

• Introduction

• Implementation

• Evaluation

• Conclusions

Information Extraction

• Goal

To generate structured information from natural language

documents

• Representing semantic relationships among a set of arguments

Barack Obama was born on August 4, 1961 , in Honolulu , Hawaii.

Birthday

Birthplace

Barack Obama Person

Birthday August 4, 1961

Birthplace Honolulu

Previous Approaches

• Many supervised machine learning approaches have been

successfully applied to the RDC task

(Kambhatla, 2004; Zhou et al., 2005; Zelenko et al., 2003; Culotta

and Sorensen, 2004; Bunescu and Mooney, 2005; Zhang et al.,

Large amounts of training data are required

• Weakly-supervised techniques have been sought

(Zhang, 2004; Chen et al., 2006; Zhou et al., 2009)

To learn the IE system without significant annotation effort

(Banko et al., 2007; Wu and Weld, 2010)

Contents

• Introduction

• Implementation

• Evaluation

• Conclusions

Open Information Extraction

• An alternative weakly-supervised IE paradigm

(Banko et al., 2007)

• Problem Definition

Binary relation extraction between ei and ej

Considering relationships explicitly represented by ri,j

• Goal

Large-scale IE

• Domain-independent

• Relation-independent

Without hand-crafted rules or hand-annotated training examples

𝑓: 𝐷 → 𝑒𝑖 , 𝑟𝑖,𝑗 , 𝑒𝑗 1 ≤ 𝑖, 𝑗 ≤ 𝑁

How to Eliminate Human Supervision

• Self-supervised Learning for Open IE

Using automatically obtained training examples

• From external knowledge

• Previous Systems

TextRunner (Banko et al., 2007)

• Penn Treebank

• A small set of heuristics about syntactic structural constraints

WoE (Wu and Weld, 2010)

• Wikipedia articles

• Wikipedia Infoboxes

What’s the Problem?

• Previous approaches mainly depend on language-specific

knowledge for English

Heuristic-based Approach

• Syntactic treebank for the target language

• Heuristics designed for the target language

Wikipedia-based Approach

• Wikipedia articles and infoboxes are available not only for English

• Differences among languages in the amount of available resources

English Wikipedia: 3,500,000 articles

Korean Wikipedia: 150,000 articles

Contents

• Introduction

• Implementation

• Evaluation

• Conclusions

Cross-lingual Annotation Projection

• Goal

To obtain training examples for the target language LT

• Method

To leverage parallel corpora to project the annotations on the

source language LS to the target language LT

The premise is that parallel corpora between LS and LT are much

easier to obtain than the task-specific training dataset for LT

Barack Obama was born in Honolulu Hawaii , .

<e1, r12, e2> = <Barack Obama, was born in, Honolulu>

버락 오바마 (beo-rak-o-ba-ma)

는 (neun)

하와이 (ha-wa-i)

호놀룰루 (ho-nol-rul-ru)

의 (ui)

에서 (e-seo)

태어났다 (tae-eo-nat-da)

<e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>

Cross-lingual Annotation Projection

• Previous Work

Part-of-speech tagging (Yarowsky and Ngai, 2001)

Named-entity tagging (Yarowsky et al., 2001)

Verb classification (Merlo et al., 2002)

Dependency parsing (Hwa et al., 2005)

Mention detection (Zitouni and Florian, 2008)

Semantic role labeling (Pado and Lapata, 2009)

• To the best of our knowledge, no work has reported on the

Open IE task

Annotation

• To obtain annotations for the sentences in LS

• Procedure

A set of entities in the given sentence is identified

Each instance is composed of a pair of entities

For each instance, extraction is performed

Annotation

• Procedure

Annotation

• Procedure

Annotation

• Procedure

Projection

• To project the annotations from the sentences in LS onto

the sentences in LT using word alignment information

• Procedure

For each instance, the existence of relationship is determined

If the instance is positive, the contextual subtext is projected

Projection

• Procedure

는 (neun)

하와이 (ha-wa-i)

의 (ui)

에서 (e-seo)

Projection

• Procedure

는 (neun)

하와이 (ha-wa-i)

의 (ui)

에서 (e-seo)

Projection

• Procedure

는 (neun)

하와이 (ha-wa-i)

의 (ui)

에서 (e-seo)

Projection

• Procedure

는 (neun)

하와이 (ha-wa-i)

의 (ui)

에서 (e-seo)

<e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>

Contents

• Introduction

• Implementation

• Evaluation

• Conclusions

Overall Architecture

Self-Supervision

English-Korean Parallel

Corpus

Korean Annotated

Corpus

Learning

Korean Open IE Model

Extraction

Korean Raw Text

Extracted Results

Projection Annotation

Cross-lingual Annotation Projection-

based Self-Supervision

Parallel Corpus

English Sentences

English Preprocessors

English Open IE System

English Annotated

Corpus

Korean Sentences

Korean Preprocessors

Word Alignment

Projection

Korean Annotated

Corpus

• Dataset

English-Korean Parallel Corpus

• 266,892 bi-sentence pairs in English and Korean

• Preprocessors

English

• OpenNLP toolkit

Korean

• Espresso toolkit

• English Open IE

Our own implementation of the Banko’s method

• Dataset

The WSJ part of Penn Treebank

By applying a series of heuristics (Banko, 2009)

1,028,361 instances from 49,208 sentences (9.0% were positive)

• Model

Conditional Random Fields (CRF)

• With Lexical and POS tag features

• CRF++ toolkit

• Word Alignment

Aligned by GIZA++ toolkit

• In the standard configuration in both directions

• The bi-directional alignments were joined using the grow-diag-final

algorithm

Chunk-based Reorganization

• To reduce the word alignment errors

• Generating alignments between pairs of base phrase chunks

• Using a simple greedy algorithm

Based on the overlap score of aligned words between base phrase chunks

• Annotated Dataset

English

598,115 instances

• 169.771 positive instances

• Projected Dataset

Korean

278,730 instances

• 89,743 positive instances

Learning & Extraction

• Extractor for Korean Open IE

Maximum Entropy (ME) model

• To detect whether or not each given instance is positive

• Features

Lexical, POS Tag

On the dependency path

• Maximum Entropy Modeling toolkit

Conditional Random Fields (CRF) model

• To identify the contextual subtext indicating the semantic relationship

• Features

Lexical, POS Tag

On the dependency path

• CRF++ toolkit

Contents

• Introduction

• Implementation

• Evaluation

• Conclusions

Evaluation #1

• Dataset

250 sentences from Korean Wikipedia articles

With manually annotated gold standard

• 1,434 instances

• 308 positive instances

• Baseline

Heuristic-based System

• Sejong treebank corpus (Korean)

• A set of heuristics utilized for the English Open IE system except

language-specific rules

Evaluation #1

• Comparison of performances

Model P R F

Heuristic 47.7 20.1 28.3

Projection 33.6 49.0 39.8

Heuristic + Projection 41.9 46.4 44.1

Evaluation #1

Model P R F

Heuristic 47.7 20.1 28.3

Evaluation #1

Model P R F

Heuristic 47.7 20.1 28.3

Evaluation #1

Model P R F

Heuristic 47.7 20.1 28.3

Evaluation #2

• Datasets

Korean Newswire

• 302,276 documents

• 2,565,487 sentences

Korean Wikipedia

• 123,000 articles

• 1,342,003 sentences

• Manual Evaluation

For four relation types

• BIRTH_PLACE, WON_AWARD, ACQUISITION, INVENT_OF

Evaluation #2

• Evaluation results for four relation types

Type Newswire Wikipedia

precision # of extractions precision # of extractions

Birth Place 65.2 256 69.1 971

Won Award 57.4 824 63.3 286

Acquisition 67.0 1112 50.3 143

Invent Of 53.1 32 47.6 103

3,727 extractions with a precision of 63.7% for four relation types

Evaluation #2

• Distribution of the errors

Error Type # of errors

Chunking Error 364 (26.9%)

Dependency Parsing Error 461 (34.1%)

Extracting Error 527 (39.0%)

Contents

• Introduction

• Implementation

• Evaluation

• Conclusions

Conclusions

• Summary

A Cross-lingual Annotation Projection Approach for Open IE

Korean Open IE system developed using an English Open IE

system and an English-Korean parallel corpus

Our system outperformed the heuristic-based system

Our system achieved 63.7% in precision from a large-scale

evaluation

• Ongoing Work

Reducing sensitivity to the errors committed by preprocessors

Investigating hybrid approaches considering various external

knowledge sources

A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information...

Technology

Transcript of A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information...

Botão Lingual / Lingual Button / Botón Lingual€¦ · 2 Botão Lingual / Lingual Button / Botón Lingual Instruções de Uso / Instructions for Use / Intrucciones de Uso Botão

Facilitating Multi-Lingual Sense Annotation: Human ...pb/papers/gwc14-multilingual-stemmer.pdf · Facilitating Multi-Lingual Sense Annotation: Human Mediated Lemmatizer Pushpak Bhattacharyya

Cross-Lingual Web API Classi cation and Annotationceur-ws.org/Vol-775/paper1.pdfCross-Lingual Web API Classi cation and Annotation Maria Maleshkova, Lukas Zilka, Petr Knoth, Carlos

E ectiveness of Automatic Translations for Cross-Lingual ...pavel/OM/articles/live-4789-9080-jair.pdf · data annotation. Cross-lingual ontology mapping methods are also helpful in

Annotation consistency using annotation intersections

Cross-lingual Annotation Projection of Semantic Roles · Cross-lingual Annotation Projection for Semantic Roles around 150,000 annotated tokens of 7,000 frame-evoking elements. Although

ORTHO LINGUAL COLLECTION - Proclinic€¦ · Lingual Bracket Removing Pliers | 678-703 Lingual view of the Lingual Bracket Removing Pliers debonding the bracket. Lingual Mathieu Pliers

Audio lingual approach1

Embedding Projection for Targeted Cross-Lingual Sentiment ... · cially for ne-grained sentiment tasks, such as aspect-level or targeted sentiment analysis. To improve this situation,

A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction

Global Media Monitoring - GESIS...Main stream news Blogs Cross-lingual article matching Article semantic annotation Event formation Cross-lingual cluster matching Event registry API

DEV111 Multi Lingual

CHIST-ERA Projects Seminar 2016 PostersAbstracts list: From Data to New Knowledge (Call 2011) CAMOMILE – Collaborative Annotation of multi-MOdal, multI-Lingual and multi-mEdia documents

Multi-lingual glossary (English, French, German, Italian ...cws.cengage.co.uk/stolowy3/students/Multi-lingual Glossary_3rd edn.pdf · Multi-lingual glossary (English, French, German,

Nerf lingual

A Cross-Lingual Annotation Projection Approach for Relation Detection

Multi-lingual glossary

intra-lingual and cross-lingual voice conversion using harmonic plus ...

Annotation - 碁峰資訊epaper.gotop.com.tw/pdf/A155.pdf · Retention meta-annotation Java annotation annotation class class Java virtual machine annotation class class annotation

Using XTrans for Broadcast Transcription · transcription and MDE annotation of broadcast audio, telephone speech and meetings. XTrans is a multi-lingual, multi-platform transcription