Text Classification & Summarization -...

31
Text Classification & Summarization Kornél Markó, Florian Schmedding Averbis GmbH Germany FP7-ICT-2013-SME-DCA

Transcript of Text Classification & Summarization -...

Page 1: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Text Classification & Summarization

Kornél Markó, Florian Schmedding Averbis GmbH

Germany

FP7-ICT-2013-SME-DCA

Page 2: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Background

• WP 3: Provision of a Natural Language Processing Toolkit for processing legal documents in

– Bulgarian (IICT-BAS)

– English, French, German (Averbis)

– Italian (UNITO)

• Levels of linguistic pre-processing

– Sentence Splitting, Tokenization, POS-Tagging, Decompounding, Named Entity Recognition, Concept Mapping, Link Detection

• Text Classification and Text Summarization

2

Page 3: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Multilingual EUCases NLP Pipeline

3

Syllabus EuroVoc Geonames

1. Am 17. August 2011 erhob der Beschwerde-führer, ein mit einer …

Sentence detection

Tokenizer Stopword

Tagger

Stemmer Morpho-Semantic

Segmenter

Part-of-Speech Tagger

Chunker

Regular Expression Annotator

Concept Mapper

Concept Mapper

Concept Mapper

Textrank Descriptor Extraction

Text Summary

Page 4: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Multilingual Terminologies

4

Page 5: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Examples

5

Page 6: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Examples

6

Page 7: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Examples

7

• What are these documents about ? • Are they related to each other ?

Page 8: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Keyword Extraction

8

• What are these documents about ? • Are they related to each other ?

• Employer • Employment • Labour relations • Wage earner • Governance • Work

• Work • Work contract • Labour tribunal • Employer • Contract • Wage earner

Page 9: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Text summary

9

His previous contract of employment was not varied, because the second contract was made between different parties. The dispute relates to the interpretation and application of article 3(1) of Council Directive 77/187/EEC on the approximation of the laws of the Member States relating to the safeguarding of employees' rights in the event of transfers of undertakings, businesses or parts of businesses ("the Acquired Rights Directive"), which has now been consolidated with subsequent amendments and repealed by Council Directive 2001/23/EC. The TECs were to plan and deliver training and to promote and support the development of small businesses and self-employment within their area under contracts with the government. And it went further still when it ruled in para 2, drawing on its previous case law, that contracts of employment existing on the date of the transfer between the transferor and the workers assigned to the undertaking transferred are deemed to be handed over on that date from the transferor to the transferee regardless of what has been agreed between the parties in that respect.

Page 10: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

NLP Backend

10

Page 11: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Text Rank (Principles)

• Graph-based Algorithm inspired by Google’s Page Rank

– Nodes are words (concepts)

– Edges represent relations to other words (concepts)

• Co-occurences within sentences in the document

• Iterate a graph-based ranking algorithm to give nodes a weight (counting vertices)

• Sort by the final score

11

Page 12: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

12

Employer Employee

If an employer transferred his business undertaking to another party, the position at common law of an employee who worked for the first employer before the transfer and for the new employer after it was in principle clear.

Common Law

Page 13: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

13

Employer Employee

Common Law

Contract

Employment

Replacement

His previous contract of employment was not varied, because the second contract was made between different parties. But the first contract was the subject of an express or implied novation, involving the termination of the first contract and its replacement by a new contract.

Page 14: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

14

Employer Employee

Common Law

Contract

Employment

Rights

Legislation

But it could work disadvantageously to the employee in any situation where his rights depended on showing that his employment had been continuous for a given period, since a novation necessarily involved a discontinuity. It was this disadvantage which the legislation now under consideration was intended to obviate.

Replacement

Page 15: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

15

Employer Employee

Common Law

Contract

Employment

Replacement

Rights

Legislation

Relationship Conditions

Justice

Judgment

But its effect is, inevitably, to introduce a fictional element into this tripartite relationship, since (where the legislative conditions are satisfied) the employee is treated as having been employed by the new employer all along and ex hypothesi such is not the case. The European Court of Justice [2005] IRLR 647 acknowledges this in para 43 of its judgment.

Page 16: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Text Rank

16

Employer Employee

Common Law

Contract

Employment

Replacement

Rights

Legislation

Relationship Conditions

Justice

Judgment

4

3

1

1

6

2

3

0 2

3 1

2

Page 17: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Text Summaries

• Based on the same principle…

– Recognize sentences

– Detect the most important words (concepts) in the document by using Text Rank

– Select n sentences containing these top terms, sort them by document positions

17

Page 18: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Evaluation

18

Page 19: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Preliminary Results: Summaries

19

Page 20: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Preliminary Results: Keywords

20

Page 21: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

Conclusion & Discussion

• Text Rank is an effective and elegant way to compute „importance“ of terms.

– Language independent

– Unsupervised

• In addition, a machine-learning based approach for text classification is provided by UNITO

• Preliminary results are very encouraging

– Text Summaries: „very useful“ and „useful“

• > 80 % for de, en, fr

• > 60% it, bg

– Keyword Extraction (ongoing work):

• Further improvement necessary for all languages

• Different coverage for the languages (e.g. Eurovoc)

21

Page 23: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

D3.1 NLP Toolkit

Kornél Markó, Averbis

FP7-ICT-2013-SME-DCA

Page 24: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

D3.1 NLP Toolkit

• Selection of Apache Unstructured Information Management Architecture (UIMA) as a framework for processing legal documents in Bulgarian, English, French, German, and Italian

• Provision of …

– language-specific analysis engines (modules) by partners: Sentence Splitting, Tokenization, POS-Tagging, NER, concept mapping, …

– wrappers for non-UIMA components

– a common typesystem and mappings to a universal tagset

24

Page 25: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

NLP Toolkit: UIMA Framework

25

Analysis Engine

Input

Annotations

Input text, e.g. „Hello world!“

Text analysis task, e.g. sentence detection, POS-Tagging, etc.

Annotations about the text, e.g. Noun: Pos: 6,11 („World“)

Common Analysis Structure containing text and annotations using an unique type-system

Annotations

Input

CA

S

Bas

ics

Pip

elin

e

1. Am 17. August 2011 erhob der Beschwerde-führer, ein mit einer …

Sentence detection

Tokenizer Part-of-Speech Tagger

Page 26: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

NLP Toolkit: EUCases Pipeline

26

LT Syllabus EuroVoc Geonames

1. Am 17. August 2011 erhob der Beschwerde-führer, ein mit einer …

Sentence detection

Tokenizer Stopword

Tagger

Stemmer Morpho-Semantic

Segmenter

Part-of-Speech Tagger

Chunker

Regular Expression Annotator

Concept Mapper

Concept Mapper

Concept Mapper

Textrank Descriptor Extraction

Text Summary

Page 27: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

NLP Toolkit: Challenges & Solutions

• AVERBIS supports German, English, and French (UIMA compliant)

• UNITO has superior parsers for Italian

• IICT-BAS supports Bulgarian

Approach:

• Including UNITO‘s and IICT-BAS‘ components into the UIMA pipeline…

– … by wrapping them to UIMA analysis engines

– … makes concept mapping, descriptor extraction, and text summarization (language unspecific) available to Italian and Bulgarian documents

27

Page 28: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

NLP Toolkit: Wrappers

• Send text to wrapped parser

• Lift custom annotations to common type system

28

Unito Sentence Detection

CAS annotation with type system

custom annotation

Page 29: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

NLP Toolkit: Wrappers

29

Unito Sentence Detection

Unito POS tagger + tokenizer

IICT-BAS combined

parser

Sentences, Tokens, POS tags, lemmas

Page 30: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

NLP Toolkit: EUCases Pipeline

30

1. Am 17. August 2011 erhob der Beschwerde-führer, ein mit einer …

Languagedetection

Sentence detection

Tokenizer Stopword

Tagger Stemmer

Morpho-Semantic

Segmenter

Part-of-Speech Tagger

Chunker

Sentence detection

Tokenizer Stopword

Tagger Stemmer

Morpho-Semantic

Segmenter

Part-of-Speech Tagger

Unito Sentence Detection

Unito POS tagger + tokenizer

IICT-BAS combined

parser

Sentences, Tokens, POS tags, lemmas

Morpho-Semantic

Segmenter

DE, EN

FR

IT

BG

LT Syllabus

EuroVoc

Geonames

Regular Expression Annotator

Concept Mapper

Textrank Descriptor Extraction

Text Summary

Page 31: Text Classification & Summarization - EUCaseseucases.eu/.../documents/...expertwsclassificationsummarization.pdf · Text Classification & Summarization Kornél Markó, ... –Sentence

NLP Toolkit: Input & Output

• Akoma Ntoso XML documents delivered to pipeline

– Transform XML to plain text

– Not trivial because block elements must be separated but inline elements not

• Insert inline annotations into Akoma Ntoso

– Annotations refer to plain text positions

– Map plain text positions to corresponding text node of XML document

31

<akomaNtoso><p>Entscheidung</p><p><em>Sach</em>verhalt</p></akomaNtoso>

EntscheidungSachverhalt

Entscheidung Sach verhalt

Entscheidung Sachverhalt

Each element a linebreak

No linebreaks in document!