CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of...

25
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken {cath,paul}@iai.uni-sb.de http://www.iai.uni-sb.de

Transcript of CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of...

Page 1: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 1

AUTINDEX

AUTINDEX: Automatic Indexing and Classification of Texts

Catherine Pease & Paul SchmidtIAI, Saarbrücken

{cath,paul}@iai.uni-sb.de

http://www.iai.uni-sb.de

Page 2: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 2

AUTINDEX

Automatic Indexing and Classification of Texts

AUTINDEX:-

• calculates keywords in texts

• places text in its appropriate classification

Page 3: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 3

AUTINDEX

APPLICATIONS

• Information Services for indexing scientific articles

• Document Management Systems for text classification according to content

• Libraries for indexing incoming books and articles

Page 4: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 4

AUTINDEX

Basis Components

• Morpho-syntactic analysis: tagging and lemmatisation

• Shallow parsing: resolution of grammatical ambiguities and identification of NPs

Page 5: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 5

AUTINDEX

Linguistic Resources for Pre-processing

• Morphological Analyser & Morpheme dictionaries

• Grammar rules for shallow parsing

Page 6: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 6

AUTINDEX

Morphological Analyser

“Cost reduction”

cost:

{lu=cost,ls=cost,c=verb,vtype=fiv}

{lu=cost,ls=cost,c=verb,vtype=inf}

{lu=cost,ls=cost,c=noun,nb=sg}

reduction:

{lu=reduction,ls=reduce,c=noun,nb=sg}

Page 7: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 7

AUTINDEX

Shallow Parsing

The company evaluated the cost reduction

noun

NP finite verb NP

Page 8: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 8

AUTINDEX

Controlled Indexing

• Identifies multiword terms and their syntactic variants

• Calculates keywords based on frequency and semantic weighting

• Checks thesaurus for relevant entry

• Classifies text

Page 9: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 9

AUTINDEX

Linguistic Resources for Indexing

• Multiword Terms and Variants Direct Match: cost reduction -> cost reduction Indirect match: inflectional differences cost reduction -> cost reductions

Page 10: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 10

AUTINDEX

Linguistic Resources for Indexing

lexical synonyms: rise - increase

derivational synonyms: biomagnetic – biomagnetism air pollutant – air pollution

Page 11: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 11

AUTINDEX

Linguistic Resources for Indexing

structural variants: costs of reduction – reduction costs combined (structural plus

derivational): transmitted DC power – DC power transmission

to calculate plane waves – place wave calculation

Page 12: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 12

AUTINDEX

Semantic Weighting

• 140 semantic types in dictionaries

• Weight assigned to nouns depending on semantic type

• Result of weighting set of keywords belonging to most frequent semantic classes

Page 13: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 13

AUTINDEX

Classification

• Descriptors annotated with Classification Code

• Hyperonym and Synonym relations used

• Frequency used to calculate Topic Classification

Page 14: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 14

AUTINDEX

User-Specific Thesauri

• Keywords checked against Thesaurus

• Hierarchical Structure of Thesaurus used to calculate Descriptors:

hyperonym relations synonym relations

Page 15: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 15

AUTINDEX

Example Output

• Keywords: List of descriptors from thesaurus plus weighting

• List of free terms / free descriptors plus weighting

• Topic Classification with relevant code

Page 16: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 16

AUTINDEX

Free Indexing

• Free indexing follows the same steps as for controlled indexing but without the use of a thesaurus

• The result is a list of free descriptors

Page 17: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 17

AUTINDEX

Architecture

Page 18: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 18

AUTINDEX

Bilingual Components

• Automatic language recognition

• Bilingual dictionaries

• Bilingual thesauri

Page 19: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 19

AUTINDEX

Libraries & the Internet

• Switch of focus from libraries to Internet because of:

Search engines e.g. Google

Poor access to library resources

Page 20: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 20

AUTINDEX

Reasons for Poor Access

• search tools need full text match

• human indexation too general and inconsistent

• no flexibility in terms of semantic relations

Page 21: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 21

AUTINDEX

AUTINDEX in Libraries

• High percentage of all queries have no hit in electronic library catalogue

• From the rest a high percentage is not used

Page 22: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 22

AUTINDEX

IntelligentCAPTURE

• Complete processing chain for digital content in libraries:

- scanning of contents tables

- treatment with OCR technology

- automatic indexation

- feeding results into library system

- integration of improved retrieval system

Page 23: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 23

AUTINDEX

Dandelon database

• Supports 16 EU languages for multilingual retrieval

• Running in 4 countries at 9 libraries

Page 24: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 24

AUTINDEX

Work Flow

Page 25: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

CIG Conference Norwich September 2006

AUTINDEX 25

AUTINDEX

Summary

• AUTINDEX provides for controlled and free indexing

• Integrated in a complete processing chain AUTINDEX can be used to improve access to library resources through efficient methods of indexation