October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1...

9
October 2005 CSA3180: Text Processing I 1 CSA3180: Natural Language Processing Text Processing 1 • Language Encoding Issues • Common Corpora • Handling Large Document Collections • Applications: Anatomy of a Search Engine • NLTK

Transcript of October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1...

Page 1: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.

October 2005 CSA3180: Text Processing I 1

CSA3180: Natural Language Processing

Text Processing 1• Language Encoding Issues• Common Corpora• Handling Large Document Collections• Applications: Anatomy of a Search Engine• NLTK

Page 2: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.

October 2005 CSA3180: Text Processing I 2

Language Encoding Issues

• Different encoding methods

• Different languages

• Unicode Standard

• Further information:– Unicode Consortium– Jukka Korpela Tutorial

http://www.cs.tut.fi/~jkorpela/chars.html

Page 3: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.

October 2005 CSA3180: Text Processing I 3

Language Encoding Issues

• Character Repertoire – set of distinct characters

• Character Code – mapping between characters and positive integers

• Character Encoding – algorithm for presenting characters using particular code

Page 4: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.

October 2005 CSA3180: Text Processing I 4

Language Encoding Issues

• Encoding using octets

• Common Encodings:– ASCII– ISO Latin I (ISO 8859-1)– ISO Latin II + III Extensions (for Maltese)– Unicode & UTF-8– ANSI– Cyrillic and Chinese Encodings

Page 5: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.

October 2005 CSA3180: Text Processing I 5

Language Encoding Issues

• Text encoding on the Web

• MIME Standard– Content-Type: text/html; charset=iso-8859-1– Used in Email and Web Servers– Problems in implementation: few encodings

properly supported– UTF-8 recommended

Page 6: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.

October 2005 CSA3180: Text Processing I 6

Common Corpora

• WordNet

• TREC/ACE/TIDES Corpora

• Linguistic Data Consortium (LDC)– GigaWord (News)– Tree Banks– MUC (Message Understanding Conference)– TIPSTER (Information Retrieval)

Page 7: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.

October 2005 CSA3180: Text Processing I 7

Handling Large Document Collections

• Special issues involved in processing

• Hierarchical directory structures

• File indexes

• Batch processing – start, resume, pause, end

• Job scheduling

Page 8: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.

October 2005 CSA3180: Text Processing I 8

Applications

• Anatomy of a Search Engine (Larry Page and Sergey Brin)

• Describes the internals of Google

• NLP in everyday life!

Page 9: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.

October 2005 CSA3180: Text Processing I 9

Next Sessions…

• Natural Language Toolkit (NLTK)

• http://nltk.sourceforge.net/

• Please download and install!