October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1...
-
Upload
georgiana-mckinney -
Category
Documents
-
view
221 -
download
7
Transcript of October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1...
![Page 1: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.](https://reader036.fdocuments.us/reader036/viewer/2022082817/56649e305503460f94b20f39/html5/thumbnails/1.jpg)
October 2005 CSA3180: Text Processing I 1
CSA3180: Natural Language Processing
Text Processing 1• Language Encoding Issues• Common Corpora• Handling Large Document Collections• Applications: Anatomy of a Search Engine• NLTK
![Page 2: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.](https://reader036.fdocuments.us/reader036/viewer/2022082817/56649e305503460f94b20f39/html5/thumbnails/2.jpg)
October 2005 CSA3180: Text Processing I 2
Language Encoding Issues
• Different encoding methods
• Different languages
• Unicode Standard
• Further information:– Unicode Consortium– Jukka Korpela Tutorial
http://www.cs.tut.fi/~jkorpela/chars.html
![Page 3: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.](https://reader036.fdocuments.us/reader036/viewer/2022082817/56649e305503460f94b20f39/html5/thumbnails/3.jpg)
October 2005 CSA3180: Text Processing I 3
Language Encoding Issues
• Character Repertoire – set of distinct characters
• Character Code – mapping between characters and positive integers
• Character Encoding – algorithm for presenting characters using particular code
![Page 4: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.](https://reader036.fdocuments.us/reader036/viewer/2022082817/56649e305503460f94b20f39/html5/thumbnails/4.jpg)
October 2005 CSA3180: Text Processing I 4
Language Encoding Issues
• Encoding using octets
• Common Encodings:– ASCII– ISO Latin I (ISO 8859-1)– ISO Latin II + III Extensions (for Maltese)– Unicode & UTF-8– ANSI– Cyrillic and Chinese Encodings
![Page 5: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.](https://reader036.fdocuments.us/reader036/viewer/2022082817/56649e305503460f94b20f39/html5/thumbnails/5.jpg)
October 2005 CSA3180: Text Processing I 5
Language Encoding Issues
• Text encoding on the Web
• MIME Standard– Content-Type: text/html; charset=iso-8859-1– Used in Email and Web Servers– Problems in implementation: few encodings
properly supported– UTF-8 recommended
![Page 6: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.](https://reader036.fdocuments.us/reader036/viewer/2022082817/56649e305503460f94b20f39/html5/thumbnails/6.jpg)
October 2005 CSA3180: Text Processing I 6
Common Corpora
• WordNet
• TREC/ACE/TIDES Corpora
• Linguistic Data Consortium (LDC)– GigaWord (News)– Tree Banks– MUC (Message Understanding Conference)– TIPSTER (Information Retrieval)
![Page 7: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.](https://reader036.fdocuments.us/reader036/viewer/2022082817/56649e305503460f94b20f39/html5/thumbnails/7.jpg)
October 2005 CSA3180: Text Processing I 7
Handling Large Document Collections
• Special issues involved in processing
• Hierarchical directory structures
• File indexes
• Batch processing – start, resume, pause, end
• Job scheduling
![Page 8: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.](https://reader036.fdocuments.us/reader036/viewer/2022082817/56649e305503460f94b20f39/html5/thumbnails/8.jpg)
October 2005 CSA3180: Text Processing I 8
Applications
• Anatomy of a Search Engine (Larry Page and Sergey Brin)
• Describes the internals of Google
• NLP in everyday life!
![Page 9: October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.](https://reader036.fdocuments.us/reader036/viewer/2022082817/56649e305503460f94b20f39/html5/thumbnails/9.jpg)
October 2005 CSA3180: Text Processing I 9
Next Sessions…
• Natural Language Toolkit (NLTK)
• http://nltk.sourceforge.net/
• Please download and install!