LIS6 18 lecture 2 preparing and preprocessing
description
Transcript of LIS6 18 lecture 2 preparing and preprocessing
![Page 1: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/1.jpg)
LIS618 lecture 2
preparing and preprocessing
Thomas Krichel2011-04-21
![Page 2: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/2.jpg)
today
• text files and non-text files• characters• lexical analysis
![Page 3: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/3.jpg)
preliminaries
• Here we discuss some preparatory stages that has the way the IR system is built.
• This may sound unduly technical.• But the reality is that often, when querying IR
systems– we don’t know how they have been built– we can’t find something that is not there– we have to make guesses and try several things
![Page 4: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/4.jpg)
key aid: index
• An index is a list of features, with a list of locations where the feature is to be found.– The feature may be a word– The locations depend on where we need to find
the feature at.
![Page 5: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/5.jpg)
example: book
• In a book the index links terms and locations are page numbers.
• Index terms are selected manually by the author or a specially trained indexer.
• Once indexing terms are chosen the index can be built by computer.
• Note that any index terms may be on several page.
![Page 6: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/6.jpg)
example: computer files
• Here the feature can be a word.• An a location can be the location of the file
with the location of the start term within the file.
• The location of the term can be found by looking at the start and add the length of the term.
![Page 7: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/7.jpg)
working with files
• Usually, when indexing takes place, we study the contents of documents that are stored in files.
• Usually you know something about the files– by the name of the directory– by commonly used file name part, e.g. extension
• JHOVE is a tool that tries to guess file types when you don’t know much about them.
![Page 8: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/8.jpg)
character extraction
• The first thing to do is to extract the characters out of these documents.
• That means translating the file content, a series of 0s and 1s into a sequence of characters.
• For some types the textual information is limited.
![Page 9: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/9.jpg)
image file, music files
• Even if you have image files and music files, there usually is some textual data that is store.
• jpeg files have exif metadata.• mp3 file have id3 metadata.• These data has simple attribute: value data.
![Page 10: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/10.jpg)
example for id3 version, start
• header: "TAG" • title: 30 bytes of the title • artist: 30 bytes of the artist name • album: 30 bytes of the album name • year: 4 A four-digit year • comment: 30 bytes of comment• … more fields
![Page 11: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/11.jpg)
mixed contents files
• If you look at MS word, spreadsheet or PDF file, they contain text surrounded by not textual code that the software reading it can understand.
• Such code contains descriptions of document structure, as well as formatting instructions.
![Page 12: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/12.jpg)
markup?
• Markup is the non-textual contents of a text documents.
• Markup is used to provide structure to a document and to prepare for its formatting.
• An easy example for markup is any page written in HTML
![Page 13: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/13.jpg)
example from a very fine page
<li><a tabindex="400" class="int" href="travel_schedule.html" accesskey="t">travel
schedule</a></li> <li><a tabindex="500" class="int" accesskey="l"
href="links.html">favorite links</a>.</li></ul><div class="hidden"> This text is hidden.</div>
![Page 14: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/14.jpg)
text extracted
• The extracted text is “travel schedule favorite links. This text is hidden.”
• The phrase “This text is hidden” is not normally shown on the page but it is still extracted and indexed by Google.
• Well, I think it is.
![Page 15: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/15.jpg)
tokenization
• Tokenization refers to the splitting of the source to be indexed into tokens that can be used in a query language.
• In the most common case, we think about splitting the text into words.
• Hey that is easy: “travel” “schedule” “favorite” “links” “this” “text” “is” “hidden”
• Spot a problem?
![Page 16: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/16.jpg)
text files
• Some files can either contain text, meant for the end user.
• Some others contain text, plus some textual instruction that are means for processing software. Such instructions are called markup.
• Typical examples for markup are web pages.
![Page 17: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/17.jpg)
key aid: index terms• The index term is a part of the document that
has a meaning on its own.• It is usually a word. Sometimes it is a phrase.• Let us stick with the idea of the indexing term
being a word.
![Page 18: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/18.jpg)
the representation of text
• In principle, every file is just a sequence of on/off electrical signals.
• How can these signals translated into text?• Well text is a sequence of characters and
every character needs to be recognized.• Character recognition leads to text
recognition.
![Page 19: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/19.jpg)
bits and bytes
• A bit is a unit of digital information. It can take the values 0 and 1.
• A byte (respelling of “bite” to avoid confusion) is a unit of storage. It should be hardware dependent but it is de facto 8 bits.
![Page 20: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/20.jpg)
characters and computer• Computers can not deal with characters
directly. They can only deal with numbers.• There we need to associate a number with
every character that we want to use in an information encoding system. • A character set combines characters with
number.
![Page 21: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/21.jpg)
characters
• When we are working with characters, two concepts appear.– The character set.– The character encoding.
• To properly recognize digital data, we need to know both.
![Page 22: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/22.jpg)
The character set
• The character set is an association of characters with numbers. Example– ‘ ‘ 32– bell 7– ‘’ 8594
• Without such a concordance the processing of text by computer would be impossible.
![Page 23: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/23.jpg)
the character encoding
• This regards the way the character number is written out as a sequence of bytes.
• There are several ways to do this. Going into details here would be a bit too technical.
• Let us look at some examples.
![Page 24: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/24.jpg)
ascii
• The most famous, but old character set is the American Standard for Information Interchange ASCII.
• It contains 128 characters.• Its representation in bytes has every byte start
with a 0 and then the seven bits that the character number makes up.
![Page 25: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/25.jpg)
iso-8859-1 (iso-latin-1)
• This is an extension of ASCII that has additional characters that are suitable for the Western European languages.
• There are 256 characters.• Every character occupies a byte.
![Page 26: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/26.jpg)
Windows 1252
• This is an extension of iso-latin-1.• It is a legacy character set used on the
Microsoft Windows operating system. • Some code positions of iso-latin-1 have been
filled with characters that Microsoft liked.
![Page 27: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/27.jpg)
UCS / Unicode
• UCS is a universal character set. • It is maintained by the International Standards
Organization.• Unicode is an industry standard for characters.
It is better documented than UCS.• For what we discuss here, UCS and Unicode
are the same.
![Page 28: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/28.jpg)
Basic multilingual plane
• This is a name for the first 65536 characters in Unicode.• Each of these characters fits into two bytes
and is conveniently represented by four hex numbers.• Even for these characters, there are numerous
complications associated with them.
![Page 29: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/29.jpg)
ascii and unicode
• The first 128 characters of UCS/Unicode are the same as the ones used by ASCII.
• So you can think of UCS/Unicode as an extension of ASCII.
![Page 30: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/30.jpg)
in foreign languages
• Everything becomes difficult. • As an example consider the characters
– o– ő– ö
• The latter two can be considered o with diarcitics or as separate characters.
![Page 31: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/31.jpg)
most problematic: encoding
• One issue is how to map characters to numbers.
• This is complicated for languages other than English.
• But assume UCS/Unicode has solved this.• But this is not the main problem that we
have when working with non-ASCII character data.
![Page 32: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/32.jpg)
encoding
• The encoding determines how the numbers of each character should be put into bytes.
• If you have a character set that is has one byte for each character, you have no encoding issue.
• But then you are limited to 256 characters in your character set.
![Page 33: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/33.jpg)
fixed-length encoding
• If you have a fixed length encoding, all characters take the same number of bytes.
• Say for the basic-multilingual plane of unicode, you need two bytes for each character, and then you are limited to that.
• If you are writing only ASCII, it appears a waste.
![Page 34: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/34.jpg)
variable length encoding
• The most widely used scheme to encode Unicode is a variable length scheme, called UTF-8.
• It is important to understand that the encoding needs to known and correct.
![Page 35: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/35.jpg)
bascis of UTF-8
• Every ASCII character, represented as a byte, starts with a zero.
• Characters that are beyond ASCII require two or three bytes to be complete.
• The first byte will tell you how many bytes are coming to make the character complete.
![Page 36: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/36.jpg)
byte shapes in UTF-8 encoding
• 0?????? ASCII• 110???? first octet of two-byte character • 1110???? first byte of three-byte character • 11110??? first octet of four-byte character• 10??????? byte that is not the first byte• as you can see, there are sequences of bytes
that are not valid
![Page 37: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/37.jpg)
hex range to UTF-8
• 0000 to 007F 0???????• 0080 to 07FF 110????? 10??????• 0800 to FFFF 1110???? 10?????? 10??????
![Page 38: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/38.jpg)
in the simplest case
• In the simplest case, let us assume that we index a bunch of text files as full text.
• These documents contain text written in some sort of a human language.
• They may also contain some markup, but we assume we can strip the markup.
![Page 39: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/39.jpg)
lexical analysis
• Lexical analysis is the process of preparing a sequence or characters into a sequence of words.
• These candidate words can then be used as index term (or what I have called features).
![Page 40: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/40.jpg)
text direction
• Text is not necessarily linear. • When right to left writing is used, you
sometimes have to jump in the opposite direction when a left to right expression appears. Example:
• عام استقاللها الجزائر 132بعد 1962حققتالفرنسي االحتالل من .سنة
![Page 41: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/41.jpg)
spaces
• One important delimiter between words are spaces.
• The blank, the newline, the carriage return and the tab characters are collectively known as whitespace
• We can collapse whitespace and get the words.
• But there are special cases.
![Page 42: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/42.jpg)
text without spaces: Japanese
• Japanese uses 3 scripts, kanji, hiragana and katakana.
• All foreign words are written with katakana. Any conitunous katakana sequence is a word.
• A Japanese word has at most 4 kanji/hiragana. A dictionary has to find backward from 4 to 1 if a word exists.
![Page 43: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/43.jpg)
punctuation
• Normally lexical analysis discards punctuation. But some punctuation can be part of a word, say “333B.C.”
• What we have to understand here is that probably at query time, the same removal of punctuation occurs.
• The user searches for “333B.C”, the query is translated to 333BC and the occurrence is found.
![Page 44: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/44.jpg)
hyphen
• It’s a particularly problematic punctuation mark.
• Breaking them, like, for example, in “state-of-the-art” is good.
• End of line hyphenation breaks at a line end may not be ends of words.
• Some hyphenation is part of a word, e.g. UTF-8. I should use the non-breaking hyphen.
![Page 45: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/45.jpg)
apostrophe
• Example:– Mr. O’Neill thinks that the boys’ stories about
Chile’s capital aren’t amusing.• O Neil … aren t• The first seem ok, but second looks silly.
![Page 46: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/46.jpg)
stop words
• Some words are considered to carry no meaning because they are too frequent.
• Sometimes, they are completely removed, and not removed in the index.
• Then you can’t search for the phrase “to be or not to be”.
• Stop word are very language dependent.
![Page 47: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/47.jpg)
stemming
• Frequently, text contains grammatical forms that a user may not look at. In English, e.g.– plurals– gerund form– past tense suffixes
• The most famous algorithm for English stemming is Martin Porter’s stemming algorithm.
• Evidence on usefulness is mixed.
![Page 48: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/48.jpg)
grammatical variants
• These are much more important in other languages. E.g., in Russian – “My name is Irina.” “Меня зовут Ирина.” – “I love Irina.” “Я люблю Ирину.”
• Text in a query in Russian must be matched against a variety of grammatical forms, so of which are very tricky to derive.
![Page 49: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/49.jpg)
casing
• Normalization usually means that we treat uppercase and lowercase the same, meaning that the uppercase and lowercase variants of a token become the same term.
• However in English– “US” vs “us”– “IT” vs “it”…
![Page 50: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/50.jpg)
treatment of diacritics
• There are rules in Unicode how diacritics can be stripped.
• These involve decomposing diacritic• This is helpful for people who find the input of
diacritics tough.• But there are lots of other tough choices.
![Page 51: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/51.jpg)
historic equivalents
• Many languages that use diacritics or ligatures have developed spelling variants without them. – Jaspersstraße Jasperstrasse– Völklingen Voelklingen
• A mechanical de-diacritic operation would yield “Volklingen” which few people would chose.
• Such operations are very language specific.
![Page 52: LIS6 18 lecture 2 preparing and preprocessing](https://reader036.fdocuments.us/reader036/viewer/2022062316/56816731550346895ddbda6a/html5/thumbnails/52.jpg)
http://openlib.org/home/krichel
Please shutdown the computers whenyou are done.
Thank you for your attention!