Lecture 1: Encoding Languages - University of Pittsburghnaraehan/ling1330/Lecture1.pdf ·...
-
Upload
duongkhanh -
Category
Documents
-
view
226 -
download
0
Transcript of Lecture 1: Encoding Languages - University of Pittsburghnaraehan/ling1330/Lecture1.pdf ·...
Lecture 1: Encoding Language
LING 1330/2330: Introduction to Computational Linguistics
Na-Rae Han
Objectives
Understand the fundamentals of how language is encoded on a computer
Text encoding systems
ASCII
ISO-8859
Unicode
1/12/2017 2
The language of computers:
How is language represented on a
computer?
1/12/2017 3
Natural ("Human") languages:
Spoken form
Written form
*Also: sign languages
The language of computers
1/12/2017 4
At the lowest level, computer language is binary:
Information on a computer is stored in bits
A bit is either: ON (=1, =yes) or OFF (=0, =no)
This language essentially contains
two alphabetic characters
Next level up: byte
A byte is made up of a sequence of 8 bits
ex. 01001101
Historically, a byte was the number of bits used to encode a single character of text in a computer
Byte is a basic addressable unit in most computer architecture
Encoding a written language
1/12/2017 5
How to represent a text with 0s and 1s? Hello world!
010010000110010101101100011011000110111100100000011101110110111101110010011011000110010000100001
Each character is mapped to a code point (=character code), e.g., a unique integer. H 72dec
e 101dec
Each code point is represented as a binary number, using a fixed number of bits. 8 bits == 1 byte in the example above
H 72dec 01001000 (26+23 = 64 + 8 = 72)
e 101dec 01100101 (26+ 25 + 22 + 20= 64 + 32 + 4 + 1 = 101)
One byte can represent 256 (=28) different characters 00000000 0dec 11111111 255dec
ASCII encoding for English
1/12/2017 6
How many bits are needed to encode English? 26 lowercase letters: a, b, c, d, e, …
26 uppercase letters: A, B, C, D, E, …
10 Arabic digits: 0, 1, 2, 3, 4, …
Punctuation: . , : ; ? ! ' "
Symbols: ( ) < > & % * $ + -
We are already up to 80
6 bits (26 = 64) is not enough; we will need at least 7 (27 = 128)
ASCII (the American Standard Code for Information Interchange) did just that
Uses 7-bit code (= 128 characters) for storing English text
Range 0 to 127
The ASCII chart
1/12/2017 7
https://en.wikipedia.org/wiki/ASCII
http://web.alfredstate.edu/weimandn/miscellaneous/ascii/ASCII%20Conversion%20Chart.pdf
Decimal Binary (7-bit) Character
0 000 0000 (NULL)
… … …
35 010 0011 #
36 010 0100 &
… … …
48 011 0000 0
49 011 0001 1
50 011 0010 2
… … …
Decimal Binary (7-bit) Character
65 100 0001 A
66 100 0010 B
67 100 0011 C
… … …
97 110 0001 a
98 110 0010 b
99 110 0011 c
… … …
127 111 1111 (DEL)
ASCII (the American Standard Code for Information
Interchange)
1/12/2017 8
The ASCII encoding scheme
First published in 1963
Uses 7-bit code (= 128 characters) for storing English text, ranging from 0 to 127
In an 8-bit (1 byte) representation, the highest bit is always 0
Printable characters
Upper and lower case roman alphabet
Digits
Punctuation marks, symbols, and space
Includes 32 non-printing characters
Control characters: BELL, ACKNWOLEDGE, BACKSPACE, DELETE, etc. originally for typewriters, many obsolete now
WHITESPACE characters: TAB, LINE FEED, CARRIAGE RETURN
Practice
1/12/2017 9
What is this English text?
Note: byte (=8-bit) ASCII representation instead of 7-bit
Space provided for your convenience only!
Answer:
Hi!
01001000 01101001 00100001
Extending ASCII: ISO-8859, etc.
1/12/2017 10
ASCII (=7 bit, 128 characters) was sufficient for encoding English. But what about characters used in other languages?
Solution: Extend ASCII into 8-bit (=256 characters) and use the additional 128 slots for non-English characters
ISO-8859: has 16 different implementations!
ISO-8859-1 aka Latin-1: French, German, Spanish, etc.
ISO-8859-7 Greek alphabet
ISO-8859-8 Hebrew alphabet
JIS X 0208: Japanese characters
Problem: overlapping character code space.
224dec means à in Latin-1 but א in ISO-8859-8!
The problem with multiple encoding
systems
1/12/2017 11
Problem: Multiple coding systems map different characters to the same character code
Solution 1: Provide meta-information on coding system
Ex. MIME (Multipurpose Internet Mail Extensions)
But what if your message contains characters from multiple coding systems?
Solution 2: Have a single universal code system for all writing systems UNICODE
Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit
Unicode
1/12/2017 12
A character encoding standard developed by the Unicode Consortium
Provides a single representation for all world's writing systems
"Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.”
(http://www.unicode.org)
How big is Unicode?
1/12/2017 13
Version 9.0 (2016) has codes for 128,237 characters
Full Unicode standard uses 32 bits (4 bytes) : it can represent 232 = 4,294,967,296 characters!
In reality, only 21 bits are needed
Unicode has three encoding versions
UTF-32 (32 bits/4 bytes): direct representation
UTF-16 (16 bits/2 bytes): 216=65,536 possibilities
UTF-8 (8 bits/1 byte): 28=256 possibilities
8-bit, 16-bit, 32-bit
1/12/2017 14
UTF-32 (32 bits/4 bytes): direct representation
UTF-16 (16 bits/2 bytes): 216=65,536 possibilities
UTF-8 (8 bits/1 byte): 28=256 possibilities
Wait! But how do you represent all of 232 (=4 billion) code points with only one byte (UTF-8: 28 =256 slots)? You don't.
In reality, only 221 bits are ever utilized for 128K characters.
UTF-8 and UTF-16 use a variable-width encoding.
Why UTF-16 and UTF-8? They are more compact (more so for certain languages, i.e.,
English)
Variable-width encoding
1/12/2017 15
UTF-8 as a variable-width encoding
ASCII characters get encoded with just 1 byte
ASCII is originally 7-bits, so the highest bit is always 0 in an 8-bit encoding
All other characters are encoded with multiple bytes
How to tell? The highest bit is used as a flag.
Highest bit 0: single character
Highest bit 1: part of a multi-byte character
Advantage for English: 8-bit ASCII is already a valid UTF-8!
01001000 11001001 10001000 01101001 01101001
'H' as 1 byte (8 bits):
cf. 'H' as 2 bytes (16 bits):
01001000
0000000001001000
É
A look at Unicode chart
1/12/2017 16
How to find your Unicode character:
http://www.unicode.org/standard/where/
http://www.unicode.org/charts/
Basic Latin (ASCII)
http://www.unicode.org/charts/PDF/U0000.pdf
1/12/2017 17
Code point for M.
But "004D"?
Another representation: hexadecimal
1/12/2017 18
Hexadecimal (hex) = base-16
Utilizes 16 characters: 0 1 2 3 4 5 6 7 8 9 A B C D E F
Designed for human readability & easy byte conversion 24=16: 1 hexadecimal digit is equivalent to 4 bits
1 byte (=8 bits) is encoded with just 2 hex chars!
Unicode characters are usually referenced by their hexadecimal code
Lower-number characters go by their 4-char hex codes (2 bytes), e.g. U+004D ("M", U+ designates Unicode)
Higher-number characters go by 5 or 6 hex codes, e.g. U+1D122 (http://www.unicode.org/charts/PDF/U1D100.pdf)
Letter Base-10 (decimal)
Base-2 (binary)
Base-16 (hex)
M 77 0000 0000 0100 1101 004D