Lecture 1: Encoding Languages - University of Pittsburghnaraehan/ling1330/Lecture1.pdf ·...

Lecture 1: Encoding Language

LING 1330/2330: Introduction to Computational Linguistics

Na-Rae Han

Objectives

Understand the fundamentals of how language is encoded on a computer

Text encoding systems

ASCII

ISO-8859

Unicode

1/12/2017 2

The language of computers:

How is language represented on a

computer?

1/12/2017 3

Natural ("Human") languages:

Spoken form

Written form

*Also: sign languages

The language of computers

1/12/2017 4

At the lowest level, computer language is binary:

Information on a computer is stored in bits

A bit is either: ON (=1, =yes) or OFF (=0, =no)

This language essentially contains

two alphabetic characters

Next level up: byte

A byte is made up of a sequence of 8 bits

ex. 01001101

Historically, a byte was the number of bits used to encode a single character of text in a computer

Byte is a basic addressable unit in most computer architecture

Encoding a written language

1/12/2017 5

How to represent a text with 0s and 1s? Hello world!

010010000110010101101100011011000110111100100000011101110110111101110010011011000110010000100001

Each character is mapped to a code point (=character code), e.g., a unique integer. H 72dec

e 101dec

Each code point is represented as a binary number, using a fixed number of bits. 8 bits == 1 byte in the example above

H 72dec 01001000 (26+23 = 64 + 8 = 72)

e 101dec 01100101 (26+ 25 + 22 + 20= 64 + 32 + 4 + 1 = 101)

One byte can represent 256 (=28) different characters 00000000 0dec 11111111 255dec

ASCII encoding for English

1/12/2017 6

How many bits are needed to encode English? 26 lowercase letters: a, b, c, d, e, …

26 uppercase letters: A, B, C, D, E, …

10 Arabic digits: 0, 1, 2, 3, 4, …

Punctuation: . , : ; ? ! ' "

Symbols: ( ) < > & % * $ + -

We are already up to 80

6 bits (26 = 64) is not enough; we will need at least 7 (27 = 128)

ASCII (the American Standard Code for Information Interchange) did just that

Uses 7-bit code (= 128 characters) for storing English text

Range 0 to 127

The ASCII chart

1/12/2017 7

https://en.wikipedia.org/wiki/ASCII

http://web.alfredstate.edu/weimandn/miscellaneous/ascii/ASCII%20Conversion%20Chart.pdf

Decimal Binary (7-bit) Character

0 000 0000 (NULL)

… … …

35 010 0011 #

36 010 0100 &

… … …

48 011 0000 0

49 011 0001 1

50 011 0010 2

… … …

Decimal Binary (7-bit) Character

65 100 0001 A

66 100 0010 B

67 100 0011 C

… … …

97 110 0001 a

98 110 0010 b

99 110 0011 c

… … …

127 111 1111 (DEL)




http://web.alfredstate.edu/weimandn/miscellaneous/ascii/ASCII Conversion Chart.pdf

http://web.alfredstate.edu/weimandn/miscellaneous/ascii/ASCII Conversion Chart.pdf

ASCII (the American Standard Code for Information

Interchange)

1/12/2017 8

The ASCII encoding scheme

First published in 1963

Uses 7-bit code (= 128 characters) for storing English text, ranging from 0 to 127

In an 8-bit (1 byte) representation, the highest bit is always 0

Printable characters

Upper and lower case roman alphabet

Digits

Punctuation marks, symbols, and space

Includes 32 non-printing characters

Control characters: BELL, ACKNWOLEDGE, BACKSPACE, DELETE, etc. originally for typewriters, many obsolete now

WHITESPACE characters: TAB, LINE FEED, CARRIAGE RETURN

Practice

1/12/2017 9

What is this English text?

Note: byte (=8-bit) ASCII representation instead of 7-bit

Space provided for your convenience only!

Answer:

Hi!

01001000 01101001 00100001

Extending ASCII: ISO-8859, etc.

1/12/2017 10

ASCII (=7 bit, 128 characters) was sufficient for encoding English. But what about characters used in other languages?

Solution: Extend ASCII into 8-bit (=256 characters) and use the additional 128 slots for non-English characters

ISO-8859: has 16 different implementations!

ISO-8859-1 aka Latin-1: French, German, Spanish, etc.

ISO-8859-7 Greek alphabet

ISO-8859-8 Hebrew alphabet

JIS X 0208: Japanese characters

Problem: overlapping character code space.

224dec means à in Latin-1 but א in ISO-8859-8!

The problem with multiple encoding

systems

1/12/2017 11

Problem: Multiple coding systems map different characters to the same character code

Solution 1: Provide meta-information on coding system

Ex. MIME (Multipurpose Internet Mail Extensions)

But what if your message contains characters from multiple coding systems?

Solution 2: Have a single universal code system for all writing systems UNICODE

Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit

Unicode

1/12/2017 12

A character encoding standard developed by the Unicode Consortium

Provides a single representation for all world's writing systems

"Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.”

(http://www.unicode.org)

http://www.unicode.org/



How big is Unicode?

1/12/2017 13

Version 9.0 (2016) has codes for 128,237 characters

Full Unicode standard uses 32 bits (4 bytes) : it can represent 232 = 4,294,967,296 characters!

In reality, only 21 bits are needed

Unicode has three encoding versions

UTF-32 (32 bits/4 bytes): direct representation

UTF-16 (16 bits/2 bytes): 216=65,536 possibilities

UTF-8 (8 bits/1 byte): 28=256 possibilities

8-bit, 16-bit, 32-bit

1/12/2017 14

UTF-32 (32 bits/4 bytes): direct representation

UTF-16 (16 bits/2 bytes): 216=65,536 possibilities

UTF-8 (8 bits/1 byte): 28=256 possibilities

Wait! But how do you represent all of 232 (=4 billion) code points with only one byte (UTF-8: 28 =256 slots)? You don't.

In reality, only 221 bits are ever utilized for 128K characters.

UTF-8 and UTF-16 use a variable-width encoding.

Why UTF-16 and UTF-8? They are more compact (more so for certain languages, i.e.,

English)

Variable-width encoding

1/12/2017 15

UTF-8 as a variable-width encoding

ASCII characters get encoded with just 1 byte

ASCII is originally 7-bits, so the highest bit is always 0 in an 8-bit encoding

All other characters are encoded with multiple bytes

How to tell? The highest bit is used as a flag.

Highest bit 0: single character

Highest bit 1: part of a multi-byte character

Advantage for English: 8-bit ASCII is already a valid UTF-8!

01001000 11001001 10001000 01101001 01101001

'H' as 1 byte (8 bits):

cf. 'H' as 2 bytes (16 bits):

01001000

0000000001001000

É

A look at Unicode chart

1/12/2017 16

How to find your Unicode character:

http://www.unicode.org/standard/where/

http://www.unicode.org/charts/

Basic Latin (ASCII)

http://www.unicode.org/charts/PDF/U0000.pdf

http://www.unicode.org/standard/where/

http://www.unicode.org/charts/

http://www.unicode.org/charts/PDF/U0000.pdf

1/12/2017 17

Code point for M.

But "004D"?

Another representation: hexadecimal

1/12/2017 18

Hexadecimal (hex) = base-16

Utilizes 16 characters: 0 1 2 3 4 5 6 7 8 9 A B C D E F

Designed for human readability & easy byte conversion 24=16: 1 hexadecimal digit is equivalent to 4 bits

1 byte (=8 bits) is encoded with just 2 hex chars!

Unicode characters are usually referenced by their hexadecimal code

Lower-number characters go by their 4-char hex codes (2 bytes), e.g. U+004D ("M", U+ designates Unicode)

Higher-number characters go by 5 or 6 hex codes, e.g. U+1D122 (http://www.unicode.org/charts/PDF/U1D100.pdf)

Letter Base-10 (decimal)

Base-2 (binary)

Base-16 (hex)

M 77 0000 0000 0100 1101 004D

http://www.unicode.org/charts/PDF/U1D100.pdf

Lecture 1: Encoding Languages - University of Pittsburghnaraehan/ling1330/Lecture1.pdf ·...

Documents

Transcript of Lecture 1: Encoding Languages - University of Pittsburghnaraehan/ling1330/Lecture1.pdf ·...