What You’ll Learn Today - BU Computer Science...9 17 What You Learned Today More encoding!...

9
1 1 Aaron Stevens 16 February 2011 CS101 Lecture 12: Text Representation 2 What You’ll Learn Today – How do computers store text information? – Why do some characters show up as s on my browser?

Transcript of What You’ll Learn Today - BU Computer Science...9 17 What You Learned Today More encoding!...

Page 1: What You’ll Learn Today - BU Computer Science...9 17 What You Learned Today More encoding! –Character Sets –ASCII –Unicode 18 Announcements and To Do List –HW04 due Wednesday

1

1

Aaron Stevens16 February 2011

CS101 Lecture 12:Text Representation

2

What You’ll Learn Today

– How do computers store text information?– Why do some characters show up as �s

on my browser?

Page 2: What You’ll Learn Today - BU Computer Science...9 17 What You Learned Today More encoding! –Character Sets –ASCII –Unicode 18 Announcements and To Do List –HW04 due Wednesday

2

3

Binary Representations

Recall: a single bit can be either a 0 or a 1

What if you need to represent more than 2choices?

n bits can represent 2n possible combinations

4

Representing Text

There are finite number of characters torepresent, so list them all and assign each abinary pattern.

Character setA list of characters and the binary codes usedto represent each one.Computer manufacturers agreed to standardizein the early 1960s.

Page 3: What You’ll Learn Today - BU Computer Science...9 17 What You Learned Today More encoding! –Character Sets –ASCII –Unicode 18 Announcements and To Do List –HW04 due Wednesday

3

5

The ASCII Character Set

ASCII stands for American StandardCode for Information Interchange

ASCII originally used seven bits torepresent each character, allowing for128 unique characters

Later extended ASCII evolved so that alleight bits were used.

6

The ASCIICharacterSet

(7 bits)

Page 4: What You’ll Learn Today - BU Computer Science...9 17 What You Learned Today More encoding! –Character Sets –ASCII –Unicode 18 Announcements and To Do List –HW04 due Wednesday

4

7

ASCII Encoding

Example: Hello, world!H -> 72 -> 01001000e -> 101 -> 01100101l -> 108 -> 01101100l -> 108 -> 01101100o -> 111 -> 01101111, -> 44 -> 00101100 -> 32 -> 00100000w -> 119 -> 01110111o -> 111 -> 01101111r -> 114 -> 01110010l -> 108 -> 01101100d -> 100 -> 01100100! -> 33 -> 00100001

Encoding Algorithm:

For each character: Find it’s ASCII code.Convert to binary.

8

ASCII Decoding

01000010 01100101 00100000 01110100 01110010 0111010101100101 00100000 01110100 01101111 00100000 0111100101101111 01110101 01110010 00100000 01110011 0110001101101000 01101111 01101111 01101100

01000010 -> 0x42 -> B01100101 -> 0x65 -> e

Decoding Algorithm:

For each 8 bits: Convert Hex/decimal valueLookup ASCII symbol

Page 5: What You’ll Learn Today - BU Computer Science...9 17 What You Learned Today More encoding! –Character Sets –ASCII –Unicode 18 Announcements and To Do List –HW04 due Wednesday

5

9

ASCII Decoding01000010 -> 0x42 -> B01100101 -> 0x65 -> e00100000 -> 0x20 ->01110100 -> 0x74 -> t01110010 -> 0x72 -> r01110101 -> 0x75 -> u01100101 -> 0x65 -> e00100000 -> 0x20 ->01110100 -> 0x74 -> t01101111 -> 0x6f -> o00100000 -> 0x20 ->01111001 -> 0x79 -> y01101111 -> 0x6f -> o01110101 -> 0x75 -> u01110010 -> 0x72 -> r00100000 -> 0x20 ->01110011 -> 0x73 -> s01100011 -> 0x63 -> c01101000 -> 0x68 -> h01101111 -> 0x6f -> o01101111 -> 0x6f -> o01101100 -> 0x6c -> l

10

TheExtendedASCIICharacterSet

Page 6: What You’ll Learn Today - BU Computer Science...9 17 What You Learned Today More encoding! –Character Sets –ASCII –Unicode 18 Announcements and To Do List –HW04 due Wednesday

6

11

ASCII Art

12

Can't You Take a Joke? :-)

Carnegie Mellon professor Scott E. FahlmanProposed ASCII emoticons, Sept. 19, 1982.Source: http://www.wired.com/science/discoveries/news/2008/09/dayintech_0919

Page 7: What You’ll Learn Today - BU Computer Science...9 17 What You Learned Today More encoding! –Character Sets –ASCII –Unicode 18 Announcements and To Do List –HW04 due Wednesday

7

13

The Unicode Character Set

Extended ASCII is not enough for internationaluse.

Unicode uses 16 bits per characterHow many characters can UNICODErepresent?

Unicode is a superset of ASCII.The first 256 characters correspond exactly to

the extended ASCII character set

14

The Unicode Character Set

Page 8: What You’ll Learn Today - BU Computer Science...9 17 What You Learned Today More encoding! –Character Sets –ASCII –Unicode 18 Announcements and To Do List –HW04 due Wednesday

8

15

Unicode Character Distribution

16

Page 9: What You’ll Learn Today - BU Computer Science...9 17 What You Learned Today More encoding! –Character Sets –ASCII –Unicode 18 Announcements and To Do List –HW04 due Wednesday

9

17

What You Learned Today

More encoding!– Character Sets– ASCII– Unicode

18

Announcements and To Do List

–HW04 due Wednesday 2/16–Readings:

• Reed ch 5, pp 83-87, 89-90 (today)

– Quiz 2 on Friday 2/18• Covers lectures 6,7,9,10,11