Bits and Huffman Encoding

16
Bits and Huffman Encoding Please get a piece of paper and a pen and put your name and netid on it. Make sure you can turn in it after class without losing your notes.

description

Bits and Huffman Encoding. Please get a piece of paper and a pen and put your name and netid on it. Make sure you can turn in it after class without losing your notes. How is data stored in a computer?. Several different ways are possible - PowerPoint PPT Presentation

Transcript of Bits and Huffman Encoding

Page 1: Bits and Huffman Encoding

Bits and Huffman Encoding

Please get a piece of paper and a pen and put your name and netid on it. Make sure you can turn in it after class without losing

your notes.

Page 2: Bits and Huffman Encoding

How is data stored in a computer?

• Several different ways are possible• In an ASCII-encoded text file (hithero known

as a “text file”), each character consists of 8-bit chunks

• What’s a bit?

Page 3: Bits and Huffman Encoding

Bits Mean What You Want Them To Mean

01001101 01001001 01001011 01000101

M I K E

But why 8-bit chunks? Couldn’t we interpret the same string as 4 bit chunks?

0100 1101 0100 1001 0100 1011 0100 01014 13 4 9 4 11 4 5

With 8 bit chucks there are 28 (256) possible characters. With 4 bit chunks there are 24 (16) possible characters.

Page 4: Bits and Huffman Encoding

What if we want to save memory?

• Could just use fewer bits per chunk, but that limits the number of characters we can represent

• But my document contains a lot more ‘a’s than ‘%’s. What if the code for a was “01” and the code for % was “10000101”

Page 5: Bits and Huffman Encoding

A ProblemMike decides he wants to compress his vast collection of Harry Potter

fanfic (all stored a textfiles). He notices that the character ‘H’ occurs much more frequently in the textfiles than does the character ‘%’, but they both use 8 bits. So he changes the encoding so that ‘H’ is represented by ‘01’ (2 bits) and ‘%’ is represented by ‘1010101010101010’ (16 bits). Assuming all other characters stay at 8 bits (a bit unlikely…but just pretend)…how much more frequent do ‘H’s need to be than ‘%’s for there to be a space savings?

1. There must be more than 3Hs for every %s2. There must be more than 3Hs for every 2%s3. There must be more than 4Hs for every 3%s4. As long as there more Hs than %s you get a savings

Page 6: Bits and Huffman Encoding

• We can get savings if we know what characters occur frequently

• So imagine you look at some set of data with character frequencies. How much can you save?

Page 7: Bits and Huffman Encoding

Consider the string:

• AACCCAAABAADAE

Letter Frequency

A 8

C 3

B 1

D 1

E 1

What’s the best encoding?

Page 8: Bits and Huffman Encoding

Why will this encoding not work?

Letter Frequency Encoding

A 8 0

C 3 1

B 1 10

D 1 01

E 1 11

What is the encoding for AAE?What is the encoding for ADC?

Page 9: Bits and Huffman Encoding

Consider the string:

• AACCCAAABAADAEThe rule: one character’s encoding cannot be the prefix of another. So if A=01, no other encoding can begin with 01

Letter Frequency Encoding

A 8 ?

C 3 ?

B 1 ?

D 1 ?

E 1 ?

Work with your row to find the most efficient possible encoding for each character. I was able to encode the entire string in 25 bits. See if you can do it that efficiently – but be careful with prefixes. Write your encoding, plus your computation of its total length, on your handin sheet.

Page 10: Bits and Huffman Encoding

My Encoding

• AACCCAAABAADAEThe rule: one character’s encoding cannot be the prefix of another. So if A=01, no other encoding can begin with 01

Letter Frequency Encoding

A 8 0

C 3 11

B 1 100

D 1 1010

E 1 1011

There a several possible encodings that can give you 25. I’m pretty sure there’s no more efficient encoding, at least as far as compressing individual characters.

Page 11: Bits and Huffman Encoding

Given a set of frequencies, how should you select the character encoding for

maximum efficiency?

Huffman Encoding

Page 12: Bits and Huffman Encoding

The basic idea

• We will make a “huffman tree”. By repeatedly combining nodes with small frequencies into “meta nodes” with larger frequency

• A letter’s position in the tree will tell us what its encoded form is

Page 13: Bits and Huffman Encoding

Generate a huffman tree and encodings for this circumstance

Follow the tutorial linked off the resources section in Sakai

Letter2 Frequency Encoding

A 12 ?

C 9 ?

B 8 ?

D 8 ?

E 5 ?

F 5 ?

Write the huffman tree and the resultant encoding on your handin sheet.

Page 14: Bits and Huffman Encoding

The Header

• The Magic Number• Counts• Or a tree for extra credit

Page 15: Bits and Huffman Encoding

The Psuedo-EOF character

• A made up character to write to your file

Page 16: Bits and Huffman Encoding

Please turn your sheet in at the back of the room (try for a neat pile)