Organizing Files for Performance

28
Organizing Files for Organizing Files for Performance Performance 6 Data Compression

description

Data Compression. 6. Organizing Files for Performance. 6 .1. Data Compression. Introduction to Compression Methods in Data Compression Run-Length Coding Huffman Coding. Content. Data compression. - PowerPoint PPT Presentation

Transcript of Organizing Files for Performance

Page 1: Organizing Files for Performance

Organizing Files for Organizing Files for PerformancePerformance6

• Data Compression

Page 2: Organizing Files for Performance

Data CompressionData Compression6.1

Page 3: Organizing Files for Performance

3

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

ContentContent

►Introduction to Compression►Methods in Data Compression

– Run-Length Coding– Huffman Coding

Page 4: Organizing Files for Performance

4

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Data compression Data compression ► Data compression methods are used to make files smaller by

re/encoding data that goes into a file .► There are many reasons for making file smaller

– Use less storage , resulting in cost saving– Can be transmitted faster , decreasing access time or,

alternatively ,allowing the same access time with a lower And cheaper bandwidth

– Can be processed faster sequentially

Page 5: Organizing Files for Performance

5

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Page 6: Organizing Files for Performance

6

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Page 7: Organizing Files for Performance

7

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Techniques of compressions Techniques of compressions

►Using different notation►Suppressing Repeating Sequences►Assigning Variables Length Codes► Irreversible Compression Techniques (Lossy)

Page 8: Organizing Files for Performance

8

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

► Fixed-Length fields are good candidates► Decrease the number of bits by finding a more compact notation► Cons.

– unreadable by human– cost in encoding time – decoding modules increase the complexity of s/w used for particular application.

►Example: The state fields in the person records. 6 bits (for 50 states) instead of 16.

►It’s classified as redundancy reduction technique. ►With so many costs, is this kind of compression worth it?

Using different NotationUsing different Notation

Page 9: Organizing Files for Performance

9

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Using different NotationUsing different Notation► The notation used for representing information can often be

made more compact.► EX if we are going to write a file that contains information about

students such as name, marks , and major, we can declare the mark as byte instead of integer, in this way we can save a space.

ST_REC = Name_Stu : string[50]; Mark_Stu : int ; byte Major_Stu : string[30]; string[3];

// using lookup table.

Page 10: Organizing Files for Performance

10

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Suppressing Repeating SequencesSuppressing Repeating Sequences

► Run-length encoding (RLE): encode sequences of repeating values rather than writing all the values in the file.

► EX: Suppose we wish to compress an image using run –length encoding, and we find that can be omit the byte 0xff

from the representation of image . - How would we encode the following sequence of hexadecimal

byte values ? 22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24

- The way: the first three pixels are to be copied in sequence. The runs of 24 and 26 are both run length encoded .the remaining pixels are copied in sequence ,the resulting sequence is:

22 23 ff 24 07 25 ff 26 06 25 24

Page 11: Organizing Files for Performance

11

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

► Run-length encoding (cont’d)

– example of redundancy reduction

– cons.

• not guarantee any particular amount of space savings

• under some circumstances, compressed image is larger than original image

– Why? Can you prevent this?

Suppressing Repeating SequencesSuppressing Repeating Sequences

Page 12: Organizing Files for Performance

12

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

RLERLE

Here we have a series of blue x 6, magenta x 7, red x 3, yellow x 3 and green x 4, that is:

Ex 1:

Ex 2:

Page 13: Organizing Files for Performance

13

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

► Morse code: oldest & most common scheme of variable-length code

► Some values occur more frequently than others– that value should take the least amount of space

► Huffman coding– base on probability of occurrence

• determine probabilities of each value occurring

• build binary tree with search path for each value

• more frequently occurring values are given shorter search paths in tree

Assigning Variable-Length CodesAssigning Variable-Length Codes

Page 14: Organizing Files for Performance

14

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Variable-length encoding Variable-length encoding

► Any encoding scheme in which the codes are of different lengths. More frequently occurring codes are given shorter lengths than frequently occurring codes. Huffman encoding is an example of variable-length encoding.

► Huffman code which determines the probabilities of each value occurring in the data set and then builds a binary tree in which the search path for each value represent the code for that value.

Page 15: Organizing Files for Performance

15

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Page 16: Organizing Files for Performance

16

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Huffman EncodingHuffman Encoding

►Compression– Typically, in files and messages,

• Each character requires 1 byte or 8 bits• Already wasting 1 bit for most purposes!

►Question– What’s the smallest number of bits that can be used to

store an arbitrary piece of text?

► Idea– Find the frequency of occurrence of each character– Encode Frequent characters short bit strings– Rarer characters longer bit strings

Page 17: Organizing Files for Performance

17

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Huffman EncodingHuffman Encoding

►Encoding– Use a tree– Encode by following

tree to leaf– eg

• E is 00• S is 011

– Frequent charactersE, T 2 bit encodings

– Others A, S, N, O 3 bit encodings

Page 18: Organizing Files for Performance

18

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Huffman EncodingHuffman Encoding

► Encoding– Use a tree

• Inefficient in practice– Use a direct-addressed lookup

table

? Finding the optimal encoding– Smallest number of bits to

represent arbitrary text

A 010

E 00

B

:

:

N

:

S

T

110

001

10

Page 19: Organizing Files for Performance

19

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Huffman EncodingHuffman Encoding

►Divide and conquer– Decide on a root - n choices– Decide on roots for sub-trees - n choices– Repeat n times

O(n!)

►Greedy Approach– Sort characters by frequency– Form two lowest weight nodes into a sub-tree

• Sub-tree weight = sum of weights of nodes– Move new tree to correct place

Page 20: Organizing Files for Performance

20

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Huffman Encoding - OperationHuffman Encoding - Operation

Initial sequenceSorted by frequency

Combine lowest twointo sub-tree

Move it to correctplace

Page 21: Organizing Files for Performance

21

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

After shifting sub-treeto its correct place ...

Huffman Encoding - OperationHuffman Encoding - Operation

Combine next lowestpair

Move sub-tree to correct place

Page 22: Organizing Files for Performance

22

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Move the new tree to the correct place ...

Huffman Encoding - OperationHuffman Encoding - Operation

Now the lowest two are the“14” sub-tree and D

Combine and move to correct place

Page 23: Organizing Files for Performance

23

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Move the new tree to the correct place ...

Huffman Encoding - OperationHuffman Encoding - Operation

Now the lowest two are thethe “25” and “30” trees

Combine and move to correct place

Page 24: Organizing Files for Performance

24

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Huffman Encoding - OperationHuffman Encoding - Operation

Combine last two trees

Page 25: Organizing Files for Performance

25

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Huffman Encoding - DecodingHuffman Encoding - Decoding

Page 26: Organizing Files for Performance

26

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Huffman Encoding - Time ComplexityHuffman Encoding - Time Complexity

►Sort keys O(n log n)►Repeat n times

– Form new sub-tree O(1)– Move sub-treeO(logn)

(binary search)

– Total O(n log n) ►Overall O(n log n)

Page 27: Organizing Files for Performance

27

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

Irreversible Compression TechniquesIrreversible Compression Techniques

► Compression in which information is lost.– EX : Shrinking a raster image from 400 by 400 pixels to 100 by 100

pixels .

► There is no way to determine what the original pixels were from the one new pixel.

► Irreversible Compression is less common in data files than reversible compression but there are times when the info. That is lost of little or no value. – EX: Speech Compression.

Page 28: Organizing Files for Performance

28

Org

aniz

ing

Fil

es f

or P

erfo

rman

ce6

CIS 256 (File Structures)

►Some information can be sacrificed►Less common in data files►Shrinking raster image

– 400-by-400 pixels to 100-by-100 pixels

– 1 pixel for every 16 pixels

►Speech compression – voice coding (the lost information is of no little or no value)

Lossy Compression TechniquesLossy Compression Techniques