The impossible patent: an introduction to lossless data compression
-
Upload
belinda-lopez -
Category
Documents
-
view
46 -
download
1
description
Transcript of The impossible patent: an introduction to lossless data compression
![Page 1: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/1.jpg)
The impossible patent: an introduction to
lossless data compression
Carlo Mazza
![Page 2: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/2.jpg)
Plan
● Introduction● Formalization● Theorem● A couple of good ideas
![Page 3: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/3.jpg)
Introduction
![Page 4: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/4.jpg)
What is data compression?
Data compression is the procedure that reduces the size of information. It is used today in many applications, expecially in digital data:
● generic files compression (ZIP, RAR, etc.)● audio compression (MP3, AAC, FLAC, etc.)● images compression (JPG, GIF, PNG, etc.)● video compression (AVI, MP4, WMV, etc.)
![Page 5: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/5.jpg)
(Very) Brief historic overview
1970s LZW (Lempel, Ziv and Welch)
Microsoft and Apple, email
1980s ARJ, PKZIP, LHarc
BBS and newsgroups
1990s JPG, MP3 “The web”, browsers, Yahoo and Google
2000s H.264, AAC, MP4, M4V
dot-com bubble, Facebook
1838: Morse Code1940: Information Theory (Shannon, Fano, Huffman)
![Page 6: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/6.jpg)
Screenshot of PKZIP 2.04g, created on February 15, 2007 using DOSBox
![Page 7: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/7.jpg)
Different kinds of compression
● Lossless compression: ZIP, RAR, FLAC, PNG
● Lossy compression: MP3, JPG, MP4, AAC
![Page 8: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/8.jpg)
Formalization
![Page 9: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/9.jpg)
Lossless compression
The lossless compression is the compression which does not lose information, i.e., there is another operation, decompression, such that compressing and decompressing a file gives back the exact same file.
![Page 10: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/10.jpg)
● Messaggi SMS: "hi m8, r u k? sry i 4gt 2 cal u lst nite. why dnt we go c movie 2nite? c u l8r" "c 6? xke nn ho bekkato ness1 in 3no? cmq c vdm + trd nel pom"
● Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht frist and lsat ltteer is at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe.
No loss of information
![Page 11: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/11.jpg)
Jane S., a chief sub editor and editor, can always be found
hard at work in her cubicle. Jane works independently, without
wasting company time talking to colleagues. She never
thinks twice about assisting fellow employees, and she always
finishes given assignments on time. Often Jane takes extended
measures to complete her work, sometimes skipping
coffee breaks. She is a dedicated individual who has absolutely no
vanity in spite of her high accomplishments and profound
knowledge in her field. I firmly believe that Jane can be
classed as a high-caliber employee, the type which cannot be
dispensed with. Consequently, I duly recommend that Jane be
promoted to executive management, and a proposal will be
sent away as soon as possible.
Loss of information
![Page 12: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/12.jpg)
Formalization
We try to formalize the situation:● let F be a file, a sequence of ones and zeros● let L(F) be the length of the file F● we want to find a procedure that from F
yields another file G in such a way that L(G)≤L(F)
How many are the files of length N? And those of length at most N?
![Page 13: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/13.jpg)
Compression as a function
We think of compression as a function f from the set of files into the same set of files such that L(f(F))≤L(F). What properties do we need from this function for the compression to be lossless?
● the function f(F)=0 surely compresses but loses information
● the function f(F)=F surely does not loses information but does not compress either
What is the property that distinguishes lossless and lossy compression?
![Page 14: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/14.jpg)
Compression as a function
As we said before, we say that the compression is lossless if there is another operation which recovers, the original file.
The functions f models lossless compression if there is another fuction g such that for every file F we have
g(f(F))=F (f o g)(F)=F (f o g)(F)=(id)(F)
We say that f has a left inverse
![Page 15: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/15.jpg)
Left inverses and injective maps
Theorem: A function f admits a left inverse if and only if it is injective.
Proof: Say f is a map from X to Y. Suppose f is injective. Then every y is the image of at most one x in X. We define the map g by stating that every y which is hit goes back to x, and every other y can do whatever it wants. It is clear that for every x in X, g(f(x))=x.
![Page 16: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/16.jpg)
Left inverses and injective maps
Proof (cont’d): Suppose now that f admits a left inverse, call it g. Suppose that f(x)=f(x’). Then g(f(x))=g(f(x’)), but x=g(f(x))=g(f(x’))=x’, and therefore x=x’, that is f is injective.
We managed to translate an intuitive property (“losslessness”) into a precise mathematical concept (injectivity).
![Page 17: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/17.jpg)
Theorem
![Page 18: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/18.jpg)
Limits of lossless compression
• WEB Technologies• Premier Research Corporation (MINC)• Hyper Space method• Matthew Burch• Pegasus Web Services Inc. (patent
7,096,360)Actually...
• Theorem: There is no “perfect” lossless compression.
![Page 19: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/19.jpg)
Proof by contraddiction
Theorem: There isn’t a function f such that for every F we have L(f(F))≤L(F), but there is at lest one such that L(f(F))<L(F)).
Proof: Let’s suppose such a function exists● Let F be a file which is actually compressed
and let G=f(F). Consider L(f(G)).● If L(f(G))=L(F) then let H=f(G)=f(f(F)) and
consider L(f(H)) and so on.● Since f is injective, I cannot hit the same file
twice.
![Page 20: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/20.jpg)
Proof (continued)
● So the length will have to decrease eventually.
● But then we will eventually go to files of length one, from where we cannot go any further, which leads to a contraddiction.
![Page 21: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/21.jpg)
Schubfachprinzip
Dirichlet’s Principle (1834), pigeonhole principle● Let f be a function from a set A to a set B. If
the number of elements of B is stricly less than that of A, then f is not injective.
![Page 22: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/22.jpg)
Let’s count
Theorem: There isn’t a function f which compresses almost all files (i.e., L(f(F))≤L(F) for all F but there is at least one such that L(f(F))<L(F)).
Proof: Let N be the minimal length of a file which is compressed. The files of length N-1 are 2(N-1) and so all files of length N are 2(N-
1)+2(N-2)+...+21=2N-2. Then f sends a set of size 2(N-2)+1 to a set of size 2(N-2). But because of the pigeonhole principle, it cannot be injective.
![Page 23: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/23.jpg)
Impossible compression
So there is no universal compression function. Actually, looking at the proof, it’s clear that if something is compressed, something else increases in size. So, if we have no good ideas, better leave everything as is.
![Page 24: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/24.jpg)
A couple of good ideasRLE and prefix codes
![Page 25: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/25.jpg)
Run Lenght Encoding
The Run Lenght Encoding (RLE) technique is one of the oldest compression algorithm: when a symbol repeats, we substitute the symbol and the number of its repetitions.
• “aaaabbbcccdd” -> “4a3b3c2d”
• “mathematics”->“1m1a1t1h1e1m1a1t1i1c1s”
It works badly for messages with few repetitions and very well for messages with a lot of repetitions (fax).
![Page 26: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/26.jpg)
ASCII encoding
But we still need to encode the letters and frequencies in binary. In general, let’s say we have a text message that we want to compress. The output will be a binary string, so we need to convert letters to binary numbers. One of the standards is the ASCII standard that assigns to each letter a 7 bit number (a string of 7 ones or zeros, so it encodes 27=128 symbols).
![Page 27: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/27.jpg)
Dictionary Encoding
We decide to choose a dictionary that need not be only one letter, but maybe more. But we still need to have some kind of fixed length to be able to separate the frequencies from the symbols.
![Page 28: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/28.jpg)
Exercise
011000010110000101100010011000110110000101100001011000100110001101100001011000010110001001100011
011000010110000101100010011000110110000101100001011000100110001101100001011000010110001001100011
01100001011000010110001001100011
01100001011000010110001001100011
01100001011000010110001001100011
![Page 29: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/29.jpg)
Reducing number of bits
• encoding “mathematics” in ASCII requires 7 bits * 11 letters = 77bits
• “mathematics” only has 8 different letters, so only 3 bits are needed, so in total 33 bits
• but we could use less bits for the more frequent letters, i.e., a=0 m=1 t=10 h=11 e=100 i=101 c=110 s=111 so “mathematics” becomes “1010111001010101110111” (22 b)
• but that also encodes “iasaattihas”
![Page 30: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/30.jpg)
Prefix codes
Need to make sure that no code is the prefix of another code
• a=0 b=1 c=10 doesn’t work
• a=0 b=10 c=11 works
Examples:
• international prefix (+1 USA, +39 Italy)
![Page 31: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/31.jpg)
Huffman coding
We start with a frequency table of the letters. We produce a tree following the rules:
• create a tree for every letter with weight equal to its frequency
• create a new tree by joining the two trees with the least two weights (and give it as weigth the sum of the two weigths)
• go on until there is only one tree
To see what the codes are, we read the tree from the top to the bottom.
![Page 32: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/32.jpg)
Examples
1. “aaaabbbccdd”a. RLE “4a3b2c2d” “10000111010111011” (17 bits)
b. Huffman:
2. “mathematics”a. RLE “1m1a1t1h1e1m1a1t1i1c1s” (3*11=33 bits)
b. Huffman:
![Page 33: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/33.jpg)
assassins: (5,s) (2,a) (1,i) (1,n)
s
5
a
2
i
1
n
1
s
5
a
2
i n
2
s
5
i n
a
4
![Page 34: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/34.jpg)
s
a
i n
0
0
0
1
1
1
So, in the end:s=0a=10i=110n=111
“assassins” = 100010001101110 (15 bits)
Try “sessions”, “sassafrasses”, “mummy”, “beekeeper”, but not “mathematics”
![Page 35: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/35.jpg)
Advantages and disavantages
• RLE: one can start compressing at once (there is no need to read the whole message to construct a frequency table)
• RLE: works expecially well when there are few symbols and lots of repetitions
• Huffman: works well when the frequencies are not close to each other (natual language)
• Huffman: works expecially well when frequencies are powers of two
![Page 36: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/36.jpg)
That’s all folks!
![Page 37: The impossible patent: an introduction to lossless data compression](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812d14550346895d91f6b6/html5/thumbnails/37.jpg)
(Very) Brief history of data compression
● 1838: Morse code● 1940s: Information theory (Shannon, Fano,
Huffman)● 1970s: LZW (Lempel, Ziv and Welch),
Microsoft, Apple● 1980s: ARJ, PKZIP, LHarc (BBS and
newsgroups)● 1990s: JPG, MP3 (“The web” and browsers),
1994: Yahoo 1998: Google● 2001: dot-com bubble● 2004: Facebook