MATH 1020: Mathematics For Non-science Chapter 3.1: Information in a networked age

MATH 1020: MATH 1020: Mathematics For Non-science

Chapter 3.1: Chapter 3.1: Information in a networked age

1

Instructor: Dr. Ken Tsang

Room E409-R9

Email: [email protected]

mailto:[email protected]

Transmitting Information

– Binary codes– Encoding with parity-check sums– Data compression – Cryptography– Model the genetic code

2

The ChallengesThe Challenges

3

Mathematical Challenges in the Digital Revolution How to correct errors in data transmission How to electronically send and store

information economically How to ensure security of transmitted data How to improve Web search efficiency

Binary CodesBinary CodesA binary code is a system for encoding data made

up of 0’s and 1’s

Examples– Postnet (tall = 1, short = 0)– UPC (universal product code, dark = 1, light = 0)

– Morse code (dash = 1, dot = 0)– Braille (raised bump = 1, flat surface = 0)– Yi-jing 易经 (Yin=0, yang=1)

Binary Codes are EverywhereBinary Codes are EverywhereCD, MP3, and DVD players, digital TV, cell

phones, the Internet, GPS system, etc. all represent data as strings of 0’s and 1’s rather than digits 0-9 and letters A-Z

Whenever information needs to be digitally transmitted from one location to another, a binary code is used

Transmission ProblemsTransmission Problems

What are some problems that can occur when data is transmitted from one place to another?

The two main problems are– transmission errors: the message sent is not the

same as the message received– security: someone other than the intended

recipient receives the message

Transmission Error ExampleTransmission Error Example Suppose you were looking at a newspaper ad for a

job, and you see the sentence “must have bive years experience”

We detect the error since we know that “bive” is not a word

Can we correct the error? Why is “five” a more likely correction than “three”? Why is “five” a more likely correction than “nine”?

Another ExampleAnother Example Suppose NASA is directing one of the Mars rovers by

telling it which crater to investigate

There are 16 possible signals that NASA could send, and each signal represents a different command

NASA uses a 4-digit binary code to represent this information

0000 0100 1000 1100

0001 0101 1001 1101

0010 0110 1010 1110

0011 0111 1011 1111

Lost in TransmissionLost in TransmissionThe problem with this method is that if there is

a single digit error, there is no way that the rover could detect or correct the error

If the message sent was “0100” but the rover receives “1100”, the rover will never know a mistake has occurred

This kind of error – called “noise” – occurs all the time

10

BASIC IDEABASIC IDEA The details of techniques used to protect information against

noise in practice are sometimes rather complicated, but basic principles are easily understood.

The key idea is that in order to protect a message against a noise, we should encode the message by adding some redundant information to the message.

In such a case, even if the message is corrupted by a noise, there will be enough redundancy in the encoded message to recover, or to decode the message completely.

Adding Redundancy to our Adding Redundancy to our MessagesMessages

To decrease the effects of noise, we add redundancy to our messages.

First method: repeat the digits multiple times.

Thus, the computer is programmed to take any five-digit message received and decode the result by majority rule.

Majority RuleMajority Rule

So, if we sent 00000, and the computer receives any of the following, it will still be decoded as 0.

00000 11000 Notice that for the10000 10100 computer to decode01000 10010 incorrectly, at least00010 10001 three errors must be00001 etc. made.

Independent ErrorsIndependent Errors

Using the five-time repeats, and assuming the errors happen independently, it is less likely that three errors will occur than two or fewer will occur.

This is called the maximum likelihood decoding.

Why don’t we use this?Why don’t we use this?

Repetition codes have the advantage of simplicity, both for encoding and decoding

But, they are too inefficient! In a five-fold repetition code, 80% of all

transmitted information is redundant.Can we do better? Yes!

http://images.google.com/imgres?imgurl=http://www.eapcrackerbarrel.com/images/confusedMonkey.jpg&imgrefurl=http://www.eapcrackerbarrel.com/INDEXAug05.HTM&h=160&w=176&sz=10&tbnid=_i0oc6PbhS3zqM:&tbnh=86&tbnw=95&hl=en&start=6&prev=/images%3Fq%3Dconfused%2Bmonkey%26svnum%3D10%26hl%3Den%26lr%3D

More RedundancyMore Redundancy

Another way to try to avoid errors is to send the same message twice

This would allow the rover to detect the error, but not correct it (since it has no way of knowing if the error occurs in the first copy of the message or the second)

16

Parity-Check Sums Sums of digits whose parities determine the check digits.

Even Parity – Even integers are said to have even parity. Odd Parity – Odd integers are said to have odd parity.

Decoding The process of translating received data into code words. Example: Say the parity-check sums detects an error.

The encoded message is compared to each of the possible correct messages. This process of decoding works by comparing the distance between two strings of equal length and determining the number of positions in which the strings differ.

The one that differs in the fewest positions is chosen to replace the message in error.

In other words, the computer is programmed to automatically correct the error or choose the “closest” permissible answer.

17

Error CorrectionError Correction

Over the past 40 years, mathematicians and engineers have developed sophisticated schemes to build redundancy into binary strings to correct errors in transmission!

One example can be illustrated with Venn diagrams!

Claude Shannon (1916-2001)“Father of Information Theory”

Computing the Check DigitsComputing the Check Digits The original message is four digits long

We will call these digits I, II, III, and IV

We will add three new digits, V, VI, and VII

Draw three intersecting circles as shown here

Digits V, VI, and VII should bechosen so that each circlecontains an even number ofones

III IVII

I

VII

V VI

Venn Diagrams

A Hamming (7,4) codeA Hamming (7,4) code

A Hamming code of (n,k) means the message of k digits long is encoded into the code word of n digits.

The 16 possible messages:0000 1010 0011 11110001 1100 11100010 1001 11010100 0110 10111000 0101 0111

20

Binary Linear CodesBinary Linear Codes

The error correcting scheme we just saw is a special case of a Hamming code.

These codes were first proposed in 1948 by Richard Hamming (1915-1998), a mathematician working at Bell Laboratories.

Hamming was frustrated with losing a week’s worth of work due to an error that a computer could detect, but not correct.

Appending Digits to the MessageAppending Digits to the Message The message we want to send is “0100”

Digit V should be 1 so that the first circle has two ones

Digit VI should be 0 so that the second circle has zero ones (zero is even!)

Digit VII should be 1 so thatthe last circle has two ones

Our message is now 0100101 0 01

0

1

1 0

Encoding those messagesEncoding those messages

Message codeword

0000 0000000 0110 0110010

0001 0001011 0101 0101110

0010 0010111 0011 0011100

0100 0100101 1110 1110100

1000 1000110 1101 1101000

1010 1010001 1011 1011010

1100 1100011 0111 0111001

1001 1001101 1111 1111111

Detecting and Correcting ErrorsDetecting and Correcting Errors Now watch what happens when there is a single digit error

We transmit the message 0100101 and the rover receives 0101101

The rover can tell that the second and third circles have odd numbers of ones, but the first circle is correct

So the error must be in the digit that is in the second and third circles, but not the first: that’s digit IV

Since we know digit IV is wrong, there isonly one way to fix it: change it from 1 to 0

0 11

0

1

1 0

Try It!Try It!

Encode the message 1110 using this method

You have received the message 0011101. Find and correct the error in this message.

Extending This IdeaExtending This Idea

This method only allows us to encode 4 bits (16 possible) messages, which isn’t even enough to represent the alphabet!

However, if we use more digits, we won’t be able to use the circle method to detect and correct errors

We’ll have to come up with a different method that allows for more digits

Parity Check SumsParity Check SumsThe circle method is a specific example of a

“parity check sum”

The “parity” of a number is 1 is the number is odd and 0 if the number is even

For example, digit V is 0 if I + II + III is even, and 1 if I + II + III is odd

Conventional NotationConventional Notation

Instead of using Roman numerals, we’ll use a1 to represent the first digit of the message, a2 to represent the second digit, and so on

We’ll use c1 to represent the first check digit, c2 to represent the second, etc.

Old Rules in the New NotationOld Rules in the New Notation

Using this notation, our rules for our check digits become– c1 = 0 if a1 + a2 + a3 is even

– c1 = 1 if a1 + a2 + a3 is odd

– c2 = 0 if a1 + a3 + a4 is even

– c2 = 1 if a1 + a3 + a4 is odd

– c3 = 0 if a2 + a3 + a4 is even

– c3 = 1 if a2 + a3 + a4 is odd

a3 a4a2

a1

c3

c1 c2

An Alternative SystemAn Alternative System

If we want to have a system that has enough code words for the entire alphabet, we need to have 5 message digits: a1, a2, a3, a4, a5

We will also need more check digits to help us decode our message: c1, c2, c3, c4

Rules for the New SystemRules for the New SystemWe can’t use the circles to determine the check

digits for our new system, so we use the parity notation from before

c1 is the parity of a1 + a2 + a3 + a4




Making the CodeMaking the Code

Using 5 digits in our message gives us 32 possible messages, we’ll use the first 26 to represent letters of the alphabet

On the next slide you’ll see the code itself, each letter together with the 9 digit code representing it

The CodeThe CodeLetter Code Letter Code

A 000000000 N 011010101

B 000010111 O 011101100

C 000101110 P 011111011

D 000111001 Q 100001011

E 001001101 R 100011100

F 001011010 S 100100101

G 001100011 T 100110010

H 001110100 U 101000110

I 010001111 V 101010001

J 010011000 W 101101000

K 010100001 X 101111111

L 010110110 Y 110000100

M 011000010 Z 110010011

Using the CodeUsing the Code Now that we have our code, using it is simple

When we receive a message, we simply look it up on the table

But what happens when the message we receive isn’t on the list?

Then we know an error has occurred, but how do we fix it? We can’t use the circle method anymore

Beyond CirclesBeyond Circles Using this new system, how do we decode

messages?

Simply compare the (incorrect) message with the list of possible correct messages and pick the “closest” one

What should “closest” mean?

The distance between the two messages is the number of digits in which they differ

The Distance Between The Distance Between MessagesMessages

What is the distance between 1100101 and 1010101? – The messages differ in the 2nd and 3rd digits, so the

distance is 2

What is the distance between 1110010 and 0001100? – The messages differ in all but the 7th digit, so the

distance is 6

Hamming DistanceHamming Distance

Def: The Hamming distance between two vectors of a vector space is the number of components in which they differ, denoted d(u,v).

Hamming DistanceHamming Distance

Ex. 1: The Hamming distance between

v = [ 1 0 1 1 0 1 0 ]

u = [ 0 1 1 1 1 0 0 ]

d(u, v) = 4

Notice: d(u,v) = d(v,u)

Hamming weight of a VectorHamming weight of a Vector

Def: The Hamming weight of a vector is the number of nonzero components of the vector, denoted wt(u).

http://images.google.com/imgres?imgurl=http://www.4girls.gov/disability/weights.jpg&imgrefurl=http://www.4girls.gov/disability/active.htm&h=225&w=300&sz=11&tbnid=sc60lnio1EuquM:&tbnh=83&tbnw=111&hl=en&start=2&prev=/images%3Fq%3Dweights%26svnum%3D10%26hl%3Den%26lr%3D

Hamming weight of a codeHamming weight of a code

Def: The Hamming weight of a linear code is the minimum weight of any nonzero vector in the code.

http://images.google.com/imgres?imgurl=http://www.4girls.gov/disability/weights.jpg&imgrefurl=http://www.4girls.gov/disability/active.htm&h=225&w=300&sz=11&tbnid=sc60lnio1EuquM:&tbnh=83&tbnw=111&hl=en&start=2&prev=/images%3Fq%3Dweights%26svnum%3D10%26hl%3Den%26lr%3D

Hamming WeightHamming Weight

The Hamming weight of v = [ 1 0 1 1 0 1 0 ] u = [ 0 1 1 1 1 0 0 ] w = [ 0 1 0 0 1 0 1 ]are:

wt(v) = 4 wt(u) = 4 wt(w) = 3

Nearest-Neighbor DecodingNearest-Neighbor Decoding

The nearest neighbor decoding method decodes a received message as the code word that agrees with the message in the most positions

Trying it OutTrying it OutSuppose that, using our alphabet code, we

receive the message 010100011

We can check and see that this message is not on our list

How far away is it from the messages on our list?

Distances From 010100011 Distances From 010100011 Code Distance Code Distance

000000000 4 011010101 5

000010111 4 011101100 5

000101110 4 011111011 3

000111001 4 100001011 4

001001101 6 100011100 8

001011010 6 100100101 4

001100011 2 100110010 4

001110100 6 101000110 6

010001111 3 101010001 6

010011000 5 101101000 6

010100001 1 101111111 6

010110110 3 110000100 5

011000010 3 110010011 3

Fixing the ErrorFixing the Error

Since 010100001 was closest to the message that we received, we know that this is the most likely actual transmission

We can look this corrected message up in our table and see that the transmitted message was (probably) “K”

This might still be incorrect, but other errors can be corrected using context clues or check digits

50

The distances between message “1010 110” and all possible code words:

v 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110

code word 0000 000 0001 011 0010 111 0100 101 1000 110 1100 011 1010 001 1001 101

distance 4 5 2 5 1 4 3 4

v 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110

code word 0110 010 0101 110 0011 100 1110 100 1101 000 1011 010 0111 001 1111 111

distance 3 4 3 2 5 2 6 3

Distances From Distances From 1010 1101010 110



51

53

Data compression is important to storage systems because it allows more bytes to be packed into a given storage medium than when the data is uncompressed.

Some storage devices (notably tape) compress data automatically as it is written, resulting in less tape consumption and significantly faster backup operations.

Compression also reduces file transfer time, saving time and communications bandwidth.

Data compressionData compression

CompressionCompressionThere are two main categories

– Lossless– Lossy

Compression ratio:

54

55

A good metric for compression is the compression factor (or compression ratio) given by:

If we have a 100KB file that we compress to 40KB, we have a compression factor of:

Compression factorCompression factor

Information TheoryInformation Theory

Shannon, C.E. (1948). A mathematical theory of communication. Bell System Technical Journal 30, 50-64.

Very precise definition of information as a message made up of symbols from some finite alphabet.

Shannon’s definition of information ignores the meaning conveyed by the message

56

Information Theory Information Theory cont.cont.

Information content is a quantifiable amount

The information content of some message is inversely related to the probability that that message will be received from the set of all possible messages.

The message with the lowest probability of being received contains the highest information content.

57

58

Compression is achieved by removing data redundancy while preserving information content.

The information content of a group of bytes (a message) is its entropy. – Data with low entropy permit a larger compression ratio than data

with high entropy. Entropy 熵 , H, is a function of symbol frequency. It is

the weighted average of the number of bits required to

encode the symbols of a message. For a single symbol x:

H= -P(x) log2P(x)

Information contentInformation content

59

The entropy of the entire message is the sum of the individual symbol entropies.

-P(xi) log2P(xi)

Entropy of a messageEntropy of a message

i

where xi is the i-th symbol

Information and entropy are measures of unexpectedness.Entropy effectively limits the strongest lossless compression possible.

EntropyEntropy Entropy is a measure of information content: the

minimum number of bits required to store data without any loss of information.

Entropy is sometimes called a measure of surprise, the uncertainty associated with the message

– A highly predictable sequence contains little actual information Example: 11011011011011011011011011 (what’s next?)

– A completely unpredictable sequence of n bits contains n bits of information Example: 01000001110110011010010000 (what’s next?)

– Note that nothing says the information has to have any “meaning” (whatever that is)

A fair coin has an entropy of one. If the coin is not fair, then the uncertainty is lower and the entropy is also lower.

60

Entropy of a coin flipEntropy of a coin flip

61

Entropy H(X) of a coin flip, measured in bits; graphed versus the fairness of the coin Pr(X=1).

Note the maximum of the graph depends on the distribution: Here, at most 1 bit is required to communicate the outcome of a fair coin flip; but the result of a fair die would require at most log2(6) bits.

Inefficiency of ASCIIInefficiency of ASCII

Realization: In many natural (English) files, we are much more likely to see the letter ‘e’ than the character ‘&’, yet they are both encoded using 7 bits!

Solution: Use variable length encoding! The encoding for ‘e’ should be shorter than the encoding for ‘&’.

63

64

ASCII (cont.)ASCII (cont.) Here are the ASCII bit strings for the capital letters in our

alphabet:Letter ASCII Letter ASCII

A 0100 0001 N 0100 1110

B 0100 0010 O 0100 1111

C 0100 0011 P 0101 0000

D 0100 0100 Q 0101 0001

E 0100 0101 R 0101 0010

F 0100 0110 S 0101 0011

G 0100 0111 T 0101 0100

H 0100 1000 U 0101 0101

I 0100 1001 V 0101 0110

J 0100 1010 W 0101 0111

K 0100 1011 X 0101 1000

L 0100 1100 Y 0101 1001

M 0100 1101 Z 0101 1010

Variable Length CodingVariable Length Coding

Assume we know the distribution of characters (‘e’ appears 1000 times, ‘&’ appears 1 time)

Each character will be encoded using a number of bits that is inversely proportional to its frequency (made precise later).

Need a ‘prefix free’ encoding: if ‘e’ = 001

than we cannot assign ‘&’ to be 0011. Since encoding is variable length, need to know when to stop.

65

Example: Morse codeExample: Morse codeMorse code is a method of transmitting textual information

as a series of on-off tones, lights, or clicks that can be directly understood by a skilled listener or observer without special equipment.

Each character is a sequence of dots and dashes, with the shorter sequences assigned to the more frequently used letters in English – the letter 'E' represented by a single dot, and the letter 'T' by a single dash.

Invented in the early 1840s. it was extensively used in the 1890s for early radio communication before it was possible to transmit voice.

66

67

A U.S. Navy seaman sends Morse code signals in 2005.

Vibroplex semiautomatic key. The paddle, when pressed to the right by the thumb, generates a series of dits. When pressed to the left by the knuckle of the index finger, the paddle generates a dah.

International Morse CodeInternational Morse Code

68

Relative Frequency of Letters in English TextRelative Frequency of Letters in English Text

69

Encoding TreesEncoding Trees

Think of encoding as an (unbalanced) tree. Data is in leaf nodes only (prefix free).

‘e’ = 0, ‘a’ = 10, ‘b’ = 11 How to decode ‘01110’?

10

10e

a b

70

Cost of a TreeCost of a Tree

For each character ci let fi be its frequency in the file.

Given an encoding tree T, let di be the depth of ci in the tree (number of bits needed to encode the character).

The length of the file after encoding it with the coding scheme defined by T will be C(T)= Σdi fi

71

Example Huffman encodingExample Huffman encoding A = 0

B = 100C = 1010D = 1011R = 11

ABRACADABRA = 01001101010010110100110

This is eleven letters in 23 bitsA fixed-width encoding would require 3 bits for

5 different letters, or 33 bits for 11 lettersNotice that the encoded bit string can be

decoded!

72

Why it worksWhy it worksIn this example, A was the most common

letterIn ABRACADABRA:

– 5 As code for A is 1 bit long– 2 Rs code for R is 2 bits long– 2 Bs code for B is 3 bits long– 1 C code for C is 4 bits long– 1 D code for D is 4 bits long

73

Creating a Huffman encodingCreating a Huffman encodingFor each encoding unit (letter, in this example),

associate a frequency (number of times it occurs)– Use a percentage or a probability

Create a binary tree whose children are the encoding units with the smallest frequencies– The frequency of the root is the sum of the frequencies

of the leaves

Repeat this procedure until all the encoding units are in the binary tree

74

Example, step IExample, step I Assume that relative frequencies are:

– A: 40– B: 20– C: 10– D: 10– R: 20

(I chose simpler numbers than the real frequencies) Smallest numbers are 10 and 10 (C and D), so connect those

75

Example, step IIExample, step II C and D have already been used, and the

new node above them (call it C+D) has value 20

The smallest values are B, C+D, and R, all of which have value 20– Connect any two of these; it doesn’t matter

which two

76

Example, step IIIExample, step IIIThe smallest values is R, while A and

B+C+D all have value 40Connect R to either of the others

root

leave

77

Example, step IVExample, step IVConnect the final two nodes

78

Example, step VExample, step V Assign 0 to left branches, 1 to right branches Each encoding is a path from the root

A = 0B = 100C = 1010D = 1011R = 11

Each path terminates at a leaf

Do you see why encoded strings are decodable?

79

Unique prefix propertyUnique prefix property A = 0

B = 100C = 1010D = 1011R = 11

No bit string is a prefix of any other bit string For example, if we added E=01, then A (0) would be

a prefix of E Similarly, if we added F=10, then it would be a

prefix of three other encodings (B=100, C=1010, and D=1011)

The unique prefix property holds because, in a binary tree, a leaf is not on a path to any other node

80

Practical considerationsPractical considerations It is not practical to create a Huffman encoding for

a single short string, such as ABRACADABRA– To decode it, you would need the code table– If you include the code table in the entire message, the

whole thing is bigger than just the ASCII message

Huffman encoding is practical if:– The encoded string is large relative to the code table, OR– We agree on the code table beforehand

For example, it’s easy to find a table of letter frequencies for English (or any other alphabet-based language)

81

Data compressionData compressionHuffman encoding is a simple example of data

compression: representing data in fewer bits than it would otherwise need

A more sophisticated method is GIF (Graphics Interchange Format) compression, for .gif files

Another is JPEG (Joint Photographic Experts Group), for .jpg files– Unlike the others, JPEG is lossy—it loses information– Generally OK for photographs (if you don’t compress them too

much) because decompression adds “fake” data very similar to the original

82

83

Photographic images incorporate a great deal of information. However, much of that information can be lost without objectionable deterioration in image quality.

With this in mind, JPEG allows user-selectable image quality, but even at the “best” quality levels, JPEG makes an image file smaller owing to its multiple-step compression algorithm.

It’s important to remember that JPEG is lossy, even at the highest quality setting. It should be used only when the loss can be tolerated.

JPEG CompressionJPEG Compression

84

2. Run Length Encoding (RLE)2. Run Length Encoding (RLE)

RLE: When data contain strings of repeated symbols (such as bits or characters), the strings can be replaced by a special marker, followed by the repeated symbol, followed by the number of occurrences. In general, the number of occurrences (length) is shown by a two digit number.

If the special marker itself occurs in the data, it is duplicated (as in character stuffing).

RLE can be used in audio (silence is a run of 0s) and video (run of a picture element having the same brightness and color).

85

An Example of Run-Length Encoding

86

2. Run Length Encoding (RLE)2. Run Length Encoding (RLE)

Example– # is chosen as the special marker.– Two-digit number is chosen for the repetition count.– Consider the following string of decimal digits

15000000000045678111111111111118

Using RLE algorithm, the above digital string would be encoded as:

15#01045678#1148– The compression ration would be

(1 – (16/32)) * 100% = 50%



88

Model the genetic codeModel the genetic code

The genome 基因組 is the instruction manual for life, an information system that specifies the biological body.

In its simplest form, it consists of a linear sequence of four extremely small molecules, called nucleotides.

These nucleotides make up the “steps” of the spiral-staircase structure of the DNA and are the letters of the genetic code.

89

The structure of part of a DNA The structure of part of a DNA double helixdouble helix

DNA is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms.

91

A DNA double helixA DNA double helix

92

The main role of DNA (Deoxyribonucleic acid 脱氧核糖核酸 ) molecules is the long-term storage of information.

Four bases found in DNAFour bases found in DNA

93

The DNA double helix is stabilized by hydrogen bonds between the bases attached to the two strands. The four bases (nucleotides) found in DNA are adenine (abbreviated A), cytosine (C), guanine (G) and thymine (T). These four bases are attached to the sugar/phosphate to form the complete nucleotide

Escherichia coli genomeEscherichia coli genome

>gb|U00096|U00096 Escherichia coli 大腸桿菌 K-12 MG1655 complete genome 基因組 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTG

TGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA

94

Hierarchies of symbolsHierarchies of symbols

English computer genetics

letter (26) bit (2) nucleotide 核苷酸 (4)

word byte codon(1-28 letters) (8 bits) (3 nucleotides)

sentence line gene

book program genome

95

Information TheoryInformation Theory

Information Source:

Transmitter Receiver Destination:

Noise Source

Message Signal

Received Signal Message

A typical communication system

Shannon (1948)

96

Mutation

DNA

Parents Child

DNA from an Information DNA from an Information Theory PerspectiveTheory Perspective

The “alphabet” for DNA is {A,C,G,T}. Each DNA strand is a sequence of symbols from this alphabet.

These sequences are replicated and translated in processes reminiscent of Shannon’s communication model.

There is redundancy in the genetic code that enhances its error tolerance.

97

The Central Dogma of The Central Dogma of Molecular BiologyMolecular Biology

Replication Transcription

Reverse Transcription

Translation

DNARNA

Protein

Ribonucleic acid核糖核酸

What Information Theory Contributes to What Information Theory Contributes to Genetic BiologyGenetic Biology

A useful model for how genetic information is stored and transmitted in the cell

A theoretical justification for the observed redundancy of the genetic code

100

Data Compression in gene Data Compression in gene sequencessequences

As an illustration of data compression, let’s use the idea of gene sequences.

Biologists are able to describe genes by specifying sequences composed of the four letters A, T, G, and C, which stand for the four nucleotides adenine, thymine, guanine, and cytosine, respectively.

Suppose we wish to encode the sequence AAACAGTAAC.

101

Data Compression (cont.)Data Compression (cont.) One way is to use the (fixed-length) code: A00, C01, T10, and

G11. Then AAACAGTAAC is encoded as: 00000001001110000001. From experience, biologists know that the frequency of occurrence

from most frequent to least frequent is A, C, T, G. Thus, it would more efficient to choose the following binary code:

A0, C10, T110, and G111. With this new code, AAACAGTAAC is encoded as:

0001001111100010. Notice that this new binary code word has 16 letters versus 20 letters

for the fixed-length code, a decrease of 20%. This new code is an example of data compression!

102

Data Compression (cont.)Data Compression (cont.) Suppose we wish to decode a sequence encoded with the new

data compression scheme, such as 0001001111100010. Looking at groups of three digits at a time, we can decode this

message! Since 0 only occurs at the end of a code word, and the codes

words that end in 0 are 0, 10, and 110, we can put a mark after every 0, as this will be the end of a code word.

The only time a sequence of 111 occurs is for the code word 111, so we can put a mark after every triple of 1’s.

Thus, we have: 0,0,0,10,0,111,110,0,0,10, which is AAACAGTAAC.

MATH 1020: Mathematics For Non-science Chapter 3.1: Information in a networked age

Documents

Transcript of MATH 1020: Mathematics For Non-science Chapter 3.1: Information in a networked age