Computer Science 210 Computer Systems 2006 Semester 2 ... · Data Representation 17 October 2006...

Computer Science 210 Computer Systems 2006 Semester 2

Lecture Notes

Data Representation

Bruce Hutton Department of Computer Science

University of Auckland Tuesday, October 17, 2006

Data Representation 17 October 2006 Page i

Contents 1. Number Representation......................................................................................................... 1-1

§1.1 Asian Numbers.............................................................................................................. 1-1

§1.2 Roman Numerals........................................................................................................... 1-2 §1.3 Computer Based Numbers ............................................................................................. 1-3

2. Base conversion .................................................................................................................... 2-4

§2.1 Conversion from internal form into a base ..................................................................... 2-4

§2.2 Conversion from a base into internal form..................................................................... 2-4 §2.3 Conversion between binary and hexadecimal................................................................. 2-5

3. Performing Integer Arithmetic in a Base ............................................................................... 3-6

§3.1 Addition ........................................................................................................................ 3-6 §3.2 Subtraction .................................................................................................................... 3-7

§3.3 Comparison ................................................................................................................... 3-8

§3.4 Shifting left and right..................................................................................................... 3-8

§3.5 Multiplication................................................................................................................ 3-9 §3.6 Division....................................................................................................................... 3-10

4. Representation of Unsigned and Signed Integers................................................................. 4-12

§4.1 Unsigned representation .............................................................................................. 4-12

§4.2 Sign/magnitude representation..................................................................................... 4-13 §4.3 Excess (biased) representation ..................................................................................... 4-14

§4.4 Two’s complement ...................................................................................................... 4-15

§4.5 Zero and Sign extension .............................................................................................. 4-16

§4.6 Carry and Overflow..................................................................................................... 4-17 §4.7 Representation of signed numbers in other bases ......................................................... 4-17

5. Bits as Sets, and Extracting Fields....................................................................................... 5-19

6. The ASCII Character Set..................................................................................................... 6-21

7. Unicode Characters ............................................................................................................. 7-24 8. Floating Point Numbers ...................................................................................................... 8-26

§8.1 Internal Representation of Floating Point Numbers...................................................... 8-26

§8.2 Textual Representation of Floating Point Numbers...................................................... 8-28

9. Storage of Data in Computer Memory................................................................................. 9-31 §9.1 Bits, bytes, words, longwords and quadwords.............................................................. 9-31

§9.2 Big endian and little endian storage of data.................................................................. 9-32

Data Representation 17 October 2006 Page ii 10. Storage of structured objects ........................................................................................... 10-34

§10.1 Arrays ..................................................................................................................... 10-34

§10.2 Descriptors .............................................................................................................. 10-34

§10.3 Multi-dimensional arrays......................................................................................... 10-35 §10.4 Records ................................................................................................................... 10-36

§10.5 Pointers ................................................................................................................... 10-36

§10.6 Class instances ........................................................................................................ 10-39

11. Appendices ..................................................................................................................... 11-45 §11.1 Base Conversion Table and Powers of two .............................................................. 11-46

§11.2 Hexadecimal Addition Table ................................................................................... 11-47

§11.3 Hexadecimal Multiplication Table........................................................................... 11-47

§11.4 The ASCII character set........................................................................................... 11-48 §11.5 Names for powers of 10........................................................................................... 11-49

Data Representation 17 October 2006 Page 1-1

Number Representation

1. Number Representation Numbers have some meaning, independent of how they are represented. For example, if we have a (New Zealand) carton of eggs, the number of eggs is the same, no matter whether we write “12”, “twelve”, “a dozen”, “XII” or “十ニ”. Because we have ten fingers (including thumbs), people tend to represent integers in a decimal notation. A digit sequence an-1 an-2 ... a2 a1 a0 represents the integer an-1 * tenn-1 +an-2 * tenn-2 + ... + a2 * ten2+ a1 * ten1 + a0 * ten0. For example, 365

represents 3 * ten2 + 6 * ten1 + 5 * ten0.

§1.1 Asian Numbers There are alternative ways of representing numbers. For example, we can represent 1, 2, 3, ... 9 by 一, ニ, 三, 四, 五, 六, 七, ハ, 九. We can represent 10, 100, 1000 by 十, 百, 千. We can represent 104 by 万, 108 by 億, 1012 by 兆, etc. We can then build up numbers out of these, grouping digits in lots of 4, rather than 3. For example, 1,234,567 becomes In Chinese: 一百ニ十三万四千五百六十七 In Japanese: 百ニ十三万四千五百六十七 In Korean: 백이십삼만사천오백육십칠

In Japanese and Korean, if a digit is 0, we miss out both the digit, and the power of ten marker, so 2003 is represented by 二千三 in Japanese. In Chinese, we mark internal sequences of zeros by a single 零, so 2003 is represented by 二千零三 (although years tend to be represented differently - we also tend to say years in a different manner from numbers in English). (We do much the same when saying numbers in English, when we insert “and” before the tens or units, and say “two thousand and three” for 2003.) To give a more complex example, 200304 is represented as 二十万零三百零四 in Chinese. A suffix representing the class of object being counted is usually also appended, for example 年 is appended for years, 月 for months, 日 for days, 円 for yen, etc. In Japanese, a 1 is omitted in front of 10, 100, optional in front of 1000, and compulsory in front of 10000. In Chinese, a leading 1 is usually explicitly indicated by a 一, except for numbers 10 to 19. In Korean, the 1 in front of a power of ten is always omitted.



Japanese Chinese Korean 0 零 rei ling´ 영 yeong

1 一 ichi yi¯ 일 il

2 二 ni er` 이 i

3 三 san san¯ 삼 sam

4 四 shi/yon si` 사 sa

5 五 go wuˇ 오 o

6 六 roku liu` 육 yuk

7 七 shichi/nana qi¯ 칠 ch'il

8 八 hachi ba¯ 팔 p’al

9 九 kyuu jiuˇ 구 gu

10 十 juu shi´ 십 sip

100 百 hyaku baiˇ 백 baek

1000 千 sen qian¯ 천 ch’eon

10000 万 man wan` 만 man

Exercise Write the base 10 number 1024 in Chinese, Japanese and Korean. Exercise What is the algorithm for saying numbers in New Zealand English? Write a computer program to translate numbers into New Zealand English.

§1.2 Roman Numerals We can represent numbers in Roman Numerals, using I (1), V (5), X (10), L (50), C (100), D (500), M (1000). Generally, numbers are represented by writing down these symbols multiple times, so that their sum represents the number. The representation is rather like building up a value out of a minimal number of coins. For example, 632 is represented by DCXXXII. However digits corresponding to 4 and 9 are represented by preceding the representation of 5 or 10 by the representation of 1. For example 40 is represented by XL, and 900 is represented by CM. This is rather like tendering a higher denomination coin, then receiving change. Thus we represent the numbers 1 to 10 by I, II, III, IV, V, VI, VII, VIII, IX X. We can represent multiples of 10 up to 100 by X, XX, XXX, XL, L, LX, LXX, LXXX, XC, C. We can represent multiples of 100 up to 1000 by C, CC, CCC, CD, D, DC, DCC, DCCC, CM, M. Multiple of 1000 can be represented by sequences of Ms. After this, the system starts to collapse, although there is a system where thousands can be written by putting a – over the top of the number, and this is probably related to the way Europeans group digits in lots of threes. We can build up numbers, by writing the above patterns for the individual decimal digits, starting with the most significant digits. For example, 398 is CCCXCVIII.



Exercise Write the base ten numbers 1024 and 1995 in Roman numerals.

§1.3 Computer Based Numbers Computers are built of electronic circuitry. Information is represented by currents, voltage differences, or (for permanent storage) orientation of magnetic material. A single electronic circuit or piece of magnetic material is often regarded as being in one of two states, so can be used to represent a binary digit. Thus numbers are usually represented within a computer in base 2. It is of course possible to represent an integer in any base. A digit sequence an-1 an-2 ... a2 a1 a0 represents the integer an-1 * basen-1 +an-2 * basen-2 + ... + a2 * base2+ a1 * base1 + a0 * base0,.

Although data is stored within the computer in binary, it is often convenient to display the information in hexadecimal (base 16), because it takes up less space on the page, and is easier to remember. Displaying a binary number in hexadecimal really just amounts to grouping the bits together in lots of 4, and using a single symbol for the four bits. We use the symbols 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f to represent the hexadecimal digits.

Decimal Hexadecimal Binary 2n Hex 2n Decimal 2-n Decimal

0 0 0000 1 1 1

1 1 0001 2 2 0.5

2 2 0010 4 4 0.25

3 3 0011 8 8 0.125

4 4 0100 10 16 0.0625

5 5 0101 20 32 0.03125

6 6 0110 40 64 0.015625

7 7 0111 80 128 0.0078125

8 8 1000 100 256 0.00390625

9 9 1001 200 512 0.001953125

10 a 1010 400 1024 0.0009765625

11 b 1011 800 2048 0.00048828125

12 c 1100 1000 4096 0.000244140625

13 d 1101 2000 8192 0.0001220703125

14 e 1110 4000 16384 0.00006103515625

15 f 1111 8000 32768 0.000030517578125


Base conversion

2. Base conversion §2.1 Conversion from internal form into a base To convert an integer into a digit string in a base, we repeatedly divide the number by the base. The digits correspond to the remainders of these divisions, and are generated from right to left.

static char toDigit( int n ) {

if ( n < 10 )

return ( char ) ( n + '0' );

else

return ( char ) ( n + 'a' - 10 ); } static String toBase( int n, int base ) { if ( n == 0 ) return "0"; else if ( n < 0 ) return "-" + toBase( - n, base ); else { String result = ""; while ( n > 0 ) { result = toDigit( n % base ) + result; n = n / base; } return result; } }

For example, to convert the decimal number 13 into binary, we get 13 / 2 = quotient 6 remainder 1 6 / 2 = quotient 3 remainder 0 3 / 2 = quotient 1 remainder 1 1 / 2 = quotient 0 remainder 1 So writing the remainders from right to left, we get 1101.

§2.2 Conversion from a base into internal form To convert from a digit string in a base to an integer is like evaluating a polynomial. We just process the digits from left to right, multiply what we have so far by the base, and add the next digit. static int fromDigit( char c ) { if ( '0' <= c && c <= '9' ) return c - '0'; else if ( 'a' <= c && c <= 'z' ) // Allow up to base 36 return c - 'a' + 10; else if ( 'A' <= c && c <= 'Z' ) return c - 'A' + 10; else throw new Error( "Invalid digit" ); }


Base conversion

static int fromBase( String s, int base ) { int sign = 1; if ( s.charAt( 0 ) == '-' ) { sign = -1; s = s.substring( 1 ); } int result = 0; for ( int i = 0; i < s.length(); i++ ) result = result * base + fromDigit( s.charAt( i ) ); return sign * result; }

For example, to convert the binary representation 1101 into decimal, we get 0 * 2 + 1 = 1 1 * 2 + 1 = 3 3 * 2 + 0 = 6 6 * 2 + 1 = 13 giving the number 13 (in decimal). The arithmetic in the above computations can be performed in any base you like. Obviously people prefer to perform the arithmetic in decimal, because we remember our decimal tables. If the arithmetic is performed by a computer, it could be considered to be performed in binary. Numbers can be represented as text strings, in other words, as a string of ASCII characters. Numbers also have a representation in “internal form”, which on modern computers happens to be a sequence of bits, typically stored in 4 or 8 bytes. Since each byte is composed of 8 bits, an integer is typically stored in 32 or 64 bits. However, note that the representation of a number in binary as a text string is completely different from its representation as bits internally. In a textual representation, the number of bytes needed is equal to the number of digits. Each digit is represented by the ASCII character ‘0’ or ‘1’, which are themselves encoded in “internal form”, as bit patterns 00110000 and 00110001, that can be interpreted as numbers (hexadecimal 30 and 31, decimal 48 and 49).

§2.3 Conversion between binary and hexadecimal Conversion between binary and hexadecimal is almost trivial, since it amounts to grouping or ungrouping bits into sets of four. For example, using our conversion table, hexadecimal 3fc is 0011 1111 1100. We often need to multiply or divide by 2. If the number is in binary, this amounts to shifting the digits left or right by one place. If we want to represent our numbers in hexadecimal, we can decode the hexadecimal digits into binary, perform the shift, then encode them again. For example, to compute 3fc / 2, we decode 3fc as 0011 1111 1100, shift the bits right by 1, to get 0001 1111 1110, and encode it again to get 1fe. In Java, multiplication by 2 corresponds to the << operator, and division by 2 corresponds to the >> operator (for signed values) or >>> (for unsigned values). Exercise Convert 1013 decimal into binary, and hexadecimal. Convert 1f3 hexadecimal into binary and decimal. Consider the signed byte x with hexadecimal value 0xc4. What is the hexadecimal value of x >> 1?


Performing Integer Arithmetic in a Base

3. Performing Integer Arithmetic in a Base Suppose we represent an integer in a base as an array of digits, with the coefficient of basei stored in the i-th element. Suppose we know our addition, subtraction and multiplication tables for individual digits.

§3.1 Addition We need a notion of a “carry” of 0 or 1, initially 0, representing the excess from the computations of the previous digits. Process the digits from “right” (least significant digit) to “left” (most significant digit). • Add the digits and carry. • If the sum is >= base, subtract the base from the sum, and set the carry to 1. • The sum is the new digit. static int[] add( int[] a, int[] b ) { int[] result = new int[ MAXDIGIT ]; int carry = 0; for ( int i = 0; i < MAXDIGIT; i++ ) { int sum = a[ i ] + b[ i ] + carry; if ( sum >= base ) { carry = 1; sum -= base; } else { carry = 0; } result[ i ] = sum; } return result; }

Example In decimal 3 5 2 7 2 + 0 2 3 6 1 carry 0 0 0 1 0 0 sum 3 7 6 3 3

If we have a limited number of digits we can represent, we can lose the top digit (the carry out from the addition of the most significant digits). Computers have a limited amount of memory, so computer based arithmetic is not quite the same as natural arithmetic for large numbers. All numbers are essentially trimmed modulo basen, where n is the number of digits stored. Example With 5 decimal digits, we have 3 5 2 7 2 + 8 2 3 6 1 carry 1 0 0 1 0 0 sum 1 7 6 3 3



With 5 binary digits, we have 1 0 1 0 1 + 0 1 1 0 1 carry 1 1 1 0 1 0 sum 0 0 0 1 0

Exercise What is the sum of the binary numbers 0110 1010 + 0110 1100

What is the sum of the binary numbers 0110 1010 + 1110 1100

Assume you are limited to 8 binary digits.

§3.2 Subtraction Subtraction is similar, except we have a “borrow” instead of the carry. We need a notion of a “borrow” of 0 or 1, initially 0, representing the deficit from the computations of the previous digits. Process the digits from “right” (least significant digit) to “left” (most significant digit). • Subtract the digits and borrow. • If the difference is < 0, add the base to the difference, and set the borrow to 1. • The sum is the new digit. static int[] subtract( int[] a, int[] b ) { int[] result = new int[ MAXDIGIT ]; int borrow = 0; for ( int i = 0; i < MAXDIGIT; i++ ) { int diff = a[ i ] - b[ i ] - borrow; if ( diff < 0 ) { borrow = 1; diff += base; } else { borrow = 0; } result[ i ] = diff; } return result; }

Example In decimal 3 5 2 7 2 - 0 2 3 6 1 borrow 0 0 1 0 0 0 diff 3 2 9 1 1

If we have a limited number of digits we can represent, we can lose the top digit (the borrow out from the subtraction of the most significant digits), and the number appears as a positive number (as if the absolute value of the result was subtracted from 100000). This is a base 10 equivalent to the 2’s complement representation of signed numbers.



Example With 5 decimal digits, we have 3 5 2 7 2 - 8 2 3 6 1 borrow 1 0 1 0 0 0 diff 5 2 9 1 1

For example, if we mindlessly subtract 1 from 0 in base 10 using the above algorithm, we get 99999. Example With 5 binary digits, we have 1 0 0 1 1 - 1 1 0 1 0 borrow 1 1 0 0 0 0 diff 1 1 0 0 1

Exercise What is the difference of the binary numbers 1110 1010 - 0110 0100

What is the difference of the binary numbers 0110 1010 - 1110 1100

Assume you are limited to 8 bits.

§3.3 Comparison We often want to know when one number is <, ==, or > another. This is just a matter of scanning and comparing the digits from left to right. static int compareUnsigned( int[] a, int[] b ) { for ( int i = MAXDIGIT - 1; i >= 0; --i ) if ( a[ i ] != b[ i ] ) return a[ i ] - b[ i ]; return 0; }

§3.4 Shifting left and right When performing multplication and division in a base, it is often necessary to shift the digits left or right by one digit. There is a fill digit to insert, usually 0, and a digit is lost. Shifting left amounts to multiplying by the base, shifting right to dividing by the base. static int[] shiftLeft( int[] a, int fill ) { int[] result = new int[ MAXDIGIT ]; for ( int i = MAXDIGIT - 1; i > 0; --i ) { result[ i ] = a[ i - 1 ]; } result[ 0 ] = fill; return result; }



static int[] shiftRight( int[] a, int fill ) { int[] result = new int[ MAXDIGIT ]; for ( int i = 0; i < MAXDIGIT - 1; i++ ) { result[ i ] = a[ i + 1 ]; } result[ MAXDIGIT - 1 ] = fill; return result; }

Example In decimal 2 3 4 6 8 shift left 3 4 6 8 0 shift right 0 2 3 4 6

Exercise What is the value of 1110 1010

if we shift it left or right by 1 digit, and lose the digit shifted off the end?

§3.5 Multiplication When doing “long” multiplication, so long as the base is small, it is often best to generate a “times table” representing multiplication of the first operand by a digit. static int[][] timesTable( int[] a ) { int[][] multTable = new int[ base ][]; multTable[ 0 ] = zero(); for ( int i = 1; i < base; i++ ) multTable[ i ] = add( multTable[ i - 1 ], a ); return multTable; }

For example, a times table for 13643, assuming 5 decimal digits, is Times table

13643 * digit

Trimmed result

0 0 00000

1 13643 13643

2 27286 27286

3 40929 40929

4 54572 54572

5 68215 68215

6 81858 81858

7 95501 95501

8 109144 09144

9 122787 22787

We can then perform the multiplication by going through the digits of the second operand, adding in the product of first operand by the digit, with a suitable shift in the digits. Base 2 is so simple we do not need a table. When performing it on paper, we usually perform the algorithm right to left, but it is slightly more convenient to perform it left to right, if our shift algorithm only shifts a digit at a time.



static int[] multiply( int[] a, int[] b ) { int[] product = new int[ MAXDIGIT ]; int[][] multTable = timesTable( a ); for ( int i = MAXDIGIT - 1; i >= 0; --i ) { product = shiftLeft( product, 0 ); product = add( product, multTable[ b[ i ] ] ); } return product; }

If multiplying n-digit numbers, we get a 2n-digit result. On a computer, we lose the top n digits. Example 13643 * 59226 in 5 digit decimal arithmetic is

digit

13643 * digit

previous result *

10 new result

5 68215 00000 68215

9 22787 82150 04937

2 27286 49370 76656

2 27286 66560 93846

6 81858 38460 20318

Example 1010 * 1110 in 4 digit binary arithmetic is

digit

1010 * digit

previous result *

10 new result

1 1010 0000 1010

1 1010 0100 1110

1 1010 1100 0110

0 0000 1100 1100

Exercise What is the product of the binary numbers 0110 1010 * 0110 1100

Assume the answer is trimmed to 8 bits.

§3.6 Division “Long”division can be done in a similar manner. Suppose we wish to divide one number (the dividend) by another (the divisor) to get a quotient and residue. We process the digits of the dividend from left to right. Each time through the loop, we shift the next digit of the dividend into the right of the residue, and divide the residue by the divisor to get the next digit of the quotient, and a new residue, which is less than the divisor. Dividing the residue by the divisor can be done by table lookup or sucessive subtraction. However, there are problems with generating the table, because the product may be too large to represent.



static int divideDigitUnsigned( int[] residue, int [] divisor ) { int digit = 0; while ( compareUnsigned( residue, divisor ) >= 0 ) { copy( subtract( residue, divisor ), residue ); digit++; } return digit; }

static int[] divideUnsigned( int[] dividend, int[] divisor ) { int[] residue = new int[ MAXDIGIT ]; int[] quotient = new int[ MAXDIGIT ]; if ( isZero( divisor ) ) throw new Error( "Divide by zero" ); for ( int i = MAXDIGIT - 1; i >= 0; --i ) { residue = shiftLeft( residue, dividend[ i ] ); int digit = divideDigitUnsigned( residue, divisor ); quotient = shiftLeft( quotient, digit ); } return quotient; }

Example In decimal 52834/12 = 04402, with residue 10.

Dividend digit Residue

Residue*10 + dividend[i]

Quotient digit

5 0 5 0 2 5 52 4 8 4 48 4 3 0 3 0 4 3 34 2 10

Example In binary 1100 1010/101 = 0010 1000, with residue 10.

Dividend digit Residue

Residue*10 + dividend[i]

Quotient digit

1 0 1 0 1 1 11 0 0 11 110 1 0 1 10 0 1 10 101 1 0 0 0 0 1 0 1 0 0 1 10 0 10

Exercise What is the quotient and residue of the binary division 1010 1100 / 0000 0011


Representation of Unsigned and Signed Integers

4. Representation of Unsigned and Signed Integers All computer data ends up being stored as bits. Computer memory is general purpose, and can be used to store any information - instructions, characters, unsigned and signed integers, floating point numbers, etc. The bits can be interpreted in whatever way we wish. Because we have a limited number of bits, there is a limit on the range of different values we can represent in a given sized memory. For example, if we have n bits, we can store at most 2n different values.

§4.1 Unsigned representation Given 8 bits, we can interpret the bit patterns 00000000, 00000001, 00000010, 00000011, ... 11111110, 11111111 as unsigned integers 0, 1, 2, 3, ... 254, 255 (decimal). With this interpretation, we cannot represent integers outside the range 0 ... 255. We can have a similar representation if we have more bits.

+132 10000100

+131 10000011

+130 10000010

+129 10000001

+128 10000000

+127 01111111

+126 01111110

+125 01111101

+124 01111100

11111100 +252

11111101 +253

11111110 +254

11111111 +255

00000000 0

00000001 +1

00000010 +2

00000011 +3

00000100 +4

Unsigned Representation

Suppose we have n bits. We can represent numbers between 0 and 2n - 1. For example if n is 8, we represent numbers between 0 and 255. For example, decimal 30 is 16 + 8 + 4 + 2, and so is represented as the bit pattern 00011110. Note that Java does not permit the programmer to declare variables as unsigned integers. To a large extent this does not matter, because +, -, and * produce the same bit pattern no matter whether we interpret the values as unsigned or two’s complement numbers. But unsigned division is different for very large unsigned numbers. Unsigned division by a power of 2 can be performed by using >>>. Comparisons (<, >, etc) are also different. We can get the equivalent of an unsigned comparison by subtracting 1 << (n–1) from the values to be compared.



Exercise What is the bit pattern for decimal 42 as an 8 bit unsigned number? Write down bit patterns of the minimum and maximum values for 32 bit unsigned values as unsigned hexadecimal numbers.

§4.2 Sign/magnitude representation By changing our interpretation, we can interpret some of these bit patterns as negative numbers. One way is to consider the top bit as representing the sign of the number, while the remaining bits represent the magnitude. There are two representations of zero, +0 and -0. Sign/magnitude representation is often used in the representation of floating point numbers.

-4 10000100

-3 10000011

-2 10000010

-1 10000001

-0 10000000

+127 01111111

+126 01111110

+125 01111101

+124 01111100

11111100 -124

11111101 -125

11111110 -126

11111111 -127

00000000 +0

00000001 +1

00000010 +2

00000011 +3

00000100 +4

Sign/magnitude Representation

Suppose we have n bits. We can represent numbers between -(2n-1 -1) and +(2n-1 - 1), with two representations of 0. For example if n is 8, we represent numbers between -127 and +127. • A positive number x is represented as itself. • +0 is represented as 0.

• -0 is represented as 1<< n-1 (i.e., 2n-1).

• A negative number -x is represented as 1<< n-1 + x (i.e., 2n-1 + x). For example, decimal -30 is represented as the bit pattern 10000000 + 00011110 = 10011110. Exercise What is the bit pattern for decimal 42 and -42 as an 8 bit sign/magnitude numbers?



Write down bit patterns of the minimum and maximum values for 32 bit sign/magnitude values as unsigned hexadecimal numbers.

§4.3 Excess (biased) representation We can represent a signed number by adding the absolute value of the most negative number to the value, to get an unsigned number. It has the advantage that the ordering is the same as for unsigned numbers, so unsigned comparisons work. Excess representation is often used in the representation of the exponent for floating point numbers.

+4 10000100

+3 10000011

+2 10000010

+1 10000001

0 10000000

-1 01111111

-2 01111110

-3 01111101

-4 01111100

11111100 +124

11111101 +125

11111110 +126

11111111 +127

00000000 -128

00000001 -127

00000010 -126

00000011 -125

00000100 -124

Excess Representation

Usually for n bit numbers, the value 2n-1 is added to the number to generate an unsigned number, and our number is excess 2n-1.

Suppose we have n bits. Suppose numbers are represented in excess 2n-1. We can represent numbers between -2n-1 and +(2n-1 - 1). For example if n is 8, we represent numbers between -128 and +127.

• A positive number x is represented as (1 << (n-1)) + x. (i.e. 2n-1 + x).

• 0 is represented as 1 << n-1. (i.e. 2n-1).

• A negative number -x is represented as ((1 <<( n-1)) - 1) - x + 1. (i.e. (2n-1 - 1) - x + 1).

Now (1 << (n-1)) - 1 or 2n-1 - 1 is the bit pattern 0111...11, and it is easy to subtract a number from this value in binary, because it is just a matter of interchanging 0’s and 1’s (except for the most significant bit, which should be left as 0). In fact the excess representation is the same as the two’s complement representation, but with the opposite value for the top digit.



For example 30 is represented as 10000000 + 00011110 = 10011110, -30 is represented as 01111111 - 00011110 + 1 = 01100001 + 1 = 01100010. Exercise What is the bit pattern for decimal 42 and -42 as an 8 bit excess-128 numbers? Write down bit patterns of the minimum and maximum values for 32 bit excess representation values as unsigned hexadecimal numbers.

§4.4 Two’s complement

We can represent a signed number as the value modulo 2n. In other words, non-negative numbers are represented as themselves, and negative numbers as the value + 2n. Two’s complement representation is used to represent signed integers on most machines. It has the advantage that addition, subtraction, and multiplication are the same for both unsigned and two’s complement numbers, because the mapping of a → a mod 2n for integers preserves these operations. Also, the representation of the non-negative numbers that can be represented is the same for both usigned and two’s complement. The only difference between the representation of two’s complement and unsigned representation is the choice of inverse mapping.

-124 10000100

-125 10000011

-126 10000010

-127 10000001

-128 10000000

+127 01111111

+126 01111110

+125 01111101

+124 01111100

11111100 -4

11111101 -3

11111110 -2

11111111 -1

00000000 0

00000001 +1

00000010 +2

00000011 +3

00000100 +4

Two’s Complement Representation

To add, subtract, or multiply two complement numbers, we just perform the operation as if the numbers are unsigned, and throw away any additional bits generated. To perform negation, form the ones complement (subtract from all ones (the representation of -1), then add 1.

However, signed and unsigned division are different. Dividing by 2n can be done using the >> operator for signed numbers, and >>> operator for unsigned numbers, to shift the value right by n bits. For >>, the sign bit is used to fill the space, while for >>>, 0 is used to fill the space.



Suppose we have n bits. We can represent numbers between -2n-1 and +(2n-1 - 1). For example if n is 8, we represent numbers between -128 and +127. • A positive number x is represented as itself, x. • 0 is represented as itself, 0.

• A negative number -x is represented as ((1 << n) - 1) - x + 1 (i.e. (2n -1) - x + 1, or ~x + 1).

Now (1 << n) - 1 or 2n - 1 is the bit pattern 1111...11, and it is easy to subtract a number from this value in binary, because it is just a matter of interchanging 0’s and 1’s (what is called the 1’s complement). For example -30 is represented as 11111111 - 00011110 + 1 = 11100001 + 1 = 11100010. Exercise What is the bit pattern for decimal 42 and -42 as 8 bit two’s complement numbers? What is the signed decimal value represented by the 8 bit two’s complement bit pattern 10000101? Write down bit patterns of the minimum and maximum values for 32 bit two’s complement values as unsigned hexadecimal numbers. Exercise Suppose we are performing 8 bit two’s complement arithmetic, involving variables with bit patterns x = 1100 1000, y = 1100 1011, z = 1110 1100. What is the 8-bit bit pattern corresponding to “-x” (the negation of x)? What is the 8-bit bit pattern corresponding to y + z? What is the 8-bit bit pattern corresponding to y - z?

§4.5 Zero and Sign extension Suppose we want to extend a value by adding more digits. this is often done when considering a character as an integer, etc. Unsigned numbers are “zero extended” by filling the additional space on the left by 0. For example,when extending an 8 bit unsigned value to 16 bits we get something like 1010 0110 -> 0000 0000 1010 0110 Two’s complement numbers are “sign extended” by filling the additional space on the left by the sign bit. For example,when extending an 8 bit two’s complement value to 16 bits we get something like 0010 0110 -> 0000 0000 0010 0110 1010 0110 -> 1111 1111 1010 0110 (In general, small magnitude negative numbers start with a long run of 1’s.) Exercise Zero extend and sign extend the 8 bit binary number represented by the hexadecimal number 0xe4, to 16 bits, and write the answer back in hexadecimal.



§4.6 Carry and Overflow Because we have a limited number of bits, we cannot always represent the result of an arithmetic operation correctly. If the final carry bit of an unsigned addition is 1, we lose the top bit, and the result is incorrect as an unsigned number. For example, if adding 250 + 12 as unsigned numbers, we get 1111 1010 + 0000 1100 = 0000 0110

which is not correct as an unsigned number. However as two’s complement addition of -6 + 12 = +6, it is correct. We can also get an incorrect result of a two’s complement operation, where the sign bit gets corrupted by the carry out from most significant bit of the magnitute. This is called overflow. For addition, this occurs when the carry into the sign bit is different from the carry out. For example, if adding integers 125 + 12 as 8 bit two’s complement numbers, we get 0111 1101 + 0000 1100 = 1000 1001

which is not correct as a signed number, and represents a negative two’s complement number (-119). However, as unsigned addition it is correct. Exercise Which of the following additions generate a carry or overflow? 0011 1010 + 0101 1100 1111 0101 + 1100 1100 0111 1111 + 0111 1111

§4.7 Representation of signed numbers in other bases Most ways of representing signed numbers in binary generalise to other bases. The sign/magnitude representation has a generalisation, where we represent signed numbers by their absolute value in a base, preceded by a + or - to represent the sign.

The excess representation can be formed by adding (basen)/2 to the value to be represented to get an unsigned number. For base 10, with 5 digits, this is 50000. The two’s complement representation has a generalisation, where we represent signed numbers modulo basen. In other words, add basen to a negative number to get the unsigned number that represents it. In general, a digit sequence in which the leading digit is >= base/2 represents a negative number. With 5 digits and base 10, -1 comes out as 99999, -2 as 99998, etc. Sign extension amounts to filling the additional digits on the right by 9. There used to be mechanical calculators that represented signed numbers in this way. When performing a right shift on a digit sequence representing a signed number (effectively dividing by the base), we have to shift in either 0 or 9 for the top digit, depending on the sign of the number. For a general base, this is static int[] shiftRightSigned( int[] a ) { if ( a[ MAXDIGIT - 1 ] >= base / 2 ) return shiftRight( a, base - 1 ); else return shiftRight( a, 0 );



}

We can compute the base 10 representation of -x by taking the 9’s complement, and adding 1. For a general base, the complement is static int[] complement( int[] a ) { int[] result = new int[ MAXDIGIT ]; for ( int i = 0; i < MAXDIGIT; i++ ) { result[ i ] = ( base - 1 ) - a[ i ]; } return result; }


Bits as Sets, and Extracting Fields

5. Bits as Sets, and Extracting Fields Sometimes bits are used to represent sets. The i-th bit indicates whether i is in the set. The Java operators &, | and ~ can be used to represent intersection, union and complement. For example, suppose we are performing 8 bit two’s complement arithmetic, involving variables with bit patterns x = 0011 1100 (0x3c), y = 1101 1011 (0xdb), then ~x = 1100 0011 (0xc3), x & y = 0001 1000 (0x18), x | y = 1111 1111 (0xff). Sometimes data is packed together in a bit pattern (for example, because it is important to decrease the size of the data, because it is to be transmitted over a slow network, or there are large amounts of data). How can we extract the data? We can make use of shift operators. To extract a field out of a 64 bit value, assuming two’s complement representation, shift the field to the very left, then to the very right. Use >> to sign extend the value, value = source << ( 64 - position - size ); value = value >> ( 64 - size );

positionsize64 - position - size

64 - size

Sign extended

sign

bit

64 - position - size


Bits as Sets, and Extracting Fields

Alternatively, use >>> to extract an unsigned value. value = source << ( 64 - position - size ); value = value >>> ( 64 - size );

positionsize64 - position - size

64 - size

Zero extended

sign

bit

64 - position - size

A “mask” is a bit pattern with 1’s for the bits that make up a field, and 0’s elsewhere. We can create a mask of size bits, based at a position by writing; mask = ( ( 1L << size ) - 1) << position;

We can then clear the field by writing source = source & ~ mask;

and assign a new value by writing source = source | (value << position ) & mask;

Exercise Suppose we are performing 8 bit two’s complement arithmetic, involving variables with bit patterns x = 1100 1000, y = 1100 1011, z = 1110 1100. What is the 8-bit bit pattern corresponding to “~x” (the complement of x)? What is the 8-bit bit pattern corresponding to “x >> 3” (the result of an arithmetic shift right by 3 bits)? What is the 8-bit bit pattern corresponding to y | z? What is the 8-bit bit pattern corresponding to y & z?


The ASCII Character Set

6. The ASCII Character Set The encoding of characters most commonly used in computers for communicating with peripheral devices is the ASCII character code. The ASCII character code is a convention for how bytes are to be interpreted when sent to a terminal, printer, or similar device. (ASCII stands for American Standards Committee for Information Interchange.). Other codes have been developed to allow for a much bigger variety of characters, including Arabic and Chinese characters, etc. Devices often have a table stored in ROM (read only memory) indicating the way characters are to be displayed, and it is this table that means that printing ASCII character 41 (hexadecimal) results in the letter ‘A’ appearing on the screen or printer. They may have several such tables, representing different fonts. The processor itself has no notion of character code - it just manipulates bytes as bit patterns, integers, etc. ASCII characters from hexadecimal 21 to 7E represent printable characters. The effect of “writing” the character is to change the colour of the dots on the screen or paper indicated by the table stored in ROM. Other characters are control characters, used to control such things as cursor position (where the text is printed), cause scrolling, indicate the start and end of packets of information transmitted over the network, etc. A simple protocol that made use of these was the IBM BISYNC protocol. Their use is as follows: Transmission Control Characters used when sending packets of data: 00 NUL (Null): No character, used to fill in time or space, when there is no data. 01 SOH (Start of Heading): Indicates the start of the header of a packet of data. 02 STX (Start of Text): Indicates the end of the header and start of the text of a packet of

data. 03 ETX (End of Text): Indicates the end of the text in a packet of data. 04 EOT (End of Transmission): Indicates the end of a complete transmission, composed of a

sequence of packets. For example, it is used to indicate end of file in unix.

05 ENQ (Enquiry): A request for a response from a device (for example, a “Who are you?” request).

06 ACK (Acknowledge): Acknowledgment by the receiving device, indicating that the data was received properly.

Bell Control: 07 BEL (Bell): Used to request that the bell/buzzer be rung on a device. Format Control Characters used to control the print head on a printer (nowadays, to control the cursor): 08 BS (Backspace): Indicates movement of the printing mechanism or display cursor

backwards one position. 09 HT (Horizontal Tab): Indicates movement of the printing mechanism or display cursor

forward to the next preassigned “tabulation” or stopping position on the same line.



0A LF (Line Feed): Indicates movement of the printing mechanism or display cursor down to the same column on the next line.

0B VT (Vertical Tab): Indicates movement of the printing mechanism or display cursor down to the next preassigned vertical “tabulation” or stopping position.

0C FF (Form Feed): Indicates movement of the printing mechanism or display cursor to the starting position of the next page, form, or screen.

0D CR (Carriage Return): Indicates movement of the printing mechanism or display cursor backward to the beginning of the current line.

Shift Characters: 0E SO (Shift Out): Indicates the code combinations which follow shall be interpreted

as outside the standard character set until a Shift In character is reached.

0F SI (Shift In): Indicates the end of a special character sequence started by a Shift Out character.

Data Link Escape Character: 10 DLE (Data Link Escape): Used to give a different meaning to the following character in a

packet (for example, to indicate that the following character is to be treated as an ordinary character, not as an ETX character, etc.).

Device Control: 11 DC1 (Device Control 1): 12 DC2 (Device Control 2): 13 DC3 (Device Control 3): 14 DC4 (Device Control 4): Characters to control the action of a device. For example DC1 is

ctrl-Q (request for more data), DC3 is ctrl-S (request for no more data, since the devices buffer is overflowing, the screen is full, etc.).

Transmission Control Characters used when sending packets of data: 15 NAK (Negative Acknowledgment): Negative Acknowledgment by the receiving device, indicating that

the data was not received properly. 16 SYN (Synchronous/Idle): Used as a filler character in a transmission protocol that sends

characters continuously to achieve synchronisation. 17 ETB (End of Transmission Block): Indicates the end of a block of data (split over several packets). 18 CAN (Cancel): Used to indicate that the previous information should be cancelled

(due to error, etc.). 19 EM (End of Medium): Indicates the physical end of a tape, card, etc. 1A SUB (Substitute): Substituted for a character that is found to be erroneous or invalid.



Escape Character: 1B ESC (Escape): A character intended to provide code extensions in that it gives the

following sequence of characters a special meaning. Escape sequences are typically used for additional control of terminals and printers. For example, escape sequences can be used for setting the cursor position to a specific place on the screen, clearing a line, the screen, or part of the screen, inserting or deleting lines, changing font or character size, changing to or from reverse video mode or graphics mode, etc. Some keys send escape sequences when typed. For example, the arrow keys, and function keys send escape sequences.

Information Separation Characters: 1C FS (File Separator): 1D GS (Group Separator): 1E RS (Record Separator): 1F US (United Separator): Characters used to separate the components of compound data

objects. Space and Delete Characters: 20 SP (Space): Used to print a blank character, or move the cursor forward by one

space. 7F DEL (Delete): Used to obliterate characters (for example, on a paper tape, used to

punch out every hole on the tape). Now used as a way of deleting a previous character typed.

Textual Characters: The characters with ASCII value 0x21 to 0x7e represent textual characters.


Unicode Characters

7. Unicode Characters Computers also need to represent text for Asian and other non-European languages. Not so long ago, every culture developed their own way of coding characters for their own writing system for use on computers. For example, China developed the GB (Guo Biao) character set, Taiwan developed the Big Five character set, Japan developed the JIS (Japanese Industrial Standard) , and Korea developed the KS (Korean Standard) character set. Having multiple systems meant the numeric code for the same charater was different in different systems. These were unified into a new universal encoding, called Unicode, that is meant to contain all characters in all modern languages, with a single code for corresponding characters in different languages. So 漢字 has the same coding, no matter whether used in Chinese or Japanese text. Of course there are lots of characters, but two bytes (16 bits) is sufficient to represent most of the common characters from the world’s languages, and 4 bytes definitely suffices for everything conceivable. The plain encoding of Unicode characters using 2 bytes is called UCS-2 (2 byte Universal Character Set encoding). The plain 4 byte encoding is called UCS-4. Now a problem is how to combine Unicode with ASCII, and how to store text in such a way that it works for ASCII as well as Unicode. ASCII characters have the same encoding in unicode, so the only problem is that we need to use 16 bits instead of 7 or 8. But that would mean that all text files would need to be padded out to double the size with an extra zero byte for each ASCII character, and all applications would need to understand the new representation. There is an ingenious encoding system, called UTF-8 (8 bit UCS Transformation Format), that means that Unicode text can be encoded so that ordinary ASCII text is encoded without modification. UTF-8 can encode up to 31 bit values in such a way that small values take up only a small amount of space. ASCII characters, with values 0x00-0x7f, are encoded as themselves, with the most significant bit 0. Values from 0x80 to 0x7fff ffff are encoded using multiple bytes. The bits are broken up into 6-bit groups, from the least significant bit. The groups of up to 6 bits are encoded as a sequence of bytes, with the most significant bits first. Each group except the most significant is prefixed by the bits “10” and generated as a byte. The first byte starts with as many 1’s as there are bytes in the code, followed by a 0. Bits From To Encode as 0-7 0000 0000 0000 007f 0xxxxxxx

8-11 0000 0080 0000 07ff 110xxxxx 10xxxxxx 12-16 0000 0800 0000 ffff 1110xxxx 10xxxxxx 10xxxxxx 17-21 0001 0000 001f ffff 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 22-26 0020 0000 03ff ffff 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 27-31 0400 0000 7fff ffff 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

It is easy to determine the start of a character, namely anything that does not start with 10. It is easy to determine the number of bytes, by looking at the length of the run of 1’s in the leading byte. This is important when dealing with potentially corrupted data. The data can easily be unpacked, by removing the run of 1’s at the at the left of each byte, and packing the remaining data into 2 (UCS-2) or 4 (UCS-4) bytes.


Unicode Characters

Example Consider the text

今日はBruce!

The UTF8 encoding of the Unicode text is UTF8 Hex Binary Char

UCS2 Hex Binary

e4 1110 0100 今 4eca 0100 1110 1100 1010

bb 1011 1011 8a 1000 1010 e6 1110 0110 日 65e5 0110 0101 1110 0101

97 1001 0111 a5 1010 0101 e3 1110 0011 は 306f 0011 0000 0110 1111

81 1000 0001 af 1010 1111 42 0100 0010 B 42 0100 0010 72 0111 0010 r 72 0111 0010 75 0111 0101 u 75 0111 0101 63 0110 0011 c 63 0110 0011 65 0110 0101 e 65 0110 0101 21 0010 0001 ! 21 0010 0001 0a 0000 1010 \n 0a 0000 1010

Exercise Convert the following UTF-8 Unicode text into UCS-2 (look in http://www.unicode.org or view in a web browser with UTF-8 text encoding to determine the representation of the characters, if you want to know). UTF8 Hex Binary Char

UCS2 Hex Binary

e4 1110 0100 bd 1011 1101 a0 1010 0000 e5 1110 0101 a5 1010 0101 bd 1011 1101 e5 1110 0101 90 1001 0000 97 1001 0111 3f 0011 1111 0a 0000 1010


Floating Point Numbers

8. Floating Point Numbers §8.1 Internal Representation of Floating Point Numbers Physics, Chemistry, and Computer Science all require the representation of very large and small numbers. People usually represent very large and small numbers in “scientific notation”, as a fixed point number times a power of 10. For example, 2.99792458*108metre/sec, for the speed of light. In computer languages, “10 to the power of” is usually written as “e”, so we would write 2.99792458e8. Floating point numbers are represented internally in the computer in a similar manner, but using base 2 rather than base 10. The IEEE standard for floating point numbers includes 32 bit single precision and 64 bit double precision formats. The formats are as follows: Single Precision Bits Size Sign s 31:31 1 Exponent e 30:23 8 Fraction f 22:0 23 Double Precision Bits Size Sign s 63:63 1 Exponent e 62:52 11 Fraction f 51:0 52

Fraction fExponent eSign s

The exponent is represented in an excess or bias of 127 for the single precision format, and 1023 for the double precision format. (You can consider the bias as 128 and 1024 if you adjust where you consider the “.” to be in the fraction.) If we treat e as an unsigned number, ranging between 0 and 255 (single precision) or 2047 (double precision), then the exponent E is given by E = e - bias, and

E ranges between Emin - 1 and Emax + 1,

where For Single Precision Emin = -126 (e = 1) Emax = +127 (e = 254)

For Double Precision Emin = -1022 (e = 1) Emax = +1023 (e = 2046)



The bit patterns corresponding to E = Emin - 1 (e = 0) and E = Emax + 1 (e = all 1’s) are interpreted in a special manner. • Not a Number When E = Emax + 1 and f ≠ 0, the bit pattern is interpreted as an invalid value (NaN, Not a

Number), used to indicate arithmetic overflow, etc. • Infinity When E = Emax + 1 and f = 0, the bit pattern is interpreted as

Value = (-1)s * infinity

• Finite Value When Emin ≤ E ≤ Emax (the normal interpretation), the bit pattern is interpreted as

Value = (-1)s * 1.f * 2E

• Denormal Value When E = Emin - 1 and f ≠ 0, the bit pattern is interpreted as

Value = (-1)s * 0.f * 2E+1 (We can interpret this as either effectively representing a value with exponent E = Emin, but

making the integer bit (the bit to the left of the point) 0, instead of 1, or effectively representing a value with exponent E = Emin - 1, but shifting the bits of f left by 1, and using the most significant bit as the integer bit.)

• Zero When E = Emin - 1 and f = 0, the bit pattern is interpreted as

Value = (-1)s * 0 Examples (Single precision) NaN 0 11111111 11111111111111111111111 infinity 0 11111111 00000000000000000000000 3.4028235e38 0 11111110 11111111111111111111111 100.0 0 10000101 10010000000000000000000 2.0 0 10000000 00000000000000000000000 1.0000001 0 01111111 00000000000000000000001 1.0 0 01111111 00000000000000000000000 0.99999994 0 01111110 11111111111111111111111 0.8125 0 01111110 10100000000000000000000 -0.8125 1 01111110 10100000000000000000000 0.5 0 01111110 00000000000000000000000 1.17549435e-38 0 00000001 00000000000000000000000 1.1754942e-38 0 00000000 11111111111111111111111 2.8e-45 0 00000000 00000000000000000000010 1.4e-45 0 00000000 00000000000000000000001 0.0 0 00000000 00000000000000000000000



Examples (Double precision) NaN 0 11111111111

1111111111111111111111111111111111111111111111111111 infinity 0 11111111111

0000000000000000000000000000000000000000000000000000 1.7976931348623157e308 0 11111111110

1111111111111111111111111111111111111111111111111111 2.0 0 10000000000

0000000000000000000000000000000000000000000000000000 1.0000000000000002 0 01111111111

0000000000000000000000000000000000000000000000000001 1.0 0 01111111111

0000000000000000000000000000000000000000000000000000 0.9999999999999999 0 01111111110

1111111111111111111111111111111111111111111111111111 1.390671161567e-309 0 00000000001

0000000000000000000000000000000000000000000000000000 1.390671161566996e-309 0 00000000000

1111111111111111111111111111111111111111111111111111 4.9e-324 0 00000000

0000000000000000000000000000000000000000000000000001 0.0 0 00000000000

0000000000000000000000000000000000000000000000000000

Exercise Determine the largest and smallest positive finite and denormal values for float and double, and represent their values as decimal values and their bit patterns as hexadecimal integers. What is the accuracy to which values can be represented for float and double (i.e., how many significant digits can be represented)? Indicate the bit pattern as a hexadecimal integer for the float values 0.125, 0.5, 1.0, 2.0, 4.0, 1.5, 1.825.

§8.2 Textual Representation of Floating Point Numbers //_______________________________________________________________________ private static double stringToDouble( String s ) { // Parse the string to generate a double. //_______________________________________________________________________ int sign = +1; double value = 0; double scaleFactor = 1.0; int exponentSign = +1; int exponentValue = 0; // Get the sign if ( s.length() == 0 ) return 0.0; if ( s.charAt( 0 ) == '-' ) { sign = -1; s = s.substring( 1 ); } else if ( s.charAt( 0 ) == '+' ) { s = s.substring( 1 ); } // Get integral part while ( true ) { if ( s.length() == 0 ) break; char c = s.charAt( 0 );



if ( c < '0' || c > '9' ) break; value = value * 10 + ( c - '0' ); s = s.substring( 1 ); } // Get fractional part if ( s.length() != 0 && s.charAt( 0 ) == '.' ) { s = s.substring( 1 ); while ( true ) { if ( s.length() == 0 ) break; char c = s.charAt( 0 ); if ( c < '0' || c > '9' ) break; value = value * 10 + ( c - '0' ); scaleFactor = scaleFactor / 10; s = s.substring( 1 ); } } value = value * sign * scaleFactor; // Get the exponent if ( s.length() > 0 && ( s.charAt( 0 ) == 'e' || s.charAt( 0 ) == 'E' ) ) { s = s.substring( 1 ); if ( s.length() > 0 && s.charAt( 0 ) == '-' ) { exponentSign = -1; s = s.substring( 1 ); } else if ( s.length() > 0 && s.charAt( 0 ) == '+' ) { s = s.substring( 1 ); } while ( true ) { if ( s.length() == 0 ) break; char c = s.charAt( 0 ); if ( c < '0' || c > '9' ) break; exponentValue = exponentValue * 10 + ( c - '0' ); s = s.substring( 1 ); } } // Compute the value if ( exponentSign < 0 ) for ( int i = 0; i < exponentValue; i++ ) value= value/ 10; else for ( int i = 0; i < exponentValue; i++ ) value= value* 10; return value; } //_______________________________________________________________________ private static String doubleToString( double value, int accuracy ) { // Convert a double value to a decimal representation, with the // specified number of places after the decimal point. //_______________________________________________________________________ int exponent = 0; String sign = ""; String digitString = ""; double rounding = 5.0;



// Compute the sign if ( value < 0.0 ) { sign = "-"; value = - value; } else if ( value > 0.0 ) sign ="+"; // Scale to 1 <= value < 10 if ( value != 0 ) { while ( value < 1.0 ) { value = value * 10; --exponent; } while ( value >= 10.0 ) { value = value / 10; ++exponent; } // Round to make sure the last digit is as accurate as possible. for ( int i = 0; i <= accuracy; i++ ) rounding = rounding / 10; value = value + rounding; if ( value >= 10.0 ) { value = value / 10; ++exponent; } } // compute the digits. for ( int i = 0; i <= accuracy; i++ ) { int digit = ( int ) Math.floor( value ); value = ( value - digit ) * 10; digitString = digitString + ( char ) ( digit + '0' ); } // Return the result. return sign + digitString.charAt( 0 ) + "." + digitString.substring( 1 ) + exponentToString( exponent ); } //_______________________________________________________________________ private static String exponentToString( int exponent ) { //_______________________________________________________________________ if ( exponent < 0 ) { return "e-" + ( - exponent ); } else if ( exponent > 0 ) { return "e+" + exponent; } else return ""; }


Storage of Data in Computer Memory

9. Storage of Data in Computer Memory §9.1 Bits, bytes, words, longwords and quadwords A bit is a “binary digit” - a 0 or a 1. A bit is represented in the computer in some physical form, such as a voltage difference, or a magnet pointing in one direction or the other. A byte is composed of 8 bits. Because each bit has 2 possible states, a byte can represent at most 28 = 256 different values. One interpretation of these values is as characters. A keyboard lets the user type 52 upper and lower case letters, 10 digits, a space, delete character, 32 special symbols, and 32 control characters (e.g., line break or tab characters), giving a total of 128 different alternatives. Seven bits are required to represent these alternatives. The eighth bit was historically used for parity checking, but is now not usually used. On modern machines, computer memory is composed of bytes. Normally every byte of memory can be addressed individually, by an integer. Successive bytes of memory are numbered 0, 1, 2, 3, ..., so we can think of computer memory as an array of bytes. A pair of adjacent bytes can be grouped together into a 16 bit value called a word. A word can represent 216 (65536) different values. Because there are many more than 256 different Chinese characters, we cannot represent Chinese characters in a byte. However, there are less than 65536 different Chinese characters, so it is possible to use a word to represent a Chinese character. Four adjacent bytes can be grouped together into a 32 bit value called a longword. A longword is often used to store the value of an integer variable. Older machines use a longword to store an address. Because a longword is composed of 32 bits, there can be at most 232 different addresses on such machines, which limits the amount of memory to 4Gbytes. Large modern machines require more memory, and hence need more bits to represent an address. They use what is called a quadword to store an address. A quadword is composed of eight adjacent bytes, grouped together into a 64 bit value. The Alpha computer uses a quadword to store an address. The information in a byte, word, longword or quadword can be considered to represent an unsigned integer. We can number the bits within the data as bit 0, 1, 2, ... 7 for a byte, bit 0, 1, 2, ... 15 for a word, etc. The unsigned integer represented by a bit pattern is Σ biti * 2i. Because people write the more significant digits of a number to the left of the less significant digits, we usually do the same when drawing a diagram of the contents of a byte, word, longword or quadword.



7 0

0

0

063

31

15

byte

word

longword

quadword

§9.2 Big endian and little endian storage of data On modern machines, it is possible to refer to individual bytes, by an address. Values that take up several consecutive bytes can be referred to by the address of the low byte. Several bytes are needed to store a number. There are two obvious choices for storage of numbers using a sequence of bytes - what is called big endian (most significant end first) and little endian (least significant end first). A similar choice is made when people write numbers and dates. Everyone writes decimal numbers in big endian form. 365 means three hundred and sixty five, not five hundred and sixty three. In New Zealand, we write dates in little endian form. For example 27th February 2003 would be written as 27/2/2003. In Japan, they write dates in big endian form as 2003年2月27日, or 二千三年二月二十七日. In the USA, they are inconsistent, writing the date in the order month, day, year as 2/27/2003. For example, suppose we want to store the quadword 0x0123456789abcdef in memory starting at address 0x1000000. We get one of the following representations:

Big endian

01

23

45

67

89

ab

cd

ef

1000000:

1000001:

1000002:

1000003:

1000004:

1000005:

1000006:

1000007:

Little endian

ef

cd

ab

89

67

45

23

01

1000000:

1000001:

1000002:

1000003:

1000004:

1000005:

1000006:

1000007:



The value of a longword stored at address base is computed as byte[ base ] << 24 + byte[ base + 1 ] << 16

+ byte[ base + 2 ] << 8 + byte[ base + 3 ] for big endian data. The value of a longword stored at address base is computed as byte[ base + 3 ] << 24 + byte[ base + 2 ] << 16

+ byte[ base + 1 ] << 8 + byte[ base ] for little endian data. There has not been much consistency among different computer architectures, with regard to whether they store numbers in big endian or little endian format. IBM machines use the big endian format for numbers, while DEC (now COMPAQ) machines tend to use the little endian format. It is actually possible to choose the format on the Alpha on startup, but the choice tends to be little endian, and the Alpha simulator only supports this alternative. Strings are stored as a sequence of bytes, with one character per byte. The end of the string is often indicated by a null (zero) byte. Strings are always stored in big endian format (in other words, the first character at the low address end).

68 65 6c 6c 6f 21 0a 00

!h" !e" !l" !l" !o" !!" !\n" !\0"

Storage of the string “hello!\n” When data is transmitted over the network, numbers tend to be transmitted big endian, with the most significant byte first. When displaying information down the page, most people display low addresses at the top of the page, and high addresses at the bottom. When displaying multiple bytes across the page, there is again the choice of order. Low address first is better for big endian data, high address first is better for little endian data. All computers contain registers, which tend to contain 32 bits on older machines, and 64 bits on newer machines. Are numbers stored in big endian or little endian format? The answer is that in a sense the question has no meaning, because in most computer architectures we do not have instructions that address individual bytes within the register. On most computers, including the Alpha, data composed of more than one byte has to be aligned, according to its size. Word data has to start at an address divisible by 2, longword data has to start at an address divisible by 4, and quadword data has to start at an address divisible by 8. This is for ease of implementation. If the data is not aligned, then part of the data might be stored in one memory chip, and part in another. Fetching the data would have to be broken up into two requests. Exercise When you know more about assembly language: Write Alpha assembly language to load a big-endian value from memory into a register, so that it can be operated on by arithmetic instructions.


Storage of structured objects

10. Storage of structured objects Compound objects are composed of a sequence of values.

§10.1 Arrays An array is a compound object composed of elements of the same type. In many modern languages, the number of elements might be determined dynamically, at run-time. All the elements have the same size, and are stored at successive memory addresses. int array[] = { 111, 222, 333, 444, 555 };

0x1000000

0x1000008

0x1000010

0x1000018

0x1000020

array

array[ 0 ] (offset 0)





111

222

333

444

555

We use indexing to access the elements. The ith element, written as “array [ i ]” in C, is stored at address array + elementSize * i

Thus the address of an element can be computed by scaling the index by the element size, and adding it to the base address of the array. For example, for the Alpha computer architecture, which uses 8 bytes for an integer, we might write ldiq $t0, array; // Load array base address into $t0 mulq $i, 8, $t1; // Compute scaled index into $t1 addq $t0, $t1, $t0; // Compute element address into $t0

to obtain the address of the ith element, followed by ldq $t0, ($t0); // Load value of element into $t0.

to obtain the contents of the ith element.

§10.2 Descriptors In many languages (but not C, which only allows fixed size arrays), information is stored on the size of the array. This might be stored separately from the array itself, in what is called a “descriptor”.



0x1000000

0x1000008

0x1000010

0x1000018

0x1000020

descriptor

111

222

333

444

555

5

0x1000000

size

array






In such languages, it is possible to perform bounds checking, to ensure the array index is in range.

§10.3 Multi-dimensional arrays Multi-dimensional arrays usually follow the same conventions. But the elements are subarrays. int array[ 3 ][ 4 ];

array[ 0 ][ 0 ] array[ 0 ]

array[ 0 ][ 1 ]

array[ 0 ][ 2 ]

array[ 0 ][ 3 ]

array[ 1 ][ 0 ] array[ 1 ]

array[ 1 ][ 1 ]

array[ 1 ][ 2 ]

array[ 1 ][ 3 ]

array[ 2 ][ 0 ] array[ 2 ]

array[ 2 ][ 1 ]

array[ 2 ][ 2 ]

array[ 2 ][ 3 ]

The subarray “array [ i ]”, is stored at address array + numberElementsInSubarray * elementSize * i

For example, for the Alpha computer architecture, we might write ldiq $t0, array; // Load array base address into $t0 mulq $i, 4*8, $t1; // Compute scaled first index into $t1 addq $t0, $t1, $t0; // Compute element address into $t0

to obtain the address of array[ i ]. The element “array [ i ][ j ]”, is stored at address array + numberElementsInSubarray * elementSize * i + elementSize * j



For example, for the Alpha computer architecture, we might write ldiq $t0, array; // Load array base address into $t0 mulq $i, 4*8, $t1; // Compute scaled first index into $t1 addq $t0, $t1, $t0; // Compute address of array[ i ] into $t0 mulq $j, 8, $t1; // Compute scaled second index into $t1 addq $t0, $t1, $t0; // Compute address of array[ i ][ j ] into $t0

to obtain the address of array[ i ][ j ].

§10.4 Records A record is compound object composed of fields of possibly different types. The number and types of the fields are specified at compile-time. In C, a record corresponds to a “struct” (structure) declaration. The Java equivalent is a class with only instance fields. struct BinaryNode { struct BinaryNode *left; char *name; struct BinaryNode *right; };

0x1000028

0x1000030

0x1000038

record

left (offset 0)

name (offset 8)

right (offset 10)

Fields may be of different sizes. They are stored at successive addresses. Sometimes the computer architecture may require some fields to be aligned (at an address divisible by the natural size of the data). This might imply that a gap is left in the record to ensure alignment of the next field. Because the number and type of the fields is fixed at compile time, the offfset of a field from the start of the record is a constant that can be determined at compile-time. A field, written as “record.fieldName” in C, is stored at address record + fieldOffset

Thus the address of an element can be computed by adding a constant to the base address. Many computer architectures have instructions that refer to a memory address as a displacement from a base address stored in a register. For example, for the Alpha computer architecture, we might write ldiq $t0, record; // Load record base address into $t0 ldq $t0, fieldOffset($t0); // Load value of field into $t0

where fieldOffset is a number, such as 0x0 for left, 0x8 for name, 0x10 for right.

§10.5 Pointers We often want a reference (pointer) to another object. At the architecture level, a reference to another object is usually represented by the address of the object. A null reference is represented by a zero value. In C, to declare a variable as a pointer variable, we write TypeName *variableName;

For example, the data structure corresponding to the expression a * b + c



+

*

a b

c

could be represented as

0x1000028

0x1000030

0x1000038

left

name

right

0x1000040

0x1000048

0x1000050

left

name

right

0x1000058

0x1000060

0x1000068

left

name

right

0x1000070

0x1000078

0x1000080

left

name

right

0x1000088

0x1000090

0x1000098

left

name

right

0x1000040

0x1000058

0x1000070

+ \0

* \0

a \0

b \0

c \0

0x1000088

0x10000a0

0x10000a2

0x10000a4

0x10000a6

0x10000a8

0x0

0x0

0x0

0x0

0x0

0x0

0x10000a8

0x10000a6

0x10000a4

0x10000a2

root

0x10000a0

In C it is possible to declare array and structure variables. The variable corresponds to the space for the array or struct itself, and is allocated as a part of the declaration. int array[ 4 ]; struct BinaryNode a;

It is also possible to declare pointer variables, that can point to memory containing a primitive type, array, or structure.



char *s = "hello"; // A pointer to the start of memory containing // the text “hello”. int *p = array; // A pointer to the start of the array. struct BinaryNode *q = &a; // A pointer to the address of the variable “a”.

array

a

s

p

q

‘h’‘e’‘l’‘l’‘o’‘\0’

It is also possible to dynamically allocate memory, and make a pointer variable point to the memory. This lets us create dynamic data structures. In C, given the address of a memory, it is possible to refer to adjacent memories. That is why we can step through the characters in a string if we have a pointer to the first character of the string. Pointers can be “dereferenced” to access the data they point to. For example, *s refers to the character 'h', *p refers to array[ 0 ], and *q refers to a. Pointers can also be indexed, to refer to adjacent data. For example s[ 1 ] refers to the the character 'e', p[ i ] refers to array[ i ], etc. Selecting a field of an object pointed to by a pointer is a very common action. Rather than having to write “(*ptr).fieldName”, we can write “ptr-> fieldName”. Assigning pointer variables does not copy the object pointed to. You just get two pointers pointing to the same place. struct BinaryNode a; struct BinaryNode *p, *q; p = &a; q = p;

a

p

q

In Java, all variables are either of primitive type, or effectively pointers. Reference variables have a default null value. All objects have to be explicitly created, using “new”. Field selection in Java, written as “ptr.fieldName” actually corrresponds to “ptr-> fieldName” in C.



§10.6 Class instances In the C language, records are composed of instance fields. There is no such notion as static fields, and there is no way of associating functions with a record. However C++ and Java do have such notions. How could a class instance be represented, so that we can support extending of classes, with overriding of functions? We could add a field at the beginning of each record, which points to a “function table” – a table of the instance functions of the class.

ptr

Pointer to function table

Function table

field0

field1

field2

field3

function0

function1

function2

function3

Fields

To implement the Java equivalent of ptr.functionName( param0, param1, ... ) in C, we would have to write ptr->functionTable->functionName( ptr, param0, param1, ...).

The function address is obtained from the function table, and all instance functions take the instance as an additional parameter. The function table never changes. So objects of the same type can share the function table.



Pointer to

function table

Function table

field0

field1

field2

field3

function0

function1

function2

function3

Fieldsvariable1

Pointer to

function table

field0

field1

field2

field3

Fields

variable2

When a class extends another class, we ensure that the layout of the first part of the field and function table of the subclass is the same as that of the superclass.


Function table for class C

fields for Object

fields for A

fields for B

fields for C

functions for Object

functions for A

functions for B

functions for C

Fields

class A extends Objectclass B extends Aclass C extends B

B b = new C()

This means that even though a variable might point to an object belonging to a subclass, the portion of the data structure accessed as an object of the declared type has the layout of the declared type.



While the first portion of the function table for the subclass has the same layout as the superclass, the entries are not necessarily the same. If a function is overridden, it replaces the entry for the function of the same name and signature in the function table.

Function table for class B


functions for A

functions for Bfg

Function table for class C


functions for A

functions for Bfg

functions for Ch

class A extends Objectclass B extends Aclass C extends Bclass B declares functions f and gClass C declares functions f and h code for f in B

code for g in B

code for f in C

code for h in C

Static fields and functions do not appear as part of the object. Static variables are stored separately, and occur only once, no matter how many instances of the class exist. Example Consider the java program public class A extends Object { int field1, field2, field5; void method1() { } void method3() { } void method4() { } void method5() { } } public class B extends A { int field2, field4; void method2() { } void method3() { } void method1() { } }

An instance of class B will look something like




Function table

A.field1

A.field2

A.field5

B.field2

B.method1()

Fields

B.field4

B.method3()

A.method4()

A.method5()

B.method2()

Object.toString()

Example Consider the following Java program. class A { public int x1, x2; public A() { System.out.println( "Invoke A_init()" ); } public String toString() { return "A"; } public void f( int x ) { System.out.println( "Invoke A.f( " + x + " )" ); } public void g( int x ) { System.out.println( "Invoke A.g( " + x + " )" ); } } class B extends A { public int y1, y2; public B() { System.out.println( "Invoke B_init()" ); } public String toString() { return "B"; } public void f( int x ) { System.out.println( "Invoke B.f( " + x + " )" ); } public void h( int x ) { System.out.println( "Invoke B.h( " + x + " )" ); } } public class Overriding { public final static void main( String[] args ) {



A a = new B(); System.out.println( "a = " + a ); a.f( 4 ); a.g( 5 ); } }

This program generates output Invoke A_init() Invoke B_init() a = B Invoke B.f( 4 ) Invoke A.g( 5 )

We can translate this program into the following C program. #include <stdio.h> #include <stdlib.h> struct Object { struct Object_FunctionTable *functionTable; }; struct Object_FunctionTable { char *(*toString)( struct Object *a ); }; char *Object_toString( struct Object *a ) { return "Object"; } struct A { struct A_FunctionTable *functionTable; int x1, x2; }; struct A_FunctionTable { char *(*toString)( struct A *a ); void (*f)( struct A *a, int x ); void (*g)( struct A *a, int x ); }; void A_init( struct A *b ) { printf( "Invoke A_init()\n" ); } char *A_toString( struct A *a ) { return "A"; } void A_f( struct A *a, int x ) { printf( "Invoke A.f( %d )\n", x ); } void A_g( struct A *a, int x ) { printf( "Invoke A.g( %d )\n", x ); } struct A_FunctionTable A_FunctionTableInstance = { &A_toString, &A_f, &A_g }; struct B { struct B_FunctionTable *functionTable;



int x1, x2; int y1, y2; }; struct B_FunctionTable { char *(*toString)( struct B *b ); void (*f)( struct B *b, int x ); void (*g)( struct B *b, int x ); void (*h)( struct B *b, int x ); }; void B_init( struct B *b ) { A_init( ( struct A * ) b ); printf( "Invoke B_init()\n" ); } char *B_toString( struct B *b ) { return "B"; } void B_f( struct B *b, int x ) { printf( "Invoke B.f( %d )\n", x ); } void B_h( struct B *b, int x ) { printf( "Invoke B.h( %d )\n", x ); } struct B_FunctionTable B_FunctionTableInstance = { B_toString, B_f, ( void (*)( struct B *b, int x ) ) A_g, B_h }; struct Object *createObject( int size, struct Object_FunctionTable *functionTable ) { struct Object *object = ( struct Object * ) malloc( size ); object->functionTable = ( struct Object_FunctionTable * ) functionTable; return object; } int main( int argc, char *argv[], char *arge[] ) { struct A *a = ( struct A * ) createObject( sizeof( struct B ), ( struct Object_FunctionTable * ) &B_FunctionTableInstance ); B_init( ( struct B * ) a ); printf( "a = %s\n", ( a->functionTable->toString )( a ) ); ( a->functionTable->f )( a, 4 ); ( a->functionTable->g )( a, 5 ); }

Exercise Write a simple Java program that illustrates the use of overriding. Translate this program into C (not C++), by implementing the method tables explicitly.


11. Appendices


Appendices

§11.1 Base Conversion Table and Powers of two

Decimal Hexadecimal Binary 2n Hex 2n Decimal 2-n Decimal

0 0 0000 1 1 1

1 1 0001 2 2 0.5

2 2 0010 4 4 0.25

3 3 0011 8 8 0.125

4 4 0100 10 16 0.0625

5 5 0101 20 32 0.03125

6 6 0110 40 64 0.015625

7 7 0111 80 128 0.0078125

8 8 1000 100 256 0.00390625

9 9 1001 200 512 0.001953125

10 a 1010 400 1024 0.0009765625

11 b 1011 800 2048 0.00048828125

12 c 1100 1000 4096 0.000244140625

13 d 1101 2000 8192 0.0001220703125

14 e 1110 4000 16384 0.00006103515625

15 f 1111 8000 32768 0.000030517578125


Appendices

§11.2 Hexadecimal Addition Table

+ 1 2 3 4 5 6 7 8 9 a b c d e f

1 2 3 4 5 6 7 8 9 a b c d e f 10

2 3 4 5 6 7 8 9 a b c d e f 10 11

3 4 5 6 7 8 9 a b c d e f 10 11 12

4 5 6 7 8 9 a b c d e f 10 11 12 13

5 6 7 8 9 a b c d e f 10 11 12 13 14

6 7 8 9 a b c d e f 10 11 12 13 14 15

7 8 9 a b c d e f 10 11 12 13 14 15 16

8 9 a b c d e f 10 11 12 13 14 15 16 17

9 a b c d e f 10 11 12 13 14 15 16 17 18

a b c d e f 10 11 12 13 14 15 16 17 18 19

b c d e f 10 11 12 13 14 15 16 17 18 19 1a

c d e f 10 11 12 13 14 15 16 17 18 19 1a 1b

d e f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c

e f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d

f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e

§11.3 Hexadecimal Multiplication Table

* 2 3 4 5 6 7 8 9 a b c d e f

2 4 6 8 a c e 10 12 14 16 18 1a 1c 1e

3 6 9 c f 12 15 18 1b 1e 21 24 27 2a 2d

4 8 c 10 14 18 1c 20 24 28 2c 30 34 38 3c

5 a f 14 19 1e 23 28 2d 32 37 3c 41 46 4b

6 c 12 18 1e 24 2a 30 36 3c 42 48 4e 54 5a

7 e 15 1c 23 2a 31 38 3f 46 4d 54 5b 62 69

8 10 18 20 28 30 38 40 48 50 58 60 68 70 78

9 12 1b 24 2d 36 3f 48 51 5a 63 6c 75 7e 87

a 14 1e 28 32 3c 46 50 5a 64 6e 78 82 8c 96

b 16 21 2c 37 42 4d 58 63 6e 79 84 8f 9a a5

c 18 24 30 3c 48 54 60 6c 78 84 90 9c a8 b4

d 1a 27 34 41 4e 5b 68 75 82 8f 9c a9 b6 c3

e 1c 2a 38 46 54 62 70 7e 8c 9a a8 b6 c4 d2

f 1e 2d 3c 4b 5a 69 78 87 96 a5 b4 c3 d2 e1


Appendices

§11.4 The ASCII character set 00 NUL 01 SOH 02 STX 03 ETX 04 EOT 05 ENQ 06 ACK 07 BEL (\a) 08 BS (\b) 09 HT (\t) 0A LF (\n) 0B VT (\v) 0C FF (\f) 0D CR (\r) 0E SO 0F SI 10 DLE 11 DC1 12 DC2 13 DC3 14 DC4 15 NAK 16 SYN 17 ETB 18 CAN 19 EM 1A SUB 1B ESC 1C FS 1D GS 1E RS 1F US 20 SP 21 ! 22 " 23 # 24 $ 25 % 26 & 27 ' 28 ( 29 ) 2A * 2B + 2C , 2D - 2E . 2F / 30 0 31 1 32 2 33 3 34 4 35 5 36 6 37 7 38 8 39 9 3A : 3B ; 3C < 3D = 3E > 3F ? 40 @ 41 A 42 B 43 C 44 D 45 E 46 F 47 G 48 H 49 I 4A J 4B K 4C L 4D M 4E N 4F O 50 P 51 Q 52 R 53 S 54 T 55 U 56 V 57 W 58 X 59 Y 5A Z 5B [ 5C \ 5D ] 5E ^ 5F _ 60 ` 61 a 62 b 63 c 64 d 65 e 66 f 67 g 68 h 69 i 6A j 6B k 6C l 6D m 6E n 6F o 70 p 71 q 72 r 73 s 74 t 75 u 76 v 77 w 78 x 79 y 7A z 7B { 7C | 7D } 7E ~ 7F DEL


Appendices

§11.5 Names for powers of 10 In computing, as in physics and chemistry, we need to have a notation for big and small numbers in order to describe them. Because Europeans group digits in lots of three, the names are powers of 103.

Exa 1018

Peta 1015

Tera 1012

Giga 109

Mega 106

Kilo 103

Milli 10-3

Micro 10-6

Nano 10-9

Pico 10-12

Femto 10-15

Computer Science 210 Computer Systems 2006 Semester 2 ... · Data Representation 17 October 2006...

Documents

Transcript of Computer Science 210 Computer Systems 2006 Semester 2 ... · Data Representation 17 October 2006...