LING 408/508: Programming for Linguists Lecture 2 August 26 th.

33
LING 408/508: Programming for Linguists Lecture 2 August 26 th

Transcript of LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Page 1: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

LING 408/508: Programming for Linguists

Lecture 2August 26th

Page 2: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Today’s Topics

• continuing on from last time …• Homework 1

Page 3: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Adminstrivia• No class on

– Monday September 7th (Labor Day)– Wednesday November 11th (Veterans Day)– Week after September 11th (out of town), plus Monday 21st – Monday October 12th

Page 4: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Introduction: data types

• what if you want to store even larger numbers than 32 bits?– Binary Coded Decimal (BCD)– 1 byte can code two digits (0-9 requires 4 bits)– 1 nibble (4 bits) codes the sign (+/-), e.g. hex C/D23 22 21 20

0 0 0 0

23 22 21 20

0 0 0 1

23 22 21 20

1 0 0 1

0

1

9

2 0 1 4

2 bytes (= 4 nibbles)

+ 2 0 1 4

2.5 bytes (= 5 nibbles)

23 22 21 20

1 1 0 0 C23 22 21 20

1 1 0 1 Dcredit (+) debit (-)

Page 5: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Introduction: data types

• Typically, 64 bits (8 bytes) are used to represent floating point numbers (double precision)– c = 2.99792458 x 108 (m/s)– coefficient: 52 bits (implied 1, therefore treat as 53)– exponent: 11 bits (usually not 2’s complement, unsigned

with bias 2(10-1)-1 = 511)– sign: 1 bit (+/-)

C:floatdouble

wikipedia

x86 CPUs have a built-in floating point coprocessor (x87)80 bit long registers

e.g. probabilities

Page 6: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Introduction: data types

• Next time, we'll talk about the representation of characters (letters, symbols, etc.)

Page 7: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Example 1

• Recall the speed of light:• c = 2.99792458 x 108 (m/s)

1. Can a 4 byte integer be used to represent c exactly?– 4 bytes = 32 bits– 32 bits in 2’s complement format– Largest positive number is – 231-1 = 2,147,483,647– c = 299,792,458

Page 8: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Example 2

• Recall the speed of light:• c = 2.99792458 x 108 (m/s)

2. How much memory would you need to encode c using BCD notation?– 9 digits– each digit requires 4 bits (a nibble)– BCD notation includes a sign nibble– total is 5 bytes

Page 9: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Example 3

• Recall the speed of light:• c = 2.99792458 x 108 (m/s)

3. Can the 64 bit floating point representation (double) encode c without loss of precision?– Recall significand precision: 53 bits (52 explicitly

stored)– 253-1 = 9,007,199,254,740,991 – almost 16 digits

Page 10: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Example 4

• Recall the speed of light:• c = 2.99792458 x 108 (m/s)

• The 32 bit floating point representation (float) – sometimes called single precision - is composed of 1 bit sign, 8 bits exponent (unsigned with bias 2(8-1)-1), and 23 bits coefficient (24 bits effective).

• Can it represent c without loss of precision? – 224-1 = 16,777,215– Nope

Page 11: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1

• For both solutions, show your work, i.e. how you derived your answer

• Pi ( ) is an irrational number𝛑– can't be represented precisely!

wikipedia

Page 12: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1

1. Encode Pi as accurately as possible using both the 64and 32 bit floating point representationsInstruction: draw the diagram and fill in the 1's and 0's

2. How many decimal places of precision is provided by each of the 64 and 32 bit floating point representations?

Page 13: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1 Hints

• How to encode 1: (bias: 01111 + 0 = 20, frac: 1000… remember: there is an implicit leading 1,

• = 1.000… in binary)

Page 14: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1 Hints

• How to encode 2: (exp: 10000 = bias 01111 + 1 = 21, frac: 1000…) = 10.00… in binary

Page 15: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1 Hints

• How to encode 3: (exp: 10000 = bias 01111 + 1 = 21, frac: 1100…) = 11.000… in binary

Page 16: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1 Hints

• How to encode 4: (exp: 10001 = bias 01111 + 10 = 22, frac: 1000…) = 100.0… in binary

Page 17: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1 Hints

• How to encode 5: (exp: 10001 = bias 01111 + 10 = 22, frac: 1010…) = 101.0… in binary

Page 18: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1 Hints

• How to encode 6: (exp: 10001 = bias 01111 + 10 = 22, frac: 1100…) = 110.0… in binary

Page 19: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1 Hints

• How to encode 7: (exp: 10001 = bias 01111 + 10 = 22, frac: 1110…) = 111.0… in binary

Page 20: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1 Hints

• How to encode 8: (exp: 10001 = bias 01111 + 100 = 23, frac: 1000…) = 1000.0… in binary

Page 21: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1 Hints

• Decimal 3.5 is 1.11 x 21 = 11.1 in binary

Page 22: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1 Hints

• Decimal 3.25 is 1.101 x 21 = 11.01 in binary

Page 23: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1 Hints

• Decimal 3.125 is 1.1001 x 21 = 11.001 in binary

Page 24: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Homework 1

• Due Friday night – (by midnight in my emailbox)

• Required format (for all homeworks unless otherwise specified):– Plain text or PDF formats only

• (no .doc, .docx etc.)

– Single file only – cut and paste into one document• (no multiple attachments)

– Subject line: 408/508 Homework 1– First line: your full name

Page 25: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Introduction: data types• How about letters, punctuation, etc.?• ASCII

– American Standard Code for Information Interchange– Based on English alphabet (upper and lower case) + space + digits +

punctuation + control (Teletype Model 33)– Question: how many bits do we need?– 7 bits + 1 bit parity– Remember everything is in binary …

C:char

Teletype Model 33 ASR Teleprinter (Wikipedia)

Page 26: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Introduction: data typesorder is important in sorting!

0-9: there’s a connection with BCD. Notice: code 30 (hex) through 39 (hex)

Page 27: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Introduction: data types• Parity bit:

– transmission can be noisy– parity bit can be added to ASCII code– can spot single bit transmission errors– even/odd parity:

• receiver understands each byte should be even/odd

– Example: • 0 (zero) is ASCII 30 (hex) = 011000• even parity: 0110000, odd parity: 0110001

– Checking parity: • Exclusive or (XOR): basic machine instruction

– A xor B true if either A or B true but not both

– Example:• (even parity 0) 0110000 xor bit by bit• 0 xor 1 = 1 xor 1 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0

x86 assemby language:1. PF: even parity flag set by arithmetic ops.2. TEST: AND (don’t store

result), sets PF3. JP: jump if PF set

Example:MOV al,<char>TEST al, alJP <location if even><go here if odd>

Page 28: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Introduction: data types• UTF-8

– standard in the post-ASCII world– backwards compatible with ASCII– (previously, different languages had multi-byte character sets that clashed)– Universal Character Set (UCS) Transformation Format 8-bits

(Wikipedia)

Page 29: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Introduction: data types

• Example:– あ Hiragana letter A: UTF-8: E38182– Byte 1: E = 1110, 3 = 0011– Byte 2: 8 = 1000, 1 = 0001– Byte 3: 8 = 1000, 2 = 0010– い Hiragana letter I: UTF-8: E38184

Shift-JIS (Hex): あ : 82A0い : 82A2

Page 30: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Introduction: data types

• How can you tell what encoding your file is using?• Detecting UTF-8

– Microsoft: • 1st three bytes in the file is EF BB BF • (not all software understands this; not everybody uses it)

– HTML:• <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" >• (not always present)

– Analyze the file:• Find non-valid UTF-8 sequences: if found, not UTF-8…• Interesting paper:

– http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

Page 31: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Introduction: data types

• Filesystem:– different on different computers: sometimes a problem if you

mount filesystems across different systems• Examples:

– FAT32 (File Allocation Table) DOS, Windows, memory cards– ExFAT (Extended FAT) SD cards (> 4GB files)– NTFS (New Technology File System) Windows– ext4 (Fourth Extended Filesystem) Linux– HFS+ (Hierarchical File System Plus) Macs

limited to 4GB max file size

Page 32: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Introduction: data types• Filesystem:

– different on different computers: sometimes a problem if you mount filesystems across different systems

• Files:– Name (Path from / root)– Type (e.g. .docx, .pptx, .pdf, .html, .txt)– Owner (usually the Creator)– Permissions (for the Owner, Group, or Everyone)– need to be opened (to read from or write to)– Mode: read/write/append– Binary/Text

in all programming languages:open command

Page 33: LING 408/508: Programming for Linguists Lecture 2 August 26 th.

Introduction: data types

• Text files: – text files have lines: how do we mark the end of a line?– End of line (EOL) control character(s):

• LF 0x0A (Mac/Linux), • CR 0x0D (Old Macs), • CR+LF 0x0D0A (Windows)

– End of file (EOF) control character: • (EOT) 0x04 (aka Control-D)

binaryvision.nl

programming languages:NUL used to markthe end of a string