Microcomputer Course Koby Gottlieb, 3/03, p-1 Microcomputer Course: x86 Architecture Basics Koby...

Microcomputer CourseKoby Gottlieb, 3/03, p-1

Microcomputer Course:x86 Architecture Basics

Koby Gottlieb, Intel IsraelSpring Semester 2003

Administration

Participants

Grading 65% exam, 35% exercises All exercises are mandatory! No exercise => no final grade

Web page: http://www.ee.technion.ac.il/~ee044800/ Latest version of slides Pointers to additional info

Material

Contacts Koby Gottlieb koby.gottlieb@intel.com 04-8655718 George Leifman [gleifman@techunix.technion.ac.il]

Objectives

Get acquainted with the X86 architecture and system components Historical perspective

Major features

SW/HW aspects

What you will not get here Expertise in X86 programming

Full details of everything

Expect Varying level of details

Hopefully: taste of more

Lecture 1: Introduction

History, Trends, & Terminology

Major Milestones in Microprocessors 1971 - 1975

Intel: 4004, 4 bit, 8008, 8 bit, 8080, 8 bit Motorola & other Still as a logic substitution

1975: Altair, 8080 based, 1st kit. Basic interpreter 1977: TRS80 of RadioShack

Apple, Apple-II 1980: VISICALC - 1st killer application

12 active companies 1981: IBM-PC, w/8088. MS-DOS, IBM PC/XT later Open standard! 1982: 80286 CPU 1984: IBM PC/AT, 80286 based 1985: i386 CPU 1987: i386 based PC 1989: i486 CPU, Windows! 1991: i486 PC, x86 competitors 1993: Pentium® processor 1994: Pentium based PC 1995: Pentium Pro processor 1996: Pentium Pro based PC Windows 95, Windows NT 1996: Pentium w/MMXTM Technology + PC 1997: Pentium-II® processor + PC, Cyrix MediaGX Did not mention buses, networking, etc...

Trends Phase-I (-1970): “Enter the game”

Technology progress

Phase-II (1970-1990): “Imitate the big guys”Use technology and copy ideas from mainframes Enlarge word size FP hardware Paging, Protection, Caching High integration

Phase-III (1990-): “Lead the game”Innovation unique to microcomputers: Super-scalar, out of order execution Branch prediction High frequency networking Graphics

Diverging again?Net-PC, thin client, connected PC, High integration, etc...

Representative Data

ScreenYear Xistors MIPS Memory Disk Screen

Memory___________________________________________________

1975 10K 0.2 <64K - Mono 2K

1985 275K 5.5 1M <10MB CGA 14” 64K

1995 5.5M 300 16M >500MB SVGA 17” 1MB___________________________________________________

Factor 20 40 16 50 - 16

1998 7.5M 600 32M 4GB “ 4MB

Performance Observations

Performance is doubled every 18-24 months

The “Spiral” principle (A. Grove):

New software is written to the fastest available processor. Actually, it takes the next generation of processors to run this software (reasonably).

There will always be applications that will use all available computing power.

<well, some people question that…>

"My nightmare is to wake up some day and not need any more computing power." Gordon Moore, Feb 1998

Computer System

Class focus CPU:

General architectureMajor components

System:Buses (CPU, ISA, PCI, USB) InterruptsMemory hierarchy

PCI BUS

Host/PCICacheBridge

Memory

ScsiAdap

ScsiBus

LanAdap

USBHub

KeyBoard,Monitor,Mouse,Speakers

GraphAdapt

VideoBuffer

Architecture & Microarchitecture

Architecture:“The science, art, or profession of designing and constructing building, bridges, etc...” Focus on how it looks externally Consider how it can be built and how good it performs!

Construction:“The way in which something is built, formed or devised by fitting parts or elements together systematically” Focus on how it is built/operated Make sure it works and works correctly!

Definition applicable also for buildings, cars, computers, etc...In computers: construction == architecture

Architecture & Microarchitecture (cont.)

Architecture:The collection of features of a processor (or a system) as they are seen by the “user”

User: a binary executable running on that processor, or

assembly level programmer

MicroArchitecture:The collection of features or way of implementation of a processor (or a system) do not affect the user.

Features which change Timing/performance are considered microarchitecture.

Architecture & Microarchitecture Elements

Architecture: Registers data width (8/16/32) Instruction set Addressing modes Addressing methods (Segmentation, Paging, etc...) Protection etc…

Architecture Bus width Physical memory size Caches & Cache size Number of execution units Execution Pipelines, Number & type execution units Branch prediction TLB etc...

Timing is considered Architecture (though it is user visible!)

Compatibility

Processor A (or system A) is compatible to processor B (or system B) if it preserves the same architecture features as seen by a user of B

SW compatible - most important for us! The ability to run existing SW!

HW compatible Upward compatible “Exactly” compatible (sometimes - “Downward” compatible)

(Required more for application, less for OS) Architecture Extension: addition of data-types, instructions,

etc… to enable better support of new/existing applications.e.g. MMX

Trends: 1985-1955: no/minimal architecture changes 1995- : we see them coming (MMX, SSE etc.)

Basic Architecture

Overview Data types Registers Stack Flags Segmentation Instructions:

Format Memory operands Instruction Set

IA (X86) evolution - Detailed

Year Proce-ssor

bits bus PhMem

MHZ MIPS FP Cache Prot Page Tran Tech P W Feature

1978 8086 16 16 1M 5/10 .3/.8 8087 - - - 29K 3u 40 2.5/1.0 New, 16 bit

1979 8088 16 8 1M 5/8 -"- 8087 - - - -"- 3u 40 2.5/1.0 8 bit bus

1982 80186 16 16 1M 80187 - - - 56K More integration

1982 80286 16 16 16M 8/12.5 1.2/2.7 80287 - + - 134K 68 3.3/1.1 Protection, 16M

1985 80386 32 16/32 4G 16/33 5.0/11.4 80387 - + + 275K 132 2.7 32 bit, paging

1989 i486 32 32 4G 25/100 20/80 on/487 8K + + 1.2M 1.2/.8 168 4.5 FP on chip,Cache, “RISC”

1993 Pentium 32 64 4/64G 66/200 112/400 on 2*8K + + 3.1M .8/.6 273 10/15 Dual Pipe, FPpipe, DP support

1995 P6 32 64 4/64G 150/200 400 on 2*8K+256K + + 5.5M 0.6 387 14-18 Dynamicexecution

1996 P55C 32 64 4/64G 200/233 466 on 2*16K + + >5M 0.6/0.35 273 8-10 MMX Technology

1997 PentiumII

32 64 4/64G 233/300 600 on 2*16K+512Kon SECC

+ + ~7.5M 0.35 242(SEC)

Each major generation is 2-3X faster than its predecessor!

Operation Modes

Architecture reflects this evolution!

Evolution => several modes: 16 bit:

Real Mode (8086 and up) Protected mode (286 and up) Virtual 86 mode (386 and up)

32 bit: Protected mode (386 and up) Paging enabled/disabled

Proce-ssor

RealMode

Prot-16 Prot-32

286 + +

386+ + + +

Philosophical Basis of First Wave Architecture

HW / Instruction level Support for high level languages

Instruction Set encoding for memory efficiency Two-Operand Instructions

reg-reg, reg-imm, reg-memory, read-modify-write

Specialized and Implicit Register Usage

Complex Instructions to maximize functionality / instruction fetch

Memory Segmentation Relocatable programs / modules well supported

Addressing modes suitable for high-level languages Stack and Data spaces in instruction set architecture

Push / Pop Addressing in Stack Space

Structure and Array element addressing syllables

Lecture 2: x86 Architecture Basics

Basic Data Types

WORDHIGHBYTE

LOWBYTE

15 7 0

Address N+1

Address N

DOUBLEWORD

QUADWORD

HIGH WORD LOW WORD

Address N+3

Address N+2

Address N+1

Address N

31 15 0

LOW DOUBLEWORD

63 47 31 15 0

AddressN+7

AddressN+6

AddressN+5

AddressN+4

Address N+3

Address N+2

AddressN+1

Address N

Figure 3-2 Fundamental Date Types

AddressN

HIGH DOUBLEWORD

Memory Layout Little Endian - LSB in lower addresses

Figure 3-3 Bytes, Words, Doublewords and Quadwords in Memory

QUADWORD AT ADDRESS 6CONTAINS 7AFE06361FA4230BH

DOUBLEWORD AT ADDRESS OAHCONTAINS 7AFE0636H

WORD AT ADDRESS 0BHCONTAINS FE06H

BYTE AT ADDRESS 9CONTAINS 1FH

WORD AT ADDRESS 6CONTAINS 230BH

WORD AT ADDRESS 2CONTAINS 74CBH

WORD AT ADDRESS 1CONTAINS CB31H

Extended Data Types7 0

Figure 3-4. Data Types

BYTE INTEGER7 BIT MAGNITUDE1 BIT SIGN

WORD INTEGER15 BIT MAGNITUDE1 BIT SIGN

DOUBLEWORD INTEGER31 BIT MAGNITUDE1 BIT SIGN

BYTE ORDINAL8 BIT MAGNITUDE

WORD ORDINAL16 BIT MAGNITUDE

DOUBLEWORD ORDINAL32 BIT MAGNITUDE

BCD INTEGER4 BIT DIGIT PER BYTE

PACKED BCD INTEGER4 BIT DIGIT PER HALF-BYTE

NEAR POINTER32/16 BIT OFFSET

FAR POINTER32/16 BIT OFFSET16 BIT SELECTOR

BIT FIELDUP TO 32 BITS

BIT STRINGUP TO 4 GIGABITS

BYTE STRINGUP TO 4 GIGABYTES

31 15 0Pointers w/ 16 Bit Offsets

Registers

8 “General” 32 bit registers: EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP

Lowest 16 bits can be specified as AX, BX, ... Bits 0-7, 8-15 of first four registers can be addressed as bytes:

AL, AH, BL, BH, CL, CH, DL, DH

Special registers: eFLAGS, eIP Segment registers: CS, DS, ES, SS, FS, GS 8 Floating Point registers + FP control & status 8 MMX registers Various OS registers (trace, debug, control, status).

Not discussed here

* Throughout this presentation: eREG = REG / EREG, according to 16/32 bit mode

Basic Registers

Figure 3-5. Application Register Set

GENERAL REGISTERS

31 23 15 7 0

16-BIT

32-BIT

15 0SEGMENT REGISTERS

STATUS AND CONTROL REGISTERS31 0

EFLAGS

Floating Point

FPU DATA REGISTERSTAGFIELD

SIGNIFICANDEXPONENTSIGN

79 78 64 63 0 1 0FP7

CONTROL REGISTER

STATUS REGISTER

TAG WORD

15 0 47

INSTRUCTION POINTER

DATA POINTER

Figure 6-1. Floating-Point Unit Register Set

MMX (1)

Packed bytes (8x8 bits)

63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0

Packed word (4x16 bits)

63 48 47 32 31 16 15 0

Packed doublewords (2x32 bits)63 32 31 0

Quadword (64 bits)

Figure 2-3. Packed Data Types

MMX (2)

63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0

Memory address 1008h Memory address 1000h

Figure 2-4. Eight Packed Bytes in Memory (at address 1000H)

Figure 2-2. MMX Register Set

Streaming SIMD Extensions

Packed, single-precision FP (4x32 bits)

127 96 95 64 63 32 31 0

Streaming SIMD Extensions Register Set

Stack Related instructions: Push/Pop, Call/Ret Used in interrupt handling

TOP OF STACK

STACK SEGMENT31 0

SUBROUTINEPASSED

VARIABLES

BOTTOM OF STACK(INITIAL ESP VALUE)

PUSHES PUT THETOP OF STACK ATLOWER ADDRESSES

POPS PUT THE TOP OF STACK ATHIGHER ADDRESSES

Figure 3-8. Stacks

Affected by ALU operations, e.g.,: ADD, CMP

Used mainly for control flow: Jcc

But also by other instructions ADC, RLC, etc...

Table 3-2. Status FlagsNAME PURPOSE CONDITIONS REPORTED

OF OVERFLOW Result exceeds positive or negative limit of numbers range

SF Sign Result is negative (less then zero)

ZF Zero Result is zero

AF Auxiliary carry Carry out of bit position 3 (used for BCD)

PF Parity Low byte of resutl had even parity (even number of set bits)

CF Carry flag Carry out of most significant bit of result

FIGURE 3-9. EFLAGS Register

0000000000

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

X ID FLAG (ID)X VIRTUAL INTERRUPT PENDING (VIP)X VIRTUAL INTERRUPT FLAG (VIF)X ALIGNMENT CHECK (AC)X VIRTUAL 8086 MODE (VM)X RESUME FLAG (RF)X NESTED TASK (NT)X I/O PRIVILEGE LEVEL (IOPL)S OVERFLOW FLAG (OF)C DIRECTION FLAG (DF)X INTERRUPT ENABLE FLAG (IF)X TRAP FLAG (TF)S SIGN FLAG(SF)S ZERO FLAG (ZF)S AUXILIARY CARRY FLAG (AF)S PARITY FLAG (PF)S CARRY FLAG (CF)

S INDICATES A STATUS FLAGC INDICATES A CONTROL FLAGX INDICATES A SYSTEM FLAG

BIT POSITIONS SHOWN AS O OR 1 ARE INTEL RESERVED.DO NOT USE. ALWAYS SET THEM TO THE VALUE PREVIOUSLY READ.

Segmentation (1) Problem: use “large” memory with short word width

16 bits can cover only 64KB

Solution: address space divided into “segments” Logical address is segment:offset Linear address: address in the “flat” address range Segmentation: mapping from Logical to Linear

Segment base = f(segment register) offset < segment size: 64K (8086/286), 4G (386+) linear address = segment_base + offset

At any time. there are 4 (8086) or 6 (386 and on) active segment registers CS: code DS: data SS: stack ES: extra FS, GS: new extra

Rough analogy - phone directory: region and a number within

Segmentation (2)

Figure 3-7. A Segmented Memory

DIFFERENT LOGICAL SEGMENTSDIFFERENT ADDRESS SPACE

IN PHYSICAL MEMORY

CODESEGMENT

STACKSEGMENT

DATASEGMENT

Figure 3-6. An Unsegmented Memory

DIFFERENT LOGICAL SEGMENTS

GSFSESDSCSSS

ONE PHYSICAL ADDRESS SPACE

Instruction Set Architecture

Data transfer Common ALU operations (add, sub, cmp...) Control flow (jmp, call, jcc) String (movs, cmps, sets - using eSI, eDI, eCX) FP instructions (also SSE variant) MMX instructions OS support (task switch, call gates, table handling, etc.) I/O instructions

“Exotic” instructions (DAA, XLAT, etc.)

Data Transfer Instructions MOV Byte / Word / Dword, register to / from register / memory

Also Zero/Sign extended moves

STACK OPERATIONS PUSH [SOURCE]

SP = SP - 2 ; ( SS : SP ) = SOURCE ; (16-bit)

ESP = ESP - 4 ; ( SS : ESP ) = SOURCE ; (32-bit)

POP [DEST]

DEST = ( SS : SP ) ; SP = SP + 2; (16-bit)

DEST = ( SS : ESP ) ; ESP = ESP + 4; (32-bit)

I/O instructions IN (I/O port specified by 8-bit immediate or DX)

EAX / AX / AL = Dword / Word / Byte from I/O port

OUT (I/O port specified by 8-bit immediate or DX)

I/O Port = Dword / Word / Byte from EAX / AX / AL

ALU and Shift Instructions

ALU Usual ALU ops on Byte / Word / Dword operands Integer Multiply and Divide Instructions

Multiply and Divide use implicit register pairs Chained add / subtract for multi-precision arithmetic Conversion ops for packed and unpacked decimal math

SHIFTS and ROTATES Logical and arithmetic shifts Rotates through / not through C bit By constant (1) or by count in CL

Later extended to immediate operand in 80386

Control Flow

Assortment of conditional jumps Jcc Byte/Word/Dword displacement CS not modified, eIP = eIP + signed displacement *

JMP SHORT - one byte displacement NEAR - two/four byte displacement FAR - Across Segments (Loads CS : eIP)

CALL Like JMP Pushes next eIP in SHORT and NEAR forms Pushes CS : eIP in FAR forms

RET (Return) Pops into eIP in NEAR form Pops into CS : eIP in FAR form

* eIP = IP / EIP, according to 16/32 bit mode

Add examples

String Instructions String operation components

Opcode specifying ES:eDI op DS:eSI Typically preceded by REP (Repeat) Prefix

for ( ; Repeat Condition; eCX -- )

ES : eDI op DS : eSI

eDI = eDI + d*oprnd_size ; eSI = eSI + d*oprnd_size

d = 1 or -1 Value selected by Direction Flag

OP can be move, compare, find … oprnd_size is one of 1/2/4

Repeat Condition is test of Zero Flag Can be test if set or test if clear Zero Flag set by CX == 0 or by op (CMPS)

* eREG = REG / EREG, according to 16/32 bit mode

Small ExamplesRepeat Mov

struct str {char c[1000];}src,dst;void move()

{ dst=src;}

COMM_src:BYTE:03e8HCOMM_dst:BYTE:03e8H...; 2 : {

push ebpmov ebp, esppush esipush edi

; 3 : dst=src;mov ecx, 250mov esi, OFFSET FLAT:_srcmov edi, OFFSET FLAT:_dstrep movsd

; 4 : }pop edipop esipop ebpret 0

Mov, ALU, Jmp

int array[100];int sum=0;

void sum_array(){ register int i;

for(i=0;i<100;i++)sum+=array[i];

COMM_array:DWORD:064H_sum DD 01H DUP (?); 5 : register int i;; 6 : for(i=0;i<100;i++)

mov eax, OFFSET FLAT:_array$L28:; 7 : sum+=array[i];

mov ecx, DWORD PTR [eax]add eax, 4add DWORD PTR _sum, ecxcmp eax, OFFSET FLAT:_array+400jl SHORT $L28

; 8 : }

Instruction Format (1) Variable length instructions! Remember - only 2 explicit operands - source and destination

may be the same Machine language

Zero or more prefixes Opcode Mod/Rm Displacement Immediate

8/16/32 Handling Instruction encoding differs for 8-bit and 16/32 bit operands Same instruction encoding is used for 16 and 32 bit operands Current code segment defines if processor runs 16 or 32 bit code In 16 bit code “word” operands indicate 16 bit operand,

In 32 bit code “word” operands indicate 32 bit operand. This can be toggled by the operand-size prefix (0x66)

Instruction Format (2)

Figure F-1. Core Instruction Format

d32|16|8| none d32|16|8|none{ {TTTTTTTT TTTTTTTT mod TTT r/m ss index base

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7-6 5-3 2-0 7-6 5-3 2-0

{ { {{

opcode(one or two bytes)(T represents an

opcode bit.)

mod r/mbyte

s-i-bbyte

register and addressmode specifier

addressdisplacement(4, 2, 1 bytes

or none)

immediatedata

(4, 2, 1 bytesor none)

Figure 25-1. Instruction Format

1 OR 2 0 OR 1 0 OR 1 0,1,2 OR 4

NUMBER OF BYTES

0,1,2 OR 4

INSTRUCTIONPREFIX

ADDRESS-SIZE PREFIX

OPERAND-SIZE PREFIX

SEGMENTOVERRIDE

0 OR 1 0 OR 1 0 OR 1 0 OR 1

NUMBER OF BYTES

REP PREFIX

LOCK PREFIX

0 OR 1 0 OR 1

OPCODE MOD R/M SIB DISPLACMENT IMMEDIATE

Assembly Version

MS/Intel “programming tool” Case insensitive Operand size implicit Destination on left

Unix “compiler back-end” Case sensitive Operand size explicit Destination on right

Assembler should be smart enough to handle prefix when 16 bit operand is used in 32 bit code...

Example of ASM FormatIntel/Microsoft tools:

- op dest, src- implied width => requires adding BYTE PTR if not clear- many other goodies (structures style)

MOV EAX, ECX ; EAX:=ECXMOV BYTE PTR GS:[EAX]+4, 0 ; GS:[EAX+4]:=0

REP MOVSB REP MOVS BYTE PTR ES:[EDI], BYTE PTR DS:[ESI]

ADD BX,[EAX+EDX*2]+4 ; 16 bit opsize ADD EBX, [BX+SI] ; 16 bit addr STRING:DB ’A’,’B’,’C’, 0

Unix tools:- op src, dest- explicit width- case sensitive- less flexible

movl %ecx,%eaxmovb 0, %gs:4(%eax)

rep movsbaddw 4(%eax,%edx,2),%bxaddl (%bx,%si),%ebx

L001: .4byte “abc”,’0’

Instruction Format Example strncpy code listing

section .textstrncpy() 0: 55 pushl %ebp 1: 8b ec movl %esp,%ebp 3: 83 ec 0c subl $0xc,%esp 6: c7 45 fc 00 00 00 00 movl $0x0,-4(%ebp) d: 2b d2 subl %edx,%edx f: 3b 55 10 cmpl 16(%ebp),%edx 12: 73 26 jae 0x26 <3a> 14: 8d 45 fc leal -4(%ebp),%eax 17: 89 04 24 movl %eax,(%esp) 1a: e8 fc ff ff ff call 0xfffffffc <1b> 1f: 8b 45 0c movl 12(%ebp),%eax 22: 8b 55 fc movl -4(%ebp),%edx 25: 8a 04 10 movb (%eax,%edx),%al 28: 8b 4d 08 movl 8(%ebp),%ecx 2b: 88 04 11 movb %al,(%ecx,%edx) 2e: 8b 45 fc movl -4(%ebp),%eax 31: 40 incl %eax 32: 89 45 fc movl %eax,-4(%ebp) 35: 3b 45 10 cmpl 16(%ebp),%eax 38: 72 da jb 0xda <14> 3a: 8b 45 0c movl 12(%ebp),%eax 3d: 8b e5 movl %ebp,%esp 3f: 5d popl %ebp 40: c3 ret main() 41: 83 ec 0c subl $0xc,%esp 44: c7 04 24 00 00 00 00 movl $0x0,(%esp) 4b: c7 44 24 04 00 00 00 00 movl $0x0,4(%esp) 53: c7 44 24 08 04 00 00 00 movl $0x4,8(%esp) 5b: e8 a0 ff ff ff call 0xffffffa0 <0> 60: a3 00 00 00 00 movl %eax,0x0 65: 83 c4 0c addl $0xc,%esp 68: c3 ret

Encoding and Prefix example

disassembly for 6.o

section .text 0: a3 00 00 00 00 movl %eax,0x0 5: 66 a3 00 00 00 00 movw %eax,0x0 b: a2 00 00 00 00 movb %al,0x010: 89 1d 00 00 00 00 movl %ebx,0x016: 66 89 1d 00 00 00 00 movw %bx,0x01d: 88 1d 00 00 00 00 movb %bl,0x023: 89 19 movl %ebx,(%ecx)25: 66 89 19 movw %bx,(%ecx)28: 88 19 movb %bl,(%ecx)2a: 67 89 19 movl %ebx,(%bx,%di)2d: 67 66 89 19 movw %bx,(%bx,%di)31: 67 88 19 movb %bl,(%bx,%di)34: a4 movsb (%esi),(%edi)35: f3 a4 repz movsb (%esi),(%edi)37: ff 01 incl (%ecx)39: f0 ff 01 lock incl (%ecx)3c: 64 ff 01 incl %fs:(%ecx)3f: 66 ff 01 incw (%ecx)42: 67 ff 01 incl (%bx,%di)45: f3 ff 01 repz incl (%ecx)48: f3 f0 64 66 67 ff 01 repz lock incw %fs:(%bx,%di)4f: c3 ret

Lecture 3: Floating Point & MMX

Floating Point

Supports single, double and extended memory format (32, 64 and 80 bits), as well as other exotic formats (BCD, integers)

8 registers arranged as a stack

A stack entry (register) is always 80 bits; conversion done during load/store only

Each operation works on top of stack and optionally on another element (another stack entry of memory). Stack top moves!

Supports all IEEE-754 operations (fadd, fsub, fmul, fdiv, fsqrt, fcmp...) and many others (trig, transcendentals, etc...)

Separate unit (“co-processor”) for 8086, 286, 386. On-chip since i486

Exception model reflects this history

Big performance boost in i486, large in Pentium, huge in Pentium-II

Floating Point Formats General Format: (-1)S 2E (b0b1b2b3..bp-1)

Where:S = 0 or 1E = any integer between Emin and Emax, inclusivebi = 0 or 1p = number of bits of precision

SpecificsParameter Single Double ExtendedFormat width in bits 32 64 80p (bits of precision) 24 53 64 2’s

complementExponent width in bits 8 11 15 Biased formatEmax +127 +1023 +16383Emin -126 -1022 -16382Exponent bias +127 +1023 +16383

b0 is implied in single/double, explicit in extended Example of single precision number:

Significand (Mantissa)S Exponent

Floating Point Formats (cont.) Example

Ordinary Decimal 178.125

Scientific Decimal 1.78125E2

Scientific Binary 1 0110010001 E111

Scientific Binary (Biased Exp +127) 1 0110010001 E10000110

Single Format (Normalized)

Sign 0

Biased Exponent 10000110

Significand 01100100010000000000000

1 (implicit)

Special numbers:Exponent is all 0’s or all 1’s: Zero - the entire number is 0

There’s a negative zero, too! Denormals - (+/-) very small numbers, implicit 1 is not assumed Infinite - (+/-) Special rules for 0*inf, etc... NaN - Not a Number - Signaling NaNs and Quite NaNs

Used to propagate errors in computations when exceptions are masked

Floating Point - Control

Rounding modes Precision control Exception Masks

Rounding Control00 - Round to Nearest or Even01 - Round Down (Toward -)10 - Round Up (Toward +)11 - Chop (Truncate Toward 0)

Precision control00 - 24 Bits (Single Precision)01 - (Reserved)10 - 53 Bits (Double Precision)11 - 64 Bits (Extended Precision)

Reserved(Infinity Control)*Rounding ControlPrecision Control

ReservedException Mask

PrecisionUnderflowOverflowZero DivideDenormalized OperandInvalid Operation

XPCRCX

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

* Infinity Control has no effect on 386 and on. Used for compatibility with 8086 & 286

Figure 6-3. FPU Control Word Format

Floating Point - Status

Exception Flags Top Of Stack

position Complex exception

model and handling.Not discussed.

Figure 6-3. FPU Status Word Format

ES is set if any unmasked exception bit is set; Cleared otherwise.

Top values:000 = Register 0 is top of stack001 = Register 1 is top of stack....111 = Register 7 is top of stack

FPU BusyTop of Stack PointerCondition Code

Error Summary StatusStack FaultException Flags

PrecisionUnderflowOverflowZero DivideDenormalized OperandInvalid Operation

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Floating Point - Status

Flag each stack value as Valid, Empty, Zero, Special

Tag Values00 = Valid01 = Zero10 = Special: Invalid (NaN, Unsupported, Infinity, or Denormal) 11 = Empty

Tag(7) Tag(6) Tag(5) Tag(4) Tag(3) Tag(2) Tag(1) Tag(0)

Figure 6-4. Tag Word Format

Floating Point ExampleCOMM _x1:QWORD ; S E P

_a DD 040b00000r ; 5.5 0 81 (1.)6000_b DD 040800000r ; 4 0 81 (1.)0_c DD 03e800000r ; 0.25 0 7B (1.)0_i2 DD 02H_$T32 DD 040800000r ; 4

; 8 : {

push ebpmov ebp, esp tos=0

; 9 : x1=(-b+sqrt(b*b-4*a*c))/(i2*a);

fld DWORD PTR _b ; b tos=7 fchs ;-bfld DWORD PTR _b ; b tos=6 fmul DWORD PTR _b ; b2

fld DWORD PTR $T32 ; 4 tos=5 fmul DWORD PTR _a ; 4afmul DWORD PTR _c ; 4acfsubp ST(1), ST(0) ; b2-4ac tos=6 fsqrt ; =sqrt(b2-4ac)faddp ST(1), ST(0) ;-b+ tos=7 fild DWORD PTR _i2 ; 2 tos=6 fmul DWORD PTR _a ; 2*afdivp ST(1), ST(0) ; (-b+)/2a tos=7 fstp QWORD PTR _x1 ; store tos=0

; 10 : }

pop ebpret 0

// solving a quadratic equationfloat a=5.5;float b=4;float c=.25;int i2=2;double x1;

void main(){ x1=(-b+sqrt(b*b-4*a*c))/(i2*a);}

Note: Stack movement Data type used Number

representation

An extension to the X86 Architecture Useful mainly for multimedia applications Technique: Single Instruction, Multiple Data (SIMD) with

existing registers New data types - packed 64 bit consists of either

8 bytes 4 words 2 doublewords

Instructions to operate on this data Software model to allow easy migration with existing OS

MMX Instructions and Data Types

Operations (57 instructions) Arithmetic (saturating & wrap-around)

– Add– Subtract– Multiply– Multiply-add (for fast multiply

accumulate) Conversions

– Pack from large to smaller words– Unpack from small to large words

Data Types Packed (fixed point) integers Signed, Unsigned Bytes, Words, D-Words

8 Packed Bytes

4 Packed Words

2 Packed D-words

Logical– And– Or– Shifts– Compares (including arithmetic)

Transfers – Register to register– Load from and store to memory

Integration into Intel Architecture 8 new registers - MM0- MM7 Uses previously reserved opcodes Uses same addressing modes

Uses same 2 operand encoding scheme– One is src/dest– Other src may come from memory

Example of Operations

Paddusw: Add Unsigned Packed word with Saturate

a3 a2 a1

b3 b2 b1+ + + +

a3+b3 a2+b2 a1+b1

Packed Add

Paddw: Add Packed word

a3 a2 a1

b3 b2 b1+ + + +

a3+b3 a2+b2 a1+b1

ShiftPsllq: Shift left logical all 64-bit

a3 a2 a1

a1 a0a2 0000h

Multiply-AddPmaddwd: multiply and add

a3 a2 a1 a0

b3 b2 b1 b0** **

a3*b3+a2*b2a1*b1+a0*b0

Example Operations

Multiply-AddPmaddwd: multiply and add

a3 a2 a1 a0

b3 b2 b1 b0** **

a3*b3+a2*b2a1*b1+a0*b0

Multiply-HighPmulhw: multiply high

a3 a2 a1 a0

b3 b2 b1 b0

c3 c2 c1 c0

a0*b0a1*b1a2*b2a3*b3

High order 16 bits

Multiply-LowPmullw: multiply low

a3 a2 a1 a0

b3 b2 b1 b0

c3 c2 c1 c0

a0*b0a1*b1a2*b2a3*b3

low order 16 bits

FP/MMX view of registers

TOSStatusWord

MM Tag

FPU Tag063

MM0 TOS=0

MMX registers MM0-MM7 overlap FP registers All tags are valid w/ MMX TOS is 0

Pros No change of context Easy OS migration

Cons Difficult to Intermix MMX

& FP Need to signal MMX to

FP move(EMMS: clear tags)

Example: Conditional Select / Branch Removal

PCMPEQW

X4 !=greenX2 != greenX1=green X3=green

greengreengreen green

0x0000 0xFFFF 0xFFFF 0x0000

X1 X2 X4X3 PAND

0x0000 0xFFFF 0xFFFF 0x0000

Y1 Y2 Y4Y3

POR 0x0000 0x0000

finished pixels

0x0000 0x0000

if (X[i] != green) then new_image[i] = X[i] else new_image[i] = Y[i]

Packed Comparison (Pcmpcc) and the logical instructions enable conditional select operations in parallel and without data dependent branches.

X Y new_image

MOVQ MM1, XPCMPEQW MM1, GREENMOVQ MM2, MM1PANDN MM1, XPAND MM2, YPOR MM1, MM2MOVQ New, MM1

Streaming SIMD Extensions

New technology to exploit parallelism in FP and INT applications Key capabilities

Packed Operations Branch Removal/Compression Data Movement/Hints FP/INT Type Conversion

Application Domains… 3D Graphics: geometry, lighting signal processing high precision simulation & modeling video encoding/decoding other apps requiring streaming input and output

SIMD FP instructions types

arithmetic square root approximation instructions min & max loads & stores

move mask compare & set mask logical compare & set eflags conversion

Instruction CategoriesSIMD FP - 4 wide, single-precision, 2 wide double precision SIMD INT - extensions to MMX™ technology capabilities, Extending all the MMX instruction to 128 bit using the new XMM registersCacheability controlState management

New SIMD FP Registers

eight 128-bit, 4x32bit single-precision FP2x64bit single-precision FP, 128 bit MMX

•New set of registers!•Direct access•Used for data only•Extended processor state

Packed SP Data Type Each register holds 4 single-precision FP values or 2 double precision IEEE-754 compatible Scalar operates on least-significant number

1-bit sign8-bit exponent23-bit mantissa

22233031

1-bit sign10-bit exponent53-bit mantissa

Packed Operations

a0a1a2a3

b0b1b2b3

a0 op b0a1 op b1a2 op b2a3 op b3

op is one of•addps addpd •subps subpd•mulps mulpd•divps divpd

Scalar Operations

a0a1a2a3

b0b1b2b3

a0 op b0a1a2a3

op is one of•Addss addsd•subss subsd•nulss mulsd•divss divsd

SIMD Data Organization

Exploit vertical parallelism SOA versus AOS

Array of Structures

Structure of Arrays

X0, X1, X2, X3, … Y0, Y1, Y2, Y3, … Z0, Z1, Z2, Z3, ...

X0, Y0, Z0, X1, Y1, Z1, X2, Y2, Z2, …

Better cacheability

Better SIM

D calculations

Matrix Vector Multiply example

Typical 3D operation Load values in SOA format

xxxx…, yyyy…, zzzz…

Follow with multiply and add operations

movaps xmm0, [list+X+ecx] ;load X components

movaps xmm2, [list+Y+ecx] ;load Y components

movaps xmm3, [list+Z+ecx] ;load Z components

movaps xmm1, [esi+m00] ;m00 m00 m00 m00

movaps xmm4, [esi+m01] ;m01 m01 m01 m01

Loop back and pick up next 4 vertices…mulps xmm1, xmm0 ;x*m00 x*m00 x*m00 x*m00

mulps xmm4, xmm2 ;y*m01 y*m01 y*m01 y*m01

addps xmm4, xmm1 ;add the 2 results

movaps xmm1, [esi+m02] ;load matrix element m02 (x4)

mulps xmm1, xmm3 ;z*m02 z*m02 z*m02 z*m02

addps xmm4, xmm1 ;add results

addps xmm4, [esi+m03] ;add last element of matrix

SIMD Integer Instructions

Extensions to MMXTM technology instructions Operate on same 64-bit registers as previous MMX technology

instructions, Or on the new 128 bit XMM registers Instructions: extract, insert, min/max, byte mask integer,

multiply high unsigned, shuffle multiply unsigned double, add and sub qword, average

Computing Model

Processinginput output

If you can’t bring data in fast enough or spit it out fast enough… SIMD is of little or no use

The “streaming” part of the Streaming SIMD Extensions is critical to overall performance

Prefetch

Hides latency by bringing in data before you need it Provide cache hints to fetch data to different cache levels

prefetchnta - prefetch non-temporal data to non-temporal cache (L1)

prefetcht0 - prefetch temporal data into both L1 and L2 caches prefetcht1, prefetcht2 - prefetch temporal data into L2 cache

Prefetch Illustrated

memoryL1

prefetchnta [esi]

Prefetch Illustrated (2)

memoryL1

prefetcht0 [esi]

Prefetch Illustrated (3)

memoryL1

prefetcht1 [esi]prefetcht2 [esi]

Warning about prefetch

Proper placing of prefetch is critical to insure that there’s enough time between prefetch and actual load limited resources to load data

Excessive use of prefetch can actually hurt performance Spread out prefetches and insure sufficient computation

before load

Streaming Stores

Avoids polluting cache with data that you don’t real soon (or ever)

Supports 128 and 64-bit versions movntps - from xmm reg to memory movntq - from mm reg to memory

If store is a cache hit, then cache is updated and not sent to memory

Weakly ordered Use sfence instruction to insure order

Streaming Store Illustrated

memory

movntps [esi]

No write allocation*

* If it was a cache hit,then data goes to cacheand is not writtendirectly to memory

AMD 64 bit extension

Starting with Opron (released April – 2003)

Lecture 4: Addressing and Instruction Encoding

Addressing Modes Compute the offset part of a memory reference

Direct Register (by name) Memory: 16 bit addressing modes

(Segment) + base + index + displacement disp(base,index) 8 combinations of base + index

BX, SI, DI, BX+SI, BX+DI, BP+SI, BP+DI, disp only Disp 0, 8, or 16 bit. Occupies 0, 1 or 2 bytes All combinations are encoded in the ModRM byte Examples: 8(bp); (bp,bx);

Memory: 32 bit addressing modes (Segment) + base + index*scale + disp

disp(base,index,scale) Base: each of the 8 general registers or none Index: each of 7 general registers (no ESP!) or none Scale: 1,2,4,8 Index/scale requires an additional SIB bytes.

Base only can use ModRM byte only Disp: 0, 8 or 32 bit. Occupies 0, 1 or 4 bytes Examples: (eax); array(ecx,4); based_array(eax,ebx,4)

Addressing mode determined by code mode (16/32) Can be toggled using the address prefix (0x67)

Addressing Modes

Figure 25-2. ModR/M and SIB byte Formats

MOD REG/OPCODE R/M

SIB (SCALE INDEX BASE) BYTE

7 6 5 4 3 2 1 0

SS INDEX BASE

MODR/M BYTE

7 6 5 4 3 2 1 0

Figure 3-10. Effective Address Computation (32-bit vesrion)

{CSSSDSESFSGS

EAXECXEDXEBXESPEBPESIEDI

EAXECXEDXEBXEBPESIEDI

No Displacement8-bit Displacement32-bit Displacement

+ + * +{} }{}{}{ }

Segment + Base + (Index * Scale) + Displacement

Encoding Example

ModRM byte: MOD REG/OP Reg/Mem 7 6 5 4 3 2 1 0

SIB byte: SS INDEX BASE 83 ec 0c subl $0xc,%esp

89 03 movl %eax,(%ebx) / base onlyc7 45 fc 07 00 00 00 movl $0x7,-4(%ebp) / base, disp ; imm8b 55 fc movl -4(%ebp),%edx / base, disp ; reg89 45 fc movl %eax,-4(%ebp)3b 55 10 cmpl 16(%ebp),%edx89 04 24 movl %eax,(%esp) / base only espc7 04 24 08 00 00 00 movl $0x8,(%esp)c7 44 24 04 09 00 00 00 movl $0x9,4(%esp) / base+disp esp8a 04 10 movb (%eax,%edx),%al / base + index8a 04 50 movb (%eax,%edx,2),%al / base+scale*index8a 04 42 movb (%edx,%eax,2),%al / 88 04 11 movb %al,(%ecx,%edx)40 incl %eax72 da jb 0xda <14>5d popl %ebpc3 ret

MOD/RM-SIB Table: 16-bit addressing table

Table 11-1. 16-Bit Addressing Forms with the ModR/M Byter8(/r)r16(/r)r32(/r)/digit (Opcode)REG =

ALAXEAX0000

CLCXECX1001

DLDXEDX2010

BLBXEBX3011

AHSPESP4100

DHSIESI6110

BHDIEDI7111

EffectiveAddress Mod R/M ModR/M Values in Hexadecimal

[BX+SI][BX+DI][BP+SI][BP+DI][SI][DI]disp162

00 000001010011100101110111

0001020304050607

08090A0B0C0D0E0F

1011121314151617

18191A1B1C1D1E1F

2021222324252627

28292A2B2C2D2E2F

3031323334353637

38393A3B3C3D3E3F

[BX+SI]+disp83

[BX+DI]+disp8[BP+SI]+disp8[BP+DI]+disp8[SI]+disp8[DI]+disp8[BP]+disp8[BX]+disp8

01 000001010011100101110111

4041424344454647

48494A4B4C4D4E4F

5051525354555657

58595A5B5C5D5E5F

6061626364656667

68696A6B6C6D6E6F

7071727374757677

78797A7B7C7D7E7F

[BX+SI]+disp16[BX+DI]+disp16[BP+SI]+disp16[BP+DI]+disp16[SI]+disp16[DI]+disp16[BP]+disp16[BX]+disp16

10 000001010011100101110111

8081828384858687

88898A8B8C8D8E8F

9091929394959697

98999A9B9C9D9E9F

A0A1A2A3A4A5A6A7

A8A9AAABACADAEAF

B0B1B2B3B4B5B6B7

B8B9BABBBCBDBEBF

EAX/AX/ALECX/CX/CLEDX/DX/DLEBX/BX/BLESP/SP/AHEBP/BP/CHESI/SI/DHEDI/DI/BH

11 000001010011100101110111

C0C1C2C3C4C5C6C7

C8C9CACBCCCDCECF

D0D1D2D3D4D5D6D7

D8D9DADBDCDDDEDF

E0EQE2E3E4E5E6E7

E8E9EAEBECEDEEEF

F0F1F2F3F4F5F6F7

F8F9FAFBFCFDFEFF

1. The default segment register is SS for the effective addresses containing a BP index, DS for other effective addresses.

2. The “disp16” nomenclature denotes a 16-bit displacement following the ModR/M byte, to be added to the index.

3. The “disp8” nomenclature denotes an 8-bit displacement following the ModR/M byte, to be sign-extended and added to the index.

MOD/RM-SIB Table: 32-bit addressing table

Table 11-2. 32-Bit Addressing Forms with the ModR/M Byter8(/r)r16(/r)r32(/r)/digit (Opcode)REG =

ALAXEAX0000

CLCXECX1001

DLDXEDX2010

BLBXEBX3011

AHSPESP4100

CHBPEBP5101

DHSIESI6110

BHDIEDI7111

EffectiveAddress Mod R/M ModR/M Values in Hexadecimal

[EAX][ECX][EDX][EBX][--][--]1

disp32[ESI][EDI]

00 000001010011100101110111

0001020304050607

08090A0B0C0D0E0F

1011121314151617

18191A1B1C1D1E1F

2021222324252627

28292A2B2C2D2E2F

3031323334353637

38393A3B3C3D3E3F

disp8[EAX]3

disp8[ECX]disp8[EDX]disp8[EBX]disp8[--][--]disp8[EBP]disp8[ESI]disp8[EDI]

01 000001010011100101110111

4041424344454647

48494A4B4C4D4E4F

5051525354555657

58595A5B5C5D5E5F

6061626364656667

68696A6B6C6D6E6F

7071727374757677

78797A7B7C7D7E7F

disp32[EAX]disp32[ECX]disp32[EDX]disp32[EBX]disp32[--][--]disp32[EBP]disp32[ESI]disp32[EDI]

10 000001010011100101110111

8081828384858687

88898A8B8C8D8E8F

9091929394959697

98999A9B9C9D9E9F

A0A1A2A3A4A5A6A7

A8A9AAABACADAEAF

B0B1B2B3B4B5B6B7

B8B9BABBBCBDBEBF

EAX/AX/ALECX/CX/CLEDX/DX/DLEBX/BX/BLESP/SP/AHEBP/BP/CHESI/SI/DHEDI/DI/BH

11 000001010011100101110111

C0C1C2C3C4C5C6C7

C8C9CACBCCCDCECF

D0D1D2D3D4D5D6D7

D8D9DADBDCDDDEDF

E0E1E2E3E4E5E6E7

E8E9EAEBECEDEEEF

F0F1F2F3F4F5F6F7

F8F9FAFBFCFDFEFF

1. The [--][--] nomenclature means a SIB follows the ModR/M byte.

2. The disp32 nomenclature denotes a 32-bit displacement following the SIB byte, to be added to the index.

3. The disp8 nomenclature denotes an 8-bit displacement following the SIB byte, to be sign-extended and added to the index.

Table 11-3. 32-Bit Addressing Forms with the SIB Byte

r32Base =Base =

Scaled Index SS Index SIB Values in Hexadecimal

[EAX][ECX][EDX][EBX]none[EBP][ESI][EDI]

00 000001010011100101110111

0008101820283038

0109111921293139

020A121A222A323A

030B131B232B333B

040C141C242C343C

050D151D252D353D

060E161E262E363E

070F171F272F373F

[EAX*2][ECX*2][EDX*2][EBX*2]none[EBP*2][ESI*2][EDI*2]

01 000001010011100101110111

4048505860687078

4149515961697179

424A525A626A727A

434B535B636B737B

444C545C646C747C

454D555D656D757D

464E565E666E767E

474F575F676F777F

10 000001010011100101110111

80889098A0A8B0B8

81899189A1A9B1B9

828A929AA2AAB2BA

838B939BA3ABB3BB

848C949CA4ACB4BC

858D959DA5ADB5BD

868E969EA6AEB6BE

878F979FA7AFB7BF

11 000001010011100101110111

C0C8D0D8E0E8F0F8

C1C9D1D9E1E9F1F9

C2CAD2DAE2EAF2FA

C3CBD3DBE3EBF3FB

C4CCD4DCE4ECF4FC

C5CDD5DDE5EDF5FD

C6CED6DEE6EEF6FE

C7CFD7DFE7EFF7FF

1. The [*] nomenclature means a disp32 with no base if MOD is 00,[EBP] otherwise. This provides the following addressing modes:disp32[index] (MOD=00).disp8[EBP][index] (MOD=01).disp32[EBP][index] (MOD=10).

Addressing Mode Examples_s$ = -4_t$ = -8; 16 : // typedef struct {char mem1, mem2, prev,next;} str_type_t;; 17 : // int i,j; <Globals>; 18 : // char a,b,c,d,e,f,g,h; <Globals>; 19 : // int s,t;; <Stack>

; 20 : /* int var; */ a = var;mov al, BYTE PTR _var; DISP only

; 21 : /* str_type_t struc; */ b = struc.mem2;mov cl, BYTE PTR _struc+1; DISP only

; 22 : /* int array[100]; */ c = array[i]; mov edx, DWORD PTR _imov al, BYTE PTR _array[edx*4]; INDEX*SCALE+DISP

; 23 : /* str_type_t arr_str[100]; */ d = arr_str[i].next;mov ecx, DWORD PTR _imov dl, BYTE PTR _arr_str[ecx*4+3]; INDEX*SCALE+DISP

; 24 : /* int *p; */ e = *p; mov eax, DWORD PTR _pmov cl, BYTE PTR [eax]; BASE

; 25 : /* str_type_t *str_p; */ f = str_p->prev; mov edx, DWORD PTR _str_pmov al, BYTE PTR [edx+2]; BASE+DISP

; 26 : /* int *arr_p; */ g = arr_p[j]; mov ecx, DWORD PTR _jmov edx, DWORD PTR _arr_pmov al, BYTE PTR [edx+ecx*4]; BASE+INDEX*SCALE

; 27 : /* str_type_t *arr_str_p; */ h = arr_str_p[j]->prev;mov ecx, DWORD PTR _jmov edx, DWORD PTR _arr_str_pmov al, BYTE PTR [edx+ecx*4+2]; BASE+INDEX*SCALE+DISP

; 28 : /* int s,t (stack) */ t = s;mov ecx, DWORD PTR _s$[ebp]mov DWORD PTR _t$[ebp], ecx BASE+DISP

Software Conventions

API - Application Programming Interface Interface among procedures modules: param types, order, etc.. Machine dependent; Defined in HLL.

Calling conventions - machine level calling sequence Parameter passing location and order (stack/regs) Register preserved across calls Machine dependent, may be also language dependent

ABI - Application Binary Interface Interface among application and DLLs/OS machine dependent; OS dependent

C conventions for x86: Parameters are pushed last to 1st EAX, EBX, EDX are scratch, all other preserved Return results - Int: EAX (EDX also), FP: top of FP stack All parameters (without prototypes) are widened to int/double

Compatibility Issues

Goal: enable modules compiled by different compilers to interoperate Possibly also different languages

Calling conventions Parameter location/order Static link (e.g. for Pascal)

External variable names - for the linker “Name decoration” - for C++

External name encodes the variable’s type Exception handling

How to locate the user-registered exception handler when an exception occurs

Debug information format This is only nice-to-have

All of the above are broken in practice!

Administration

Please send me your Email address, so that we can have a mailing list My address is koby.gottlieb@intel.com Subject line: Microcomputer course mailing list

Contacts koby Koby.gottlieb@intel.com 04-8655718 Ofri ofri.wechsler@intel.com 04-8655851

Example for Calling Conventions Example of calling stackUsing 32 bit mode, near pointer

#include <string.h>

char in[]="abc";char out[10];char *p;

char *strncpy(char *d, const char *s, size_t n){ int i=0; for (i=0;i<n;i++) { monitor(&i); d[i]=s[i]; } return (char*)s;}

main(){ p = strncpy(out,in,sizeof(in));}==============================================

.file "strncpy.c"

.globl inin: .size in,4

.string "abc" // char in[]="abc"

.comm out,10,4 // char out[10];

.comm p,4,4 // char *p

/====================/ strncpy/--------------------

.globl strncpystrncpy:

pushl %ebp // put old frame ptrmovl %esp,%ebp // set new frame ptrsubl $12,%esp // Space for automat movl $0,-4(%ebp) // i = 0subl %edx,%edx // %edx = 0cmpl 16(%ebp),%edx // if (i<n)jae .L1

.L2:lea -4(%ebp),%eax // %eax = &imovl %eax,(%esp) // push &icall monitor // movl 12(%ebp),%eax // %eax = smovl -4(%ebp),%edx // %edx = imovb (%eax,%edx),%al // %al = s[i]movl 8(%ebp),%ecx // %ecx = dmovb %al,(%ecx,%edx) // d[i] = %almovl -4(%ebp),%eax // \incl %eax // | i++movl %eax,-4(%ebp) // /cmpl 16(%ebp),%eax // if (i<n)jb .L2

.L1:movl 12(%ebp),%eax // return smovl %ebp,%esp // clear stackpopl %ebp // return frameret

/====================/ main/--------------------

.globl mainmain:

subl $12,%espmovl $out,(%esp) // outmovl $in,4(%esp) // inmovl $4,8(%esp) // pcall strncpy // callmovl %eax,p // store return valueaddl $12,%esp // clean stackret

esp frame itemebp+16 sizeof(in)/nebp+12 in / s

before call strncpy ebp+8 out / dafter the call ebp+4 ret-addrafter push ebp ebp old ebp

ebp-4 var iebp-8 unused

before call monitor ebp-12 param I

Code example - Unix style

; unsigned int fib(int x) {; if(x>2); return(fib(x-1)+fib(x-2));; else; return(1);; } ; ; Stack frame:; %esp + 8: x ; %esp +4: return address ; note - no ebp!

.file "0f.c"

.align 16fib: / x= 8(%esp)

pushl %eax / save %eax is it needed?movl 8(%esp),%edx / edx=xcmpl $2,%edx / x>2?pushl %ebx / (save %ebx)jle .L66decl %edx / x<2, x--pushl %edx / \call fib / } %ebx = fib(x-1)movl %eax,%ebx / /movl 16(%esp),%eax / \subl $2,%eax / } %eax = x-2pushl %eax / /call fib / %eax = fib(x-2)addl %ebx,%eax / %eax = fib(x-1) + fib(x-2)addl $8,%esp / reset stack

.L67:popl %ebx / restore %ebxaddl $4,%esp / reset stack (contained %eax) ret / return.align 16,7,4

.L66:popl %ebxmovl $1,%eaxaddl $4,%espret / ret 1

Code example - Microsot/Intel style; unsigned int fib(int x) {; if(x>2); return(fib(x-1)+fib(x-2));; else; return(1); }; Stack frame:; %esp + 12: x ; %esp + 8: return address; %esp + 4,+8: esi, edi; %eax contains return value ; note - no ebp! TITLE fib.c .386P.model FLAT..._x$ = 8_fib PROC NEAR; 39 : { push esi ; push temps push edi; 40 : if( x > 2 ) mov edi, DWORD PTR _x$[esp+4] ; edi = x cmp edi, 2 ; x > 2? jle SHORT $L370; 41 : return (fib(x-1) + fib(x - 2)); lea eax, DWORD PTR [edi-2] ; eax = x-2! dec edi ; edi = x-1 push eax call _fib ; eax = fib(x-2) add esp, 4 ; pop param mov esi, eax push edi call _fib ; eax = fib(x-1) add esp, 4 add eax, esi; eax = fib(x-1)+fib(x-2) pop edi pop esi ; restore temps ret 0$L370:; 42 : else; 43 : return (1); mov eax, 1 ; return 1 pop edi; 44 : } pop esi ret 0_fib ENDP_TEXT ENDSEND

CISC vs RISC

CISC (Complex Instruction Set Computer) Simple as well as exotic instructions Small number of registers Variety of addressing modes Operation on registers and memory “Goal”: reduced instruction count

RISC (Reduced Instruction Set Computer) Simple instructions only Large number of registers Small set of addressing modes Memory access in load/stores only “Goal”: fast execution (through simple design)

Lecture 5: Segmentation

Segmentation and Protected Mode

Segmentation Already mentioned:

Address space divided into “segments” Logical address: segment:offset Linear address: address in the “flat” address range Segmentation: mapping from Logical to Linear

Segment base = f(segment register)offset < segment size: <64K (86/286), <4G (386+) linear address = segment_base + offset

At any time, there are 4 (8086) or 6 (286 and on) active segment registersCS: code, DS: data, SS: stack, ES: extra. FS, GS: new extra

Segment register can be modified by: A move from general register (e.g., mov ds,ax) LxS inst (e.g., LDS): loads full ptr to both seg reg and general reg.

All memory reference instructions assume default seg register CS for branches, SS for Stack, DS for data ES for string instruction destination operands

Segment prefix can be used to modify the default segment register for the current instruction

Address Computation - Real Mode Segment Base = segment*16 Linear address = segment*16 + offset

0x1234:0x5678 = 0x12340+0x5678 = 0x179B8 Implications:

Application may consume many segments Address space = 64K*16 = 1M Linear addresses have many (4096) logical representations

Seg+1:offset = seg:offset+16 But: do not assume that!

Last address: 0xFFFF*16+0xFFFF = 0x10FFEF (1M+64K-16)

19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

16-Bit Segment Selector 0 0 0 0

19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

0 0 0 0 16-Bit Effective Address0 0 0 0

21-Bit Linear Address

Offset

Linear Address

= 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Figure 9-1. 8086 Address Translation

0 0 0 0

Segmentation Models (16-bit) Terminology:

Code (“text”): code + optional read-only variables Data: global variables, static variable. Read/Write and Read Only Stack: parameters, automatic variables Heap: dynamically allocated memory

Model Segment Size Pointer Type

Tiny Code+Data+Stack+Heap <64K Offset

Small Code<64K, Data+Stack+Heap<64K Offset

Compact Code<64K, Data<64K, Stack<64K, Heap>64K Seg:Offset

Large Code>64K, Data>64K, Stack<64K, Heap>64K Seg:Offset

Huge Single Element > 64K Seg:Offset Trade-off between size and performance: smaller is faster

Tiny/Small: no need to switch segment registers Compact: uses only one data segment, but can address the entire memory

(e.g., heap) , if needed, not too many segment switches

32-bit mode typically uses Flat Model: One segment, starts at 0, spans 4G, all seg registers point to it

Segmentation Pros/Cons

Great solution for large memory w/ small word Overhead depends on memory requirements

Pay per use…

Allows reasonable code/data sharing

- More complex software development and tools

- Real mode segmentation => 1MB only

Did x86 succeed because of segmentation?… or in spite of it?...

Protected Mode - MotivationTwo Problems: 1MB is too small

More sophisticated usage: Multiprogramming, Multiusers.

Addresses are interpreted differently by different clients:

entry *next_entry (entry *p)

{ return (p->next); }

crnt_p = next_entry(crnt_p);

crnt_p->value = 7;

Increased need for protection among apps and OS In real mode, each application can crash the OS!

Protection - Solution

Solution - store information on segments in a table Segment register becomes an index to a “Descriptor Table”

contains address (base) and protection information

Address info Segment base

Linear address is still segment_base + offset, but…Segment_base = Descriptor_Table[f(segment)].base

Protection info Segment size (“limit”) <64K for 286, <4G for 386+ Segment type R, RW, ER, EO (Read, R/W, Execute/Read, Exec-Only) Segment privilege (0..3: 0 is higher, 3 is lower) Restrictions are checked for every access Descriptor table should be protected!

… And more colorful protection mechanisms…

Small step to the segment, big leap to mankind!

Address Computation Linear address = Descriptor_Table[f(segment)].base + offset Coverage

286: 16MB address space. Requires 512 segments to cover! 386: 4GB address space. 1 segment is sufficient

Descriptor Table

Selector Offset

SegmentDescriptor

LinearAddress

LogicalAddress

BaseAddress

Figure 11-5. Segment Translation

A General Segmentation Model The OS allocates each application a number of segments

Access LimitBase Address

SegmentRegisters

SegmentDescriptors

Access LimitBase Address

PhysicalMemory

Figure 11-3. Multisegmented Model

Descriptor Tables 2 tables

GDT (Global)Pointed by GDTR

LDT (Local)Per taskPointed indirectly by

Table pointers GDTR: Base + Limit LDTR: GDT Index only

Entries (descriptors) of various types: Segments Call/interrupt gates Tasks LDT - in the GDT

Figure 11-4. TI Bit Selects Descriptor Table

Base AddressLimit

GDTRBase Address

Limit LDTR

Selector

SegmentSelector

GlobalDescriptor

LocalDescriptor

TI = 0 TI = 1

13 bit index (8K entries) 1 bit table index - GDT (0) or LDT (1) 2 bit “requested” privilege level - see below

Segment Selector Format

INDEXTI

5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 01 1

Table Indicator 0 = GDT 1 = LDTRequested Privilege Level 00 = Most Privileged 11 = Least

Figure 11-7. Segment Selector Format

Segment Descriptor Format Note

286 => 386 evolution (high-order 16 bits) Limit >4M achieved by the granularity bit - scale by 212

D/B: 16/32 bit

Information moves fromdescriptor (in memory)to CPU when segmentregister is loaded

Figure 11-8. Segment Descriptors Format

AVL Available for use by system softwareBASE Segment base addressD/B Default operation size

(0 = 16-Bit segment; 1 = 32-bit segment)DPL Descriptor privilege level G GranularityLIMIT Segment limitP Segment presentS Descriptor type

(0 = System; 1 = application)TYPE Segment type

Reserved

5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 01

1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 63 23

0151616

Base 31:24 P Base 23:16

Segment Limit 15:0

TYPESGD/B

SegLimit19:16

Base Address 15:0 +0

Privilege Levels Lower number => higher privilege Code can access data of equal/lower privilege levels only Code can call more privileged data via “call gates” only

Each level has its own stack!

Some instructions and I/O operations are restricted to certain privilege levels

Also referred to as:

“Rings”

LEVEL 3

LEVEL 2

LEVEL 1

LEVEL 0

Applications

Operating SystemServices (DeviceDrivers, etc..)

Operating SystemKernel

Protection Rings

Figure 12-2: Protections Rings

For Data: DPL >= max(RPL,CPL)Otherwise: exception!

RPL: used by OSto protect against userTrojan Horses

Trojan Horse:User code calls OS codeand passes a pointer toa privileged data area

Privilege Level Check

Figure 12-3. Privilege Check for Data Access

CPL Current Privilege LevelDPL Descriptor Privilege LevelRPL Requested Privilege Level

5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 01

1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 63 23

0151616

PrivilegeCheck

Operand Segment Selector

Current Code Segment Register

Operand Segment Descriptor

Call Gate A mechanism for the OS to restrict lower privileged code to

call only pre-defined entry points Restricts the caller privilege level

Not Used

Procedure EntryPoint

DPL Count

Selector Offset 15:0

BASE DPL BASE

Offset 31:16

Descriptor Table

GateDescriptor

Code SegmentDescriptor

Selector Offset Within Segment

Destination Address

015 031

Figure 12-6. Call Gate Mechanism

From which PrivilegeLevel the gate may be used.

The called code Privilege Level.

# of parameters

Inter Level Call - Stack Switching On inter-level call

CPL changes Control transferred Stack switched

Stack Switch New (privileged) SS:eSP is taken from the TSS Old SS:eSP is stored on new stack Parameters are copied to new stack

Figure 12-9. Stack Frame During Inter Level Call

PARM 1

PARM 2

PARM 3

OLD SS

OLD ESP

PARM 1

PARM 2

PARM 3

OLD CS

OLD EIP

Old Stack Before Call

New StackAfter Call, Before Return

Old Stack After RETURN(far retN N=3)

SS0:ESP0

From TSS

Assume call from 3 to 0 Unique to interlevel calls

Task State Segment (TSS) HW mechanism to allow multi-tasking Contains task state:

Dynamic: Integer registers, segment registers, flags...

Static: CR3, per level stack pointer, etc... Task switch occurs on

JMP/Call to a TSS descriptor task gate Interrupt indexed to a task gate Certain IRET

On switch: Current TSS content is updated

For example, general registers are saved by the CPU!

New task’s values are retrieved from TSS

Contains initial stack values for level 0-2

Figure 13-1. 32-bit Task State Segment

EIPEFLAGS

EAXECXEDXEBXESPEBPESIEDI

IO map base addr

Sel for task LDTGSFSDSSSCSES

Link (old TSS sel)

048C1014181C2024282C3034383C4044484C5054585C6064

Protected ModeImplications and Observations (1)

Software conversion Straightforward if software does not assume linear = seg*16+offset

Real mode “tricks” do not work. Usually for the better: Cannot write on code

Cannot access segment 0

Cannot jump/write into the OS

New issues How to write on code (self-modifying code, SMC) - use aliasing

How to use OS services - via call gates

How to access a known linear address -Map a segment to a known base. e.g., use 4G at base 0.

How to protect between processes (applications) -Use LDT per process. Reload LDTR at context switch

Additional related mechanisms exist - not discussed

Protected ModeImplications and Observations (2)

Segmentation performance impactIssues: Indirect vs direct addressing Protection check

Solution: information is cached for each active segment Almost no overhead on memory reference... But Change of a segment register is costly!

Segment based virtual memory Possible - through use of “segment not present” bit Unattractive - due to variable length of segments

Protected Mode in Practice

Value of segmentation/protected mode Only option for a 16-bit OS 32-bit OS prefers flat model, page based protection

Call gates used for system calls by Unix Not by Windows NT

Task switch mechanism rarely if ever used OS/2 uses 32-bit segmentation Exotic uses for segment registers

FS used by NT for multi-threading

Segmentation - Summary

Mapping of logical address (selector:offset) to linear address Initially introduced to support addressability larger than the

machine word Since 286, descriptor tables enable

Protection Even larger addressability Potentially swapping to disk (but not very useful)

Starting with 386, 4GB addressability But 32-bit operating systems almost all use flat memory

Protection is the separation of Several different segments with different attributes Processes from one another Processes from the OS (protection “rings”)

All memory accesses are verified Control transfers are limited: call gates

Lecture 6: Paging

Paging

Paging - Motivation The problem:

Effective address can cover 4GB, but actual memory is smaller Application should be independent of actual memory location

Avoid memory size and location concernsAllow using fixed, pre-defined, addressesAllow easier code/data sharingInsensitive to gaps within address space

e.g., because of unused, but allocated, array space

Solution: Paging Idea: memory is divided to equal sized chunks - pages (4KB) A layer on top of segmentation. Performs linear-to-physical mapping Address “processing”

Logical (Seg:offset) ==> Linear ==> Physical Page tables map linear address to physical. Keeping track of:

Valid pagesPaged out pages (to secondary memory)Non-existent pages

Analogy: Big library, good catalog, reading room, storage room

Page Translation Mechanism 2-level hierarchical mapping

Using page directory and page tables All pages and page tables are 4K

Linear address divided to: Directory 10 bits Table 10 bits Offset 12 bits Dir/Table serves as index

into a page table Offset serves as pointer

into a data page

Page entry points to apage table or page

Performance issues: TLB!

Advanced mechanism 4MB pages(not discussed here)

Figure 11-16. Combined Segment and Page Address Translation

Descriptor Table

Selector Offset

SegmentDescriptor

LogicalAddress

BaseAddress

DIR TABLE OFFSET

Page Table

PG Tbl Entry

Page Directory

4K Dir Entry

4K Page Frame

Operand

Linear Address Space (4K Page)1121

Page Translation Mechanism (Cont.) CR3 points to current page directory

May be changed per process

Usually, a page directory entry (covers 4MB) will point to a page table that covers data of the same type/usage

Sharing. Can alias Different linear on

same physical(e.g., SMC)

Pages from differentprocesses to samephysical (e.g., OS)

DIR TABLE OFFSET

Page Dir Code

StackOS

Phys Mem

4K page

Page Tables

Page Entry Format 20 bit pointer to a 4K

Aligned address 12 bits flags Virtual memory

Present Accessed, Dirty

Protection Writable (R#/W) User (U/S#)

2 levels only!

0, 1, 2 Supervisor

Caching Page Write-Through Page Cache

Disabled

3 bits for OS usage

Figure 11-14. Format of Page Directory and Page Table Entries for 4K Pages

Page Frame Address 31:12 AVAIL 0 0 APCD

PresentWritableUserWrite-ThroughCache DisableAccessedPage Size (0: 4 Kbyte)Available for OS Use

Page DirEntry

04 12357911 681231

Page Frame Address 31:12 AVAIL D APCD

PresentWritableUserWrite-ThroughCache DisableAccessedDirtyAvailable for OS Use

Page TableEntry

04 12357911 681231

Reserved by Intel for future use (should be zero)

Paging - Virtual Memory

A page can be New, not yet loaded Loaded Paged out to disk

A loaded page can be Clean Dirty

When a page is not loaded (P bit clear) => a page fault occurs It may require removing a loaded page to insert the new one

OS prioritizes removal by LRU and dirty/clean/avail bitsDirty page should be written to disk. Clean ones need not

New page is either loaded from disk or “initialized” CPU sets page “access” flag when accessed, “dirty” when written

Copy-on-write strategy for forked Unix processes

Paging Pros and Cons

Independence of physical memory size No need to preserve continuous memory ranges Physical memory is allocated only if accessed Can map same linear address to different physical address Can share data/code among processes Pages have predefined size - easy memory management Simple protection mechanism

Overhead of tables (insignificant) Internal fragmentation (insignificant) Translation time - reduced by TLB

Entering Protected Mode, Paging “Easy”: set CR0.PE to enable protection, CR0.PG to enable Paging…

But one better make sure all tables are ready and valid...

Figure 10-3. Control Registers

Reserved ReservedNE

Page Directory Base PCD

0 ... 0 0MCE

Page Fault Linear Address

NW CR0

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Mapping 4M pages, 36 bit address.

PSE (page file extension) flag bit 4 of CR4 The PSE enables 4MByte pages (or 2Mbyte pages when

PAE flag is set). Those pages are mapped directly from the page directory, without using page tables.

PAE (physical address extension) flag bit 5 of CR4 The PAE flag enables 36 bit physical addresses (Not

covered here)

Paging/Segmentation Example All left indentation are data to the example, right are conclusions/results. All numbers are in HEX.GDTR: 0000 2000CR3: 0000 3000Logical address DS: 23, Offset: 3456

0010 0011

DS: index = 4, TI=GDT, Priv level = 3GDT(4) is in address 2000+4*8=2020

Assume protection validation is successful...GDT(4): base 00C03000 (linear!)

Linear address = 00C03000+3456 = 00C064560000 0000 1100 0000 0110 0100 0101 0110

PDE = 003 (0000 0000 11)PTE = 006 (00 0000 0110)Off = 456 (0100 0101 0110)

address in PDE = 3000+3*4 = 300C

M(300C) = 4000

address in PTE = 4000+6*4 = 4018

M(4018) = 5000

full address in page = 5000+456 = 5456

M(5456) = 12345678Finished? Not exactly....

Paging/Segmentation Example (Cont.)

But: where do we find the GDT?

GDTR points to 2000 linear, need to look at the physical address to see GDT(4)!

Linear address = 00002020 = 0000 0000 0000 0000 0010 0000 0010 0000

PDE = 000 (0000 0000 00)

PTE = 002 (00 0000 0020)

Off = 020 (0000 0010 0000)

address in PDE = 3000+0*4 = 3000

M(3000) = 6000

address in PTE = 6000+2*4 = 6008

M(6008) = 5000

full address in page = 5000+020 = 5020

M(5020) = 0010 f0c0 3000 1234

(G=0, 16bit, avail, present, DPL=3, appl, data/RO, limit=1234, base = 00c03000)

Addressing Methods - Summary

16 bits real and virtual mode: seg_reg*16+offset

16/32 bit protected mode:Use global/local descriptor table (GDT/LDT) xDT[seg_reg].base + offset (xDT = GDT/LDT) Check protection! Note: 32-bit is always protected

32 bit paging enabled: First translate as above, then Convert virtual (linear) to physical

pdi=vaddr[31:23]; pti=vaddr[22:13]; offset=vaddr[11:0]paddr = CR3->pdt[pdi]->pt[pti]+offset

Check protection!

Lecture 7: Interrupts and Exceptions

Interrupts and Exceptions

Some of the material in this section was taken from:The advanced Intel microprocessors - 80286/80386/80486.By: Barry B. Brey, DeVry institute of technology. Pub. by MacMillan publishing company

What is it?

“A signal from from hardware or software, such as a keystroke, that demands immediate attention and takes priority over other operations ”

Analogy: Alarm clock Pain

What Causes It? Interrupts

External to the CPU

Peripheral: I/O: mouse move/click, keyboard, timer

Exceptions Internal to the CPU

Exceptions: Processor detected:

Fault: restart from the causing instruction (e.g., page fault)

Trap: restart from the next instruction (e.g., overflow)

Abort: cannot restart (double fault) Programmed Exceptions (“software interrupts”):

e.g.: INTO, INT 3, INT n, BOUND

The terms “interrupts” & “exceptions” are frequently intermixed

Why Is It Important?

Few categories

Inevitable: only way to implement a feature Cannot implement timers, page faults, ^break etc. without it

Difficult without: alternatives imply huge program overhead e.g., polling to identify keyboard stroke

Convenience: cheaper way to implement a feature e.g., zero divide, INTO, BOUND, etc..

Conventions: Application/OS communication System call. Allow using procedures without linkage

How Does It Work?

Internal interrupts: handled completely by CPU + SW

External interrupt involves CPU + SW + peripherals External device delivers “INTR” signal (or NMI, SMI)

CPU acknowledges with “INTA” bus cycle

Device sends interrupt# on Data bus

Usually requires a PIC (Programmable Interrupt Controller) in system

External interrupts (except NMI/SMI) can be masked (IF=0)

SignalIntr

Nmi,..

DEVICE PIC CPU

Interrupt Vector 256 entry array Each entry points to an interrupt/exception handler Located in address 0 (8086) or as pointed by IDTR register First 32 entries reserved for “Intel” for CPU internal exceptions Entry contains

Real mode: 4 bytes = Segment:offset Protected mode: 8 bytes = Interrupt/Trap/Task Gate

In both cases, control transferbased on the entry info

Figure 14-1. IDTR Locates IDT in Memory

IDTR Register

IDT Base Address IDT Limit

0151647

+ Gate forInterrupt #N

Gate forInterrupt #1

Gate forInterrupt #0

Real Mode Interrupt Vector

CS #NIP #N

CS #1IP #1CS #0IP #0

Reserved entriesVector Number Description

0 Divide Error 86

1 Debug Exception 86

2 NMI Interrupt 86

3 Breakpoint 86

4 INTO-detected Overflow 86

5 BOUND Range Exceeded 186

6 Invalid Opcode 86

7 Device Not Available 86

8 Double Fault 286

9 CoProcessor Segment Overrun (reserved) 286

10 Invalid Task State Segment 286

11 Segment Not Present 286

12 Stack Fault 286

13 General Protection 86

14 Page Fault 386

15 (Intel reserved. Do not use.)

16 Floating-Point Error 386

17 Alignment Check 486

18 Machine Check* Pentium

19-31 (Intel reserved. Do not use.)

32-255 Maskable Interrupts

Interrupt Vector Content (DOS)# Interrupt Name Address Owner

00 Divide by Zero 1756:00D2 NC01 Single Step 0070:06F4 Unknown02 Nonmaskable 1993:26D2 DOS03 Breakpoint 0070:06F4 Unknown04 Overflow 0070:06F4 Unknown05 Print Screen F000:FF54 BIOS06 Invalid Opcode F000:EB43 BIOS07 Reserved F000:EAEB BIOS08 IRQ0 - System Timer 15BE:0000 FAXTSR09 IRQ1 - Keyboard 15BE:001D FAXTSR0A IRQ2 - Reserved 0A08:0057 DOS0B IRQ3 - COM2 0A08:006F DOS0C IRQ4 - COM1 D60B:0095 BIOS0D IRQ5 - Reserved 0A08:009F DOS0E IRQ6 - Diskette 0A08:00B7 DOS0F IRQ7 - Printer 0070:06F4 Unknown10 Video 15BE:0100 FAXTSR11 Equipment Determination F000:F84D BIOS12 Memory Size Determination F000:F841 BIOS13 Fixed Disk/Diskette 15BE:0174 FAXTSR14 Asynchronous Communication F000:E739 BIOS15 System Services 0C4C:19A0 SMARTDRV16 Keyboard F000:E82E BIOS17 Printer F000:EFD2 BIOS18 Resident BASIC F000:E000 BIOS19 Bootstrap Loader 0C4C:1990 SMARTDRV1A Real-Time Clock Services F000:FE6E BIOS1B Keyboard Break 0070:06EE Unknown1C User Timer Tick 1465:00D6 CASMGR1D Video Parameters F000:F0A4 BIOS1E Diskette Parameters 0000:0522 Device Driver1F Video Graphics Characters C000:5A52 BIOS20 Program Terminate 0255:003A DOS System Area21 General DOS Functions 15BE:01A1 FAXTSR22 Terminate Address 18B9:02B1 COMMAND23 Ctrl+Break Handler Address 1993:26BA DOS24 Critical Error Handler 18B9:0155 COMMAND25 Absolute Disk Read 0C4C:19DE SMARTDRV26 Absolute Disk Write 0C4C:1A27 SMARTDRV27 Terminate and Stay Resident 0255:004C DOS System Area

# Interrupt Name Address Owner

28 DOS Idle 15BE:0222 FAXTSR29 DOS Internal - FAST PUTCHAR 0255:0052 DOS System Area2A Microsoft Networks 15BE:073E FAXTSR2B Reserved for DOS 0116:10DA DOS System Area2C Reserved for DOS 0116:10DA DOS System Area2D Reserved for DOS 0116:10DA DOS System Area2E DOS - Execute Command 0ACD:013F COMMAND2F Multiplex (Process Interface) 15BE:079E FAXTSR....40 Diskette BIOS Revector F000:EC59 BIOS41 Fixed Disk Parameters F000:E13D BIOS42 Relocated Video Handler F000:F065 BIOS43 EGA/VGA User Font Table C000:5652 BIOS44 Novell Netware API F000:EA97 BIOS45 Reserved F000:EA97 BIOS46 Fixed Disk Parameters F000:E401 BIOS....70 IRQ8 - Real-Time Clock 0A08:0052 DOS71 IRQ9 - Reserved F000:EED2 BIOS72 IRQ10 - Reserved 0A08:00CF DOS73 IRQ11 - Reserved 0A08:00E7 DOS74 IRQ12 - Reserved 0A08:00FF DOS75 IRQ13 - Redirect to NMI F000:EEDB BIOS76 IRQ14 - Fixed Disk 0A08:0117 DOS77 IRQ15 - Reserved F000:8E8D BIOS....F8 Reserved for User Programs 0040:0000 Device DriverF9 Reserved for User Programs 7A0E:7E00 UnknownFA Reserved for User Programs 0080:7C00 SSCDROMFB Reserved for User Programs 972F:0001 UnknownFC Reserved for User Programs 8304:F81C UnknownFD Reserved for User Programs EF40:7C00 BIOSFE Reserved for User Programs F729:F80B SSCDROMFF Reserved for User Programs 0002:F000 SMARTDRV

CPU Interrupts

External Interrupts

Additional Players Flags

IF flag: masks external interruptsSet/Clear by STI/CLI

TF flag: is the CPU in TRAP mode (single-step)Set/Clear by eFLAGS manipulation

Instruction IRET: “return from interrupt” = RET + few more things Software interrupts: INT n, INT3, INTO (4), BOUND (int 5)

Signals (for external interrupts) INTR: interrupt request INTA: interrupt acknowledge (bus cycle) = IO#*C*R# Data Bus: pass interrupt # NMI, SMI: additional interrupt lines

NMI - non maskable (timer, power fail)SMI - system management (OS independent power management)

When an interrupt/exception occurs, the interrupt# is decoded and then the story begins...

Interrupt Processing On Call (italic: Protected mode only)

Stack (SS:ESP) is set to level 0 stack (If not there) Store on stack

Old SS, ESP (on privilege level change only)Old FlagsOld CS, EIP (the offending instruction, or the next one!)Error code (for certain exceptions only. Contains erred selector)

Clear the IF/RF bits of EFLAGs (masks interrupts, delay single-step) Jump to the interrupt/exception handler

Handler code <Do the right things right…> Issue IRET

On IRET: Similar to regular RET + Restore EFLAGs

Figure 14-4. Stack Frame After Exception or Interruptwith and without privilege level change

Unused

Old SS

Old ESP

Old EIP

Old EFLAGs

Old CS

Error Code

ESP fromTSS

NewESP

OldESP

Interrupt Handling

Hardware side Ensure transparency to the affected task

Will return to the point of interruption w/ the right flagsDoes the minimum - only what software cannot do!

Interrupt handlers Transparent: preserve all used registers! Short as possible. If cannot - enable interrupts during handling

Do not disable interrupts for a long time! Avoid nesting of same interrupts, or make sure the software can

handle that! Some of the information can/should be extracted from the Stack

(old CS:eIP, old SS:eSP, eFLAGS, error code)

Interrupt handler writers must fully understand the relevant HW interaction (PIC + actual peripheral)

Simple Interrupt Routine

<From Barry B. Bray: page 412 CHAPTER 10: INTERRUPTS>EXAMPLE 10-3

;Procedure that displays all the registers and their contents;on the video screen in response to a trap interrupt

0000 TRACE PROC FAR

0000 50 PUSH AX ;save registers0001 55 PUSH BP0002 53 PUSH BX

0003 BE 0054 R MOV BX, OFFSET NAMES ;address register names

;display registers

0006 E8 009A R CALL CRLF ;get new display line0009 E8 00A9 R CALL DISP ;display AX000C 88 C3 MOV AX,BX000E E8 00A9 R CALL DISP ;display BX0011 88C1 MOV AX,CX0013 E8 00A9 R CALL DISP ;display CX0016 88 C2 MOV AX,DX0018 EB 00A9 R CALL DISP ;display DX0018 88 C4 MOV AX,SP001D 05 000C ADD AX,120020 E8 00A9 R CALL DISP ;display SP0023 88 CS MOV AX,BP0025 E8 00A9 R CALL DISP ;display BP0028 8B C6 MOV AX,SI002A EB 00A9 R CALL DISP ;display SI002D 8B EC MOV BP,SP ;address stack with BP002F 88 46 06 MOV AX,[BP+6]0032 E8 00A9 R CALL DISP ;display IP0035 8B 46 0A MOV AX,[BP+10]0038 E8 00A9 R CALL DISP ;display flags0038 8B 46 08 MOV AX,[BP+8]003E EB 00A9 R CALL DISP ;display CS0041 8CDB MOV AX, DS

0043 E8 00A9 R CALL DISP ;display DS0046 8C C0 MOV AX,ES0048 E8 00A9 R CALL DISP ;display ES0048 8C D0 MOV AX,SS004D E8 00A9 R CALL DISP ;display SS

0050 5B POP BX ;restore registers0051 5D POP BP0052 58 POP AX0053 CF IRET0054 TRACE ENDP

Note:; Call CRLF: print new line; Call DISP: print reg name & value

Push registers

Manipulates stack

Restore registers

Setting/Clear TF Flag There are no special instructions that set or clear the trap flag. Flag is on the stack at [BP+8]

; Procedure that sets TF to enable trap0000 TRON PROC FAR0000 50 PUSH AX ; save registers0001 55 PUSH BP0002 8B EC MOV BP, SP ; get SP0004 88 46 08 MOV AX, [BP+8] ; get flags0007 80 CC 01 OR AH,1 ; set TF000A 89 46 08 MOV [BP+8],AX ; save flags000D 5D POP BP ; restore registers000E 58 POP AX000F CF IRET0010 TRON ENDP

;Procedure that clears TF to disable trap0010 TROFF PROC FAR0010 50 PUSH AX ; save registers0011 55 PUSH BP0012 8B EC MOV BP, SP ; get SP0014 88 46 08 MOV AX, [BP+8] ; get flags0017 80 ES FE AND AH, 0FEH ; clear TF001A 89 46 08 MOV [BP+8],AX ; save flags001D 5D POP BP ; restore registers001E 58 POP AX001F CF IRET0020 TROFF ENDP

Order of interrupts Last instruction traps (breakpoint) External interrupts Fetch related (Code page fault) Decode related (Illegal opcode) Execution related (Data page fault)

Hardware Example

D0D1D2D3D4D5D6D7

RD#WR#A0A1RESETCS#

PA0PA1PA2PA3PA4PA5PA6PA7

PB0PB1PB2PB3PB4PB5PB6PB7

PC0PC1PC2PC3PC4PC5PC6PC7

D0D1D2D3D4D5D6D7

Keyboard

8255A-5

1 1 1 1 2 2 2 2 Y Y Y Y Y Y Y Y1 2 3 4 1 2 3 4

1 1 1 1 2 2 2 2 A A A A A A A A 1 21 2 3 4 1 2 3 4 G G

D0D1D2D3D4D5D6D7

WAIT2#

IOWC#A0A3A4A5A6A7A8A9

A11A12A13A14A15

I1I2I3I4I5I6I7I8I9I10

O1O2O3O4O5O6O7O8

74ALS244

An 8255A-5 Interface to a KB from 80286 using interrupt vector 40H

123456789

1918171615141312

3433323130292827

432140393837

1819202122232425

1415161713121110

D0-D7: 00000010 = 40H

HW Intr Routine <From Barry B. Bray: page 422 CHAPTER 10: INTERRUPTS>

EXAMPLE 10-5; Interrupt service procedure that reads a

key; from the keyboard.

= 0500 PORTA EQU 500H ; port A= 0506 CNTR EQU 506H ; control

register0000 0100 FIFO DB 256 DUP (?) ; FIFO buffer0100 0000 INP DW ? ; input pointer0102 0000 OUTP DW ? ; output pointer

0104 KEY PROC FAR

0104 50 PUSH AX ; save registers0105 53 PUSH EX0106 57 PUSH DI0107 52 PUSH DX

0108 2E: 8B 1E 0100 R MOV BX, INP ; address FIFO010D 2E: 88 3E 0102 R MOV DI, OUTP0112 FE C3 INC BL ; test for full0114 38 DF CMP BX, DI0116 7411 JE FULL ; if full

0118 FE CB DEC BL011A BA 0500 MOV DX, PORTA011D EC IN AL, DX ; get data011E 2E: 88 07 MOV CS:[BX],AL ; save data0121 2E: FE 06 0100 R INC BYTE PTR INP0126 EB 07 90 JMP DONE

0129 FULL:0129 BO 08 MOV A/8 ; disable

interrupts0128 BA 0506 MOV DX, CNTR012E EE OUT DX, AL

012F DONE:012F 5A POP DX ; restore

register0130 5F POP DI0131 58 POP EX0132 58 POP AX0133 CF IRET

0134 KEY ENDP

EXAMPLE 10-6; procedure that reads a key from the FIFO and; returns with it in AH

0104 READ PROC FAR0104 53 PUSH EX ; save registers0105 57 PUSH DI0106 52 PUSH DX

0107 EMPTY:

0107 2E: 8B 1E 0100 R MOV BX, INP ; address FIFO010C 2E: 8B 3E 0102 R MOV DI, OUTP0111 38 DF CMP BX, DI ; test for empty0113 74 F2 JE EMPTY ; if empty

0115 2E: 8A 25 MOV AH, CS:[DI] ; get data0118 8009 MOV AL9 ; enable 8255A-5 intr011A BA0506 MOV DX, CNTR011D EE OUT DX, AL

; increment pointerO11E 2E: FE 06 01M R INC BYTE PTR OUTP0123 5A POP DX : restore registers0124 5F POP DI0125 58 POP EX0126 CB RET

READ ENDP

INP (after KB press)

OUTP (after “READ”)

The Programmable Interrupt Controller (PIC)

Heavy-weight legacy, the 8259A component - PC/AT Today part of the “chip-set”

But with full software compatibility

Newer systems use the Advanced PIC (APIC) Supports multiprocessor systems Not discussed here

Functions: React to an external interrupt, according to priorities Convert the interrupt request (IRQ) into a vector Inform the peripheral that the interrupt was acknowledged

Vector addresses can be programmed Using a set of “commands”

Up to 8 components can be cascaded. In practice, only 2

8259A connection 10K

D0D1D2D3D4D5D6D7

A0CS#RD#WR#SP/EN#INTINTA

IR0IR1IR2IR3IR4IR5IR6IR7

CAS0CAS1CAS2

D0D1D2D3D4D5D6D7

INTRINTA#

WAIT2#

IOWC#A0A2A3A4A5A6A7A8A9

A10A11A12A13A14A15

I1I2I3I4I5I6I7I8I9I10

O1O2O3O4O5O6O7O8

An 8259A Interface to the 80286

123456789

1918171615141312

1110 987654

1819202122232425

121315

Cascading 8259 & IRQs

Number Address Name Owner

IRQ 00 15BE:0000 Timer Output 0 FAXTSR

IRQ 01 15BE:001D Keyboard FAXTSR

IRQ 02 0A08:0057 [Cascade] DOS

IRQ 03 0A08:006F COM2 DOS

IRQ 04 D60B:0095 Mouse BIOS

IRQ 05 0A08:009F LPT2 DOS

IRQ 06 0A08:00B7 Floppy Disk DOS

IRQ 07 0070:06F4 LPT1 Unknown

IRQ 08 0A08:0052 Realtime-Clock DOS

IRQ 09 F000:EED2 Reserved BIOS

IRQ 10 0A08:00CF Reserved DOS

IRQ 11 0A08:00E7 Reserved DOS

IRQ 12 0A08:00FF Reserved DOS

IRQ 13 F000:EEDB Coprocessor BIOS

IRQ 14 0A08:0117 Fixed Disk DOS

IRQ 15 F000:8E8D Reserved BIOS

8259 #1

8259 #2

INTRto CPU

INTRto next8259

Lecture 8: The Software-Hardware Interface

The Software/Hardware Interaction

Background Hal/BIOS Reset/Boot Memory management More...

Old Computers...

Applications talk directly to hardware OK for embedded systems…

Program is burned in ROM

Problematic for a personal computer Who loads programs? What if underlying HW changes? Managing shared resources: memory,

files, protection? No reuse of software Programmer must be expert in HW?

Reprogrammable computer must have an OPERATING SYSTEM

Application

Hardware

OS Role Ideal model

Applications talk with the OS OS talks with the HW

This seems to solve all problem: OS loads programs HW changes require OS changes, but

not application change OS manages shared resources OS contains reused (shared) SW Programmers need not be aware of HW

In practice, however, applications continued to interact with HW

No OS support for new devices Performance requirement (e.g., screen)

Applications

Hardware

Ideal Model

Applications

Hardware

Practical Model

General OS models Monolithic

All procedures interact All can talk to the HW

Hardware

System Services

ApplicationProgram. . . . . .

User Mode

Kernel Mode

Monolithic Operating System

OperatingSystem

Procedures

ApplicationProgram

File System

Memory and I/ODevice mgmt

ProcessorScheduling

Hardware

System Services

ApplicationProgram. . . . . .

User Mode

Kernel Mode

Layered Operating System

OperatingSystem

Procedures

ApplicationProgram

Layered A procedure can talk to lower

layers only May be enforced by HW

General OS Models (Cont.) Client/Server Ideally - kernel consists of message passing only

Clean, modular, secure... but slow

In practice - kernel contains more. e.g. Scheduler, memory mgmt, device drivers, etc...

Hardware

MicroKernel

User Mode

Kernel Mode

Client/Server Operating System

ServiceProcedures

ProcessServer

FileServer

MemoryServer

NetworkServer

DisplayServer

ClientApplication

. . . . . .

Windows NT Basic Layers

Hardware

System ServicesSystem Services

ProcessManager

VirtualMemoryManager

LPCFacility

Object Manager

SecurityReference

Monitor

SecurityReference

Monitor I/OSystem

I/OSystem

KernelKernel

Hardware Abstraction LayerHardware Abstraction Layer

Hardware

UserKernel

Win32Client

Win32sub

system

Win32sub

system

PosixClient

Posixsub

system

Posixsub

system

. . . . . . LogonProcess

LogonProcess

Securitysub

system

Securitysub

system

. . . . . .

Hardware Independence

Separate between the OS and the actual hardware Allows

Same OS can BOOT on different platforms as-is Same OS can run on different platforms as-is

At boot time: hardware independence achieved by BIOS (basic I/O system)

At run time: BIOS (DOS, Windows 3.1/95) HAL (NT) Device drivers

Applications

BIOS/HAL

BIOS Model

Hardware

Ideal model Reality

BIOS Contains basic functionality for HW handling

Disk I/O Screen I/O Mouse, keyboard

All that is needed to boot the system - and more

Resides in a non-volatile memory (EPROM, Flash) Supplied with the computer

BIOS suppliers: Phoenix, AMI (American Megatrend)Usually in cooperation with chip and system makers

OS supplier: Microsoft, Microsoft, ...

Interfaces to devices are done also by: Applications (DOS only) Device drivers (Windows)

BIOS additional tasks Boot Setup (HW parameters)

Inter-Layer Interface

API (application programming interface) Defines the function, parameters, return values. etc...

Level of interface: Programmer level: C-like, machine independent Code level: machine dependent (stack layout, etc..) Transfer from programmer code to library code

Regular call, standard API

Transfer from user code to OS code (still in user mode!) Call/Jmp. May use non-standard API

Transfer from user mode to system mode Software interrupt Inter-level call (using call gate)

extern int read(int, void *, unsigned); main (){ int file=0,buf[100],icount=10,count; count = read(file,buf,icount);}

disassembly for read.out:

.._cerror() 80481c8: a3 70 94 04 08 movl %eax,0x8049470 / errno <- code 80481cd: b8 ff ff ff ff movl $0xffffffff,%eax / ret val <- -1 80481d2: c3 ret 80481d3: 90 nop 80481d4: 90 nop 80481d5: 90 nop 80481d6: 90 nop 80481d7: 90 nop

_read()read() 80481d8: b8 03 00 00 00 movl $0x3,%eax / set read code 80481dd: 9a 00 00 00 00 07 00 lcall 0x7,0x0 / enter kernel 80481e4: 72 02 jb 0x2 <80481e8> / check error 80481e6: c3 ret / OK: return 80481e7: 90 nop 80481e8: 3c 5b cmpb $0x5b,%al 80481ea: 74 ec je 0xec <80481d8> 80481ec: e9 d7 ff ff ff jmp 0xffffffd7 <80481c8> / error routine 80481f1: 90 nop 80481f2: 90 nop 80481f3: 90 nop

main()[3] 80481f4: 55 pushl %ebp 80481f5: 8b ec movl %esp,%ebp 80481f7: 81 ec 90 01 00 00 subl $0x190,%esp 80481fd: 57 pushl %edi 80481fe: 56 pushl %esi[4] 80481ff: 2b ff subl %edi,%edi 8048201: be 0a 00 00 00 movl $0xa,%esi[6] 8048206: 56 pushl %esi 8048207: 8d 85 70 fe ff ff leal -400(%ebp),%eax 804820d: 50 pushl %eax 804820e: 57 pushl %edi 804820f: e8 c4 ff ff ff call 0xffffffc4 <80481d8> / call read lib 8048214: 83 c4 0c addl $0xc,%esp[7] 8048217: 5e popl %esi 8048218: 5f popl %edi 8048219: 8b e5 movl %ebp,%esp 804821b: 5d popl %ebp 804821c: c3 ret

Unix system code sequence

#include <time.h>main (){ time_t l1,l2; l2=time(&l1); // time receives pointer to time_t (long) and returns time_t }--------------------------------------------------------------- User code4: {00401010 55 push ebp00401011 8bec mov ebp,esp00401013 83ec08 sub esp,0000000800401016 53 push ebx00401017 56 push esi00401018 57 push edi5: time_t l1,l2;6:7: l2=time(&l1);00401019 8d45fc lea eax,dword ptr [l1]0040101c 50 push eax0040101d e812000000 call time (00401034) / go to library00401022 83c404 add esp,0000000400401025 8945f8 mov dword ptr [l2],eax8: }00401028 5f pop edi00401029 5e pop esi0040102a 5b pop ebx0040102b c9 leave0040102c c3 ret---------------------------------------------------------------- Library code_time:00401034 55 push ebp00401035 8bec mov ebp,esp00401037 83ec10 sub esp,000000100040103a 8d45f0 lea eax,dword ptr [ebp-10]0040103d 50 push eax / enter DLL0040103e ff15bc804000 call dword ptr [__imp__GetLocalTime@4 (004080bc)]00401044 0fb74dfc movzx ecx,word ptr [ebp-04]00401048 0fb755fa movzx edx,word ptr [ebp-06]0040104c 51 push ecx0040104d 52 push edx0040104e 0fb745f8 movzx eax,word ptr [ebp-08]00401052 0fb74df6 movzx ecx,word ptr [ebp-0a]00401056 50 push eax00401057 51 push ecx00401058 0fb755f2 movzx edx,word ptr [ebp-0e]0040105c 0fb745f0 movzx eax,word ptr [ebp-10]00401060 52 push edx00401061 50 push eax00401062 e87f010000 call ___loctotime_t (004011e6)00401067 83c418 add esp,000000180040106a 8b4d08 mov ecx,dword ptr [ebp+08]0040106d 85c9 test ecx,ecx0040106f 7402 jz _time+0000003f (00401073)00401071 8901 mov dword ptr [ecx],eax00401073 8be5 mov esp,ebp00401075 5d pop ebp00401076 c3 ret---------------------------------------------------------------- Dll code82929af0 683970f7bf push bff7703982929af5 e923aa663d jmp bff93c1d / enter kernel---------------------------------------------------------------- Kernel codebff93c1d ...... ......... ....... jmp far .... / enter ring 0003b:03d8 cd30 int 30 ;#0028:c0236288 --------Priveleged level

WIN95 system code sequence

DOS/BIOS Interface Very low level interface definition

Parameters are passed in registersAH contains function code

Return values are passed in registersCF contains error indication (Simple JC to check!)

Actual call is done by software interruptDOS: int 0x21BIOS: various 0x10-0x1A, e.g., 0x13 for disk, 0x16 for KBD

Low level interface pros/cons+ Faster code

+ Denser code size

- Very machine dependent: not portable, less extensible

Was OK for its time. Inappropriate for today

DOS ExamplesDOS: Delete file: INT 21H / DOS call

Input Parameters AH: 41HDS:DX asciiz pointer (“myfile.dat”)

Return values AX: 0: succeed

2: file not found5: fail to delete

example: LDS DX, file_name / set file nameMOV AH, 41H / delete codeINT 21H / dos callCMP AX, 0 / succeeded?JNZ fail

....file_name DB “myfile.dat”,0

----------------------------------------------------------------------------------------------

DOS: Read from file/ Device: INT 21H / DOS call

Input Parameters AH: 3FHBX file handleCX bytes to readDS:DX pointer to data buffer

Return values AX: CF=0 Number of bytes read

AX: CF=1 error code (5-access denied, 6-invalid handle)

BIOS Example

BIOS ReadDiskSector: INT 13H / Diskette/Fixed disk BIOS serviceAH: 2H / Read Disk sector

Input Parameters: AL: Number of sectors to readCH: cylinder # (low 8 bits)CL: Cylinder/Sector #DH: Head#DL: Drive#ES:BX Pointer to buffer

Return Values: AH: 0 = no errorFixed disk error code

CF: 0 = no error, 1 = error

Normal operation Application calls DOS to bring bytes from file DOS examines which device is associated with the file For disk:

DOS file system translates logical file address into physical disk address DOS Calls BIOS to read the actual bytes

BIOS Example (Cont.)

BIOS ReadKeyBoardInput: INT 16H / keyboard BIOS service

Input Parameter: AH: 0H / Read keyboard

Return Values: AH: Scan code

AL: Ascii code

Normal operation Application calls DOS to bring bytes from file DOS examines which device is associated with the file For keyboard: Calls BIOS to read the actual bytes

In practice, applications may access the keyboard directly+ Faster access+ Can get more info (extended code)- Some capabilities are lost: e.g., cannot redirect input!

Boot-Strap Process Reset

CS:IP is 0xf000:0xfff0 (within BIOS), CR0: Cache disabled, write-thru Perform optional BIST (built-in self test) Enter Real mode, most general/segment registers are 0

POST = Power on Self Test Start OS load (OS independent)

Call ROM loader Identify boot disk Load master boot record - pass control Find boot sector - load and pass control

DOS Specific (also WIN3.1) Find IO.SYS - load and pass control (BIOS extensions) Find MSDOS.SYS - load and pass control (DOS code) Load drivers Execute config.sys (more Device drivers) Find command.com - load and pass control (“shell”) Find autoexec.bat - run (initial apps) Prompt user

POST – Power-On Self Test

Check CPU integrity (regs, math, protected mode, etc..) BIOS checksum CMOS checksum (modifiable values for setup) Check/initialize DMA, keyboard controller Check lower 64KB of memory Check/initialize PIC, Cache, display controller Check all memory Check/initialize Serial & Parallel ports Check/initialize Disk/Diskette controller Look for BIOS extensions in segments A000:F000

Identified by AA55 on a 2K memory boundary. Code follows

Other operations: create shadow ROM BIOS, etc... Start Boot Strap

Boot-Strap - Comparison

NTNTInt 19

NT Loader

NT Min Kernel

NT Kernel

Device Driver

Login Process

BIOSBIOSReset

BIOS extensions

DOSDOSDOS Device Driver

DOS Applications

WIN 3.1WIN 3.1Windows VxD

Windows Drivers (DLLs)

Windows apps and DLL

WIN 95WIN 95Windows VxD

Windows Drivers (DLLs)

Windows apps and DLL

Win95option 2

Win95option 1

Schematic onlySome details ignored

DOS, Win 3.1 & Win 95 Memory Map

* EW Jan/29/980

Extended Memory

High Mem

Rom BIOS

Empty RAM

Video, SCSI, Network,Expanded Memory, etc..

Empty RAM

DOS Free Memory

Applications

DOS Device Drivers

DOS COMMAND.COM

Interrupt Vector

1M+64K: 10FFEF

1M: 0FFFFF

1M-128K: 0E0000

640K: 9FFFF

1K, 3FF

Upper memory

Win 95 memory map

Shared

Ring 0

Shared

Ring 0

Shared

Ring 3

System Tables (IDT,GDT ..)

System DLLs

Memory Mapped Files

Top portion ofWin-16 global heap

Per Process area

Lower portion ofwin-16 global heap

MS-DOS

Per Process

Ring 3

NT Virtual Memory Map

Non Paged Pool(Cannot be swapped out)

Direct MappingAddresses

User & SystemDLLs

Per Process

Ring 3

System

Shared

Ring 0

Kernel, Drivers, IDT, GDT, TSS..

Memory Model in Win/NT and Win95

Minimize usage of segments Use paging for address and access check

Address error:Memory access beyond program limits is checked by the OS following page fault exception

Access error:Memory Access is checked for read/write & privilege level in the page table and the address segment attribute

There are different privilege levels for the OS and the user context (need to remember this during interrupt time)

The FS segment register is used as a Task pointer to the OS database

Win/NT Still supports 16-bit and real mode segmentation applications with Virtual X86 (need to remember during Interrupt service)

Win/NT / Win ‘95 Page Tables

PageDirectory

Process 1’s CR3

PageDirectory

Process 2’s CR3

PageTable

System PageTable

PhysicalMemorypages

Background: Processes and Threads

Different for different operating system, we’ll talk Win32 (Windows ‘95 and Windows NT)

Processes are the primary running objects One executable and several DLLs Address space Object (e.g. file) handles Security attributes Execution priority

A process can have multiple threads For concurrency Minimal state: just registers and a private stack

Threads are known to the OS Preemptive thread switch, as well as preemptive process switch

Threads need to synchronize access to shared resources

Process and Thread Memory Management

Each Process has its own page table Separate CR3

All user processes use the same segments

Process switch includes switching to its pages

Linear mapping address are the same as the logical address “Flat” memory model

Each thread has different stack allocation obtained from the its process memory space

No paging magic, simply a different value in ESP

How memory is shared between Processes Same physical memory page is mapped by two or more

processes’ page tables

=> Two linear addresses calculate to the same physical address

Each process switch causes flushing of the TLB

Mode Switch Enter protected mode:

Prepare at least minimal GDT/IDT/TR. Set GDTR CR0.PE <= 1. Segment register still points to old base/limit Load segment registers as needed Perform board initialization - as needed Avoid interrupts while switching. Now update IDTR. Update TSS

Enter Paging: Prepare minimal page tables (at least 1directory, 1 table) CR3 <= page directory, CR0.PG <=1 The above code in page which phys=linear.

Need jump before leaving the page.

Exit paging: The same thing, with CR0.PG <= 0

Exit protected (in 286 - just reset) Set all segments so limit=0xffff. Disable interrupts CR0.PE <= 0 IDTR = 0, enable interrupts

Get_vector: ...XOR EAX, EAX; MOV ES, AX / seg=0MOV AX, ES:x*4MOV old_vector_p, AX / get offsetMOV AX, ES:x*4+2MOV old_vector_p+2, AX / get selector

...old_vector_p: DD ? / apart from code! ....Set_vector: ...

XOR EAX, EAX; MOV ES, AX / seg=0MOV AX, new_vector_pMOV ES:x*4, AX / set offsetMOV AX, new_vector_p+2MOV ES:x*4+2, AX / set selector

...new_vector_p: DD new_service_proc / new proc addr ...int_service: IN AL, port# / read status of the keyboard

CMP AL, hot_key / check if special valueJZ hot_key_service / if OK, get requestJMP [old_service_p] / jmp far to saved address...

hot_key_service: ...

/** C example */void chain_intr(unsigned int vector,

unsigned char mask){

void (_cdecl _interrupt_far * _cdecl np)(unsigned);

_disable();np = _dos_getvect( vector);_dos_setvect(vector, process_intr);_enable();/* program PIC to enable

interrupts */outp( PIC_e, inp( PIC_e ) & mask);outp( PIC_b, EOI);

Interrupt Vector Chaining (simplified)

Running DOS under Windows - Virtual X86

Allow execution of real mode apps under protected mode OS

Use Real-Mode logical-to-linear address translation Linear = seg*16+offset

Paging is used for address translation Can have several VM X86 tasks simultaneously Lower 1MB can be mapped to anywhere

DOS/BIOS code is copied for each task Emulate (virtualize) privileged operations

Invokes a general protection handler Handler decides how to continue

In protected mode (level 0) - e.g., I/OBack in VM mode (level 3) - e.g., divide by 0

VM X86 Address Translation

Segment Unit Page Unit32

Effective Address (offset)

BaseLinearaddress

Physical address

DOS BOXVM1

DOS BoxVM2

Page table 1

Page table 2

20bit, 0-1M

Linear space

Memory

Virtualizing Privileged I/O and Instructions

Windows has APIs (at the VxD level) to tie a handler to a port

Upon an exception:1. Check interrupt vector

2. Enter protected

mode handler

3. Optionally,

Process in VM X86

TASK A

IO Bit Map

Level 0Handler

Level 3DOS/Windows

Application

IOPL = 0CPL=3

OUT DX, AL

DIV AX,0

Interruptvector

Other things

Debug/Test features. I/O addressing Initialization MultiProcessing (Locking) Event Monitoring (Pentium)

X86 SW Challenge

SW architects - define models of computations 16 bit: small/compact/medium/large 32 bit: flat is enough...

Compilers - Generates fastest code for: Family of processors Individual processor

Application writers Write faster code.

Lecture 9: Performance

General Performance Questions

How efficiently is the application running on the platform? How are resources being used? CPU, memory, IO, bus? Where CPU time is being spent? Can program logic and or algorithms be improved in time

consuming code

Section 1: Overview 181Microcomputer CourseKoby Gottlieb, 3/03, p-181

Order of Optimizations

Time everythingAccess memory efficientlyMaximize branch predictabilityAvoid stallsUse a good blend of instructions

Component profiling:

flat profiling Collecting time based and event based information while running

the application Break down of active executables (exe’s, dll’s, drivers etc.)

call graph profiling Display of the call graph Collecting the information based on hierarchical profiling

Adding the time spent in functions that were called to the time of each function

Cache utilization profiling Tradeoff between code and data size.

Example: Processing data in phase A and B. Running phase A to until a data buffer between phases is full.

Dynamic Execution

Dynamic Data Flow Analysis Determines data and register dependencies to detect opportunities

for out-of-order execution.

Deep Branch Prediction Processor can predict many branches ahead of the instruction

pointer

Speculative Execution Allows the processor to execute instructions ahead of the program

counter but still retire the results in the order of the original instruction stream.

Architecture Diagram

Fetch & Decode Unit(In order unit)•Fetches instructions•Decodes instructions to µOPs•Performs branch prediction

Retirement Unit(In order unit)•Retires instructions in order•Writes results to registers/memory

Dispatch / Execute Unit(out of order unit)•Schedules and executes µOPs•Contains 5 execution ports

L2 cache

Bus Interface Unit

L1 data cacheL1 instruction

Instruction Pool/reorder buffer•Buffer of µOPs waiting for execution

System bus

Fetch Load Store

Section 2: Optimizations: Fetch Unit

Fetch Unit

Reads cache lines (32 bytes) from L1 instruction cache

Sends instructions to the decode unit

Fetches the next instructions based on branch prediction

Fetch & Decode Unit

Retirement Unit

Dispatch & Execute Unit

Bus Interface Unit

data cacheIcache

System bus

Instruction Pool(ROB)

Branch Prediction

Static Branch Prediction: Forward conditional branches will be predicted as not taken

jz @forward

Backward conditional branches will be predicted as takenjnz @backward

Unconditional direct branches are always takenjmp label

Unconditional indirect branches fall throughjmp [eax]

Dynamic Branch Prediction: Based on prediction and history in Branch Target Buffer (BTB)

If you can guess from the history what the branch will be, so will the processor (0101 -- taken, not taken, taken, not taken)

Static Branch Prediction Example

What does the static branch prediction algorithm predict for the following branches?

add eax, [edi]

dec ecx

add edi, 4

jnz loop

cmp eax, ebx

je start

mov eax, ecx

start:

mul ebx

test ebx, ebx

jnz not_zero

mov ebx, 50

not_zero:

add ebx, 10

mov ebx, [esi]

push ebx

push eax

call swap

lea edx, [ecx *4]

jmp [edx]

Begin:

shr eax, 1

jnc Begin

mov ebx, eax

Answers: 1-3 taken, 4-6 not taken

Return Stack Buffer

Return Stack Buffer is used to predict RETs Every time a CALL is executed, the return address is pushed on

the Return Stack Buffer Every RET pops the Return Stack Buffer and uses the address for

prediction

Make sure for each CALL there is a RET or the addresses will be out of sequence and all return address predictions will be wrong

Section 2: Optimizations: Decode Unit

Decode Unit

Converts Intel Architecture instructions into micro-ops (µOps)

Stores µOps in the Instruction Pool

To reduce false register dependencies, register references are converted to an internal register set Register alias table Register renaming

Fetch & Decode Unit

Retirement Unit

Bus Interface Unit

data cacheIcache

System bus

The Decoders

3 parallel decoders Decode instructions which require 4 or fewer µOps

Microcode instruction sequencer Decodes instructions which are 5 or more µOps

Do not reorder code, minimal performance gain

microcode instruction sequencer

decoder 01-4 µOps

decoder 21 µOp

decoder 11 µOp

register allocation table

ROB/Instruction Pool

Register Renaming

40 internal registers used to reduce data dependencies No user access and transparent to software

What the processor sees:Asm code internal to processor

mov al, mem

add al, 6

mov [edi], al

mov eax, 5

mov dest1, eax

mov ax, mem2

shr ax, 3

mov dest2, ax

mov <r0>, mem

add <r0>, 6 =>> <r3>

mov [edi], <r3>

mov <r1>, 5

mov dest1, <r1>

mov <r2>, mem2

shr <r2>, 3 =>> <r4>

mov dest2, <r4>

Register Stalls

Register stalls occur when a large register is read after one of its partial registers is written

The instructions do not have to be adjacent because of out-of-order execution

Caused by register renaming The large register might be split among multiple internal registers

Penalty: Time it takes to retire the small register write

mov al, byte ptr [mem]

inc eax ; stalled waiting for al to retire

The Clean Bit

Used to indicate that a register is zero Means the partial register equals the larger register Only eax, ebx, ecx, and edx have clean bits

Partial register stalls will NOT happen if the clean bit is on No direct user access to this bit, completely internal to the

processor

What turns on the clean bit?

Zero the register with XOR or SUB ONLY!

The clean bit can stay on for multiple instructions The clean bit is not copied between registers

xor eax, eax ; eax’s clean bit is on

sub ebx, ebx ; ebx’s clean bit is on

xor eax, eax ; eax’s clean bit is on

mov ebx, eax ; does not copy the clean bit

What turns off the clean bit?

Clean bit is turned off when the operation has the possibility of changing the larger register

Use of the ‘h’ register turns off the clean bit Keep the clean bit on when using partial registers

xor eax, eax ; clean bit on

mov al, 5 ; clean bit still on

dec al ; clean bit still on

inc eax ; clean bit off, could spill into ah

mov al, 6 ; partial load

inc eax ; stall

Register Stall Examples

Identify code segments with register stalls

mov ebx, eax

sub eax, eax

mov ax, bx

inc eax

xor eax, eax

mov ax, bx

inc eax

xor eax, eaxmov ah, 5push eax

mov eax, 0

mov ax, bx

inc eax

movzx eax, bx

push eax1 2 3

pop ax

mov [mem], eax

xor eax, eax

mov al, bl

mov ah, cl

inc eax

pop eax

mov [mem], al7 8 9

Answers: odd boxes stall

Mixing of Code and Data

Don’t put data in the instruction stream put data:

in the data segment or on the stackat the end of the code segmentat the end of a function

Don’t make the processor decode data wastes decode time

usually happens with indirect jumps and their tables reduces cache efficiency

Section 2: Optimizations: Execution Unit

Only out-of-order unit Schedules and executes µOPs

based on data-flow constraints

Fetch & Decode Unit

Retirement Unit

Bus Interface Unit

data cacheIcache

System bus

Execution Units Diagram

ROBReservation

Station

Port 1

Port 3Port 2 Port 4

IntegerUnit

Load Unit

Store Address

Calculation Unit

Store DataUnit

Port 0

MOV EAX,[Memref]

MOV [Memref], EAX

PMULHW PackedMultiply

PackedALU

Address Generation

FP Unit

IntegerUnit

PackedShift

PackedALU

ADDJMP

PSRLQPACKSSWB

Out-of-Order Execution and Data Dependencies

Out-of-Order execution is limited by data dependencies

for (i=0; i<SIZE/2; i+=2){

accum += array[i];accum += array[i+1];

for (i=0; i<SIZE/2; i+=2){

accum1 += array[i];accum2 += array[i+1];

}accum = accum1 + accum2;

Good: Unrolled, but... Better: Dependencies reduced

Serializing Instructions

Serializing instructions cause execution to be delayed until all previous instructions execute, retire, and results are written back to memory. read-modify-write instructions with lock prefix

Typically used in multiprocessor/ shared memory environments XCHG with a memory operand mov to control or debug registers CPUID FLDCW (used in _ftol)

ftolftol

Casting floats to Integer formats cause serious penalties on the Pentium® II processor ~50 clks using ftol

~25clks in exe / ~30 clks in stalls!

3 Problems: Serialization Partial stall with fnstcw Memory stall

Solutions: In most cases you can use round

Store Address Unknown Blocking

If the destination address for a store is unknown, all subsequent loads will stall. e.g.

mov eax, [esi] ;stalls waiting for add edi,4 to retire

add esi, 4

and eax, 0ffh

add edi, 4

mov [edi-4], eax

dec ecx

jnz loop

mov eax, [table+ebx*4]

mov [eax], ecx

mov ecx, [esi] ; stalls

Self Modifying Code

Do not modify code that might already be in the processor’s execution pipeline Both the in-order and out-of-order units will be flushed May cause branch prediction problems Processor stalls the in-order unit until the ROB empties Cache problems and penalties

Not the same as self generating code Code that is created once and executed, but never modified suffers

no penalty

Section 2: Optimizations: Retirement

Retirement Unit

Commits µOps to permanent machine state in original order

Removes µOps from reorder buffer

Fetch & Decode Unit

Retirement Unit

Bus Interface Unit

data cacheIcache

Instruction Pool

System bus

Section 2: Optimizations: Retirement

Store Forwarding

Improves the performance of load operations of recently stored data Must be same data size and alignment Saves the time it takes the cache to combine the data

GOOD BAD

32 bit load

8 bit store

16 bit store

16 bit load

32 bit store

32 bit load

8 bit load

16 bit load

32 bit store

32 bit load

16 bit store

Store Forward Example

Avoid mixing MMx memory LD’s with scalar integer memory ST’s (i.e., 8-, 16-, 32-bit ST’s) to the same SIMD data

Code sequences such as the following will cause MOB blocks:

MOV mem, eax # dword store to “mem”, low-half of SIMD variable

MOV mem+4, ebx # dword store to “mem+4”, high-half of SIMD variable

MOVQ mm0, mem # qword load of entire SIMD variable “mem”

# MOB blocks this load because the data requested is

# not a subset of the data of either ST in the MOB

Store Forwarding C code example

float f;

unsigned int I;

f = I;

Why is it a problem?

Store Forwarding code example

Assigning unsigned int to floatmov eax, DWORD PTR _imov DWORD PTR -16+[ebp], eaxmov DWORD PTR -16+[ebp+4], 0fild QWORD PTR -16+[ebp]

A better code could be:int uint_fix[2] = { 0, 0x4f800000};f1 = ((int) i) + *(((float *) &uint_fix[i>>31]));

And in assembly:mov eax, DWORD PTR _ifild DWORD PTR _ishr eax, 31fadd DWORD PTR _uint_fix[eax*4]

Section 2: Optimizations: Memory & Cache

Memory Interface

16 Kb data cache 2 way, set associative

Write back cache by default

Fetch & Decode Unit

Retirement Unit

Bus Interface Unit

data cacheIcache

Instruction Pool

System bus

The Memory Types

Write back (WB) Default for system memory All reads and writes are cached Write misses allocate and fill cache lines Read and write sequentially for maximum performance

Uncacheable memory (UC) Accesses are executed in order without reordering Accesses to other memory types can pass accesses to UC No speculative accesses

Write-Combining (WC)

Writes are buffered and burst across the bus improving bus utilization Writes are stored in an internal buffer (32 bytes), delayed,

reordered, and combined Speculative reads ARE allowed Reads are NOT cached Buffer is flushed on:

UC access, different address, return from an interrupt, serializing instruction, locked, or I/O instruction XCHG with memory (fastest locking instruction)

Lecture 10: Hyper-Threading

Intel's Hyper-Threading TechnologyOverview

Intel's Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture.

Hyper-Threading Technology makes a single physical processor appear as two logical processors. the physical execution resources are shared and the

architecture state is duplicated for the two logical processors.

From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors.

From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.

Thread Level - Amdahl’s Law

Maximum Efficiency

Fraction parallel limits scalability

Key: Parallelize everything significant

0 20 40 60 80 100

Percent Parallel

2 Threads

4 Threads

8 Threads

16 Threads

32 Threads

64 Threads

Multithreading + Multiple Processors= Improved Performance

Run threads using multiple processors

CPU 1CPU 1

CPU 2CPU 2

CPU 3CPU 3

Brief Introduction to ThreadsBrief Introduction to Threads

Multiprocessing

Open DB’s Address Book Open DB’s Address Book

InBox CalendarInBox Calendar

Functional Parallelism

Concurrent Tasks

Apply different operations to different data elements

Data Parallelism Apply the same operation to different data elements

function SpellCheck {

loop (word = 1, words_in_file)

compare_to_dictionary (word);

function SpellCheck {

loop (word = 1, words_in_file)

compare_to_dictionary (word);

Open FileOpen File Edit Spell Check Edit Spell Check

Thread Libraries – Win32* API

C language interfacesThreads exist within a single processGood for asynchronous concurrencyAll threads are peers

No explicit parent-child model Exception: main() thread

Creating Win32* ThreadsHANDLE CreateThread(

LPSECURITY_ATTRIBUTES ThreadAttributes,DWORD StackSize,LPTHREAD_START_ROUTINE StartAddress,LPVOID Parameter,DWORD CreationFlags,LPDWORD ThreadId );

*Other names and brands may be claimed as the property of others

Thread handle is a synchronization object

Functions are explicitly mapped to threads

Pentium 4 Block diagram

Execution pipeline

A high-level view of the microarchitecture pipeline. buffering queues separate major pipeline logic blocks. The buffering queues are either partitioned or duplicated to

ensure independent forward progress through each logic block.

Fetch and deliver

• Alternate between logical processors

• Execution trace cache/ Microcode ROM

• Fetch and decode instructions

• Register rename and allocation

Execution The out-of-order execution engine consists of

the allocation, register renaming, scheduling, and execution functions

Logical processors execute simultaneously Compete for schedulers, ALUs Schedulers map independent instructions to available

execution resources

The memory subsystem The memory subsystem includes:

The DTLB: translates addresses to physical addresses. It has 64 fully associative entries; each entry can map either a 4K or a 4MB page. Although the DTLB is a shared structure between the two

logical processors, each entry includes a logical processor ID tag.

Each logical processor also has a reservation register to ensure fairness and forward progress in processing DTLB misses.

the low-latency Level 1 (L1) data cache the Level 2 (L2) unified cache, and the Level 3 unified cache (the Level 3 cache is only available

on the Intel® XeonTM processor MP). Access to the memory subsystem is also largely oblivious to

logical processors. The schedulers send load or store uops without regard to logical

processors and the memory subsystem handles them as they come.

Instruction retirement

Alternate between logical processorsCommit state in program order

Hyper Threading implementation

Two logical processors for very small additional die area

Alternate between logical processorsFetch and deliverReorder and retire

Competitive sharing between logical processorsRapid execution engineCaches

OS support From the OS point of view HT is just like multi processing

Needs BIOS support for initialization

HLT instruction (for idle) The HLT instruction stops instruction execution and places the

processor in HALT stat. An enabled interrupt, NMI or Reset resume execution. The return instruction from HLT is the next instruction

There are two modes of operation referred to as single-task (ST) or multi-task (MT).

On a processor with Hyper-Threading Technology, executing HALT transitions the processor from MT-mode to ST0- or ST1-mode, depending on which logical processor executed the HALTSpin loops:

Use new pause instructionFor long wait use the OS call

Wait on object

Lecture 11: x86 and 3D graphics

Quick Intro to 3D Graphics

Glossary:Vertex – point in 3D spaceTriangle – 3 connected verticesObject – list of triangles that have the

same material properties (AKA Mesh)Texture – a 2D image that is wrapped

on the surface of a Mesh

For Each Triangle...

Geometry transform each vertex (FP)

Lighting compute lighting from light sources and surface

properties (FP)

Rasterization setup triangle for rasterization (FP) perform shading & texture mapping during

triangle fill (INT)

Rendering Processlight

source

objects

viewing planeX

incident light

perspective projection

Mathematical modelsfor light, objects & viewercreate a 2D image viaa 3D process

Coordinate Systems

world xo

object

zv viewer

Transformation

Transformation change the coordinate system a point lies in

ExamplesViewing Transformation - translate

objects to viewer coordinate before projection to viewing plane

Shadows - translate objects to light source coordinates to calculate shadows

Lighting Process

Surface

Diffuse Light scatters in all directionsSpecular Light reflects in direction of reflection vector R

Rasterization

Now that we have transformed and lit polygons… actually, the vertices...

And, we know where they appear on the screen…

We have to fill their interiors!

Textures

Image maps to apply surfaces.Gives impression of complex surface properties.

Can substitute for lots of polys (eg, tree bitmap on a single rectangle, vs. thousands of leaf polys)

Rasterization (again)

Flat Fill

Some lighting

And textures

SkinnedMesh.exe

History of 3D Pipeline Partitioning

Geometry

Lighting

Edge Setup

Rasterization

Application

Software 3DProcessor Does All

First Generation3D HW Accelerators

Rasterization

Geometry

Lighting

Edge Setup

Application

2nd Generation HW (1998-99)

Geometry

Lighting

Edge Setup

Rasterization

Application

3rd Generation HW (1999-2000)

Geometry

Lighting

Edge Setup

Rasterization

Application

Memory BW: The problem

Frame buffer - ~3MB Z buffer - ~3MB For each object:

Mesh (vertices and triangle connectivity) – 1KB – 1MB Texture(s) - ~256KB

Typical game frame – 8MB – 20MB

AGP: the Solution AGP (Accelerated Graphics Port)

Larger BW than PCI Lower HW cost (less local RAM needed)

Frame/Z buffers stored in graphics local memory Texturing from system memory

Memory traffic when using AGP

How does it work

The AGP aperture is mapped as one chunk (1:1 mapping in CPU’s paging HW), both the gfx chip and the CPU reference the same addresses

The GART is mapping AGP memory address to the system memory address (like paging HW in the CPU)

Current generation

Multi-texturing: More than one texture for surface, used for details maps, reflections, refractions, lighting tricks, etc.

Programmable HW

Helps the developer in customizing its transform, lighting and texture operations

Programmable vertex machine

Programmable HW demos

Vertex Shader Processing

map vertex data to VVM input registers

execute vertex shader

collect data from VVM output registers, format and write to output

clip and render

Vertex shader declaration

Vertex shader code

Vertex Virtual Machine (DX8)

Constant Memory

96 entries

128 bits4 floats

16 entries

13 entries

12 entries

Vertex Input

Vertex Output

RegistersVertex

ShaderA0.X

128 instructions

128 bits4 floats

Vertex Virtual Machine (DX9)

19 entries

12 entries

Vertex Output

RegistersVertex

ShaderA0.XYZW

256 instructions

Constant Memory

256 entries

128 bits4 floats

16 entries

Vertex Input128 bits4 floats

128 bits4 floats

16 entries

Loop control128 bits

4 ints

16 entries

Branch control16 BOOLs

Changes from DX8 in red

Vs.2.0 (DX9) shader language

DX8 Mov Add Mul Mad Dp3 Dp4 Slt Sge Min Max Rcp Rsq Expp Logp Dst* Lit*

DX9 additions Loop, Endloop Rep, Endrep Jnz Call Callnz Ret Exp Log M3x3, M4x3, M3x4, M3x2,M2x3** Sgn** Abs** Frc** Lrp** Pow** Nrm** Sincos**

*Removed from DX9

**Macro instructions

Vertex Shader Example

Vs.1.1

Dcl_position v0

Dcl_color0 v1

Dcl_texcoord0 v2

Dp4 oPos.x, v0, c0

Dp4 oPos.y, v0, c1

Dp4 oPos.z, v0, c2

Dp4 oPos.w, v0, c3

Mov oD0.rgb, v1

Mov oT0.xy, v2

SW Vertex shader

backup

Shading Techniques

Scan Line

Polygon fill is done across the

scanline

Flat shading Color the whole polygon with one color Does not show highlights inside polygons

Gouraud shading If the object surfaces are curved, we can

approximate it by polygons Interpolating vertex intensity values along

scanline Still does not show highlights inside polygons

Texture Mapping]

1. Texture coordinates (s,t,q) for each vertex. These are interpolated along polygon edges.

2. Texture coordinate for each point on scanline is interpolated. Linear interpolation, or a quadratic approximation (for perspective correction) is used.

3. The (s,t) value maps into the source texture map. It usually does not fall on the center of a texel.

4. A texture mapping algorithm is applied to the nearest texel, and possibly surrounding texels, to determine the result.

Texturing eats CPU MIPS & bandwidth. 20-70% slower than flat shading.

(s0, t0 , q0)

(s1, t1, q1)

(s2, t2 , q2)

(si, ti)

s = 0, t = 0 s = 1, t = 0

s = 1, t = 1

Credits

This foil set was developed by Ronny Ronen

Microcomputer Course Koby Gottlieb, 3/03, p-1 Microcomputer Course: x86 Architecture Basics Koby...

Documents

Transcript of Microcomputer Course Koby Gottlieb, 3/03, p-1 Microcomputer Course: x86 Architecture Basics Koby...

Microcomputer Components

Fasciculo 3 MicroComputer

Gottlieb (1991) Experiential canalization of behavioral ...wexler.free.fr/library/files/gottlieb (1991) experiential... · ducklings (Gottlieb, 1987a), but those were tested with

Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.

Adaptive Regularization of Weight Vectorswebee.technion.ac.il/people/koby/publications/arow_nips...Adaptive Regularization of Weight Vectors Koby Crammer Department of Electrical Enginering

Microcomputer Block Diagram

UNIT I Microcomputer: microcomputer microprocessor µPscadec.ac.in/upload/file/Microprocessor & Microcontroller Notes.pdf · Microcomputer: The term microcomputer is generally synonymous

Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.

BELGIUM - Cleary Gottlieb

Slide show koby

Koby and Kylie Co. Retail Catalog

BBC Microcomputer service manual - The Centre for ...chrisacorns.computinghistory.org.uk/docs/Acorn/Manuals/...BBC Microcomputer service manual SECTION 1 BBC Microcomputer Models A

Gottlieb v Echevarria

Microcomputer Interfacing

Microcomputer-BasedData Acquisition System for Crop …psasir.upm.edu.my/3540/1/Microcomputer-Based_Data_Acquisition.pdf · Microcomputer-BasedData Acquisition System for Crop ...

Introduction to Microcomputer Systems Outline –What is a Microcomputer? –Types of Microcomputers –Modular Microcomputer –Clock and CPU Control Circuits.

Microcomputer Ppt

By Bryan Gottlieb*

Will Technology Save Us? - Bethany Koby

Microcomputer Systems 1