Post on 20-Dec-2015
Microcomputer CourseKoby Gottlieb, 3/03, p-1
Microcomputer Course:x86 Architecture Basics
Koby Gottlieb, Intel IsraelSpring Semester 2003
Microcomputer CourseKoby Gottlieb, 3/03, p-2
Administration
Participants
Grading 65% exam, 35% exercises All exercises are mandatory! No exercise => no final grade
Web page: http://www.ee.technion.ac.il/~ee044800/ Latest version of slides Pointers to additional info
Material
Contacts Koby Gottlieb koby.gottlieb@intel.com 04-8655718 George Leifman [gleifman@techunix.technion.ac.il]
Microcomputer CourseKoby Gottlieb, 3/03, p-3
Objectives
Get acquainted with the X86 architecture and system components Historical perspective
Major features
SW/HW aspects
What you will not get here Expertise in X86 programming
Full details of everything
Expect Varying level of details
Hopefully: taste of more
Microcomputer CourseKoby Gottlieb, 3/03, p-4
Lecture 1: Introduction
Microcomputer CourseKoby Gottlieb, 3/03, p-5
History, Trends, & Terminology
Microcomputer CourseKoby Gottlieb, 3/03, p-6
Major Milestones in Microprocessors 1971 - 1975
Intel: 4004, 4 bit, 8008, 8 bit, 8080, 8 bit Motorola & other Still as a logic substitution
1975: Altair, 8080 based, 1st kit. Basic interpreter 1977: TRS80 of RadioShack
Apple, Apple-II 1980: VISICALC - 1st killer application
12 active companies 1981: IBM-PC, w/8088. MS-DOS, IBM PC/XT later Open standard! 1982: 80286 CPU 1984: IBM PC/AT, 80286 based 1985: i386 CPU 1987: i386 based PC 1989: i486 CPU, Windows! 1991: i486 PC, x86 competitors 1993: Pentium® processor 1994: Pentium based PC 1995: Pentium Pro processor 1996: Pentium Pro based PC Windows 95, Windows NT 1996: Pentium w/MMXTM Technology + PC 1997: Pentium-II® processor + PC, Cyrix MediaGX Did not mention buses, networking, etc...
Microcomputer CourseKoby Gottlieb, 3/03, p-7
Trends Phase-I (-1970): “Enter the game”
Technology progress
Phase-II (1970-1990): “Imitate the big guys”Use technology and copy ideas from mainframes Enlarge word size FP hardware Paging, Protection, Caching High integration
Phase-III (1990-): “Lead the game”Innovation unique to microcomputers: Super-scalar, out of order execution Branch prediction High frequency networking Graphics
Diverging again?Net-PC, thin client, connected PC, High integration, etc...
Microcomputer CourseKoby Gottlieb, 3/03, p-8
Representative Data
ScreenYear Xistors MIPS Memory Disk Screen
Memory___________________________________________________
1975 10K 0.2 <64K - Mono 2K
1985 275K 5.5 1M <10MB CGA 14” 64K
1995 5.5M 300 16M >500MB SVGA 17” 1MB___________________________________________________
Factor 20 40 16 50 - 16
1998 7.5M 600 32M 4GB “ 4MB
Microcomputer CourseKoby Gottlieb, 3/03, p-9
Performance Observations
Performance is doubled every 18-24 months
The “Spiral” principle (A. Grove):
New software is written to the fastest available processor. Actually, it takes the next generation of processors to run this software (reasonably).
There will always be applications that will use all available computing power.
<well, some people question that…>
"My nightmare is to wake up some day and not need any more computing power." Gordon Moore, Feb 1998
Microcomputer CourseKoby Gottlieb, 3/03, p-10
Computer System
Class focus CPU:
General architectureMajor components
System:Buses (CPU, ISA, PCI, USB) InterruptsMemory hierarchy
CPU
PCI BUS
Host/PCICacheBridge
Memory
ScsiAdap
ScsiBus
LanAdap
LAN
USBHub
KeyBoard,Monitor,Mouse,Speakers
etc..
GraphAdapt
VideoBuffer
Microcomputer CourseKoby Gottlieb, 3/03, p-11
Architecture & Microarchitecture
Architecture:“The science, art, or profession of designing and constructing building, bridges, etc...” Focus on how it looks externally Consider how it can be built and how good it performs!
Construction:“The way in which something is built, formed or devised by fitting parts or elements together systematically” Focus on how it is built/operated Make sure it works and works correctly!
Definition applicable also for buildings, cars, computers, etc...In computers: construction == architecture
Microcomputer CourseKoby Gottlieb, 3/03, p-12
Architecture & Microarchitecture (cont.)
Architecture:The collection of features of a processor (or a system) as they are seen by the “user”
User: a binary executable running on that processor, or
assembly level programmer
MicroArchitecture:The collection of features or way of implementation of a processor (or a system) do not affect the user.
Features which change Timing/performance are considered microarchitecture.
Microcomputer CourseKoby Gottlieb, 3/03, p-13
Architecture & Microarchitecture Elements
Architecture: Registers data width (8/16/32) Instruction set Addressing modes Addressing methods (Segmentation, Paging, etc...) Protection etc…
Architecture Bus width Physical memory size Caches & Cache size Number of execution units Execution Pipelines, Number & type execution units Branch prediction TLB etc...
Timing is considered Architecture (though it is user visible!)
Microcomputer CourseKoby Gottlieb, 3/03, p-14
Compatibility
Processor A (or system A) is compatible to processor B (or system B) if it preserves the same architecture features as seen by a user of B
SW compatible - most important for us! The ability to run existing SW!
HW compatible Upward compatible “Exactly” compatible (sometimes - “Downward” compatible)
(Required more for application, less for OS) Architecture Extension: addition of data-types, instructions,
etc… to enable better support of new/existing applications.e.g. MMX
Trends: 1985-1955: no/minimal architecture changes 1995- : we see them coming (MMX, SSE etc.)
Microcomputer CourseKoby Gottlieb, 3/03, p-15
Basic Architecture
Overview Data types Registers Stack Flags Segmentation Instructions:
Format Memory operands Instruction Set
Microcomputer CourseKoby Gottlieb, 3/03, p-16
IA (X86) evolution - Detailed
Year Proce-ssor
bits bus PhMem
MHZ MIPS FP Cache Prot Page Tran Tech P W Feature
1978 8086 16 16 1M 5/10 .3/.8 8087 - - - 29K 3u 40 2.5/1.0 New, 16 bit
1979 8088 16 8 1M 5/8 -"- 8087 - - - -"- 3u 40 2.5/1.0 8 bit bus
1982 80186 16 16 1M 80187 - - - 56K More integration
1982 80286 16 16 16M 8/12.5 1.2/2.7 80287 - + - 134K 68 3.3/1.1 Protection, 16M
1985 80386 32 16/32 4G 16/33 5.0/11.4 80387 - + + 275K 132 2.7 32 bit, paging
1989 i486 32 32 4G 25/100 20/80 on/487 8K + + 1.2M 1.2/.8 168 4.5 FP on chip,Cache, “RISC”
1993 Pentium 32 64 4/64G 66/200 112/400 on 2*8K + + 3.1M .8/.6 273 10/15 Dual Pipe, FPpipe, DP support
1995 P6 32 64 4/64G 150/200 400 on 2*8K+256K + + 5.5M 0.6 387 14-18 Dynamicexecution
1996 P55C 32 64 4/64G 200/233 466 on 2*16K + + >5M 0.6/0.35 273 8-10 MMX Technology
1997 PentiumII
32 64 4/64G 233/300 600 on 2*16K+512Kon SECC
+ + ~7.5M 0.35 242(SEC)
34/43
Each major generation is 2-3X faster than its predecessor!
Microcomputer CourseKoby Gottlieb, 3/03, p-17
Operation Modes
Architecture reflects this evolution!
Evolution => several modes: 16 bit:
Real Mode (8086 and up) Protected mode (286 and up) Virtual 86 mode (386 and up)
32 bit: Protected mode (386 and up) Paging enabled/disabled
Proce-ssor
RealMode
Prot-16 Prot-32
86 +
286 + +
386+ + + +
Microcomputer CourseKoby Gottlieb, 3/03, p-18
Philosophical Basis of First Wave Architecture
HW / Instruction level Support for high level languages
Instruction Set encoding for memory efficiency Two-Operand Instructions
reg-reg, reg-imm, reg-memory, read-modify-write
Specialized and Implicit Register Usage
Complex Instructions to maximize functionality / instruction fetch
Memory Segmentation Relocatable programs / modules well supported
Addressing modes suitable for high-level languages Stack and Data spaces in instruction set architecture
Push / Pop Addressing in Stack Space
Structure and Array element addressing syllables
Microcomputer CourseKoby Gottlieb, 3/03, p-19
Lecture 2: x86 Architecture Basics
Microcomputer CourseKoby Gottlieb, 3/03, p-20
Basic Data Types
BYTE
WORDHIGHBYTE
LOWBYTE
15 7 0
Address N+1
Address N
DOUBLEWORD
QUADWORD
HIGH WORD LOW WORD
Address N+3
Address N+2
Address N+1
Address N
31 15 0
LOW DOUBLEWORD
63 47 31 15 0
AddressN+7
AddressN+6
AddressN+5
AddressN+4
Address N+3
Address N+2
AddressN+1
Address N
Figure 3-2 Fundamental Date Types
BYTE
AddressN
7 0
HIGH DOUBLEWORD
Microcomputer CourseKoby Gottlieb, 3/03, p-21
Memory Layout Little Endian - LSB in lower addresses
Figure 3-3 Bytes, Words, Doublewords and Quadwords in Memory
7A
FE
06
36
1F
A4
23
0B
74
CB
31
0EH
0DH
0CH
0BH
0AH
9H
8H
7H
6H
5H
4H
3H
2H
1H
0H
QUADWORD AT ADDRESS 6CONTAINS 7AFE06361FA4230BH
DOUBLEWORD AT ADDRESS OAHCONTAINS 7AFE0636H
WORD AT ADDRESS 0BHCONTAINS FE06H
BYTE AT ADDRESS 9CONTAINS 1FH
WORD AT ADDRESS 6CONTAINS 230BH
WORD AT ADDRESS 2CONTAINS 74CBH
WORD AT ADDRESS 1CONTAINS CB31H
Microcomputer CourseKoby Gottlieb, 3/03, p-22
Extended Data Types7 0
15 0
031
07
15 0
...
...
N 0
N
47 31
0
0
0
...
...
Figure 3-4. Data Types
BYTE INTEGER7 BIT MAGNITUDE1 BIT SIGN
WORD INTEGER15 BIT MAGNITUDE1 BIT SIGN
DOUBLEWORD INTEGER31 BIT MAGNITUDE1 BIT SIGN
BYTE ORDINAL8 BIT MAGNITUDE
WORD ORDINAL16 BIT MAGNITUDE
DOUBLEWORD ORDINAL32 BIT MAGNITUDE
BCD INTEGER4 BIT DIGIT PER BYTE
PACKED BCD INTEGER4 BIT DIGIT PER HALF-BYTE
NEAR POINTER32/16 BIT OFFSET
FAR POINTER32/16 BIT OFFSET16 BIT SELECTOR
BIT FIELDUP TO 32 BITS
BIT STRINGUP TO 4 GIGABITS
BYTE STRINGUP TO 4 GIGABYTES
31 0
31 15 0Pointers w/ 16 Bit Offsets
031
Microcomputer CourseKoby Gottlieb, 3/03, p-23
Registers
8 “General” 32 bit registers: EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP
Lowest 16 bits can be specified as AX, BX, ... Bits 0-7, 8-15 of first four registers can be addressed as bytes:
AL, AH, BL, BH, CL, CH, DL, DH
Special registers: eFLAGS, eIP Segment registers: CS, DS, ES, SS, FS, GS 8 Floating Point registers + FP control & status 8 MMX registers Various OS registers (trace, debug, control, status).
Not discussed here
* Throughout this presentation: eREG = REG / EREG, according to 16/32 bit mode
Microcomputer CourseKoby Gottlieb, 3/03, p-24
Basic Registers
Figure 3-5. Application Register Set
GENERAL REGISTERS
31 23 15 7 0
AH
DH
CH
BH
AL
DL
CL
BL
16-BIT
AX
DX
CX
BX
32-BIT
EAX
EDX
ECX
EBX
EBP
ESI
EDI
ESP
BP
SI
DI
SP
15 0SEGMENT REGISTERS
CS
SS
DS
ES
FS
GS
STATUS AND CONTROL REGISTERS31 0
EFLAGS
EIP
Microcomputer CourseKoby Gottlieb, 3/03, p-25
Floating Point
FPU DATA REGISTERSTAGFIELD
SIGNIFICANDEXPONENTSIGN
79 78 64 63 0 1 0FP7
FP6
FP5
FP4
FP3
FP2
FP1
FP0
CONTROL REGISTER
STATUS REGISTER
TAG WORD
15 0 47
INSTRUCTION POINTER
DATA POINTER
Figure 6-1. Floating-Point Unit Register Set
Microcomputer CourseKoby Gottlieb, 3/03, p-26
MMX (1)
Packed bytes (8x8 bits)
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
Packed word (4x16 bits)
63 48 47 32 31 16 15 0
Packed doublewords (2x32 bits)63 32 31 0
63 0
Quadword (64 bits)
Figure 2-3. Packed Data Types
Microcomputer CourseKoby Gottlieb, 3/03, p-27
MMX (2)
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
Memory address 1008h Memory address 1000h
Figure 2-4. Eight Packed Bytes in Memory (at address 1000H)
Figure 2-2. MMX Register Set
MM7
MM6
MM5
MM4
MM3
MM2
MM1
MM0
63 0
Microcomputer CourseKoby Gottlieb, 3/03, p-28
Streaming SIMD Extensions
Packed, single-precision FP (4x32 bits)
127 96 95 64 63 32 31 0
Streaming SIMD Extensions Register Set
XMM7
XMM6
XMM5
XMM4
XMM3
XMM2
XMM1
XMM0
127 0
Microcomputer CourseKoby Gottlieb, 3/03, p-29
Stack Related instructions: Push/Pop, Call/Ret Used in interrupt handling
TOP OF STACK
STACK SEGMENT31 0
SUBROUTINEPASSED
VARIABLES
EBP
ESP
BOTTOM OF STACK(INITIAL ESP VALUE)
PUSHES PUT THETOP OF STACK ATLOWER ADDRESSES
POPS PUT THE TOP OF STACK ATHIGHER ADDRESSES
Figure 3-8. Stacks
Microcomputer CourseKoby Gottlieb, 3/03, p-30
Flags
Affected by ALU operations, e.g.,: ADD, CMP
Used mainly for control flow: Jcc
But also by other instructions ADC, RLC, etc...
Table 3-2. Status FlagsNAME PURPOSE CONDITIONS REPORTED
OF OVERFLOW Result exceeds positive or negative limit of numbers range
SF Sign Result is negative (less then zero)
ZF Zero Result is zero
AF Auxiliary carry Carry out of bit position 3 (used for BCD)
PF Parity Low byte of resutl had even parity (even number of set bits)
CF Carry flag Carry out of most significant bit of result
FIGURE 3-9. EFLAGS Register
CF
1PF
0AF
0ZF
SF
TF
IF
DF
OF
IOPL
NT
0RF
VM
AC
VIF
VIP
ID
0000000000
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
X ID FLAG (ID)X VIRTUAL INTERRUPT PENDING (VIP)X VIRTUAL INTERRUPT FLAG (VIF)X ALIGNMENT CHECK (AC)X VIRTUAL 8086 MODE (VM)X RESUME FLAG (RF)X NESTED TASK (NT)X I/O PRIVILEGE LEVEL (IOPL)S OVERFLOW FLAG (OF)C DIRECTION FLAG (DF)X INTERRUPT ENABLE FLAG (IF)X TRAP FLAG (TF)S SIGN FLAG(SF)S ZERO FLAG (ZF)S AUXILIARY CARRY FLAG (AF)S PARITY FLAG (PF)S CARRY FLAG (CF)
S INDICATES A STATUS FLAGC INDICATES A CONTROL FLAGX INDICATES A SYSTEM FLAG
BIT POSITIONS SHOWN AS O OR 1 ARE INTEL RESERVED.DO NOT USE. ALWAYS SET THEM TO THE VALUE PREVIOUSLY READ.
Microcomputer CourseKoby Gottlieb, 3/03, p-31
Segmentation (1) Problem: use “large” memory with short word width
16 bits can cover only 64KB
Solution: address space divided into “segments” Logical address is segment:offset Linear address: address in the “flat” address range Segmentation: mapping from Logical to Linear
Segment base = f(segment register) offset < segment size: 64K (8086/286), 4G (386+) linear address = segment_base + offset
At any time. there are 4 (8086) or 6 (386 and on) active segment registers CS: code DS: data SS: stack ES: extra FS, GS: new extra
Rough analogy - phone directory: region and a number within
Microcomputer CourseKoby Gottlieb, 3/03, p-32
Segmentation (2)
Figure 3-7. A Segmented Memory
CSSS
DSES
FSGS
DIFFERENT LOGICAL SEGMENTSDIFFERENT ADDRESS SPACE
IN PHYSICAL MEMORY
CODESEGMENT
STACKSEGMENT
DATASEGMENT
DATASEGMENT
DATASEGMENT
DATASEGMENT
Figure 3-6. An Unsegmented Memory
DIFFERENT LOGICAL SEGMENTS
GSFSESDSCSSS
ONE PHYSICAL ADDRESS SPACE
Microcomputer CourseKoby Gottlieb, 3/03, p-33
Instruction Set Architecture
Data transfer Common ALU operations (add, sub, cmp...) Control flow (jmp, call, jcc) String (movs, cmps, sets - using eSI, eDI, eCX) FP instructions (also SSE variant) MMX instructions OS support (task switch, call gates, table handling, etc.) I/O instructions
“Exotic” instructions (DAA, XLAT, etc.)
Microcomputer CourseKoby Gottlieb, 3/03, p-34
Data Transfer Instructions MOV Byte / Word / Dword, register to / from register / memory
Also Zero/Sign extended moves
STACK OPERATIONS PUSH [SOURCE]
SP = SP - 2 ; ( SS : SP ) = SOURCE ; (16-bit)
ESP = ESP - 4 ; ( SS : ESP ) = SOURCE ; (32-bit)
POP [DEST]
DEST = ( SS : SP ) ; SP = SP + 2; (16-bit)
DEST = ( SS : ESP ) ; ESP = ESP + 4; (32-bit)
I/O instructions IN (I/O port specified by 8-bit immediate or DX)
EAX / AX / AL = Dword / Word / Byte from I/O port
OUT (I/O port specified by 8-bit immediate or DX)
I/O Port = Dword / Word / Byte from EAX / AX / AL
Microcomputer CourseKoby Gottlieb, 3/03, p-35
ALU and Shift Instructions
ALU Usual ALU ops on Byte / Word / Dword operands Integer Multiply and Divide Instructions
Multiply and Divide use implicit register pairs Chained add / subtract for multi-precision arithmetic Conversion ops for packed and unpacked decimal math
SHIFTS and ROTATES Logical and arithmetic shifts Rotates through / not through C bit By constant (1) or by count in CL
Later extended to immediate operand in 80386
Microcomputer CourseKoby Gottlieb, 3/03, p-36
Control Flow
Assortment of conditional jumps Jcc Byte/Word/Dword displacement CS not modified, eIP = eIP + signed displacement *
JMP SHORT - one byte displacement NEAR - two/four byte displacement FAR - Across Segments (Loads CS : eIP)
CALL Like JMP Pushes next eIP in SHORT and NEAR forms Pushes CS : eIP in FAR forms
RET (Return) Pops into eIP in NEAR form Pops into CS : eIP in FAR form
* eIP = IP / EIP, according to 16/32 bit mode
Add examples
Microcomputer CourseKoby Gottlieb, 3/03, p-37
String Instructions String operation components
Opcode specifying ES:eDI op DS:eSI Typically preceded by REP (Repeat) Prefix
for ( ; Repeat Condition; eCX -- )
{
ES : eDI op DS : eSI
eDI = eDI + d*oprnd_size ; eSI = eSI + d*oprnd_size
}
d = 1 or -1 Value selected by Direction Flag
OP can be move, compare, find … oprnd_size is one of 1/2/4
Repeat Condition is test of Zero Flag Can be test if set or test if clear Zero Flag set by CX == 0 or by op (CMPS)
* eREG = REG / EREG, according to 16/32 bit mode
Microcomputer CourseKoby Gottlieb, 3/03, p-38
Small ExamplesRepeat Mov
struct str {char c[1000];}src,dst;void move()
{ dst=src;}
COMM_src:BYTE:03e8HCOMM_dst:BYTE:03e8H...; 2 : {
push ebpmov ebp, esppush esipush edi
; 3 : dst=src;mov ecx, 250mov esi, OFFSET FLAT:_srcmov edi, OFFSET FLAT:_dstrep movsd
; 4 : }pop edipop esipop ebpret 0
Mov, ALU, Jmp
int array[100];int sum=0;
void sum_array(){ register int i;
for(i=0;i<100;i++)sum+=array[i];
}
COMM_array:DWORD:064H_sum DD 01H DUP (?); 5 : register int i;; 6 : for(i=0;i<100;i++)
mov eax, OFFSET FLAT:_array$L28:; 7 : sum+=array[i];
mov ecx, DWORD PTR [eax]add eax, 4add DWORD PTR _sum, ecxcmp eax, OFFSET FLAT:_array+400jl SHORT $L28
; 8 : }
Microcomputer CourseKoby Gottlieb, 3/03, p-39
Instruction Format (1) Variable length instructions! Remember - only 2 explicit operands - source and destination
may be the same Machine language
Zero or more prefixes Opcode Mod/Rm Displacement Immediate
8/16/32 Handling Instruction encoding differs for 8-bit and 16/32 bit operands Same instruction encoding is used for 16 and 32 bit operands Current code segment defines if processor runs 16 or 32 bit code In 16 bit code “word” operands indicate 16 bit operand,
In 32 bit code “word” operands indicate 32 bit operand. This can be toggled by the operand-size prefix (0x66)
Microcomputer CourseKoby Gottlieb, 3/03, p-40
Instruction Format (2)
Figure F-1. Core Instruction Format
d32|16|8| none d32|16|8|none{ {TTTTTTTT TTTTTTTT mod TTT r/m ss index base
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7-6 5-3 2-0 7-6 5-3 2-0
{ { {{
opcode(one or two bytes)(T represents an
opcode bit.)
mod r/mbyte
s-i-bbyte
register and addressmode specifier
addressdisplacement(4, 2, 1 bytes
or none)
immediatedata
(4, 2, 1 bytesor none)
Figure 25-1. Instruction Format
1 OR 2 0 OR 1 0 OR 1 0,1,2 OR 4
NUMBER OF BYTES
0,1,2 OR 4
INSTRUCTIONPREFIX
ADDRESS-SIZE PREFIX
OPERAND-SIZE PREFIX
SEGMENTOVERRIDE
0 OR 1 0 OR 1 0 OR 1 0 OR 1
NUMBER OF BYTES
REP PREFIX
LOCK PREFIX
0 OR 1 0 OR 1
OPCODE MOD R/M SIB DISPLACMENT IMMEDIATE
Microcomputer CourseKoby Gottlieb, 3/03, p-41
Assembly Version
MS/Intel “programming tool” Case insensitive Operand size implicit Destination on left
Unix “compiler back-end” Case sensitive Operand size explicit Destination on right
Assembler should be smart enough to handle prefix when 16 bit operand is used in 32 bit code...
Microcomputer CourseKoby Gottlieb, 3/03, p-42
Example of ASM FormatIntel/Microsoft tools:
- op dest, src- implied width => requires adding BYTE PTR if not clear- many other goodies (structures style)
MOV EAX, ECX ; EAX:=ECXMOV BYTE PTR GS:[EAX]+4, 0 ; GS:[EAX+4]:=0
REP MOVSB REP MOVS BYTE PTR ES:[EDI], BYTE PTR DS:[ESI]
ADD BX,[EAX+EDX*2]+4 ; 16 bit opsize ADD EBX, [BX+SI] ; 16 bit addr STRING:DB ’A’,’B’,’C’, 0
Unix tools:- op src, dest- explicit width- case sensitive- less flexible
movl %ecx,%eaxmovb 0, %gs:4(%eax)
rep movsbaddw 4(%eax,%edx,2),%bxaddl (%bx,%si),%ebx
L001: .4byte “abc”,’0’
Microcomputer CourseKoby Gottlieb, 3/03, p-43
Instruction Format Example strncpy code listing
section .textstrncpy() 0: 55 pushl %ebp 1: 8b ec movl %esp,%ebp 3: 83 ec 0c subl $0xc,%esp 6: c7 45 fc 00 00 00 00 movl $0x0,-4(%ebp) d: 2b d2 subl %edx,%edx f: 3b 55 10 cmpl 16(%ebp),%edx 12: 73 26 jae 0x26 <3a> 14: 8d 45 fc leal -4(%ebp),%eax 17: 89 04 24 movl %eax,(%esp) 1a: e8 fc ff ff ff call 0xfffffffc <1b> 1f: 8b 45 0c movl 12(%ebp),%eax 22: 8b 55 fc movl -4(%ebp),%edx 25: 8a 04 10 movb (%eax,%edx),%al 28: 8b 4d 08 movl 8(%ebp),%ecx 2b: 88 04 11 movb %al,(%ecx,%edx) 2e: 8b 45 fc movl -4(%ebp),%eax 31: 40 incl %eax 32: 89 45 fc movl %eax,-4(%ebp) 35: 3b 45 10 cmpl 16(%ebp),%eax 38: 72 da jb 0xda <14> 3a: 8b 45 0c movl 12(%ebp),%eax 3d: 8b e5 movl %ebp,%esp 3f: 5d popl %ebp 40: c3 ret main() 41: 83 ec 0c subl $0xc,%esp 44: c7 04 24 00 00 00 00 movl $0x0,(%esp) 4b: c7 44 24 04 00 00 00 00 movl $0x0,4(%esp) 53: c7 44 24 08 04 00 00 00 movl $0x4,8(%esp) 5b: e8 a0 ff ff ff call 0xffffffa0 <0> 60: a3 00 00 00 00 movl %eax,0x0 65: 83 c4 0c addl $0xc,%esp 68: c3 ret
Encoding and Prefix example
disassembly for 6.o
section .text 0: a3 00 00 00 00 movl %eax,0x0 5: 66 a3 00 00 00 00 movw %eax,0x0 b: a2 00 00 00 00 movb %al,0x010: 89 1d 00 00 00 00 movl %ebx,0x016: 66 89 1d 00 00 00 00 movw %bx,0x01d: 88 1d 00 00 00 00 movb %bl,0x023: 89 19 movl %ebx,(%ecx)25: 66 89 19 movw %bx,(%ecx)28: 88 19 movb %bl,(%ecx)2a: 67 89 19 movl %ebx,(%bx,%di)2d: 67 66 89 19 movw %bx,(%bx,%di)31: 67 88 19 movb %bl,(%bx,%di)34: a4 movsb (%esi),(%edi)35: f3 a4 repz movsb (%esi),(%edi)37: ff 01 incl (%ecx)39: f0 ff 01 lock incl (%ecx)3c: 64 ff 01 incl %fs:(%ecx)3f: 66 ff 01 incw (%ecx)42: 67 ff 01 incl (%bx,%di)45: f3 ff 01 repz incl (%ecx)48: f3 f0 64 66 67 ff 01 repz lock incw %fs:(%bx,%di)4f: c3 ret
Microcomputer CourseKoby Gottlieb, 3/03, p-44
Lecture 3: Floating Point & MMX
Microcomputer CourseKoby Gottlieb, 3/03, p-45
Floating Point
Supports single, double and extended memory format (32, 64 and 80 bits), as well as other exotic formats (BCD, integers)
8 registers arranged as a stack
A stack entry (register) is always 80 bits; conversion done during load/store only
Each operation works on top of stack and optionally on another element (another stack entry of memory). Stack top moves!
Supports all IEEE-754 operations (fadd, fsub, fmul, fdiv, fsqrt, fcmp...) and many others (trig, transcendentals, etc...)
Separate unit (“co-processor”) for 8086, 286, 386. On-chip since i486
Exception model reflects this history
Big performance boost in i486, large in Pentium, huge in Pentium-II
Microcomputer CourseKoby Gottlieb, 3/03, p-46
Floating Point Formats General Format: (-1)S 2E (b0b1b2b3..bp-1)
Where:S = 0 or 1E = any integer between Emin and Emax, inclusivebi = 0 or 1p = number of bits of precision
SpecificsParameter Single Double ExtendedFormat width in bits 32 64 80p (bits of precision) 24 53 64 2’s
complementExponent width in bits 8 11 15 Biased formatEmax +127 +1023 +16383Emin -126 -1022 -16382Exponent bias +127 +1023 +16383
b0 is implied in single/double, explicit in extended Example of single precision number:
Significand (Mantissa)S Exponent
02331
Microcomputer CourseKoby Gottlieb, 3/03, p-47
Floating Point Formats (cont.) Example
Ordinary Decimal 178.125
Scientific Decimal 1.78125E2
Scientific Binary 1 0110010001 E111
Scientific Binary (Biased Exp +127) 1 0110010001 E10000110
Single Format (Normalized)
Sign 0
Biased Exponent 10000110
Significand 01100100010000000000000
1 (implicit)
Special numbers:Exponent is all 0’s or all 1’s: Zero - the entire number is 0
There’s a negative zero, too! Denormals - (+/-) very small numbers, implicit 1 is not assumed Infinite - (+/-) Special rules for 0*inf, etc... NaN - Not a Number - Signaling NaNs and Quite NaNs
Used to propagate errors in computations when exceptions are masked
Microcomputer CourseKoby Gottlieb, 3/03, p-48
Floating Point - Control
Rounding modes Precision control Exception Masks
Rounding Control00 - Round to Nearest or Even01 - Round Down (Toward -)10 - Round Up (Toward +)11 - Chop (Truncate Toward 0)
Precision control00 - 24 Bits (Single Precision)01 - (Reserved)10 - 53 Bits (Double Precision)11 - 64 Bits (Extended Precision)
Reserved(Infinity Control)*Rounding ControlPrecision Control
ReservedException Mask
PrecisionUnderflowOverflowZero DivideDenormalized OperandInvalid Operation
IM
DM
ZM
OM
UM
PM
XPCRCX
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
X
* Infinity Control has no effect on 386 and on. Used for compatibility with 8086 & 286
Figure 6-3. FPU Control Word Format
Microcomputer CourseKoby Gottlieb, 3/03, p-49
Floating Point - Status
Exception Flags Top Of Stack
position Complex exception
model and handling.Not discussed.
Figure 6-3. FPU Status Word Format
ES is set if any unmasked exception bit is set; Cleared otherwise.
Top values:000 = Register 0 is top of stack001 = Register 1 is top of stack....111 = Register 7 is top of stack
FPU BusyTop of Stack PointerCondition Code
Error Summary StatusStack FaultException Flags
PrecisionUnderflowOverflowZero DivideDenormalized OperandInvalid Operation
IE
DE
ZE
OE
UE
PE
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
TOPSF
C0
C1
C2
C3
BES
Microcomputer CourseKoby Gottlieb, 3/03, p-50
Floating Point - Status
Flag each stack value as Valid, Empty, Zero, Special
Tag Values00 = Valid01 = Zero10 = Special: Invalid (NaN, Unsupported, Infinity, or Denormal) 11 = Empty
Tag(7) Tag(6) Tag(5) Tag(4) Tag(3) Tag(2) Tag(1) Tag(0)
015
Figure 6-4. Tag Word Format
Microcomputer CourseKoby Gottlieb, 3/03, p-51
Floating Point ExampleCOMM _x1:QWORD ; S E P
_a DD 040b00000r ; 5.5 0 81 (1.)6000_b DD 040800000r ; 4 0 81 (1.)0_c DD 03e800000r ; 0.25 0 7B (1.)0_i2 DD 02H_$T32 DD 040800000r ; 4
; 8 : {
push ebpmov ebp, esp tos=0
; 9 : x1=(-b+sqrt(b*b-4*a*c))/(i2*a);
fld DWORD PTR _b ; b tos=7 fchs ;-bfld DWORD PTR _b ; b tos=6 fmul DWORD PTR _b ; b2
fld DWORD PTR $T32 ; 4 tos=5 fmul DWORD PTR _a ; 4afmul DWORD PTR _c ; 4acfsubp ST(1), ST(0) ; b2-4ac tos=6 fsqrt ; =sqrt(b2-4ac)faddp ST(1), ST(0) ;-b+ tos=7 fild DWORD PTR _i2 ; 2 tos=6 fmul DWORD PTR _a ; 2*afdivp ST(1), ST(0) ; (-b+)/2a tos=7 fstp QWORD PTR _x1 ; store tos=0
; 10 : }
pop ebpret 0
_
// solving a quadratic equationfloat a=5.5;float b=4;float c=.25;int i2=2;double x1;
void main(){ x1=(-b+sqrt(b*b-4*a*c))/(i2*a);}
Note: Stack movement Data type used Number
representation
Microcomputer CourseKoby Gottlieb, 3/03, p-52
MMX
Microcomputer CourseKoby Gottlieb, 3/03, p-53
MMX
An extension to the X86 Architecture Useful mainly for multimedia applications Technique: Single Instruction, Multiple Data (SIMD) with
existing registers New data types - packed 64 bit consists of either
8 bytes 4 words 2 doublewords
Instructions to operate on this data Software model to allow easy migration with existing OS
Microcomputer CourseKoby Gottlieb, 3/03, p-54
MMX Instructions and Data Types
Operations (57 instructions) Arithmetic (saturating & wrap-around)
– Add– Subtract– Multiply– Multiply-add (for fast multiply
accumulate) Conversions
– Pack from large to smaller words– Unpack from small to large words
Data Types Packed (fixed point) integers Signed, Unsigned Bytes, Words, D-Words
8 Packed Bytes
4 Packed Words
2 Packed D-words
Logical– And– Or– Shifts– Compares (including arithmetic)
Transfers – Register to register– Load from and store to memory
Integration into Intel Architecture 8 new registers - MM0- MM7 Uses previously reserved opcodes Uses same addressing modes
Uses same 2 operand encoding scheme– One is src/dest– Other src may come from memory
Microcomputer CourseKoby Gottlieb, 3/03, p-55
Example of Operations
Paddusw: Add Unsigned Packed word with Saturate
a3 a2 a1
b3 b2 b1+ + + +
a3+b3 a2+b2 a1+b1
FFFFh
8000h
FFFFh
Packed Add
Paddw: Add Packed word
a3 a2 a1
b3 b2 b1+ + + +
a3+b3 a2+b2 a1+b1
FFFFh
8000h
7FFFh
ShiftPsllq: Shift left logical all 64-bit
a3 a2 a1
0010h
a0
a1 a0a2 0000h
Multiply-AddPmaddwd: multiply and add
a3 a2 a1 a0
b3 b2 b1 b0** **
a3*b3+a2*b2a1*b1+a0*b0
Microcomputer CourseKoby Gottlieb, 3/03, p-56
Example Operations
Multiply-AddPmaddwd: multiply and add
a3 a2 a1 a0
b3 b2 b1 b0** **
a3*b3+a2*b2a1*b1+a0*b0
Multiply-HighPmulhw: multiply high
a3 a2 a1 a0
b3 b2 b1 b0
c3 c2 c1 c0
a0*b0a1*b1a2*b2a3*b3
High order 16 bits
Multiply-LowPmullw: multiply low
a3 a2 a1 a0
b3 b2 b1 b0
c3 c2 c1 c0
a0*b0a1*b1a2*b2a3*b3
low order 16 bits
Microcomputer CourseKoby Gottlieb, 3/03, p-57
FP/MMX view of registers
ST7
ST0
TOSStatusWord
MM Tag
FPU Tag063
63
00
00
00
00
00
00
MM7
MM0 TOS=0
00
00
1113
MMX registers MM0-MM7 overlap FP registers All tags are valid w/ MMX TOS is 0
Pros No change of context Easy OS migration
Cons Difficult to Intermix MMX
& FP Need to signal MMX to
FP move(EMMS: clear tags)
Example: Conditional Select / Branch Removal
PANDN
PCMPEQW
X4 !=greenX2 != greenX1=green X3=green
greengreengreen green
0x0000 0xFFFF 0xFFFF 0x0000
X1 X2 X4X3 PAND
0x0000 0xFFFF 0xFFFF 0x0000
Y1 Y2 Y4Y3
POR 0x0000 0x0000
finished pixels
0x0000 0x0000
Y1
X2
Y3
X4
Y1
X2
X4
Y3
if (X[i] != green) then new_image[i] = X[i] else new_image[i] = Y[i]
+ =
Packed Comparison (Pcmpcc) and the logical instructions enable conditional select operations in parallel and without data dependent branches.
X Y new_image
MOVQ MM1, XPCMPEQW MM1, GREENMOVQ MM2, MM1PANDN MM1, XPAND MM2, YPOR MM1, MM2MOVQ New, MM1
Microcomputer CourseKoby Gottlieb, 3/03, p-59
Streaming SIMD Extensions
New technology to exploit parallelism in FP and INT applications Key capabilities
Packed Operations Branch Removal/Compression Data Movement/Hints FP/INT Type Conversion
Application Domains… 3D Graphics: geometry, lighting signal processing high precision simulation & modeling video encoding/decoding other apps requiring streaming input and output
Microcomputer CourseKoby Gottlieb, 3/03, p-60
SIMD FP instructions types
arithmetic square root approximation instructions min & max loads & stores
move mask compare & set mask logical compare & set eflags conversion
Instruction CategoriesSIMD FP - 4 wide, single-precision, 2 wide double precision SIMD INT - extensions to MMX™ technology capabilities, Extending all the MMX instruction to 128 bit using the new XMM registersCacheability controlState management
Microcomputer CourseKoby Gottlieb, 3/03, p-61
New SIMD FP Registers
xmm0
xmm1
xmm2
xmm3
xmm4
xmm5
xmm6
xmm7
eight 128-bit, 4x32bit single-precision FP2x64bit single-precision FP, 128 bit MMX
•New set of registers!•Direct access•Used for data only•Extended processor state
Microcomputer CourseKoby Gottlieb, 3/03, p-62
Packed SP Data Type Each register holds 4 single-precision FP values or 2 double precision IEEE-754 compatible Scalar operates on least-significant number
xmm0
1-bit sign8-bit exponent23-bit mantissa
031
0
32127
22233031
1-bit sign10-bit exponent53-bit mantissa
Microcomputer CourseKoby Gottlieb, 3/03, p-63
Packed Operations
xmm0
xmm1
xmm0
a0a1a2a3
b0b1b2b3
a0 op b0a1 op b1a2 op b2a3 op b3
op is one of•addps addpd •subps subpd•mulps mulpd•divps divpd
Microcomputer CourseKoby Gottlieb, 3/03, p-64
Scalar Operations
xmm0
xmm1
xmm0
a0a1a2a3
b0b1b2b3
a0 op b0a1a2a3
op is one of•Addss addsd•subss subsd•nulss mulsd•divss divsd
Microcomputer CourseKoby Gottlieb, 3/03, p-65
SIMD Data Organization
Exploit vertical parallelism SOA versus AOS
Array of Structures
Structure of Arrays
X0, X1, X2, X3, … Y0, Y1, Y2, Y3, … Z0, Z1, Z2, Z3, ...
X0, Y0, Z0, X1, Y1, Z1, X2, Y2, Z2, …
Better cacheability
Better SIM
D calculations
Microcomputer CourseKoby Gottlieb, 3/03, p-66
Matrix Vector Multiply example
Typical 3D operation Load values in SOA format
xxxx…, yyyy…, zzzz…
Follow with multiply and add operations
movaps xmm0, [list+X+ecx] ;load X components
movaps xmm2, [list+Y+ecx] ;load Y components
movaps xmm3, [list+Z+ecx] ;load Z components
movaps xmm1, [esi+m00] ;m00 m00 m00 m00
movaps xmm4, [esi+m01] ;m01 m01 m01 m01
Loop back and pick up next 4 vertices…mulps xmm1, xmm0 ;x*m00 x*m00 x*m00 x*m00
mulps xmm4, xmm2 ;y*m01 y*m01 y*m01 y*m01
addps xmm4, xmm1 ;add the 2 results
movaps xmm1, [esi+m02] ;load matrix element m02 (x4)
mulps xmm1, xmm3 ;z*m02 z*m02 z*m02 z*m02
addps xmm4, xmm1 ;add results
addps xmm4, [esi+m03] ;add last element of matrix
Microcomputer CourseKoby Gottlieb, 3/03, p-67
SIMD Integer Instructions
Extensions to MMXTM technology instructions Operate on same 64-bit registers as previous MMX technology
instructions, Or on the new 128 bit XMM registers Instructions: extract, insert, min/max, byte mask integer,
multiply high unsigned, shuffle multiply unsigned double, add and sub qword, average
Microcomputer CourseKoby Gottlieb, 3/03, p-68
Computing Model
Processinginput output
If you can’t bring data in fast enough or spit it out fast enough… SIMD is of little or no use
The “streaming” part of the Streaming SIMD Extensions is critical to overall performance
Microcomputer CourseKoby Gottlieb, 3/03, p-69
Prefetch
Hides latency by bringing in data before you need it Provide cache hints to fetch data to different cache levels
prefetchnta - prefetch non-temporal data to non-temporal cache (L1)
prefetcht0 - prefetch temporal data into both L1 and L2 caches prefetcht1, prefetcht2 - prefetch temporal data into L2 cache
Microcomputer CourseKoby Gottlieb, 3/03, p-70
Prefetch Illustrated
memoryL1
L2
prefetchnta [esi]
Microcomputer CourseKoby Gottlieb, 3/03, p-71
Prefetch Illustrated (2)
memoryL1
L2
prefetcht0 [esi]
Microcomputer CourseKoby Gottlieb, 3/03, p-72
Prefetch Illustrated (3)
memoryL1
L2
prefetcht1 [esi]prefetcht2 [esi]
Microcomputer CourseKoby Gottlieb, 3/03, p-73
Warning about prefetch
Proper placing of prefetch is critical to insure that there’s enough time between prefetch and actual load limited resources to load data
Excessive use of prefetch can actually hurt performance Spread out prefetches and insure sufficient computation
before load
Microcomputer CourseKoby Gottlieb, 3/03, p-74
Streaming Stores
Avoids polluting cache with data that you don’t real soon (or ever)
Supports 128 and 64-bit versions movntps - from xmm reg to memory movntq - from mm reg to memory
If store is a cache hit, then cache is updated and not sent to memory
Weakly ordered Use sfence instruction to insure order
Microcomputer CourseKoby Gottlieb, 3/03, p-75
Streaming Store Illustrated
memory
L1
L2
movntps [esi]
xmm0
No write allocation*
* If it was a cache hit,then data goes to cacheand is not writtendirectly to memory
Microcomputer CourseKoby Gottlieb, 3/03, p-76
AMD 64 bit extension
Starting with Opron (released April – 2003)
Microcomputer CourseKoby Gottlieb, 3/03, p-77
Lecture 4: Addressing and Instruction Encoding
Microcomputer CourseKoby Gottlieb, 3/03, p-78
Addressing Modes Compute the offset part of a memory reference
Direct Register (by name) Memory: 16 bit addressing modes
(Segment) + base + index + displacement disp(base,index) 8 combinations of base + index
BX, SI, DI, BX+SI, BX+DI, BP+SI, BP+DI, disp only Disp 0, 8, or 16 bit. Occupies 0, 1 or 2 bytes All combinations are encoded in the ModRM byte Examples: 8(bp); (bp,bx);
Memory: 32 bit addressing modes (Segment) + base + index*scale + disp
disp(base,index,scale) Base: each of the 8 general registers or none Index: each of 7 general registers (no ESP!) or none Scale: 1,2,4,8 Index/scale requires an additional SIB bytes.
Base only can use ModRM byte only Disp: 0, 8 or 32 bit. Occupies 0, 1 or 4 bytes Examples: (eax); array(ecx,4); based_array(eax,ebx,4)
Addressing mode determined by code mode (16/32) Can be toggled using the address prefix (0x67)
Microcomputer CourseKoby Gottlieb, 3/03, p-79
Addressing Modes
Figure 25-2. ModR/M and SIB byte Formats
MOD REG/OPCODE R/M
SIB (SCALE INDEX BASE) BYTE
7 6 5 4 3 2 1 0
SS INDEX BASE
MODR/M BYTE
7 6 5 4 3 2 1 0
Figure 3-10. Effective Address Computation (32-bit vesrion)
{CSSSDSESFSGS
EAXECXEDXEBXESPEBPESIEDI
EAXECXEDXEBXEBPESIEDI
1
2
4
8
No Displacement8-bit Displacement32-bit Displacement
+ + * +{} }{}{}{ }
Segment + Base + (Index * Scale) + Displacement
Microcomputer CourseKoby Gottlieb, 3/03, p-80
Encoding Example
ModRM byte: MOD REG/OP Reg/Mem 7 6 5 4 3 2 1 0
SIB byte: SS INDEX BASE 83 ec 0c subl $0xc,%esp
89 03 movl %eax,(%ebx) / base onlyc7 45 fc 07 00 00 00 movl $0x7,-4(%ebp) / base, disp ; imm8b 55 fc movl -4(%ebp),%edx / base, disp ; reg89 45 fc movl %eax,-4(%ebp)3b 55 10 cmpl 16(%ebp),%edx89 04 24 movl %eax,(%esp) / base only espc7 04 24 08 00 00 00 movl $0x8,(%esp)c7 44 24 04 09 00 00 00 movl $0x9,4(%esp) / base+disp esp8a 04 10 movb (%eax,%edx),%al / base + index8a 04 50 movb (%eax,%edx,2),%al / base+scale*index8a 04 42 movb (%edx,%eax,2),%al / 88 04 11 movb %al,(%ecx,%edx)40 incl %eax72 da jb 0xda <14>5d popl %ebpc3 ret
Microcomputer CourseKoby Gottlieb, 3/03, p-81
MOD/RM-SIB Table: 16-bit addressing table
Table 11-1. 16-Bit Addressing Forms with the ModR/M Byter8(/r)r16(/r)r32(/r)/digit (Opcode)REG =
ALAXEAX0000
CLCXECX1001
DLDXEDX2010
BLBXEBX3011
AHSPESP4100
CHBP1
EBP
5
101
DHSIESI6110
BHDIEDI7111
EffectiveAddress Mod R/M ModR/M Values in Hexadecimal
[BX+SI][BX+DI][BP+SI][BP+DI][SI][DI]disp162
[BX]
00 000001010011100101110111
0001020304050607
08090A0B0C0D0E0F
1011121314151617
18191A1B1C1D1E1F
2021222324252627
28292A2B2C2D2E2F
3031323334353637
38393A3B3C3D3E3F
[BX+SI]+disp83
[BX+DI]+disp8[BP+SI]+disp8[BP+DI]+disp8[SI]+disp8[DI]+disp8[BP]+disp8[BX]+disp8
01 000001010011100101110111
4041424344454647
48494A4B4C4D4E4F
5051525354555657
58595A5B5C5D5E5F
6061626364656667
68696A6B6C6D6E6F
7071727374757677
78797A7B7C7D7E7F
[BX+SI]+disp16[BX+DI]+disp16[BP+SI]+disp16[BP+DI]+disp16[SI]+disp16[DI]+disp16[BP]+disp16[BX]+disp16
10 000001010011100101110111
8081828384858687
88898A8B8C8D8E8F
9091929394959697
98999A9B9C9D9E9F
A0A1A2A3A4A5A6A7
A8A9AAABACADAEAF
B0B1B2B3B4B5B6B7
B8B9BABBBCBDBEBF
EAX/AX/ALECX/CX/CLEDX/DX/DLEBX/BX/BLESP/SP/AHEBP/BP/CHESI/SI/DHEDI/DI/BH
11 000001010011100101110111
C0C1C2C3C4C5C6C7
C8C9CACBCCCDCECF
D0D1D2D3D4D5D6D7
D8D9DADBDCDDDEDF
E0EQE2E3E4E5E6E7
E8E9EAEBECEDEEEF
F0F1F2F3F4F5F6F7
F8F9FAFBFCFDFEFF
Notes
1. The default segment register is SS for the effective addresses containing a BP index, DS for other effective addresses.
2. The “disp16” nomenclature denotes a 16-bit displacement following the ModR/M byte, to be added to the index.
3. The “disp8” nomenclature denotes an 8-bit displacement following the ModR/M byte, to be sign-extended and added to the index.
Microcomputer CourseKoby Gottlieb, 3/03, p-82
MOD/RM-SIB Table: 32-bit addressing table
Table 11-2. 32-Bit Addressing Forms with the ModR/M Byter8(/r)r16(/r)r32(/r)/digit (Opcode)REG =
ALAXEAX0000
CLCXECX1001
DLDXEDX2010
BLBXEBX3011
AHSPESP4100
CHBPEBP5101
DHSIESI6110
BHDIEDI7111
EffectiveAddress Mod R/M ModR/M Values in Hexadecimal
[EAX][ECX][EDX][EBX][--][--]1
disp32[ESI][EDI]
00 000001010011100101110111
0001020304050607
08090A0B0C0D0E0F
1011121314151617
18191A1B1C1D1E1F
2021222324252627
28292A2B2C2D2E2F
3031323334353637
38393A3B3C3D3E3F
disp8[EAX]3
disp8[ECX]disp8[EDX]disp8[EBX]disp8[--][--]disp8[EBP]disp8[ESI]disp8[EDI]
01 000001010011100101110111
4041424344454647
48494A4B4C4D4E4F
5051525354555657
58595A5B5C5D5E5F
6061626364656667
68696A6B6C6D6E6F
7071727374757677
78797A7B7C7D7E7F
disp32[EAX]disp32[ECX]disp32[EDX]disp32[EBX]disp32[--][--]disp32[EBP]disp32[ESI]disp32[EDI]
10 000001010011100101110111
8081828384858687
88898A8B8C8D8E8F
9091929394959697
98999A9B9C9D9E9F
A0A1A2A3A4A5A6A7
A8A9AAABACADAEAF
B0B1B2B3B4B5B6B7
B8B9BABBBCBDBEBF
EAX/AX/ALECX/CX/CLEDX/DX/DLEBX/BX/BLESP/SP/AHEBP/BP/CHESI/SI/DHEDI/DI/BH
11 000001010011100101110111
C0C1C2C3C4C5C6C7
C8C9CACBCCCDCECF
D0D1D2D3D4D5D6D7
D8D9DADBDCDDDEDF
E0E1E2E3E4E5E6E7
E8E9EAEBECEDEEEF
F0F1F2F3F4F5F6F7
F8F9FAFBFCFDFEFF
Notes
1. The [--][--] nomenclature means a SIB follows the ModR/M byte.
2. The disp32 nomenclature denotes a 32-bit displacement following the SIB byte, to be added to the index.
3. The disp8 nomenclature denotes an 8-bit displacement following the SIB byte, to be sign-extended and added to the index.
Table 11-3. 32-Bit Addressing Forms with the SIB Byte
r32Base =Base =
EAX0
000
ECX1
001
EDX2
010
EBX3
011
ESP4
100
[*]5
101
ESI6
110
EDI7
111
Scaled Index SS Index SIB Values in Hexadecimal
[EAX][ECX][EDX][EBX]none[EBP][ESI][EDI]
00 000001010011100101110111
0008101820283038
0109111921293139
020A121A222A323A
030B131B232B333B
040C141C242C343C
050D151D252D353D
060E161E262E363E
070F171F272F373F
[EAX*2][ECX*2][EDX*2][EBX*2]none[EBP*2][ESI*2][EDI*2]
01 000001010011100101110111
4048505860687078
4149515961697179
424A525A626A727A
434B535B636B737B
444C545C646C747C
454D555D656D757D
464E565E666E767E
474F575F676F777F
[EAX*4][ECX*4][EDX*4][EBX*4]none[EBP*4][ESI*4][EDI*4]
10 000001010011100101110111
80889098A0A8B0B8
81899189A1A9B1B9
828A929AA2AAB2BA
838B939BA3ABB3BB
848C949CA4ACB4BC
858D959DA5ADB5BD
868E969EA6AEB6BE
878F979FA7AFB7BF
[EAX*8][ECX*8][EDX*8][EBX*8]none[EBP*8][ESI*8][EDI*8]
11 000001010011100101110111
C0C8D0D8E0E8F0F8
C1C9D1D9E1E9F1F9
C2CAD2DAE2EAF2FA
C3CBD3DBE3EBF3FB
C4CCD4DCE4ECF4FC
C5CDD5DDE5EDF5FD
C6CED6DEE6EEF6FE
C7CFD7DFE7EFF7FF
Notes
1. The [*] nomenclature means a disp32 with no base if MOD is 00,[EBP] otherwise. This provides the following addressing modes:disp32[index] (MOD=00).disp8[EBP][index] (MOD=01).disp32[EBP][index] (MOD=10).
Microcomputer CourseKoby Gottlieb, 3/03, p-83
Addressing Mode Examples_s$ = -4_t$ = -8; 16 : // typedef struct {char mem1, mem2, prev,next;} str_type_t;; 17 : // int i,j; <Globals>; 18 : // char a,b,c,d,e,f,g,h; <Globals>; 19 : // int s,t;; <Stack>
; 20 : /* int var; */ a = var;mov al, BYTE PTR _var; DISP only
; 21 : /* str_type_t struc; */ b = struc.mem2;mov cl, BYTE PTR _struc+1; DISP only
; 22 : /* int array[100]; */ c = array[i]; mov edx, DWORD PTR _imov al, BYTE PTR _array[edx*4]; INDEX*SCALE+DISP
; 23 : /* str_type_t arr_str[100]; */ d = arr_str[i].next;mov ecx, DWORD PTR _imov dl, BYTE PTR _arr_str[ecx*4+3]; INDEX*SCALE+DISP
; 24 : /* int *p; */ e = *p; mov eax, DWORD PTR _pmov cl, BYTE PTR [eax]; BASE
; 25 : /* str_type_t *str_p; */ f = str_p->prev; mov edx, DWORD PTR _str_pmov al, BYTE PTR [edx+2]; BASE+DISP
; 26 : /* int *arr_p; */ g = arr_p[j]; mov ecx, DWORD PTR _jmov edx, DWORD PTR _arr_pmov al, BYTE PTR [edx+ecx*4]; BASE+INDEX*SCALE
; 27 : /* str_type_t *arr_str_p; */ h = arr_str_p[j]->prev;mov ecx, DWORD PTR _jmov edx, DWORD PTR _arr_str_pmov al, BYTE PTR [edx+ecx*4+2]; BASE+INDEX*SCALE+DISP
; 28 : /* int s,t (stack) */ t = s;mov ecx, DWORD PTR _s$[ebp]mov DWORD PTR _t$[ebp], ecx BASE+DISP
Microcomputer CourseKoby Gottlieb, 3/03, p-84
Software Conventions
API - Application Programming Interface Interface among procedures modules: param types, order, etc.. Machine dependent; Defined in HLL.
Calling conventions - machine level calling sequence Parameter passing location and order (stack/regs) Register preserved across calls Machine dependent, may be also language dependent
ABI - Application Binary Interface Interface among application and DLLs/OS machine dependent; OS dependent
C conventions for x86: Parameters are pushed last to 1st EAX, EBX, EDX are scratch, all other preserved Return results - Int: EAX (EDX also), FP: top of FP stack All parameters (without prototypes) are widened to int/double
Microcomputer CourseKoby Gottlieb, 3/03, p-85
Compatibility Issues
Goal: enable modules compiled by different compilers to interoperate Possibly also different languages
Calling conventions Parameter location/order Static link (e.g. for Pascal)
External variable names - for the linker “Name decoration” - for C++
External name encodes the variable’s type Exception handling
How to locate the user-registered exception handler when an exception occurs
Debug information format This is only nice-to-have
All of the above are broken in practice!
Microcomputer CourseKoby Gottlieb, 3/03, p-86
Administration
Please send me your Email address, so that we can have a mailing list My address is koby.gottlieb@intel.com Subject line: Microcomputer course mailing list
Contacts koby Koby.gottlieb@intel.com 04-8655718 Ofri ofri.wechsler@intel.com 04-8655851
Microcomputer CourseKoby Gottlieb, 3/03, p-87
Example for Calling Conventions Example of calling stackUsing 32 bit mode, near pointer
#include <string.h>
char in[]="abc";char out[10];char *p;
char *strncpy(char *d, const char *s, size_t n){ int i=0; for (i=0;i<n;i++) { monitor(&i); d[i]=s[i]; } return (char*)s;}
main(){ p = strncpy(out,in,sizeof(in));}==============================================
.file "strncpy.c"
.data
.globl inin: .size in,4
.string "abc" // char in[]="abc"
.comm out,10,4 // char out[10];
.comm p,4,4 // char *p
.text
/====================/ strncpy/--------------------
.globl strncpystrncpy:
pushl %ebp // put old frame ptrmovl %esp,%ebp // set new frame ptrsubl $12,%esp // Space for automat movl $0,-4(%ebp) // i = 0subl %edx,%edx // %edx = 0cmpl 16(%ebp),%edx // if (i<n)jae .L1
.L2:lea -4(%ebp),%eax // %eax = &imovl %eax,(%esp) // push &icall monitor // movl 12(%ebp),%eax // %eax = smovl -4(%ebp),%edx // %edx = imovb (%eax,%edx),%al // %al = s[i]movl 8(%ebp),%ecx // %ecx = dmovb %al,(%ecx,%edx) // d[i] = %almovl -4(%ebp),%eax // \incl %eax // | i++movl %eax,-4(%ebp) // /cmpl 16(%ebp),%eax // if (i<n)jb .L2
.L1:movl 12(%ebp),%eax // return smovl %ebp,%esp // clear stackpopl %ebp // return frameret
/====================/ main/--------------------
.globl mainmain:
subl $12,%espmovl $out,(%esp) // outmovl $in,4(%esp) // inmovl $4,8(%esp) // pcall strncpy // callmovl %eax,p // store return valueaddl $12,%esp // clean stackret
esp frame itemebp+16 sizeof(in)/nebp+12 in / s
before call strncpy ebp+8 out / dafter the call ebp+4 ret-addrafter push ebp ebp old ebp
ebp-4 var iebp-8 unused
before call monitor ebp-12 param I
Microcomputer CourseKoby Gottlieb, 3/03, p-88
Code example - Unix style
; unsigned int fib(int x) {; if(x>2); return(fib(x-1)+fib(x-2));; else; return(1);; } ; ; Stack frame:; %esp + 8: x ; %esp +4: return address ; note - no ebp!
.file "0f.c"
.align 16fib: / x= 8(%esp)
pushl %eax / save %eax is it needed?movl 8(%esp),%edx / edx=xcmpl $2,%edx / x>2?pushl %ebx / (save %ebx)jle .L66decl %edx / x<2, x--pushl %edx / \call fib / } %ebx = fib(x-1)movl %eax,%ebx / /movl 16(%esp),%eax / \subl $2,%eax / } %eax = x-2pushl %eax / /call fib / %eax = fib(x-2)addl %ebx,%eax / %eax = fib(x-1) + fib(x-2)addl $8,%esp / reset stack
.L67:popl %ebx / restore %ebxaddl $4,%esp / reset stack (contained %eax) ret / return.align 16,7,4
.L66:popl %ebxmovl $1,%eaxaddl $4,%espret / ret 1
Code example - Microsot/Intel style; unsigned int fib(int x) {; if(x>2); return(fib(x-1)+fib(x-2));; else; return(1); }; Stack frame:; %esp + 12: x ; %esp + 8: return address; %esp + 4,+8: esi, edi; %eax contains return value ; note - no ebp! TITLE fib.c .386P.model FLAT..._x$ = 8_fib PROC NEAR; 39 : { push esi ; push temps push edi; 40 : if( x > 2 ) mov edi, DWORD PTR _x$[esp+4] ; edi = x cmp edi, 2 ; x > 2? jle SHORT $L370; 41 : return (fib(x-1) + fib(x - 2)); lea eax, DWORD PTR [edi-2] ; eax = x-2! dec edi ; edi = x-1 push eax call _fib ; eax = fib(x-2) add esp, 4 ; pop param mov esi, eax push edi call _fib ; eax = fib(x-1) add esp, 4 add eax, esi; eax = fib(x-1)+fib(x-2) pop edi pop esi ; restore temps ret 0$L370:; 42 : else; 43 : return (1); mov eax, 1 ; return 1 pop edi; 44 : } pop esi ret 0_fib ENDP_TEXT ENDSEND
Microcomputer CourseKoby Gottlieb, 3/03, p-89
CISC vs RISC
CISC (Complex Instruction Set Computer) Simple as well as exotic instructions Small number of registers Variety of addressing modes Operation on registers and memory “Goal”: reduced instruction count
RISC (Reduced Instruction Set Computer) Simple instructions only Large number of registers Small set of addressing modes Memory access in load/stores only “Goal”: fast execution (through simple design)
Microcomputer CourseKoby Gottlieb, 3/03, p-90
Lecture 5: Segmentation
Microcomputer CourseKoby Gottlieb, 3/03, p-91
Segmentation and Protected Mode
Microcomputer CourseKoby Gottlieb, 3/03, p-92
Segmentation Already mentioned:
Address space divided into “segments” Logical address: segment:offset Linear address: address in the “flat” address range Segmentation: mapping from Logical to Linear
Segment base = f(segment register)offset < segment size: <64K (86/286), <4G (386+) linear address = segment_base + offset
At any time, there are 4 (8086) or 6 (286 and on) active segment registersCS: code, DS: data, SS: stack, ES: extra. FS, GS: new extra
Segment register can be modified by: A move from general register (e.g., mov ds,ax) LxS inst (e.g., LDS): loads full ptr to both seg reg and general reg.
All memory reference instructions assume default seg register CS for branches, SS for Stack, DS for data ES for string instruction destination operands
Segment prefix can be used to modify the default segment register for the current instruction
Microcomputer CourseKoby Gottlieb, 3/03, p-93
Address Computation - Real Mode Segment Base = segment*16 Linear address = segment*16 + offset
0x1234:0x5678 = 0x12340+0x5678 = 0x179B8 Implications:
Application may consume many segments Address space = 64K*16 = 1M Linear addresses have many (4096) logical representations
Seg+1:offset = seg:offset+16 But: do not assume that!
Last address: 0xFFFF*16+0xFFFF = 0x10FFEF (1M+64K-16)
19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
16-Bit Segment Selector 0 0 0 0
19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 16-Bit Effective Address0 0 0 0
21-Bit Linear Address
Base
Offset
Linear Address
+
= 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Figure 9-1. 8086 Address Translation
A20
0 0 0 0
Microcomputer CourseKoby Gottlieb, 3/03, p-94
Segmentation Models (16-bit) Terminology:
Code (“text”): code + optional read-only variables Data: global variables, static variable. Read/Write and Read Only Stack: parameters, automatic variables Heap: dynamically allocated memory
Model Segment Size Pointer Type
Tiny Code+Data+Stack+Heap <64K Offset
Small Code<64K, Data+Stack+Heap<64K Offset
Compact Code<64K, Data<64K, Stack<64K, Heap>64K Seg:Offset
Large Code>64K, Data>64K, Stack<64K, Heap>64K Seg:Offset
Huge Single Element > 64K Seg:Offset Trade-off between size and performance: smaller is faster
Tiny/Small: no need to switch segment registers Compact: uses only one data segment, but can address the entire memory
(e.g., heap) , if needed, not too many segment switches
32-bit mode typically uses Flat Model: One segment, starts at 0, spans 4G, all seg registers point to it
Microcomputer CourseKoby Gottlieb, 3/03, p-95
Segmentation Pros/Cons
Great solution for large memory w/ small word Overhead depends on memory requirements
Pay per use…
Allows reasonable code/data sharing
- More complex software development and tools
- Real mode segmentation => 1MB only
Did x86 succeed because of segmentation?… or in spite of it?...
Microcomputer CourseKoby Gottlieb, 3/03, p-96
Protected Mode - MotivationTwo Problems: 1MB is too small
More sophisticated usage: Multiprogramming, Multiusers.
Addresses are interpreted differently by different clients:
entry *next_entry (entry *p)
{ return (p->next); }
...
crnt_p = next_entry(crnt_p);
crnt_p->value = 7;
...
Increased need for protection among apps and OS In real mode, each application can crash the OS!
Microcomputer CourseKoby Gottlieb, 3/03, p-97
Protection - Solution
Solution - store information on segments in a table Segment register becomes an index to a “Descriptor Table”
contains address (base) and protection information
Address info Segment base
Linear address is still segment_base + offset, but…Segment_base = Descriptor_Table[f(segment)].base
Protection info Segment size (“limit”) <64K for 286, <4G for 386+ Segment type R, RW, ER, EO (Read, R/W, Execute/Read, Exec-Only) Segment privilege (0..3: 0 is higher, 3 is lower) Restrictions are checked for every access Descriptor table should be protected!
… And more colorful protection mechanisms…
Small step to the segment, big leap to mankind!
Microcomputer CourseKoby Gottlieb, 3/03, p-98
Address Computation Linear address = Descriptor_Table[f(segment)].base + offset Coverage
286: 16MB address space. Requires 512 segments to cover! 386: 4GB address space. 1 segment is sufficient
Descriptor Table
Selector Offset
SegmentDescriptor
LinearAddress
LogicalAddress
BaseAddress
015
031
031
+
Figure 11-5. Segment Translation
Microcomputer CourseKoby Gottlieb, 3/03, p-99
A General Segmentation Model The OS allocates each application a number of segments
CS
SS
GS
DS
ES
FS
Access LimitBase Address
SegmentRegisters
SegmentDescriptors
Access LimitBase Address
Access LimitBase Address
Access LimitBase Address
Access LimitBase Address
Access LimitBase Address
Access LimitBase Address
Access LimitBase Address
Access LimitBase Address
PhysicalMemory
Figure 11-3. Multisegmented Model
Microcomputer CourseKoby Gottlieb, 3/03, p-100
Descriptor Tables 2 tables
GDT (Global)Pointed by GDTR
LDT (Local)Per taskPointed indirectly by
LDTR
Table pointers GDTR: Base + Limit LDTR: GDT Index only
Entries (descriptors) of various types: Segments Call/interrupt gates Tasks LDT - in the GDT
Figure 11-4. TI Bit Selects Descriptor Table
TI
Base AddressLimit
GDTRBase Address
Limit LDTR
Selector
SegmentSelector
GlobalDescriptor
Table
LocalDescriptor
Table
TI = 0 TI = 1
Microcomputer CourseKoby Gottlieb, 3/03, p-101
13 bit index (8K entries) 1 bit table index - GDT (0) or LDT (1) 2 bit “requested” privilege level - see below
Segment Selector Format
INDEXTI
RPL
5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 01 1
Table Indicator 0 = GDT 1 = LDTRequested Privilege Level 00 = Most Privileged 11 = Least
Figure 11-7. Segment Selector Format
Microcomputer CourseKoby Gottlieb, 3/03, p-102
Segment Descriptor Format Note
286 => 386 evolution (high-order 16 bits) Limit >4M achieved by the granularity bit - scale by 212
D/B: 16/32 bit
Information moves fromdescriptor (in memory)to CPU when segmentregister is loaded
Figure 11-8. Segment Descriptors Format
AVL Available for use by system softwareBASE Segment base addressD/B Default operation size
(0 = 16-Bit segment; 1 = 32-bit segment)DPL Descriptor privilege level G GranularityLIMIT Segment limitP Segment presentS Descriptor type
(0 = System; 1 = application)TYPE Segment type
Reserved
5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 01
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 63 23
0151616
Base 31:24 P Base 23:16
Segment Limit 15:0
DPL
TYPESGD/B
0AVL
SegLimit19:16
Base Address 15:0 +0
+4
Microcomputer CourseKoby Gottlieb, 3/03, p-103
Privilege Levels Lower number => higher privilege Code can access data of equal/lower privilege levels only Code can call more privileged data via “call gates” only
Each level has its own stack!
Some instructions and I/O operations are restricted to certain privilege levels
Also referred to as:
“Rings”
LEVEL 3
LEVEL 2
LEVEL 1
LEVEL 0
Applications
Operating SystemServices (DeviceDrivers, etc..)
Operating SystemKernel
Protection Rings
Figure 12-2: Protections Rings
Microcomputer CourseKoby Gottlieb, 3/03, p-104
For Data: DPL >= max(RPL,CPL)Otherwise: exception!
RPL: used by OSto protect against userTrojan Horses
Trojan Horse:User code calls OS codeand passes a pointer toa privileged data area
Privilege Level Check
Figure 12-3. Privilege Check for Data Access
CPL Current Privilege LevelDPL Descriptor Privilege LevelRPL Requested Privilege Level
5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 01
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 63 23
0151616
DPL
+0
+4
CPL
RPL
PrivilegeCheck
Operand Segment Selector
Current Code Segment Register
Operand Segment Descriptor
Microcomputer CourseKoby Gottlieb, 3/03, p-105
Call Gate A mechanism for the OS to restrict lower privileged code to
call only pre-defined entry points Restricts the caller privilege level
Not Used
Procedure EntryPoint
+
DPL Count
Selector Offset 15:0
BASE DPL BASE
BASE
Offset 31:16
Descriptor Table
GateDescriptor
Code SegmentDescriptor
Selector Offset Within Segment
Destination Address
015 031
Figure 12-6. Call Gate Mechanism
From which PrivilegeLevel the gate may be used.
The called code Privilege Level.
# of parameters
Microcomputer CourseKoby Gottlieb, 3/03, p-106
Inter Level Call - Stack Switching On inter-level call
CPL changes Control transferred Stack switched
Stack Switch New (privileged) SS:eSP is taken from the TSS Old SS:eSP is stored on new stack Parameters are copied to new stack
Figure 12-9. Stack Frame During Inter Level Call
xyz
PARM 1
PARM 2
PARM 3
OLD SS
OLD ESP
PARM 1
PARM 2
PARM 3
xyz
OLD CS
OLD EIP
Old Stack Before Call
New StackAfter Call, Before Return
Old Stack After RETURN(far retN N=3)
ESP3
ESP0
ESP3
SS0:ESP0
From TSS
Assume call from 3 to 0 Unique to interlevel calls
Microcomputer CourseKoby Gottlieb, 3/03, p-107
Task State Segment (TSS) HW mechanism to allow multi-tasking Contains task state:
Dynamic: Integer registers, segment registers, flags...
Static: CR3, per level stack pointer, etc... Task switch occurs on
JMP/Call to a TSS descriptor task gate Interrupt indexed to a task gate Certain IRET
On switch: Current TSS content is updated
For example, general registers are saved by the CPU!
New task’s values are retrieved from TSS
Contains initial stack values for level 0-2
Figure 13-1. 32-bit Task State Segment
CR3
ESP1
ESP0
EIPEFLAGS
EAXECXEDXEBXESPEBPESIEDI
ESP2
IO map base addr
Sel for task LDTGSFSDSSSCSES
SS2
SS1
SS0
Link (old TSS sel)
T
048C1014181C2024282C3034383C4044484C5054585C6064
01531
Microcomputer CourseKoby Gottlieb, 3/03, p-108
Protected ModeImplications and Observations (1)
Software conversion Straightforward if software does not assume linear = seg*16+offset
Real mode “tricks” do not work. Usually for the better: Cannot write on code
Cannot access segment 0
Cannot jump/write into the OS
New issues How to write on code (self-modifying code, SMC) - use aliasing
How to use OS services - via call gates
How to access a known linear address -Map a segment to a known base. e.g., use 4G at base 0.
How to protect between processes (applications) -Use LDT per process. Reload LDTR at context switch
Additional related mechanisms exist - not discussed
Microcomputer CourseKoby Gottlieb, 3/03, p-109
Protected ModeImplications and Observations (2)
Segmentation performance impactIssues: Indirect vs direct addressing Protection check
Solution: information is cached for each active segment Almost no overhead on memory reference... But Change of a segment register is costly!
Segment based virtual memory Possible - through use of “segment not present” bit Unattractive - due to variable length of segments
Microcomputer CourseKoby Gottlieb, 3/03, p-110
Protected Mode in Practice
Value of segmentation/protected mode Only option for a 16-bit OS 32-bit OS prefers flat model, page based protection
Call gates used for system calls by Unix Not by Windows NT
Task switch mechanism rarely if ever used OS/2 uses 32-bit segmentation Exotic uses for segment registers
FS used by NT for multi-threading
Microcomputer CourseKoby Gottlieb, 3/03, p-111
Segmentation - Summary
Mapping of logical address (selector:offset) to linear address Initially introduced to support addressability larger than the
machine word Since 286, descriptor tables enable
Protection Even larger addressability Potentially swapping to disk (but not very useful)
Starting with 386, 4GB addressability But 32-bit operating systems almost all use flat memory
Protection is the separation of Several different segments with different attributes Processes from one another Processes from the OS (protection “rings”)
All memory accesses are verified Control transfers are limited: call gates
Microcomputer CourseKoby Gottlieb, 3/03, p-112
Lecture 6: Paging
Microcomputer CourseKoby Gottlieb, 3/03, p-113
Paging
Microcomputer CourseKoby Gottlieb, 3/03, p-114
Paging - Motivation The problem:
Effective address can cover 4GB, but actual memory is smaller Application should be independent of actual memory location
Avoid memory size and location concernsAllow using fixed, pre-defined, addressesAllow easier code/data sharingInsensitive to gaps within address space
e.g., because of unused, but allocated, array space
Solution: Paging Idea: memory is divided to equal sized chunks - pages (4KB) A layer on top of segmentation. Performs linear-to-physical mapping Address “processing”
Logical (Seg:offset) ==> Linear ==> Physical Page tables map linear address to physical. Keeping track of:
Valid pagesPaged out pages (to secondary memory)Non-existent pages
Analogy: Big library, good catalog, reading room, storage room
Microcomputer CourseKoby Gottlieb, 3/03, p-115
Page Translation Mechanism 2-level hierarchical mapping
Using page directory and page tables All pages and page tables are 4K
Linear address divided to: Directory 10 bits Table 10 bits Offset 12 bits Dir/Table serves as index
into a page table Offset serves as pointer
into a data page
Page entry points to apage table or page
Performance issues: TLB!
Advanced mechanism 4MB pages(not discussed here)
Figure 11-16. Combined Segment and Page Address Translation
Descriptor Table
Selector Offset
SegmentDescriptor
LogicalAddress
BaseAddress
015
031
031
+
DIR TABLE OFFSET
Page Table
PG Tbl Entry
Page Directory
4K Dir Entry
4K Page Frame
Operand
Linear Address Space (4K Page)1121
CR3
Microcomputer CourseKoby Gottlieb, 3/03, p-116
Page Translation Mechanism (Cont.) CR3 points to current page directory
May be changed per process
Usually, a page directory entry (covers 4MB) will point to a page table that covers data of the same type/usage
Sharing. Can alias Different linear on
same physical(e.g., SMC)
Pages from differentprocesses to samephysical (e.g., OS)
DIR TABLE OFFSET
Page Dir Code
Data
StackOS
Phys Mem
4K page
CR3
Page Tables
Microcomputer CourseKoby Gottlieb, 3/03, p-117
Page Entry Format 20 bit pointer to a 4K
Aligned address 12 bits flags Virtual memory
Present Accessed, Dirty
Protection Writable (R#/W) User (U/S#)
2 levels only!
0, 1, 2 Supervisor
Caching Page Write-Through Page Cache
Disabled
3 bits for OS usage
Figure 11-14. Format of Page Directory and Page Table Entries for 4K Pages
0
0 0
Page Frame Address 31:12 AVAIL 0 0 APCD
PWT
U W P
PresentWritableUserWrite-ThroughCache DisableAccessedPage Size (0: 4 Kbyte)Available for OS Use
Page DirEntry
04 12357911 681231
Page Frame Address 31:12 AVAIL D APCD
PWT
U W P
PresentWritableUserWrite-ThroughCache DisableAccessedDirtyAvailable for OS Use
Page TableEntry
04 12357911 681231
Reserved by Intel for future use (should be zero)
-
-
Microcomputer CourseKoby Gottlieb, 3/03, p-118
Paging - Virtual Memory
A page can be New, not yet loaded Loaded Paged out to disk
A loaded page can be Clean Dirty
When a page is not loaded (P bit clear) => a page fault occurs It may require removing a loaded page to insert the new one
OS prioritizes removal by LRU and dirty/clean/avail bitsDirty page should be written to disk. Clean ones need not
New page is either loaded from disk or “initialized” CPU sets page “access” flag when accessed, “dirty” when written
Copy-on-write strategy for forked Unix processes
Microcomputer CourseKoby Gottlieb, 3/03, p-119
Paging Pros and Cons
Independence of physical memory size No need to preserve continuous memory ranges Physical memory is allocated only if accessed Can map same linear address to different physical address Can share data/code among processes Pages have predefined size - easy memory management Simple protection mechanism
Overhead of tables (insignificant) Internal fragmentation (insignificant) Translation time - reduced by TLB
Microcomputer CourseKoby Gottlieb, 3/03, p-120
Entering Protected Mode, Paging “Easy”: set CR0.PE to enable protection, CR0.PG to enable Paging…
But one better make sure all tables are ready and valid...
Figure 10-3. Control Registers
Reserved ReservedNE
ET
TS
EM
MP
PE
Page Directory Base PCD
PWT
0 ... 0 0MCE
PSE
DE
TSD
PVI
VME
Page Fault Linear Address
WP
AM
PG
CD
NW CR0
CR1
CR2
CR3
CR4
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Microcomputer CourseKoby Gottlieb, 3/03, p-121
Mapping 4M pages, 36 bit address.
PSE (page file extension) flag bit 4 of CR4 The PSE enables 4MByte pages (or 2Mbyte pages when
PAE flag is set). Those pages are mapped directly from the page directory, without using page tables.
PAE (physical address extension) flag bit 5 of CR4 The PAE flag enables 36 bit physical addresses (Not
covered here)
Microcomputer CourseKoby Gottlieb, 3/03, p-122
Paging/Segmentation Example All left indentation are data to the example, right are conclusions/results. All numbers are in HEX.GDTR: 0000 2000CR3: 0000 3000Logical address DS: 23, Offset: 3456
0010 0011
DS: index = 4, TI=GDT, Priv level = 3GDT(4) is in address 2000+4*8=2020
Assume protection validation is successful...GDT(4): base 00C03000 (linear!)
Linear address = 00C03000+3456 = 00C064560000 0000 1100 0000 0110 0100 0101 0110
PDE = 003 (0000 0000 11)PTE = 006 (00 0000 0110)Off = 456 (0100 0101 0110)
address in PDE = 3000+3*4 = 300C
M(300C) = 4000
address in PTE = 4000+6*4 = 4018
M(4018) = 5000
full address in page = 5000+456 = 5456
M(5456) = 12345678Finished? Not exactly....
Microcomputer CourseKoby Gottlieb, 3/03, p-123
Paging/Segmentation Example (Cont.)
But: where do we find the GDT?
GDTR points to 2000 linear, need to look at the physical address to see GDT(4)!
Linear address = 00002020 = 0000 0000 0000 0000 0010 0000 0010 0000
PDE = 000 (0000 0000 00)
PTE = 002 (00 0000 0020)
Off = 020 (0000 0010 0000)
address in PDE = 3000+0*4 = 3000
M(3000) = 6000
address in PTE = 6000+2*4 = 6008
M(6008) = 5000
full address in page = 5000+020 = 5020
M(5020) = 0010 f0c0 3000 1234
(G=0, 16bit, avail, present, DPL=3, appl, data/RO, limit=1234, base = 00c03000)
Microcomputer CourseKoby Gottlieb, 3/03, p-124
Addressing Methods - Summary
16 bits real and virtual mode: seg_reg*16+offset
16/32 bit protected mode:Use global/local descriptor table (GDT/LDT) xDT[seg_reg].base + offset (xDT = GDT/LDT) Check protection! Note: 32-bit is always protected
32 bit paging enabled: First translate as above, then Convert virtual (linear) to physical
pdi=vaddr[31:23]; pti=vaddr[22:13]; offset=vaddr[11:0]paddr = CR3->pdt[pdi]->pt[pti]+offset
Check protection!
Microcomputer CourseKoby Gottlieb, 3/03, p-125
Lecture 7: Interrupts and Exceptions
Microcomputer CourseKoby Gottlieb, 3/03, p-126
Interrupts and Exceptions
Some of the material in this section was taken from:The advanced Intel microprocessors - 80286/80386/80486.By: Barry B. Brey, DeVry institute of technology. Pub. by MacMillan publishing company
Microcomputer CourseKoby Gottlieb, 3/03, p-127
What is it?
“A signal from from hardware or software, such as a keystroke, that demands immediate attention and takes priority over other operations ”
Analogy: Alarm clock Pain
Microcomputer CourseKoby Gottlieb, 3/03, p-128
What Causes It? Interrupts
External to the CPU
Peripheral: I/O: mouse move/click, keyboard, timer
Exceptions Internal to the CPU
Exceptions: Processor detected:
Fault: restart from the causing instruction (e.g., page fault)
Trap: restart from the next instruction (e.g., overflow)
Abort: cannot restart (double fault) Programmed Exceptions (“software interrupts”):
e.g.: INTO, INT 3, INT n, BOUND
The terms “interrupts” & “exceptions” are frequently intermixed
Microcomputer CourseKoby Gottlieb, 3/03, p-129
Why Is It Important?
Few categories
Inevitable: only way to implement a feature Cannot implement timers, page faults, ^break etc. without it
Difficult without: alternatives imply huge program overhead e.g., polling to identify keyboard stroke
Convenience: cheaper way to implement a feature e.g., zero divide, INTO, BOUND, etc..
Conventions: Application/OS communication System call. Allow using procedures without linkage
Microcomputer CourseKoby Gottlieb, 3/03, p-130
How Does It Work?
Internal interrupts: handled completely by CPU + SW
External interrupt involves CPU + SW + peripherals External device delivers “INTR” signal (or NMI, SMI)
CPU acknowledges with “INTA” bus cycle
Device sends interrupt# on Data bus
Usually requires a PIC (Programmable Interrupt Controller) in system
External interrupts (except NMI/SMI) can be masked (IF=0)
SignalIntr
Data
Nmi,..
DEVICE PIC CPU
IntA
Microcomputer CourseKoby Gottlieb, 3/03, p-131
Interrupt Vector 256 entry array Each entry points to an interrupt/exception handler Located in address 0 (8086) or as pointed by IDTR register First 32 entries reserved for “Intel” for CPU internal exceptions Entry contains
Real mode: 4 bytes = Segment:offset Protected mode: 8 bytes = Interrupt/Trap/Task Gate
In both cases, control transferbased on the entry info
Figure 14-1. IDTR Locates IDT in Memory
IDTR Register
IDT Base Address IDT Limit
0151647
+ Gate forInterrupt #N
Gate forInterrupt #1
Gate forInterrupt #0
+8
+8*N
+16
+0
Real Mode Interrupt Vector
CS #NIP #N
CS #1IP #1CS #0IP #0
4
8
0
4*N
Microcomputer CourseKoby Gottlieb, 3/03, p-132
Reserved entriesVector Number Description
0 Divide Error 86
1 Debug Exception 86
2 NMI Interrupt 86
3 Breakpoint 86
4 INTO-detected Overflow 86
5 BOUND Range Exceeded 186
6 Invalid Opcode 86
7 Device Not Available 86
8 Double Fault 286
9 CoProcessor Segment Overrun (reserved) 286
10 Invalid Task State Segment 286
11 Segment Not Present 286
12 Stack Fault 286
13 General Protection 86
14 Page Fault 386
15 (Intel reserved. Do not use.)
16 Floating-Point Error 386
17 Alignment Check 486
18 Machine Check* Pentium
19-31 (Intel reserved. Do not use.)
32-255 Maskable Interrupts
Microcomputer CourseKoby Gottlieb, 3/03, p-133
Interrupt Vector Content (DOS)# Interrupt Name Address Owner
00 Divide by Zero 1756:00D2 NC01 Single Step 0070:06F4 Unknown02 Nonmaskable 1993:26D2 DOS03 Breakpoint 0070:06F4 Unknown04 Overflow 0070:06F4 Unknown05 Print Screen F000:FF54 BIOS06 Invalid Opcode F000:EB43 BIOS07 Reserved F000:EAEB BIOS08 IRQ0 - System Timer 15BE:0000 FAXTSR09 IRQ1 - Keyboard 15BE:001D FAXTSR0A IRQ2 - Reserved 0A08:0057 DOS0B IRQ3 - COM2 0A08:006F DOS0C IRQ4 - COM1 D60B:0095 BIOS0D IRQ5 - Reserved 0A08:009F DOS0E IRQ6 - Diskette 0A08:00B7 DOS0F IRQ7 - Printer 0070:06F4 Unknown10 Video 15BE:0100 FAXTSR11 Equipment Determination F000:F84D BIOS12 Memory Size Determination F000:F841 BIOS13 Fixed Disk/Diskette 15BE:0174 FAXTSR14 Asynchronous Communication F000:E739 BIOS15 System Services 0C4C:19A0 SMARTDRV16 Keyboard F000:E82E BIOS17 Printer F000:EFD2 BIOS18 Resident BASIC F000:E000 BIOS19 Bootstrap Loader 0C4C:1990 SMARTDRV1A Real-Time Clock Services F000:FE6E BIOS1B Keyboard Break 0070:06EE Unknown1C User Timer Tick 1465:00D6 CASMGR1D Video Parameters F000:F0A4 BIOS1E Diskette Parameters 0000:0522 Device Driver1F Video Graphics Characters C000:5A52 BIOS20 Program Terminate 0255:003A DOS System Area21 General DOS Functions 15BE:01A1 FAXTSR22 Terminate Address 18B9:02B1 COMMAND23 Ctrl+Break Handler Address 1993:26BA DOS24 Critical Error Handler 18B9:0155 COMMAND25 Absolute Disk Read 0C4C:19DE SMARTDRV26 Absolute Disk Write 0C4C:1A27 SMARTDRV27 Terminate and Stay Resident 0255:004C DOS System Area
# Interrupt Name Address Owner
28 DOS Idle 15BE:0222 FAXTSR29 DOS Internal - FAST PUTCHAR 0255:0052 DOS System Area2A Microsoft Networks 15BE:073E FAXTSR2B Reserved for DOS 0116:10DA DOS System Area2C Reserved for DOS 0116:10DA DOS System Area2D Reserved for DOS 0116:10DA DOS System Area2E DOS - Execute Command 0ACD:013F COMMAND2F Multiplex (Process Interface) 15BE:079E FAXTSR....40 Diskette BIOS Revector F000:EC59 BIOS41 Fixed Disk Parameters F000:E13D BIOS42 Relocated Video Handler F000:F065 BIOS43 EGA/VGA User Font Table C000:5652 BIOS44 Novell Netware API F000:EA97 BIOS45 Reserved F000:EA97 BIOS46 Fixed Disk Parameters F000:E401 BIOS....70 IRQ8 - Real-Time Clock 0A08:0052 DOS71 IRQ9 - Reserved F000:EED2 BIOS72 IRQ10 - Reserved 0A08:00CF DOS73 IRQ11 - Reserved 0A08:00E7 DOS74 IRQ12 - Reserved 0A08:00FF DOS75 IRQ13 - Redirect to NMI F000:EEDB BIOS76 IRQ14 - Fixed Disk 0A08:0117 DOS77 IRQ15 - Reserved F000:8E8D BIOS....F8 Reserved for User Programs 0040:0000 Device DriverF9 Reserved for User Programs 7A0E:7E00 UnknownFA Reserved for User Programs 0080:7C00 SSCDROMFB Reserved for User Programs 972F:0001 UnknownFC Reserved for User Programs 8304:F81C UnknownFD Reserved for User Programs EF40:7C00 BIOSFE Reserved for User Programs F729:F80B SSCDROMFF Reserved for User Programs 0002:F000 SMARTDRV
CPU Interrupts
External Interrupts
Microcomputer CourseKoby Gottlieb, 3/03, p-134
Additional Players Flags
IF flag: masks external interruptsSet/Clear by STI/CLI
TF flag: is the CPU in TRAP mode (single-step)Set/Clear by eFLAGS manipulation
Instruction IRET: “return from interrupt” = RET + few more things Software interrupts: INT n, INT3, INTO (4), BOUND (int 5)
Signals (for external interrupts) INTR: interrupt request INTA: interrupt acknowledge (bus cycle) = IO#*C*R# Data Bus: pass interrupt # NMI, SMI: additional interrupt lines
NMI - non maskable (timer, power fail)SMI - system management (OS independent power management)
When an interrupt/exception occurs, the interrupt# is decoded and then the story begins...
Microcomputer CourseKoby Gottlieb, 3/03, p-135
Interrupt Processing On Call (italic: Protected mode only)
Stack (SS:ESP) is set to level 0 stack (If not there) Store on stack
Old SS, ESP (on privilege level change only)Old FlagsOld CS, EIP (the offending instruction, or the next one!)Error code (for certain exceptions only. Contains erred selector)
Clear the IF/RF bits of EFLAGs (masks interrupts, delay single-step) Jump to the interrupt/exception handler
Handler code <Do the right things right…> Issue IRET
On IRET: Similar to regular RET + Restore EFLAGs
Figure 14-4. Stack Frame After Exception or Interruptwith and without privilege level change
Unused
Old SS
Old ESP
Old EIP
Old EFLAGs
Old CS
Error Code
ESP fromTSS
NewESP
OldESP
Microcomputer CourseKoby Gottlieb, 3/03, p-136
Interrupt Handling
Hardware side Ensure transparency to the affected task
Will return to the point of interruption w/ the right flagsDoes the minimum - only what software cannot do!
Interrupt handlers Transparent: preserve all used registers! Short as possible. If cannot - enable interrupts during handling
Do not disable interrupts for a long time! Avoid nesting of same interrupts, or make sure the software can
handle that! Some of the information can/should be extracted from the Stack
(old CS:eIP, old SS:eSP, eFLAGS, error code)
Interrupt handler writers must fully understand the relevant HW interaction (PIC + actual peripheral)
Microcomputer CourseKoby Gottlieb, 3/03, p-137
Simple Interrupt Routine
<From Barry B. Bray: page 412 CHAPTER 10: INTERRUPTS>EXAMPLE 10-3
;Procedure that displays all the registers and their contents;on the video screen in response to a trap interrupt
0000 TRACE PROC FAR
0000 50 PUSH AX ;save registers0001 55 PUSH BP0002 53 PUSH BX
0003 BE 0054 R MOV BX, OFFSET NAMES ;address register names
;display registers
0006 E8 009A R CALL CRLF ;get new display line0009 E8 00A9 R CALL DISP ;display AX000C 88 C3 MOV AX,BX000E E8 00A9 R CALL DISP ;display BX0011 88C1 MOV AX,CX0013 E8 00A9 R CALL DISP ;display CX0016 88 C2 MOV AX,DX0018 EB 00A9 R CALL DISP ;display DX0018 88 C4 MOV AX,SP001D 05 000C ADD AX,120020 E8 00A9 R CALL DISP ;display SP0023 88 CS MOV AX,BP0025 E8 00A9 R CALL DISP ;display BP0028 8B C6 MOV AX,SI002A EB 00A9 R CALL DISP ;display SI002D 8B EC MOV BP,SP ;address stack with BP002F 88 46 06 MOV AX,[BP+6]0032 E8 00A9 R CALL DISP ;display IP0035 8B 46 0A MOV AX,[BP+10]0038 E8 00A9 R CALL DISP ;display flags0038 8B 46 08 MOV AX,[BP+8]003E EB 00A9 R CALL DISP ;display CS0041 8CDB MOV AX, DS
0043 E8 00A9 R CALL DISP ;display DS0046 8C C0 MOV AX,ES0048 E8 00A9 R CALL DISP ;display ES0048 8C D0 MOV AX,SS004D E8 00A9 R CALL DISP ;display SS
0050 5B POP BX ;restore registers0051 5D POP BP0052 58 POP AX0053 CF IRET0054 TRACE ENDP
Note:; Call CRLF: print new line; Call DISP: print reg name & value
Push registers
Manipulates stack
Restore registers
Microcomputer CourseKoby Gottlieb, 3/03, p-138
Setting/Clear TF Flag There are no special instructions that set or clear the trap flag. Flag is on the stack at [BP+8]
; Procedure that sets TF to enable trap0000 TRON PROC FAR0000 50 PUSH AX ; save registers0001 55 PUSH BP0002 8B EC MOV BP, SP ; get SP0004 88 46 08 MOV AX, [BP+8] ; get flags0007 80 CC 01 OR AH,1 ; set TF000A 89 46 08 MOV [BP+8],AX ; save flags000D 5D POP BP ; restore registers000E 58 POP AX000F CF IRET0010 TRON ENDP
;Procedure that clears TF to disable trap0010 TROFF PROC FAR0010 50 PUSH AX ; save registers0011 55 PUSH BP0012 8B EC MOV BP, SP ; get SP0014 88 46 08 MOV AX, [BP+8] ; get flags0017 80 ES FE AND AH, 0FEH ; clear TF001A 89 46 08 MOV [BP+8],AX ; save flags001D 5D POP BP ; restore registers001E 58 POP AX001F CF IRET0020 TROFF ENDP
Microcomputer CourseKoby Gottlieb, 3/03, p-139
Order of interrupts Last instruction traps (breakpoint) External interrupts Fetch related (Code page fault) Decode related (Illegal opcode) Execution related (Data page fault)
Microcomputer CourseKoby Gottlieb, 3/03, p-140
Hardware Example
10K
D0D1D2D3D4D5D6D7
RD#WR#A0A1RESETCS#
PA0PA1PA2PA3PA4PA5PA6PA7
PB0PB1PB2PB3PB4PB5PB6PB7
PC0PC1PC2PC3PC4PC5PC6PC7
D0D1D2D3D4D5D6D7
DAV#
Keyboard
8255A-5
1 1 1 1 2 2 2 2 Y Y Y Y Y Y Y Y1 2 3 4 1 2 3 4
1 1 1 1 2 2 2 2 A A A A A A A A 1 21 2 3 4 1 2 3 4 G G
D0D1D2D3D4D5D6D7
IORC#
A1A2
RESET
WAIT2#
IOWC#A0A3A4A5A6A7A8A9
A10
A11A12A13A14A15
INTR
INTA#
I1I2I3I4I5I6I7I8I9I10
O1O2O3O4O5O6O7O8
16L8
vcc
U1
STB#
U3
U2
74ALS244
An 8255A-5 Interface to a KB from 80286 using interrupt vector 40H
123456789
11
1918171615141312
3433323130292827
53698
35
6
432140393837
1819202122232425
1415161713121110
18
16
14
12
9
7
5
3
2
4
6
8
11
13
15
17
1
19
D0-D7: 00000010 = 40H
Microcomputer CourseKoby Gottlieb, 3/03, p-141
HW Intr Routine <From Barry B. Bray: page 422 CHAPTER 10: INTERRUPTS>
EXAMPLE 10-5; Interrupt service procedure that reads a
key; from the keyboard.
= 0500 PORTA EQU 500H ; port A= 0506 CNTR EQU 506H ; control
register0000 0100 FIFO DB 256 DUP (?) ; FIFO buffer0100 0000 INP DW ? ; input pointer0102 0000 OUTP DW ? ; output pointer
0104 KEY PROC FAR
0104 50 PUSH AX ; save registers0105 53 PUSH EX0106 57 PUSH DI0107 52 PUSH DX
0108 2E: 8B 1E 0100 R MOV BX, INP ; address FIFO010D 2E: 88 3E 0102 R MOV DI, OUTP0112 FE C3 INC BL ; test for full0114 38 DF CMP BX, DI0116 7411 JE FULL ; if full
0118 FE CB DEC BL011A BA 0500 MOV DX, PORTA011D EC IN AL, DX ; get data011E 2E: 88 07 MOV CS:[BX],AL ; save data0121 2E: FE 06 0100 R INC BYTE PTR INP0126 EB 07 90 JMP DONE
0129 FULL:0129 BO 08 MOV A/8 ; disable
interrupts0128 BA 0506 MOV DX, CNTR012E EE OUT DX, AL
012F DONE:012F 5A POP DX ; restore
register0130 5F POP DI0131 58 POP EX0132 58 POP AX0133 CF IRET
0134 KEY ENDP
<From Barry B. Bray: page 423 CHAPTER 10: INTERRUPTS>
EXAMPLE 10-6; procedure that reads a key from the FIFO and; returns with it in AH
0104 READ PROC FAR0104 53 PUSH EX ; save registers0105 57 PUSH DI0106 52 PUSH DX
0107 EMPTY:
0107 2E: 8B 1E 0100 R MOV BX, INP ; address FIFO010C 2E: 8B 3E 0102 R MOV DI, OUTP0111 38 DF CMP BX, DI ; test for empty0113 74 F2 JE EMPTY ; if empty
0115 2E: 8A 25 MOV AH, CS:[DI] ; get data0118 8009 MOV AL9 ; enable 8255A-5 intr011A BA0506 MOV DX, CNTR011D EE OUT DX, AL
; increment pointerO11E 2E: FE 06 01M R INC BYTE PTR OUTP0123 5A POP DX : restore registers0124 5F POP DI0125 58 POP EX0126 CB RET
READ ENDP
0
256
INP
OUTP
INP (after KB press)
OUTP (after “READ”)
FIFO
Microcomputer CourseKoby Gottlieb, 3/03, p-142
The Programmable Interrupt Controller (PIC)
Heavy-weight legacy, the 8259A component - PC/AT Today part of the “chip-set”
But with full software compatibility
Newer systems use the Advanced PIC (APIC) Supports multiprocessor systems Not discussed here
Functions: React to an external interrupt, according to priorities Convert the interrupt request (IRQ) into a vector Inform the peripheral that the interrupt was acknowledged
Vector addresses can be programmed Using a set of “commands”
Up to 8 components can be cascaded. In practice, only 2
Microcomputer CourseKoby Gottlieb, 3/03, p-143
8259A connection 10K
D0D1D2D3D4D5D6D7
A0CS#RD#WR#SP/EN#INTINTA
IR0IR1IR2IR3IR4IR5IR6IR7
CAS0CAS1CAS2
8259A
D0D1D2D3D4D5D6D7
A1
IORC#
INTRINTA#
WAIT2#
IOWC#A0A2A3A4A5A6A7A8A9
A10A11A12A13A14A15
I1I2I3I4I5I6I7I8I9I10
O1O2O3O4O5O6O7O8
16L8
U2
U1
An 8259A Interface to the 80286
123456789
11
1918171615141312
1110 987654
27132
1617
26
1819202122232425
121315
vcc
Microcomputer CourseKoby Gottlieb, 3/03, p-144
Cascading 8259 & IRQs
Number Address Name Owner
IRQ 00 15BE:0000 Timer Output 0 FAXTSR
IRQ 01 15BE:001D Keyboard FAXTSR
IRQ 02 0A08:0057 [Cascade] DOS
IRQ 03 0A08:006F COM2 DOS
IRQ 04 D60B:0095 Mouse BIOS
IRQ 05 0A08:009F LPT2 DOS
IRQ 06 0A08:00B7 Floppy Disk DOS
IRQ 07 0070:06F4 LPT1 Unknown
IRQ 08 0A08:0052 Realtime-Clock DOS
IRQ 09 F000:EED2 Reserved BIOS
IRQ 10 0A08:00CF Reserved DOS
IRQ 11 0A08:00E7 Reserved DOS
IRQ 12 0A08:00FF Reserved DOS
IRQ 13 F000:EEDB Coprocessor BIOS
IRQ 14 0A08:0117 Fixed Disk DOS
IRQ 15 F000:8E8D Reserved BIOS
8259 #1
8259 #2
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
INTRto CPU
INTRto next8259
Microcomputer CourseKoby Gottlieb, 3/03, p-145
Lecture 8: The Software-Hardware Interface
Microcomputer CourseKoby Gottlieb, 3/03, p-146
The Software/Hardware Interaction
Background Hal/BIOS Reset/Boot Memory management More...
Microcomputer CourseKoby Gottlieb, 3/03, p-147
Old Computers...
Applications talk directly to hardware OK for embedded systems…
Program is burned in ROM
Problematic for a personal computer Who loads programs? What if underlying HW changes? Managing shared resources: memory,
files, protection? No reuse of software Programmer must be expert in HW?
Reprogrammable computer must have an OPERATING SYSTEM
Application
Hardware
Microcomputer CourseKoby Gottlieb, 3/03, p-148
OS Role Ideal model
Applications talk with the OS OS talks with the HW
This seems to solve all problem: OS loads programs HW changes require OS changes, but
not application change OS manages shared resources OS contains reused (shared) SW Programmers need not be aware of HW
In practice, however, applications continued to interact with HW
No OS support for new devices Performance requirement (e.g., screen)
Applications
OS
Hardware
Ideal Model
Applications
OS
Hardware
Practical Model
Microcomputer CourseKoby Gottlieb, 3/03, p-149
General OS models Monolithic
All procedures interact All can talk to the HW
Hardware
System Services
ApplicationProgram. . . . . .
User Mode
Kernel Mode
Monolithic Operating System
OperatingSystem
Procedures
ApplicationProgram
File System
Memory and I/ODevice mgmt
ProcessorScheduling
Hardware
System Services
ApplicationProgram. . . . . .
User Mode
Kernel Mode
Layered Operating System
OperatingSystem
Procedures
ApplicationProgram
Layered A procedure can talk to lower
layers only May be enforced by HW
Microcomputer CourseKoby Gottlieb, 3/03, p-150
General OS Models (Cont.) Client/Server Ideally - kernel consists of message passing only
Clean, modular, secure... but slow
In practice - kernel contains more. e.g. Scheduler, memory mgmt, device drivers, etc...
Hardware
MicroKernel
User Mode
Kernel Mode
Client/Server Operating System
ServiceProcedures
ProcessServer
FileServer
MemoryServer
NetworkServer
DisplayServer
ClientApplication
ClientApplication
. . . . . .
Send
Reply
Microcomputer CourseKoby Gottlieb, 3/03, p-151
Windows NT Basic Layers
Hardware
System ServicesSystem Services
ProcessManager
ProcessManager
VirtualMemoryManager
VirtualMemoryManager
LPCFacility
LPCFacility
Object Manager
Object Manager
SecurityReference
Monitor
SecurityReference
Monitor I/OSystem
I/OSystem
KernelKernel
Hardware Abstraction LayerHardware Abstraction Layer
Hardware
Hardware
UserKernel
Win32Client
Win32Client
Win32sub
system
Win32sub
system
PosixClient
PosixClient
Posixsub
system
Posixsub
system
. . . . . . LogonProcess
LogonProcess
Securitysub
system
Securitysub
system
. . . . . .
Microcomputer CourseKoby Gottlieb, 3/03, p-152
Hardware Independence
Separate between the OS and the actual hardware Allows
Same OS can BOOT on different platforms as-is Same OS can run on different platforms as-is
At boot time: hardware independence achieved by BIOS (basic I/O system)
At run time: BIOS (DOS, Windows 3.1/95) HAL (NT) Device drivers
Applications
OS
BIOS/HAL
BIOS Model
Hardware
Ideal model Reality
Microcomputer CourseKoby Gottlieb, 3/03, p-153
BIOS Contains basic functionality for HW handling
Disk I/O Screen I/O Mouse, keyboard
All that is needed to boot the system - and more
Resides in a non-volatile memory (EPROM, Flash) Supplied with the computer
BIOS suppliers: Phoenix, AMI (American Megatrend)Usually in cooperation with chip and system makers
OS supplier: Microsoft, Microsoft, ...
Interfaces to devices are done also by: Applications (DOS only) Device drivers (Windows)
BIOS additional tasks Boot Setup (HW parameters)
Microcomputer CourseKoby Gottlieb, 3/03, p-154
Inter-Layer Interface
API (application programming interface) Defines the function, parameters, return values. etc...
Level of interface: Programmer level: C-like, machine independent Code level: machine dependent (stack layout, etc..) Transfer from programmer code to library code
Regular call, standard API
Transfer from user code to OS code (still in user mode!) Call/Jmp. May use non-standard API
Transfer from user mode to system mode Software interrupt Inter-level call (using call gate)
extern int read(int, void *, unsigned); main (){ int file=0,buf[100],icount=10,count; count = read(file,buf,icount);}
disassembly for read.out:
.._cerror() 80481c8: a3 70 94 04 08 movl %eax,0x8049470 / errno <- code 80481cd: b8 ff ff ff ff movl $0xffffffff,%eax / ret val <- -1 80481d2: c3 ret 80481d3: 90 nop 80481d4: 90 nop 80481d5: 90 nop 80481d6: 90 nop 80481d7: 90 nop
_read()read() 80481d8: b8 03 00 00 00 movl $0x3,%eax / set read code 80481dd: 9a 00 00 00 00 07 00 lcall 0x7,0x0 / enter kernel 80481e4: 72 02 jb 0x2 <80481e8> / check error 80481e6: c3 ret / OK: return 80481e7: 90 nop 80481e8: 3c 5b cmpb $0x5b,%al 80481ea: 74 ec je 0xec <80481d8> 80481ec: e9 d7 ff ff ff jmp 0xffffffd7 <80481c8> / error routine 80481f1: 90 nop 80481f2: 90 nop 80481f3: 90 nop
main()[3] 80481f4: 55 pushl %ebp 80481f5: 8b ec movl %esp,%ebp 80481f7: 81 ec 90 01 00 00 subl $0x190,%esp 80481fd: 57 pushl %edi 80481fe: 56 pushl %esi[4] 80481ff: 2b ff subl %edi,%edi 8048201: be 0a 00 00 00 movl $0xa,%esi[6] 8048206: 56 pushl %esi 8048207: 8d 85 70 fe ff ff leal -400(%ebp),%eax 804820d: 50 pushl %eax 804820e: 57 pushl %edi 804820f: e8 c4 ff ff ff call 0xffffffc4 <80481d8> / call read lib 8048214: 83 c4 0c addl $0xc,%esp[7] 8048217: 5e popl %esi 8048218: 5f popl %edi 8048219: 8b e5 movl %ebp,%esp 804821b: 5d popl %ebp 804821c: c3 ret
Unix system code sequence
#include <time.h>main (){ time_t l1,l2; l2=time(&l1); // time receives pointer to time_t (long) and returns time_t }--------------------------------------------------------------- User code4: {00401010 55 push ebp00401011 8bec mov ebp,esp00401013 83ec08 sub esp,0000000800401016 53 push ebx00401017 56 push esi00401018 57 push edi5: time_t l1,l2;6:7: l2=time(&l1);00401019 8d45fc lea eax,dword ptr [l1]0040101c 50 push eax0040101d e812000000 call time (00401034) / go to library00401022 83c404 add esp,0000000400401025 8945f8 mov dword ptr [l2],eax8: }00401028 5f pop edi00401029 5e pop esi0040102a 5b pop ebx0040102b c9 leave0040102c c3 ret---------------------------------------------------------------- Library code_time:00401034 55 push ebp00401035 8bec mov ebp,esp00401037 83ec10 sub esp,000000100040103a 8d45f0 lea eax,dword ptr [ebp-10]0040103d 50 push eax / enter DLL0040103e ff15bc804000 call dword ptr [__imp__GetLocalTime@4 (004080bc)]00401044 0fb74dfc movzx ecx,word ptr [ebp-04]00401048 0fb755fa movzx edx,word ptr [ebp-06]0040104c 51 push ecx0040104d 52 push edx0040104e 0fb745f8 movzx eax,word ptr [ebp-08]00401052 0fb74df6 movzx ecx,word ptr [ebp-0a]00401056 50 push eax00401057 51 push ecx00401058 0fb755f2 movzx edx,word ptr [ebp-0e]0040105c 0fb745f0 movzx eax,word ptr [ebp-10]00401060 52 push edx00401061 50 push eax00401062 e87f010000 call ___loctotime_t (004011e6)00401067 83c418 add esp,000000180040106a 8b4d08 mov ecx,dword ptr [ebp+08]0040106d 85c9 test ecx,ecx0040106f 7402 jz _time+0000003f (00401073)00401071 8901 mov dword ptr [ecx],eax00401073 8be5 mov esp,ebp00401075 5d pop ebp00401076 c3 ret---------------------------------------------------------------- Dll code82929af0 683970f7bf push bff7703982929af5 e923aa663d jmp bff93c1d / enter kernel---------------------------------------------------------------- Kernel codebff93c1d ...... ......... ....... jmp far .... / enter ring 0003b:03d8 cd30 int 30 ;#0028:c0236288 --------Priveleged level
WIN95 system code sequence
Microcomputer CourseKoby Gottlieb, 3/03, p-157
DOS/BIOS Interface Very low level interface definition
Parameters are passed in registersAH contains function code
Return values are passed in registersCF contains error indication (Simple JC to check!)
Actual call is done by software interruptDOS: int 0x21BIOS: various 0x10-0x1A, e.g., 0x13 for disk, 0x16 for KBD
Low level interface pros/cons+ Faster code
+ Denser code size
- Very machine dependent: not portable, less extensible
Was OK for its time. Inappropriate for today
Microcomputer CourseKoby Gottlieb, 3/03, p-158
DOS ExamplesDOS: Delete file: INT 21H / DOS call
Input Parameters AH: 41HDS:DX asciiz pointer (“myfile.dat”)
Return values AX: 0: succeed
2: file not found5: fail to delete
example: LDS DX, file_name / set file nameMOV AH, 41H / delete codeINT 21H / dos callCMP AX, 0 / succeeded?JNZ fail
....file_name DB “myfile.dat”,0
----------------------------------------------------------------------------------------------
DOS: Read from file/ Device: INT 21H / DOS call
Input Parameters AH: 3FHBX file handleCX bytes to readDS:DX pointer to data buffer
Return values AX: CF=0 Number of bytes read
AX: CF=1 error code (5-access denied, 6-invalid handle)
Microcomputer CourseKoby Gottlieb, 3/03, p-159
BIOS Example
BIOS ReadDiskSector: INT 13H / Diskette/Fixed disk BIOS serviceAH: 2H / Read Disk sector
Input Parameters: AL: Number of sectors to readCH: cylinder # (low 8 bits)CL: Cylinder/Sector #DH: Head#DL: Drive#ES:BX Pointer to buffer
Return Values: AH: 0 = no errorFixed disk error code
CF: 0 = no error, 1 = error
Normal operation Application calls DOS to bring bytes from file DOS examines which device is associated with the file For disk:
DOS file system translates logical file address into physical disk address DOS Calls BIOS to read the actual bytes
Microcomputer CourseKoby Gottlieb, 3/03, p-160
BIOS Example (Cont.)
BIOS ReadKeyBoardInput: INT 16H / keyboard BIOS service
Input Parameter: AH: 0H / Read keyboard
Return Values: AH: Scan code
AL: Ascii code
Normal operation Application calls DOS to bring bytes from file DOS examines which device is associated with the file For keyboard: Calls BIOS to read the actual bytes
In practice, applications may access the keyboard directly+ Faster access+ Can get more info (extended code)- Some capabilities are lost: e.g., cannot redirect input!
Microcomputer CourseKoby Gottlieb, 3/03, p-161
Boot-Strap Process Reset
CS:IP is 0xf000:0xfff0 (within BIOS), CR0: Cache disabled, write-thru Perform optional BIST (built-in self test) Enter Real mode, most general/segment registers are 0
POST = Power on Self Test Start OS load (OS independent)
Call ROM loader Identify boot disk Load master boot record - pass control Find boot sector - load and pass control
DOS Specific (also WIN3.1) Find IO.SYS - load and pass control (BIOS extensions) Find MSDOS.SYS - load and pass control (DOS code) Load drivers Execute config.sys (more Device drivers) Find command.com - load and pass control (“shell”) Find autoexec.bat - run (initial apps) Prompt user
Microcomputer CourseKoby Gottlieb, 3/03, p-162
POST – Power-On Self Test
Check CPU integrity (regs, math, protected mode, etc..) BIOS checksum CMOS checksum (modifiable values for setup) Check/initialize DMA, keyboard controller Check lower 64KB of memory Check/initialize PIC, Cache, display controller Check all memory Check/initialize Serial & Parallel ports Check/initialize Disk/Diskette controller Look for BIOS extensions in segments A000:F000
Identified by AA55 on a 2K memory boundary. Code follows
Other operations: create shadow ROM BIOS, etc... Start Boot Strap
Microcomputer CourseKoby Gottlieb, 3/03, p-163
Boot-Strap - Comparison
NTNTInt 19
NT Loader
NT Min Kernel
NT Kernel
Device Driver
Login Process
BIOSBIOSReset
POST
BIOS
BIOS extensions
DOSDOSDOS Device Driver
DOS
TSRs
DOS Applications
WIN 3.1WIN 3.1Windows VxD
Windows Drivers (DLLs)
Windows apps and DLL
WIN 95WIN 95Windows VxD
Windows Drivers (DLLs)
Windows apps and DLL
Win95option 2
Win95option 1
Schematic onlySome details ignored
Microcomputer CourseKoby Gottlieb, 3/03, p-164
DOS, Win 3.1 & Win 95 Memory Map
* EW Jan/29/980
Extended Memory
High Mem
Rom BIOS
Empty RAM
Video, SCSI, Network,Expanded Memory, etc..
Empty RAM
DOS Free Memory
Applications
TSRs
DOS Device Drivers
DOS COMMAND.COM
BIOS
Interrupt Vector
4GB
1M+64K: 10FFEF
1M: 0FFFFF
1M-128K: 0E0000
640K: 9FFFF
1K, 3FF
Upper memory
Microcomputer CourseKoby Gottlieb, 3/03, p-165
Win 95 memory map
Shared
Ring 0
Shared
Ring 0
Shared
Ring 3
System Tables (IDT,GDT ..)
VxD
System DLLs
Memory Mapped Files
Top portion ofWin-16 global heap
Per Process area
Win32
Lower portion ofwin-16 global heap
MS-DOS
Per Process
Ring 3
4GB
3GB
2GB
4MB
0
Microcomputer CourseKoby Gottlieb, 3/03, p-166
NT Virtual Memory Map
Non Paged Pool(Cannot be swapped out)
Paged
Direct MappingAddresses
Paged
4GB
2GB
0
User & SystemDLLs
Per Process
Ring 3
System
Shared
Ring 0
Kernel, Drivers, IDT, GDT, TSS..
Microcomputer CourseKoby Gottlieb, 3/03, p-167
Memory Model in Win/NT and Win95
Minimize usage of segments Use paging for address and access check
Address error:Memory access beyond program limits is checked by the OS following page fault exception
Access error:Memory Access is checked for read/write & privilege level in the page table and the address segment attribute
There are different privilege levels for the OS and the user context (need to remember this during interrupt time)
The FS segment register is used as a Task pointer to the OS database
Win/NT Still supports 16-bit and real mode segmentation applications with Virtual X86 (need to remember during Interrupt service)
Microcomputer CourseKoby Gottlieb, 3/03, p-168
Win/NT / Win ‘95 Page Tables
PageDirectory
Process 1’s CR3
PageDirectory
Process 2’s CR3
PageTable
PageTable
4K
PageTable
PageTable
PageTable
PageTable
System PageTable
4K
4K
4K
PhysicalMemorypages
4K
Microcomputer CourseKoby Gottlieb, 3/03, p-169
Background: Processes and Threads
Different for different operating system, we’ll talk Win32 (Windows ‘95 and Windows NT)
Processes are the primary running objects One executable and several DLLs Address space Object (e.g. file) handles Security attributes Execution priority
A process can have multiple threads For concurrency Minimal state: just registers and a private stack
Threads are known to the OS Preemptive thread switch, as well as preemptive process switch
Threads need to synchronize access to shared resources
Microcomputer CourseKoby Gottlieb, 3/03, p-170
Process and Thread Memory Management
Each Process has its own page table Separate CR3
All user processes use the same segments
Process switch includes switching to its pages
Linear mapping address are the same as the logical address “Flat” memory model
Each thread has different stack allocation obtained from the its process memory space
No paging magic, simply a different value in ESP
How memory is shared between Processes Same physical memory page is mapped by two or more
processes’ page tables
=> Two linear addresses calculate to the same physical address
Each process switch causes flushing of the TLB
Microcomputer CourseKoby Gottlieb, 3/03, p-171
Mode Switch Enter protected mode:
Prepare at least minimal GDT/IDT/TR. Set GDTR CR0.PE <= 1. Segment register still points to old base/limit Load segment registers as needed Perform board initialization - as needed Avoid interrupts while switching. Now update IDTR. Update TSS
Enter Paging: Prepare minimal page tables (at least 1directory, 1 table) CR3 <= page directory, CR0.PG <=1 The above code in page which phys=linear.
Need jump before leaving the page.
Exit paging: The same thing, with CR0.PG <= 0
Exit protected (in 286 - just reset) Set all segments so limit=0xffff. Disable interrupts CR0.PE <= 0 IDTR = 0, enable interrupts
Microcomputer CourseKoby Gottlieb, 3/03, p-172
Get_vector: ...XOR EAX, EAX; MOV ES, AX / seg=0MOV AX, ES:x*4MOV old_vector_p, AX / get offsetMOV AX, ES:x*4+2MOV old_vector_p+2, AX / get selector
...old_vector_p: DD ? / apart from code! ....Set_vector: ...
XOR EAX, EAX; MOV ES, AX / seg=0MOV AX, new_vector_pMOV ES:x*4, AX / set offsetMOV AX, new_vector_p+2MOV ES:x*4+2, AX / set selector
...new_vector_p: DD new_service_proc / new proc addr ...int_service: IN AL, port# / read status of the keyboard
CMP AL, hot_key / check if special valueJZ hot_key_service / if OK, get requestJMP [old_service_p] / jmp far to saved address...
hot_key_service: ...
/** C example */void chain_intr(unsigned int vector,
unsigned char mask){
void (_cdecl _interrupt_far * _cdecl np)(unsigned);
_disable();np = _dos_getvect( vector);_dos_setvect(vector, process_intr);_enable();/* program PIC to enable
interrupts */outp( PIC_e, inp( PIC_e ) & mask);outp( PIC_b, EOI);
}
Interrupt Vector Chaining (simplified)
Microcomputer CourseKoby Gottlieb, 3/03, p-173
Running DOS under Windows - Virtual X86
Allow execution of real mode apps under protected mode OS
Use Real-Mode logical-to-linear address translation Linear = seg*16+offset
Paging is used for address translation Can have several VM X86 tasks simultaneously Lower 1MB can be mapped to anywhere
DOS/BIOS code is copied for each task Emulate (virtualize) privileged operations
Invokes a general protection handler Handler decides how to continue
In protected mode (level 0) - e.g., I/OBack in VM mode (level 3) - e.g., divide by 0
Microcomputer CourseKoby Gottlieb, 3/03, p-174
VM X86 Address Translation
Segment Unit Page Unit32
16/3
2
Effective Address (offset)
BaseLinearaddress
Physical address
DOS BOXVM1
DOS BoxVM2
Page table 1
Page table 2
20bit, 0-1M
20bit, 0-1M
Linear space
Linear space
Memory
4GB
Microcomputer CourseKoby Gottlieb, 3/03, p-175
Virtualizing Privileged I/O and Instructions
Windows has APIs (at the VxD level) to tie a handler to a port
Upon an exception:1. Check interrupt vector
2. Enter protected
mode handler
3. Optionally,
Process in VM X86
TASK A
TSS A
IO Bit Map
Level 0Handler
Level 3DOS/Windows
Application
IOPL = 0CPL=3
GP13
OUT DX, AL
LGDT
DIV AX,0
Interruptvector
1
2
3
Microcomputer CourseKoby Gottlieb, 3/03, p-176
Other
Microcomputer CourseKoby Gottlieb, 3/03, p-177
Other things
Debug/Test features. I/O addressing Initialization MultiProcessing (Locking) Event Monitoring (Pentium)
Microcomputer CourseKoby Gottlieb, 3/03, p-178
X86 SW Challenge
SW architects - define models of computations 16 bit: small/compact/medium/large 32 bit: flat is enough...
Compilers - Generates fastest code for: Family of processors Individual processor
Application writers Write faster code.
Microcomputer CourseKoby Gottlieb, 3/03, p-179
Lecture 9: Performance
Microcomputer CourseKoby Gottlieb, 3/03, p-180
General Performance Questions
How efficiently is the application running on the platform? How are resources being used? CPU, memory, IO, bus? Where CPU time is being spent? Can program logic and or algorithms be improved in time
consuming code
Section 1: Overview 181Microcomputer CourseKoby Gottlieb, 3/03, p-181
Order of Optimizations
Time everythingAccess memory efficientlyMaximize branch predictabilityAvoid stallsUse a good blend of instructions
Microcomputer CourseKoby Gottlieb, 3/03, p-182
Component profiling:
flat profiling Collecting time based and event based information while running
the application Break down of active executables (exe’s, dll’s, drivers etc.)
call graph profiling Display of the call graph Collecting the information based on hierarchical profiling
Adding the time spent in functions that were called to the time of each function
Cache utilization profiling Tradeoff between code and data size.
Example: Processing data in phase A and B. Running phase A to until a data buffer between phases is full.
Section 1: Overview 183Microcomputer CourseKoby Gottlieb, 3/03, p-183
Dynamic Execution
Dynamic Data Flow Analysis Determines data and register dependencies to detect opportunities
for out-of-order execution.
Deep Branch Prediction Processor can predict many branches ahead of the instruction
pointer
Speculative Execution Allows the processor to execute instructions ahead of the program
counter but still retire the results in the order of the original instruction stream.
Section 1: Overview 184Microcomputer CourseKoby Gottlieb, 3/03, p-184
Architecture Diagram
Fetch & Decode Unit(In order unit)•Fetches instructions•Decodes instructions to µOPs•Performs branch prediction
Retirement Unit(In order unit)•Retires instructions in order•Writes results to registers/memory
Dispatch / Execute Unit(out of order unit)•Schedules and executes µOPs•Contains 5 execution ports
L2 cache
Bus Interface Unit
L1 data cacheL1 instruction
cache
Instruction Pool/reorder buffer•Buffer of µOPs waiting for execution
System bus
Fetch Load Store
Section 2: Optimizations: Fetch Unit
Section 1: Overview 185Microcomputer CourseKoby Gottlieb, 3/03, p-185
Fetch Unit
Reads cache lines (32 bytes) from L1 instruction cache
Sends instructions to the decode unit
Fetches the next instructions based on branch prediction
Fetch & Decode Unit
Retirement Unit
Dispatch & Execute Unit
L2
Bus Interface Unit
data cacheIcache
System bus
Instruction Pool(ROB)
Section 2: Optimizations: Fetch Unit
Section 1: Overview 186Microcomputer CourseKoby Gottlieb, 3/03, p-186
Branch Prediction
Static Branch Prediction: Forward conditional branches will be predicted as not taken
jz @forward
Backward conditional branches will be predicted as takenjnz @backward
Unconditional direct branches are always takenjmp label
Unconditional indirect branches fall throughjmp [eax]
Dynamic Branch Prediction: Based on prediction and history in Branch Target Buffer (BTB)
If you can guess from the history what the branch will be, so will the processor (0101 -- taken, not taken, taken, not taken)
Section 2: Optimizations: Fetch Unit
Section 1: Overview 187Microcomputer CourseKoby Gottlieb, 3/03, p-187
Static Branch Prediction Example
What does the static branch prediction algorithm predict for the following branches?
loop:
add eax, [edi]
dec ecx
add edi, 4
jnz loop
cmp eax, ebx
je start
mov eax, ecx
start:
mul ebx
test ebx, ebx
jnz not_zero
mov ebx, 50
not_zero:
add ebx, 10
mov ebx, [esi]
push ebx
push eax
call swap
lea edx, [ecx *4]
jmp [edx]
1
Begin:
shr eax, 1
jnc Begin
mov ebx, eax
2 3
4 5 6
Answers: 1-3 taken, 4-6 not taken
Section 2: Optimizations: Fetch Unit
Section 1: Overview 188Microcomputer CourseKoby Gottlieb, 3/03, p-188
Return Stack Buffer
Return Stack Buffer is used to predict RETs Every time a CALL is executed, the return address is pushed on
the Return Stack Buffer Every RET pops the Return Stack Buffer and uses the address for
prediction
Make sure for each CALL there is a RET or the addresses will be out of sequence and all return address predictions will be wrong
Section 2: Optimizations: Decode Unit
Section 1: Overview 189Microcomputer CourseKoby Gottlieb, 3/03, p-189
Decode Unit
Converts Intel Architecture instructions into micro-ops (µOps)
Stores µOps in the Instruction Pool
To reduce false register dependencies, register references are converted to an internal register set Register alias table Register renaming
Fetch & Decode Unit
Retirement Unit
Dispatch & Execute Unit
L2
Bus Interface Unit
data cacheIcache
System bus
Instruction Pool(ROB)
Section 2: Optimizations: Decode Unit
Section 1: Overview 190Microcomputer CourseKoby Gottlieb, 3/03, p-190
The Decoders
3 parallel decoders Decode instructions which require 4 or fewer µOps
Microcode instruction sequencer Decodes instructions which are 5 or more µOps
Do not reorder code, minimal performance gain
microcode instruction sequencer
decoder 01-4 µOps
decoder 21 µOp
decoder 11 µOp
register allocation table
ROB/Instruction Pool
Section 2: Optimizations: Decode Unit
Section 1: Overview 191Microcomputer CourseKoby Gottlieb, 3/03, p-191
Register Renaming
40 internal registers used to reduce data dependencies No user access and transparent to software
What the processor sees:Asm code internal to processor
mov al, mem
add al, 6
mov [edi], al
mov eax, 5
mov dest1, eax
mov ax, mem2
shr ax, 3
mov dest2, ax
mov <r0>, mem
add <r0>, 6 =>> <r3>
mov [edi], <r3>
mov <r1>, 5
mov dest1, <r1>
mov <r2>, mem2
shr <r2>, 3 =>> <r4>
mov dest2, <r4>
Section 2: Optimizations: Decode Unit
Section 1: Overview 192Microcomputer CourseKoby Gottlieb, 3/03, p-192
Register Stalls
Register stalls occur when a large register is read after one of its partial registers is written
The instructions do not have to be adjacent because of out-of-order execution
Caused by register renaming The large register might be split among multiple internal registers
Penalty: Time it takes to retire the small register write
mov al, byte ptr [mem]
inc eax ; stalled waiting for al to retire
Section 2: Optimizations: Decode Unit
Section 1: Overview 193Microcomputer CourseKoby Gottlieb, 3/03, p-193
The Clean Bit
Used to indicate that a register is zero Means the partial register equals the larger register Only eax, ebx, ecx, and edx have clean bits
Partial register stalls will NOT happen if the clean bit is on No direct user access to this bit, completely internal to the
processor
Section 2: Optimizations: Decode Unit
Section 1: Overview 194Microcomputer CourseKoby Gottlieb, 3/03, p-194
What turns on the clean bit?
Zero the register with XOR or SUB ONLY!
The clean bit can stay on for multiple instructions The clean bit is not copied between registers
xor eax, eax ; eax’s clean bit is on
sub ebx, ebx ; ebx’s clean bit is on
xor eax, eax ; eax’s clean bit is on
mov ebx, eax ; does not copy the clean bit
Section 2: Optimizations: Decode Unit
Section 1: Overview 195Microcomputer CourseKoby Gottlieb, 3/03, p-195
What turns off the clean bit?
Clean bit is turned off when the operation has the possibility of changing the larger register
Use of the ‘h’ register turns off the clean bit Keep the clean bit on when using partial registers
xor eax, eax ; clean bit on
mov al, 5 ; clean bit still on
dec al ; clean bit still on
inc eax ; clean bit off, could spill into ah
mov al, 6 ; partial load
inc eax ; stall
Section 2: Optimizations: Decode Unit
Section 1: Overview 196Microcomputer CourseKoby Gottlieb, 3/03, p-196
Register Stall Examples
Identify code segments with register stalls
lahf
mov ebx, eax
sub eax, eax
mov ax, bx
inc eax
xor eax, eax
mov ax, bx
inc eax
xor eax, eaxmov ah, 5push eax
mov eax, 0
mov ax, bx
inc eax
movzx eax, bx
push eax1 2 3
4 5 6
pop ax
mov [mem], eax
xor eax, eax
mov al, bl
mov ah, cl
inc eax
pop eax
mov [mem], al7 8 9
Answers: odd boxes stall
Section 2: Optimizations: Decode Unit
Section 1: Overview 197Microcomputer CourseKoby Gottlieb, 3/03, p-197
Mixing of Code and Data
Don’t put data in the instruction stream put data:
in the data segment or on the stackat the end of the code segmentat the end of a function
Don’t make the processor decode data wastes decode time
usually happens with indirect jumps and their tables reduces cache efficiency
Section 2: Optimizations: Execution Unit
Section 1: Overview 198Microcomputer CourseKoby Gottlieb, 3/03, p-198
Dispatch & Execute Unit
Only out-of-order unit Schedules and executes µOPs
based on data-flow constraints
Fetch & Decode Unit
Retirement Unit
Dispatch & Execute Unit
L2
Bus Interface Unit
data cacheIcache
Instruction Pool(ROB)
System bus
Section 2: Optimizations: Execution Unit
Section 1: Overview 199Microcomputer CourseKoby Gottlieb, 3/03, p-199
Execution Units Diagram
ROBReservation
Station
Port 1
Port 3Port 2 Port 4
IntegerUnit
Load Unit
Store Address
Calculation Unit
Store DataUnit
Port 0
MOV EAX,[Memref]
MOV [Memref], EAX
INC
FMUL
FDIV
FADD
LEA
PAND
PMULHW PackedMultiply
PackedALU
Address Generation
Unit
FP Unit
IntegerUnit
PackedShift
PackedALU
ADDJMP
PAND
PSRLQPACKSSWB
Section 2: Optimizations: Execution Unit
Section 1: Overview 200Microcomputer CourseKoby Gottlieb, 3/03, p-200
Out-of-Order Execution and Data Dependencies
Out-of-Order execution is limited by data dependencies
for (i=0; i<SIZE/2; i+=2){
accum += array[i];accum += array[i+1];
}
for (i=0; i<SIZE/2; i+=2){
accum1 += array[i];accum2 += array[i+1];
}accum = accum1 + accum2;
Good: Unrolled, but... Better: Dependencies reduced
Section 2: Optimizations: Execution Unit
Section 1: Overview 201Microcomputer CourseKoby Gottlieb, 3/03, p-201
Serializing Instructions
Serializing instructions cause execution to be delayed until all previous instructions execute, retire, and results are written back to memory. read-modify-write instructions with lock prefix
Typically used in multiprocessor/ shared memory environments XCHG with a memory operand mov to control or debug registers CPUID FLDCW (used in _ftol)
Section 2: Optimizations: Fetch Unit
Section 1: Overview 202Microcomputer CourseKoby Gottlieb, 3/03, p-202
ftolftol
Casting floats to Integer formats cause serious penalties on the Pentium® II processor ~50 clks using ftol
~25clks in exe / ~30 clks in stalls!
3 Problems: Serialization Partial stall with fnstcw Memory stall
Solutions: In most cases you can use round
Section 2: Optimizations: Execution Unit
Section 1: Overview 203Microcomputer CourseKoby Gottlieb, 3/03, p-203
Store Address Unknown Blocking
If the destination address for a store is unknown, all subsequent loads will stall. e.g.
e.g.
loop:
mov eax, [esi] ;stalls waiting for add edi,4 to retire
add esi, 4
and eax, 0ffh
add edi, 4
mov [edi-4], eax
dec ecx
jnz loop
mov eax, [table+ebx*4]
mov [eax], ecx
mov ecx, [esi] ; stalls
Section 2: Optimizations: Execution Unit
Section 1: Overview 204Microcomputer CourseKoby Gottlieb, 3/03, p-204
Self Modifying Code
Do not modify code that might already be in the processor’s execution pipeline Both the in-order and out-of-order units will be flushed May cause branch prediction problems Processor stalls the in-order unit until the ROB empties Cache problems and penalties
Not the same as self generating code Code that is created once and executed, but never modified suffers
no penalty
Section 2: Optimizations: Retirement
Section 1: Overview 205Microcomputer CourseKoby Gottlieb, 3/03, p-205
Retirement Unit
Commits µOps to permanent machine state in original order
Removes µOps from reorder buffer
Fetch & Decode Unit
Retirement Unit
Dispatch & Execute Unit
L2
Bus Interface Unit
data cacheIcache
Instruction Pool
System bus
Section 2: Optimizations: Retirement
Section 1: Overview 206Microcomputer CourseKoby Gottlieb, 3/03, p-206
Store Forwarding
Improves the performance of load operations of recently stored data Must be same data size and alignment Saves the time it takes the cache to combine the data
GOOD BAD
32 bit load
8 bit store
16 bit store
16 bit load
32 bit store
32 bit store
32 bit load
8 bit load
16 bit load
32 bit store
32 bit load
16 bit store
Microcomputer CourseKoby Gottlieb, 3/03, p-207
Store Forward Example
Avoid mixing MMx memory LD’s with scalar integer memory ST’s (i.e., 8-, 16-, 32-bit ST’s) to the same SIMD data
Code sequences such as the following will cause MOB blocks:
MOV mem, eax # dword store to “mem”, low-half of SIMD variable
MOV mem+4, ebx # dword store to “mem+4”, high-half of SIMD variable
:
MOVQ mm0, mem # qword load of entire SIMD variable “mem”
# MOB blocks this load because the data requested is
# not a subset of the data of either ST in the MOB
Section 2: Optimizations: Fetch Unit
Section 1: Overview 208Microcomputer CourseKoby Gottlieb, 3/03, p-208
Store Forwarding C code example
float f;
unsigned int I;
.
.
.
f = I;
Why is it a problem?
Microcomputer CourseKoby Gottlieb, 3/03, p-209
Store Forwarding code example
Assigning unsigned int to floatmov eax, DWORD PTR _imov DWORD PTR -16+[ebp], eaxmov DWORD PTR -16+[ebp+4], 0fild QWORD PTR -16+[ebp]
A better code could be:int uint_fix[2] = { 0, 0x4f800000};f1 = ((int) i) + *(((float *) &uint_fix[i>>31]));
And in assembly:mov eax, DWORD PTR _ifild DWORD PTR _ishr eax, 31fadd DWORD PTR _uint_fix[eax*4]
Section 2: Optimizations: Memory & Cache
Section 1: Overview 210Microcomputer CourseKoby Gottlieb, 3/03, p-210
Memory Interface
16 Kb data cache 2 way, set associative
Write back cache by default
Fetch & Decode Unit
Retirement Unit
Dispatch & Execute Unit
L2
Bus Interface Unit
data cacheIcache
Instruction Pool
System bus
Section 2: Optimizations: Memory & Cache
Section 1: Overview 211Microcomputer CourseKoby Gottlieb, 3/03, p-211
The Memory Types
Write back (WB) Default for system memory All reads and writes are cached Write misses allocate and fill cache lines Read and write sequentially for maximum performance
Uncacheable memory (UC) Accesses are executed in order without reordering Accesses to other memory types can pass accesses to UC No speculative accesses
Section 2: Optimizations: Memory & Cache
Section 1: Overview 212Microcomputer CourseKoby Gottlieb, 3/03, p-212
Write-Combining (WC)
Writes are buffered and burst across the bus improving bus utilization Writes are stored in an internal buffer (32 bytes), delayed,
reordered, and combined Speculative reads ARE allowed Reads are NOT cached Buffer is flushed on:
UC access, different address, return from an interrupt, serializing instruction, locked, or I/O instruction XCHG with memory (fastest locking instruction)
Microcomputer CourseKoby Gottlieb, 3/03, p-213
Lecture 10: Hyper-Threading
Microcomputer CourseKoby Gottlieb, 3/03, p-214
Intel's Hyper-Threading TechnologyOverview
Intel's Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture.
Hyper-Threading Technology makes a single physical processor appear as two logical processors. the physical execution resources are shared and the
architecture state is duplicated for the two logical processors.
From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors.
From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.
Microcomputer CourseKoby Gottlieb, 3/03, p-215
Thread Level - Amdahl’s Law
Maximum Efficiency
Fraction parallel limits scalability
Key: Parallelize everything significant
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0 20 40 60 80 100
Percent Parallel
Ma
xim
um
Pa
rall
el
Eff
icie
nc
y
2 Threads
4 Threads
8 Threads
16 Threads
32 Threads
64 Threads
)1(
1
pNp
Microcomputer CourseKoby Gottlieb, 3/03, p-216
Multithreading + Multiple Processors= Improved Performance
Multithreading + Multiple Processors= Improved Performance
Run threads using multiple processors
CPU 1CPU 1
CPU 2CPU 2
CPU 3CPU 3
Brief Introduction to ThreadsBrief Introduction to Threads
Multiprocessing
Microcomputer CourseKoby Gottlieb, 3/03, p-217
Open DB’s Address Book Open DB’s Address Book
InBox CalendarInBox Calendar
Brief Introduction to ThreadsBrief Introduction to Threads
Functional Parallelism
Concurrent Tasks
Apply different operations to different data elements
Microcomputer CourseKoby Gottlieb, 3/03, p-218
Data Parallelism Apply the same operation to different data elements
Brief Introduction to ThreadsBrief Introduction to Threads
function SpellCheck {
loop (word = 1, words_in_file)
compare_to_dictionary (word);
}
function SpellCheck {
loop (word = 1, words_in_file)
compare_to_dictionary (word);
}
Open FileOpen File Edit Spell Check Edit Spell Check
Microcomputer CourseKoby Gottlieb, 3/03, p-219
Thread Libraries – Win32* API
C language interfacesThreads exist within a single processGood for asynchronous concurrencyAll threads are peers
No explicit parent-child model Exception: main() thread
Creating Win32* ThreadsHANDLE CreateThread(
LPSECURITY_ATTRIBUTES ThreadAttributes,DWORD StackSize,LPTHREAD_START_ROUTINE StartAddress,LPVOID Parameter,DWORD CreationFlags,LPDWORD ThreadId );
Brief Introduction to ThreadsBrief Introduction to Threads
*Other names and brands may be claimed as the property of others
Thread handle is a synchronization object
Functions are explicitly mapped to threads
Microcomputer CourseKoby Gottlieb, 3/03, p-220
Microcomputer CourseKoby Gottlieb, 3/03, p-221
Pentium 4 Block diagram
Microcomputer CourseKoby Gottlieb, 3/03, p-222
Execution pipeline
A high-level view of the microarchitecture pipeline. buffering queues separate major pipeline logic blocks. The buffering queues are either partitioned or duplicated to
ensure independent forward progress through each logic block.
Microcomputer CourseKoby Gottlieb, 3/03, p-223
Fetch and deliver
• Alternate between logical processors
• Execution trace cache/ Microcode ROM
• Fetch and decode instructions
• Register rename and allocation
Microcomputer CourseKoby Gottlieb, 3/03, p-224
Execution The out-of-order execution engine consists of
the allocation, register renaming, scheduling, and execution functions
Logical processors execute simultaneously Compete for schedulers, ALUs Schedulers map independent instructions to available
execution resources
Microcomputer CourseKoby Gottlieb, 3/03, p-225
The memory subsystem The memory subsystem includes:
The DTLB: translates addresses to physical addresses. It has 64 fully associative entries; each entry can map either a 4K or a 4MB page. Although the DTLB is a shared structure between the two
logical processors, each entry includes a logical processor ID tag.
Each logical processor also has a reservation register to ensure fairness and forward progress in processing DTLB misses.
the low-latency Level 1 (L1) data cache the Level 2 (L2) unified cache, and the Level 3 unified cache (the Level 3 cache is only available
on the Intel® XeonTM processor MP). Access to the memory subsystem is also largely oblivious to
logical processors. The schedulers send load or store uops without regard to logical
processors and the memory subsystem handles them as they come.
Microcomputer CourseKoby Gottlieb, 3/03, p-226
Instruction retirement
Alternate between logical processorsCommit state in program order
Microcomputer CourseKoby Gottlieb, 3/03, p-227
Hyper Threading implementation
Two logical processors for very small additional die area
Alternate between logical processorsFetch and deliverReorder and retire
Competitive sharing between logical processorsRapid execution engineCaches
Microcomputer CourseKoby Gottlieb, 3/03, p-228
Microcomputer CourseKoby Gottlieb, 3/03, p-229
OS support From the OS point of view HT is just like multi processing
Needs BIOS support for initialization
HLT instruction (for idle) The HLT instruction stops instruction execution and places the
processor in HALT stat. An enabled interrupt, NMI or Reset resume execution. The return instruction from HLT is the next instruction
There are two modes of operation referred to as single-task (ST) or multi-task (MT).
On a processor with Hyper-Threading Technology, executing HALT transitions the processor from MT-mode to ST0- or ST1-mode, depending on which logical processor executed the HALTSpin loops:
Use new pause instructionFor long wait use the OS call
Wait on object
Microcomputer CourseKoby Gottlieb, 3/03, p-230
Lecture 11: x86 and 3D graphics
Microcomputer CourseKoby Gottlieb, 3/03, p-231
Quick Intro to 3D Graphics
Glossary:Vertex – point in 3D spaceTriangle – 3 connected verticesObject – list of triangles that have the
same material properties (AKA Mesh)Texture – a 2D image that is wrapped
on the surface of a Mesh
Microcomputer CourseKoby Gottlieb, 3/03, p-232
For Each Triangle...
Geometry transform each vertex (FP)
Lighting compute lighting from light sources and surface
properties (FP)
Rasterization setup triangle for rasterization (FP) perform shading & texture mapping during
triangle fill (INT)
Microcomputer CourseKoby Gottlieb, 3/03, p-233
Rendering Processlight
source
objects
viewing planeX
Y
Z
incident light
perspective projection
Mathematical modelsfor light, objects & viewercreate a 2D image viaa 3D process
Microcomputer CourseKoby Gottlieb, 3/03, p-234
Coordinate Systems
xw
yw
zw
world xo
yo
zo
object
xv
yv
zv viewer
xl
yl
zl
light
Microcomputer CourseKoby Gottlieb, 3/03, p-235
Transformation
Transformation change the coordinate system a point lies in
ExamplesViewing Transformation - translate
objects to viewer coordinate before projection to viewing plane
Shadows - translate objects to light source coordinates to calculate shadows
Microcomputer CourseKoby Gottlieb, 3/03, p-236
NL R
S
Lighting Process
Surface
Diffuse Light scatters in all directionsSpecular Light reflects in direction of reflection vector R
V
Microcomputer CourseKoby Gottlieb, 3/03, p-237
Rasterization
Now that we have transformed and lit polygons… actually, the vertices...
And, we know where they appear on the screen…
We have to fill their interiors!
Microcomputer CourseKoby Gottlieb, 3/03, p-238
Textures
Image maps to apply surfaces.Gives impression of complex surface properties.
Can substitute for lots of polys (eg, tree bitmap on a single rectangle, vs. thousands of leaf polys)
Microcomputer CourseKoby Gottlieb, 3/03, p-239
Rasterization (again)
Microcomputer CourseKoby Gottlieb, 3/03, p-240
Flat Fill
Microcomputer CourseKoby Gottlieb, 3/03, p-241
Some lighting
Microcomputer CourseKoby Gottlieb, 3/03, p-242
And textures
Microcomputer CourseKoby Gottlieb, 3/03, p-243
Demo
SkinnedMesh.exe
Microcomputer CourseKoby Gottlieb, 3/03, p-244
History of 3D Pipeline Partitioning
Geometry
Lighting
Edge Setup
Rasterization
Application
Software 3DProcessor Does All
First Generation3D HW Accelerators
Rasterization
HW
Geometry
Lighting
Edge Setup
Application
2nd Generation HW (1998-99)
HW
Geometry
Lighting
Edge Setup
Rasterization
Application
3rd Generation HW (1999-2000)
HW
Geometry
Lighting
Edge Setup
Rasterization
Application
Microcomputer CourseKoby Gottlieb, 3/03, p-245
Memory BW: The problem
Frame buffer - ~3MB Z buffer - ~3MB For each object:
Mesh (vertices and triangle connectivity) – 1KB – 1MB Texture(s) - ~256KB
Typical game frame – 8MB – 20MB
Microcomputer CourseKoby Gottlieb, 3/03, p-246
AGP: the Solution AGP (Accelerated Graphics Port)
Larger BW than PCI Lower HW cost (less local RAM needed)
Frame/Z buffers stored in graphics local memory Texturing from system memory
Microcomputer CourseKoby Gottlieb, 3/03, p-247
Memory traffic when using AGP
Microcomputer CourseKoby Gottlieb, 3/03, p-248
How does it work
The AGP aperture is mapped as one chunk (1:1 mapping in CPU’s paging HW), both the gfx chip and the CPU reference the same addresses
The GART is mapping AGP memory address to the system memory address (like paging HW in the CPU)
Microcomputer CourseKoby Gottlieb, 3/03, p-249
Current generation
Multi-texturing: More than one texture for surface, used for details maps, reflections, refractions, lighting tricks, etc.
Microcomputer CourseKoby Gottlieb, 3/03, p-250
Programmable HW
Helps the developer in customizing its transform, lighting and texture operations
Microcomputer CourseKoby Gottlieb, 3/03, p-251
Programmable vertex machine
Microcomputer CourseKoby Gottlieb, 3/03, p-252
Programmable HW demos
Microcomputer CourseKoby Gottlieb, 3/03, p-253
Vertex Shader Processing
Setup
map vertex data to VVM input registers
execute vertex shader
collect data from VVM output registers, format and write to output
clip and render
Vertex shader declaration
Vertex shader code
Microcomputer CourseKoby Gottlieb, 3/03, p-254
Vertex Virtual Machine (DX8)
Constant Memory
96 entries
128 bits4 floats
16 entries
13 entries
12 entries
Vertex Input
Vertex Output
RegistersVertex
ShaderA0.X
128 instructions
addr
data
128 bits4 floats
128 bits4 floats
128 bits4 floats
Microcomputer CourseKoby Gottlieb, 3/03, p-255
Vertex Virtual Machine (DX9)
19 entries
12 entries
Vertex Output
RegistersVertex
ShaderA0.XYZW
256 instructions
Constant Memory
256 entries
128 bits4 floats
addr
data
128 bits4 floats
16 entries
Vertex Input128 bits4 floats
128 bits4 floats
CLC
16 entries
Loop control128 bits
4 ints
16 entries
Branch control16 BOOLs
Changes from DX8 in red
Microcomputer CourseKoby Gottlieb, 3/03, p-256
Vs.2.0 (DX9) shader language
DX8 Mov Add Mul Mad Dp3 Dp4 Slt Sge Min Max Rcp Rsq Expp Logp Dst* Lit*
DX9 additions Loop, Endloop Rep, Endrep Jnz Call Callnz Ret Exp Log M3x3, M4x3, M3x4, M3x2,M2x3** Sgn** Abs** Frc** Lrp** Pow** Nrm** Sincos**
*Removed from DX9
**Macro instructions
Microcomputer CourseKoby Gottlieb, 3/03, p-257
Vertex Shader Example
Vs.1.1
Dcl_position v0
Dcl_color0 v1
Dcl_texcoord0 v2
Dp4 oPos.x, v0, c0
Dp4 oPos.y, v0, c1
Dp4 oPos.z, v0, c2
Dp4 oPos.w, v0, c3
Mov oD0.rgb, v1
Mov oT0.xy, v2
Microcomputer CourseKoby Gottlieb, 3/03, p-258
SW Vertex shader
Microcomputer CourseKoby Gottlieb, 3/03, p-259
backup
Microcomputer CourseKoby Gottlieb, 3/03, p-260
Shading Techniques
Scan Line
A
B
C
L R
P
Polygon fill is done across the
scanline
Flat shading Color the whole polygon with one color Does not show highlights inside polygons
Gouraud shading If the object surfaces are curved, we can
approximate it by polygons Interpolating vertex intensity values along
scanline Still does not show highlights inside polygons
Microcomputer CourseKoby Gottlieb, 3/03, p-261
Texture Mapping]
1. Texture coordinates (s,t,q) for each vertex. These are interpolated along polygon edges.
2. Texture coordinate for each point on scanline is interpolated. Linear interpolation, or a quadratic approximation (for perspective correction) is used.
3. The (s,t) value maps into the source texture map. It usually does not fall on the center of a texel.
4. A texture mapping algorithm is applied to the nearest texel, and possibly surrounding texels, to determine the result.
Texturing eats CPU MIPS & bandwidth. 20-70% slower than flat shading.
(s0, t0 , q0)
(s1, t1, q1)
(s2, t2 , q2)
(si, ti)
1
2
3
4
s = 0, t = 0 s = 1, t = 0
s = 1, t = 1
Microcomputer CourseKoby Gottlieb, 3/03, p-262
Credits
This foil set was developed by Ronny Ronen