Optimizing Sharing Patterns and Locality via Thread Migration

60
Optimizing Sharing Patterns and Locality via Thread Migration Vadim Gleizer Supervisor: Prof. Assaf Schuster

description

Optimizing Sharing Patterns and Locality via Thread Migration. Vadim Gleizer Supervisor: Prof. Assaf Schuster. Contributions of this research. Internal Distributed Shared Memory (DSM) Mechanisms Thread Migration (TM) in DSM Systems Load Balancing in DSM Systems. Internal DSM Mechanisms. - PowerPoint PPT Presentation

Transcript of Optimizing Sharing Patterns and Locality via Thread Migration

Page 1: Optimizing Sharing Patterns and Locality via Thread Migration

Optimizing Sharing Patterns and Locality via

Thread Migration

Vadim GleizerSupervisor: Prof. Assaf Schuster

Page 2: Optimizing Sharing Patterns and Locality via Thread Migration

2

Contributions of this research Internal Distributed Shared Memory

(DSM) Mechanisms Thread Migration (TM) in DSM

Systems Load Balancing in DSM Systems

Page 3: Optimizing Sharing Patterns and Locality via Thread Migration

3

Internal DSM Mechanisms An internal DSM mechanism or a DSM

handler is responsible to guarantee the consistent memory view on each workstation as follows: When a DSM region becomes invalid it is

protected Each access to the protected area will cause

an exception The internal DSM mechanism catches and

handles these exceptions

Page 4: Optimizing Sharing Patterns and Locality via Thread Migration

4

Implementation of DSM Handlers An exception handling service which is provided

by an operating system significantly simplifies this task

Let us see the Win32 Structured Exception Handling (SEH) service of Windows NT: A block of code that allowed to use DSM is wrapped by

an exception block using Win32 __try/__except keywords similarly to try/catch blocks in C++:

__try{user_main();

}__except(DSM_handler());

Let us see how such services work and the drawbacks of using them

Page 5: Optimizing Sharing Patterns and Locality via Thread Migration

5

Inside the SEH Service For each type of exceptions CPU generates a code, e.g.

division by zero has code 0; page fault has code E; a GPF (General Protection Fault) exception has code D:

In the case of a page fault exception a _KiTrap0E is called

Page 6: Optimizing Sharing Patterns and Locality via Thread Migration

6

Inside the SEH Service (cont.)

The following sequence of calls occurs before the control is passed to the DSM_handler: _KiTrap0E KiUserExceptionDispatcher RtlDispatchException RtlpExecuteHandlerForException ExecuteHandler __except_handler3 DSM_handler

Page 7: Optimizing Sharing Patterns and Locality via Thread Migration

7

Drawbacks of using SEH in DSM Systems1. Performance

The SEH service is highly time-consuming while most of its functionality is unnecessary for the DSM handler

User’s exception handlers are called before the DSM handler

2. The programmer may accidentally intercept a DSM exception

The internal DSM handler should work transparently to the programmer

Thus, if the programmer does not know that the DSM handler uses SEH – he/she may accidentally intercept a DSM exception

Page 8: Optimizing Sharing Patterns and Locality via Thread Migration

8

User-Mode First-Chance Exception Handling UMFC-EH:

Only kernel level part of SEH is used, i.e. the DSM_handler is called directly by _KiTrap0E

Thus, exceptions intercepted before any of the SEH user-mode functions is called:

• _KiTrap0E, DSM_handler• Instead of _KiTrap0E, KiUserExceptionDispatcher,

RtlDispatchException, RtlpExecuteHandlerForException, ExecuteHandler, __except_handler3, DSM_handler

To implement this scheme the detours library may be used

Page 9: Optimizing Sharing Patterns and Locality via Thread Migration

9

UMFC-EH (cont.)

Advantages: Solves both drawbacks of the SEH

service No __try/__except blocks are needed

Drawbacks: The kernel level part of SEH still is used All exceptions are intercepted, e.g.,

division by zero

Page 10: Optimizing Sharing Patterns and Locality via Thread Migration

10

Kernel-Mode First-Chance Exception Handling KMFC-EH:

Exceptions intercepted at kernel-mode by a special supervisor-level device driver, we call it DSM_filter

The DSM_filter informs the DSM_handler about DSM exceptions

Thus, the SEH service is not used

Page 11: Optimizing Sharing Patterns and Locality via Thread Migration

11

KMFC-EH (cont.)

Advantages: preserves all the advantages of the UMFC-EH

scheme SEH is not used, i.e., the CPU directly informs

the DSM_filter about page fault exceptions only page fault exceptions are intercepted

Drawbacks: all page fault exceptions are intercepted by

the DSM_filter, including those of other processes

• fortunately the overhead of this drawback is low

Page 12: Optimizing Sharing Patterns and Locality via Thread Migration

12

Performance Evaluation Our experimental environment consists of

the Millipede 4.0 DSM system: cluster of eight uniprocessor workstations

interconnected by a switched Myrinet LAN Each workstation equipped with:

• Pentium-II 300MHz• 128MB of RAM• 512KB of L2 cache• Windows NT 4.0 SP6 operating system

We have tested our DSM handlers on several commonly used for DSM benchmarks and microbenchmarks

Page 13: Optimizing Sharing Patterns and Locality via Thread Migration

13

Microbenchmarks (100000 page faults):

SEH 14 sec 100%

UMFC-EH 8 sec 57%

KMFC-EH 3 sec 21%

Performance Evaluation (cont.)

Related results (Brazos):

SEH 20 sec 200 MHz Pentium Pro with 192 MB of RAM

Segv Handler

47 sec Solaris 2.5.1 running on the same hardware

Page 14: Optimizing Sharing Patterns and Locality via Thread Migration

14

Performance Evaluation (cont.)

Page 15: Optimizing Sharing Patterns and Locality via Thread Migration

15

Performance Evaluation (cont.)

Page 16: Optimizing Sharing Patterns and Locality via Thread Migration

16

Thread Migration (TM) in DSM SystemsIntroduction: A thread is stopped at almost every moment

of its execution and launched on another machine from the same point where it was stopped

Applications of this facility: load balancing communication reduction fault tolerance cluster management powerful programming primitive

Page 17: Optimizing Sharing Patterns and Locality via Thread Migration

17

Designing a TM Mechanism Restrictions on TM – there are some

situations in which the migration makes no sense: the thread owns some local operating system

resources, e.g. a synchronization object the thread executes a local dependent

operation, e.g. prints a message Therefore the programmer should be

aware of thread migration and explicitly mark situations when a thread cannot migrate

Page 18: Optimizing Sharing Patterns and Locality via Thread Migration

18

Designing a TM Mechanism (cont.)

A state of a thread consists of: code global data heap data stack data processor’s register set other thread specific data

Page 19: Optimizing Sharing Patterns and Locality via Thread Migration

19

Host 1 Host 2

10045

10045

20002000

20042004

A

10001000

10041004

A

Designing a TM Mechanism (cont.)

10045

Page 20: Optimizing Sharing Patterns and Locality via Thread Migration

20

Designing a TM Mechanism (cont.)

Stack address translation Drawbacks:

• register values and stack values have to be investigated and probably updated (very inefficient for large stacks)

• identification of pointers (correctness, a value may resemble a pointer), possible solutions:

• special compiler or hardware support – more complex compiled code, often prevents compiler optimizations

• special programming primitives that register all pointers – harms efficiency and simplicity of programming, limit free usage of pointers

• the whole stack has to be copied at migration time

Page 21: Optimizing Sharing Patterns and Locality via Thread Migration

21

Designing a TM Mechanism (cont.)

Creating all mobile threads at DSM initialization time Advantages:

• no pointer investigation and modification Drawbacks:

• lack of scalability – the maximum number of threads are created on each host

• lack of portability – may not work in future versions of the same operating system and cannot be used for heterogeneous systems

• the whole stack has to be copied at migration

Page 22: Optimizing Sharing Patterns and Locality via Thread Migration

22

Designing a TM Mechanism (cont.)

Placement of stacks in a predefined memory region Advantages:

• no pointer investigation and modification• scalability – threads are created on

application demand or at migration time• portability

Drawbacks:• the whole stack has to be copied at

migration

Page 23: Optimizing Sharing Patterns and Locality via Thread Migration

23

Designing a TM Mechanism (cont.)

Placement of stacks in a DSM region Advantages:

• preserves all the advantages of the previous approach

• the stack has not to be copied at migration

Page 24: Optimizing Sharing Patterns and Locality via Thread Migration

24

Implementation of TM Placement of stacks in a predefined memory

region or the default stack approach the same address region is reserved at

initialization time of DSM on each host at creation each thread receives a slot for the

stack according to its id UNIX-like operating systems provide inside their

thread creation API an option to control stack location

this approach is difficult to implement in Windows NT since there is no any conventional way to control stack location

Page 25: Optimizing Sharing Patterns and Locality via Thread Migration

25

Implementation of TM (cont.)

Stack location control in Windows NT an application asks the DSM system to create a thread the thread is created in suspended state (the initial stack

is empty) the address of initial stack is obtained through its ESP

register and freed the value of the ESP register is changed to a new stack

location a pointer to the Win32 data structure – Thread

Information Block (TIB) – is obtained through the FS register

two fields inside the TIB are modified accordingly: pvStackUserTop and pvStackUserBase

the thread is resumed

Page 26: Optimizing Sharing Patterns and Locality via Thread Migration

26

Implementation of TM (cont.)

Placement of stacks in a DSM region a separate region is added to DSM a stack location of a thread is changed

to be a slot inside the new DSM region similarly to the previous approach

however the stack cannot be handled as a regular DSM region

Page 27: Optimizing Sharing Patterns and Locality via Thread Migration

27

Implementation of TM (cont.)

Why a thread’s stack cannot be handled as a regular DSM region? Let us see an example:

thread A migrates from host 1 to host 2 the stack of thread A remains on host 1 since it is placed

on DSM; therefore the first access to the stack will cause a page fault exception

DSM_handler should be called in order to bring the missing part of the stack

however the stack is protected and DSM_handler cannot be called in a regular way ...

host 1 host 2

thread A migrates

Page 28: Optimizing Sharing Patterns and Locality via Thread Migration

28

Implementation of TM (cont.)

The auxiliary stack approach: this approach is based on the KMFC-EH technique a memory region is allocated at initialization time of

DSM on each host, called the auxiliary stacks region page fault exceptions are intercepted by DSM_filter

(driver) at kernel-level when an exception has occurred on a stack DSM_filter

changes the stack location of the thread to be a slot inside the auxiliary stacks region and calls DSM_handler

DSM_handler brings the page for the original stack, sets appropriate protection, switches the stack back and transfers control to the thread

Page 29: Optimizing Sharing Patterns and Locality via Thread Migration

29

TM in the Millipede 4.0 DSM System In sum, our TM mechanism has the following

powerful features: two TM approaches kernel-level threads being migrated SEH support the FastMessages service is used to efficiently

transfer of migrating threads thread suspension and resumption are location

independent and may be recursive supporting safety of all API functions provided by

Millipede 4.0 statistics tool

Page 30: Optimizing Sharing Patterns and Locality via Thread Migration

30

Performance Evaluation

Cost of Communication in Myrinet

40.1 44

69

92

112

126.5

0

20

40

60

80

100

120

140

0 512 1024 1536 2048 2560 3072 3584 4096 4608

bytes

µse

cs

Page 31: Optimizing Sharing Patterns and Locality via Thread Migration

31

Performance Evaluation (cont.)

Win32 Function Cost

GetThreadContext 8.86 sec

SetThreadContext 9.57 sec

SuspendThread 5.42 sec

ResumeThread 6.35 sec

• The cost of Win32 calls used in TM:

• Performance of TM in Millipede 4.0:

Win32 Calls 30.2 sec

Network/copy 149.9 sec

Total TM Time 180.1 sec

Averaging over 1,000,000 instances of each call

Averaging over 1,000,000 of TMs with stack size of 176B

Page 32: Optimizing Sharing Patterns and Locality via Thread Migration

32

Performance Evaluation (cont.)

System 1K 2K 4K Hardware Characteristics

Ariadne 1100 1400 SPARC, Ethernet

PM2 210 450MHz Pentium-II

ActiveThreads 630 1100 4*50MHz HyperSPARC

CVM 15971704B

66.7MHz Power2, 128MB of RAM, 64KB of cache, 40 MB switch

Brazos 1010 4*200MHz Pentium Pro, 256MB of RAM, 256KB of cache, Gigabit Ethernet

Millipede 70000 100-MBs Ethernet

Millipede 4.0 202 219 256 Pentium-II 300MHz, 128MB of RAM, 512KB of cache

• Migration Time on Various Systems as function of stack size (sec):

Page 33: Optimizing Sharing Patterns and Locality via Thread Migration

33

Load Balancing (LB) in DSM SystemsIntroduction: Definition of load in DSM systems:

the CPU time that a computational thread consumes

the amount of communication that the thread causes during its work

Dynamic load sharing computes a less precise location scheme of

threads, but due to the relaxed requirements can often be as efficient as dynamic load balancing

Page 34: Optimizing Sharing Patterns and Locality via Thread Migration

34

11

77

4466

55

33

1414

221010

1212

88

99

1313

1111

1515

11

77

4466

55

33

1414

221010

1212

88

99

1313

1111

1515

11

77

4466

55

33

1414

221010

1212

88

99

1313

1111

1515

Introduction (cont.)

Page 35: Optimizing Sharing Patterns and Locality via Thread Migration

35

Designing an LS Mechanism The Goals of Load Sharing

A uniform distribution of threads among the stations

Minimization of communication overheads• Improving the locality of accesses• Avoiding page ping-pongs situations, in

which a page is transferred frequently among several hosts

Page 36: Optimizing Sharing Patterns and Locality via Thread Migration

36

Designing an LS Mechanism (cont.)

We propose a load sharing mechanism that works as a separate module, called the Load Sharing Module (LS-Module).

The LS-Module performs the following tasks: load imbalance detection load imbalance treatment ping-pong detection ping-pong treatment

Page 37: Optimizing Sharing Patterns and Locality via Thread Migration

37

Designing an LS Mechanism (cont.)

Load Imbalance Detection protocol has a centralized entity called the Load Sharing Server (LS-Server) that knows the power parameter of each host notified by an external module on each change

in the load for each change in the load calculates two

threshold values l and h of a host, in this way determining whether the host is normally loaded

begins load imbalance treatment protocol when load imbalance is detected

Page 38: Optimizing Sharing Patterns and Locality via Thread Migration

38

Designing an LS Mechanism (cont.)

Load Imbalance Treatment protocol is performed by the LS-Server which decides how many threads, say n, should be migrated from an overloaded host, say H1

to balance its load An entity called Load Sharing Client (LS-

Client) that runs on each host is responsible for selecting n threads whose migration will best minimize future communication

Page 39: Optimizing Sharing Patterns and Locality via Thread Migration

39

Designing an LS Mechanism (cont.)

Ping-Pong Detection protocol is performed by the Ping-Pong Client (PP- Client) entity

Each time there is an access to a remote page the PP-Client (one per host) is invoked

A ping-pong situation exists when the following two conditions are met:

1. local threads attempt to access a page a short time after it leaves the host

2. a page leaves the host a short time after it has arrived

Page 40: Optimizing Sharing Patterns and Locality via Thread Migration

40

Designing an LS Mechanism (cont.)

Ping-Pong Treatment protocol is performed by a centralized Ping-Pong Server (PP-Server) entity

The PP-Server determines which group of threads is participate in a ping-pong, then it chooses a destination host and migrates the threads to this host

If too many threads participate in a ping-pong or a ping-pong is detected a short time after it has been resolved, the PP-Server decides to treat the ping-pong using delays

Page 41: Optimizing Sharing Patterns and Locality via Thread Migration

41

LS in the Millipede 4.0 DSM System We have implemented the load sharing

mechanism in the Millipede 4.0 DSM system

Millipede 4.0 architecture The Thread-Server module The TM module The LS module:

• one centralized LS-Server• LS-Clients (one per host)• PP-Clients (one per host)

Page 42: Optimizing Sharing Patterns and Locality via Thread Migration

42

LS in the Millipede 4.0 DSM System (cont.)

Access History In order to select the threads for migration, for

each thread we keep an access history The access history contains at most one entry

for each page that was referenced by the local threads in last Tepoch time units

Obviously the access history should be updated as time passes

The access history keeps also an old history or prehistory

• summarizes the old access history of a thread

Page 43: Optimizing Sharing Patterns and Locality via Thread Migration

43

LS in the Millipede 4.0 DSM System (cont.)

Access History Structure

Page 0x0DCC

Page 0xACDC

. . .

Thread 0

Thread 7

. . .Prehistory

0:12:00 0:12:01 0:12:13

Page 44: Optimizing Sharing Patterns and Locality via Thread Migration

44

LS in the Millipede 4.0 DSM System (cont.)

Thread Selection Algorithm A heuristic value h(j) is calculated for each

thread j on the local host L. It takes into account the following characteristic:

• Maximal frequency of remote references to pages on R• Minimal access frequency of the threads remaining in L to

the pages used by the selected threads • Minimal access frequency to local pages• Maximal frequency of any remote references

Until enough threads are selected, the following procedure is performed:

• The thread j having the maximal value h(j) is chosen• The heuristic value of each thread i that has not yet been

selected is revised, taking into account migration of j

Page 45: Optimizing Sharing Patterns and Locality via Thread Migration

45

LS in the Millipede 4.0 DSM System (cont.)

Ping-Pong Detection

send page P to Hi

access page P (bring it from Hj)

receive page P from Hj

send page P to Hk

Tunused Twaiting Tuseful

Page ping-pong condition is:

(S is called the sensitivity of the ping-pong)

PPRatio =Tunused + Tuseful

Twaiting

< S

Page 46: Optimizing Sharing Patterns and Locality via Thread Migration

46

LS in the Millipede 4.0 DSM System (cont.)

Dynamic calculation of for page P The value of depends on the number

of threads that are using the page and on their behavior

P = · Nthpp

f (Nth)

is a constant;

Nthpp the number of threads involved in the ping-pong residing on the local host

Nth the total number of threads residing on the local host

f (Nth) is the function of that number

Page 47: Optimizing Sharing Patterns and Locality via Thread Migration

47

Performance Evaluation We have tested the LS module on

several benchmarks that are common in DSM systems, as well as on synthetic microbenchmarks specially designed for this purpose

We refer to the version of Millipede 4.0 with LS module as the LS version and to the version without the LS module as no LS version

Page 48: Optimizing Sharing Patterns and Locality via Thread Migration

48

Microbenchmark applications were designed to simulate various load imbalance situations

Using microbenchmark applications we have measured the individual performance of each part of the load sharing protocol: load imbalance treatment ping-pong treatment:

• locality optimization part• stabilization part

Performance Evaluation (cont.)

Page 49: Optimizing Sharing Patterns and Locality via Thread Migration

49

Locality optimization protocol

Performance Evaluation (cont.)

Page 50: Optimizing Sharing Patterns and Locality via Thread Migration

50

Page 51: Optimizing Sharing Patterns and Locality via Thread Migration

51

Stabilization protocol

Performance Evaluation (cont.)

Page 52: Optimizing Sharing Patterns and Locality via Thread Migration

52

Performance Evaluation (cont.) Stabilization protocol

Page 53: Optimizing Sharing Patterns and Locality via Thread Migration

53

Performance Evaluation (cont.) Stabilization protocol

Page 54: Optimizing Sharing Patterns and Locality via Thread Migration

54

Performance Evaluation (cont.) Stabilization protocol

Page 55: Optimizing Sharing Patterns and Locality via Thread Migration

55

Performance Evaluation (cont.) Stabilization protocol

Page 56: Optimizing Sharing Patterns and Locality via Thread Migration

56

Performance Evaluation (cont.)

Load imbalance protocol

Page 57: Optimizing Sharing Patterns and Locality via Thread Migration

57

Overhead for benchmark applications

Performance Evaluation (cont.)

Page 58: Optimizing Sharing Patterns and Locality via Thread Migration

58

Conclusions In this thesis we have researched and contributed

to three different aspects of DSM systems.1. Internal DSM handling mechanisms

We have researched how these mechanisms work We studied in depth the functionality of the SEH service We detail two major drawbacks of using SEH in DSM

systems We present two new techniques for internal DSM

mechanisms that will make these mechanisms more efficient and reliable

We analyzed the performance implication of using these techniques in the Millipede 4.0 DSM system

Page 59: Optimizing Sharing Patterns and Locality via Thread Migration

59

Conclusions (cont.)

2. Thread migration facilities in DSM Systems We observe how the correct application of this facility will

significantly increase the efficiency and reliability of the underlying DSM system

We present our design of a TM facility We discuss some correctness problems that a developer of a

TM facility has to consider We investigated different approaches for implementation of

this facility in DSM systems We have developed two new approaches: the stack on DSM

approach and the default stack approach We present two new techniques to overcome several technical

difficulties in implementing these approaches We implemented our TM facility in the Millipede 4.0 DSM

system, analyzed its performance, and compared it with other systems

Page 60: Optimizing Sharing Patterns and Locality via Thread Migration

60

Conclusions (cont.)

3. Load Balancing in DSM Systems We described several problems that occur as a result of poor

distribution of application threads We observed well-known strategies for load balancing in

distributed systems We present a design of a load sharing mechanism in DSM

systems This mechanism efficiently obtains thread communication

patterns It tries to avoid load imbalance by optimizing sharing patterns

and locality via thread migration We implemented the load sharing mechanism as a separate

module in the Millipede 4.0 DSM system Finally, we analyzed the performance implication of this

mechanism on several benchmark and microbenchmark applications