Comparative Performance Evaluation of Hot Spot Contention...

812 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 6. NO. 8. AUGUST 1995

Comparative Performance Evaluation of Hot Spot Contention Between M IN-Based and

Ring-Based Shared-Memory Architectures Xiaodong Zhang, Senior Member, IEEE, Yong Yan, and Robert Castaf ieda

Abstract-Hot spot content ion on a network-based shared- memory architecture occurs when a large number of processors try to access a globally shared variable across the network. While Multistage Interconnection Network (MIN) and Hierarchical Ring (HR) structures are two important bases on which to build large scale shared-memory multiprocessors, the different interconnection networks and cache/memory systems of the two archl- tectures respond very differently to network bottleneck situations. In this paper, we present a comparative performance evaluation of hot spot effects on the MIN-based and HR-based shared- memory architectures. Both nonblocking MIN-based and HR- based architectures are classified, and analytical models are described for understanding network differences and for evaluating hot spot performance on both architectures. The analytical comparisons indicate that HR-based architectures have the potential to handle various contentions caused by hot spots more efficiently than MIN-based architectures. Intensive performance measurements on hot spots have been conducted on the BBN TC2000 (MIN-based) and the KSRl (HR-based) machines. Performance experiments were also conducted on the practical experience of hot spots with respect to synchronization lock algorithms. The experimental results are consistent with the analytical models, and present practical observations and an evaluation of hot spots on the two types of architectures.

Index Terms-Hierarchical Rings (HR), hot spot, Multistage Interconnection Network (MIN), per formance model ing and measurements, slotted rings, shared-memory, the BBN TC2000, the KSRl.

I. INTRODU~~~N

H OT spot content ion on a network-based shared-memory architecture occurs when a large number of processors try

to access a globally shared variable across the network. This topic has been studied extensively on MIN-based architectures. Pfister and Norton [ 1 l] general ize the problem as a type of network traffic non-uniformity. Their analytical models and simulation results for blocking MIN-based architectures show that the hot spot effects may severely degrade all network traffic, not just the traffic to shared variables. The effect is def ined as rree saturation, where traffic to the hot memories backs up at the switch and interferes with other traffic, including that to non-hot memories. In addition, they indicate that the combining of messages within a MIN-based architecture is an effective technique for deal ing with the hot spot problem.

Manuscript received July 9, 1993; revised Dec. 3, 1994. Xiaodong Zhang, Yong Yan, and Robert Castaikda are with the High-

Performance Computing and Software Laboratory at the University of Texas at San Antonio, San Antonio, TX 78249; e-mail: [email protected], [email protected], and [email protected], respectively.

IEEECS Log Number D95015.

Thomas [13] presents a set of experiments des igned to measure the behavior of the Butterfly I system in the presence of memory hot spots. The experimental results reported in the paper show that the tree saturation effects do not general ize to the Butterfly I machine because nonblocking networks are used in the system.

The work we descr ibe here differs from the studies cited above in several important respects. First, we descr ibe analytical models for evaluating hot spot effects on nonblocking MIN-based architectures. W e show that significant delay may also happen in the presence of memory hot spots, when network content ion inherent in a MIN-based architecture is po- tentially high. The analytical results are confirmed by our experiments on the TC2000. W e also present analytical models for evaluating hot spots on HR-based architectures. Compre- hensive performance evaluation results provide deeper understanding and compar isons of hot spots on both architectures. Second, we conduct a series of experiments for evaluating hot spot effects on the BBN TC2000, a MIN-based architecture and the KSRl, an HR-based architecture. The experiments on the TC2000, which has heavier network content ion than the Butterfly I which was used by Thomas [13], present much stronger hot spot effects. Finally, comparat ive and quantitative performance analysis and evaluation of hot spot effects on both MIN-based and HR-based architectures are presented based on model ing and experimental results.

The organization of this paper is as follows. Section II presents per formance models for evaluating hot spot effects in terms of remote access delay and network content ion on nonblocking MIN-based and HR-based architectures. Based on the analytical models, comparat ive hot spot per formance evaluation of the two architectures is presented. In Section III, the network and system structures of the TC2000 and the KSRl are briefly overviewed, experimental results are reported, and comparat ive behavior and analysis are addressed. Performance experiments were also taken on the practical ex- per ience of hot spots with respect to synchronizat ion lock algorithms. Summaries and conclusions are given in Section IV.

II. ANALYTICAL PERFORMANCE EVALUATION

A. Analytical Models for Nonblocking MIN Architectures In a MIN architecture, the network content ion is def ined as

a conflict where two messages need to access the same portion of a path at the same time. The network can be des igned either in blocking form or nonblocking form. In the blocking net-

ZHANG. YAN, AND CASTAREDA: COMPARATIVE PERFORMANCE EVALUATION OF HOT SPOT CONTENTION 873

work, a protocol organizes a message queue while all the conflicting traffic comes to a standstill (each of the other conflicting messages sits and holds its path). When the path is cleared, the next selected message proceeds. The major problem of the blocking network is the so called cascade effect- where each new message tends to run into other b locked messages, and gets b locked itself. This ties up more resources in the switch, and increases the chance that subsequent messages will also block. This is the main reason why the tree saturation descr ibed in [ 1 l] may easily appear in a blocking MIN architecture. The nonblocking network may reduce the traffic conflicts: when conflicts happen, the switch has all but the ‘first” message retreat back to its source and free up their path. It then selects an alternative route, and after a random delay, tries again. Zhang and Qin [ 161 present a remote access delay model for a nonblocking MIN architecture where the behavior of a remote memory access is descr ibed by a state transition diagram called the drop approach [6]. Here a processor makes a remote memory access by formulating requests for access to the set of switches along that path. If it cannot obtain a switch, it abandons its request at that point and will try again at some later time. In Fig. 1, state 0 represents some processor in qui- escent state, while state n + 1 represents an ongoing successful access. State b represents the processor when it has dropped its request because of the switch contention,

Fig. 1. State transition diagram for remote-memory access through an n-stage interconnection network.

Based on the general access delay model descr ibed in [16], we present a remote access delay model in the presence of memory hot spots. Here we briefly give some of the major results of the model. For detailed analyses and proofs of the model, the interested readers may refer to [ 141.

Assume that the size of each switch in the MIN network is kx k. The probability for a hot spot access request to acquire the switch on state i + 1, denoted as ~$‘is

(2.1)

where Xi, (i = 0, . . . . n + 1) is the steady-state probabilit ies for a request at state i, T is the switch delay, and & is the average hot spot request rate.

Including a hot spot to the remote memory delay model described in [ 161, we present an average memory latency model in the presence of a hot spot on a nonblocking MIN-based network:

Tb, =nT+l+(l-‘h) 4s eh (2.2)

where n is the number of MIN stages, T is the switch delay time, $is an average memory access time, P,” is the probability of i successful memory access through all s tages in the presence of a hot spot, Ti is an average delay of unsuccessful memory access in the presence of a hot spot, and &is an average delay time spent in state b in Fig. 1. For detailed mathe- matical derivation of (2.2), the interested reader may refer to [14] and [16].

Fig. 2 presents a comparat ive view of a group of memory delay curves. The access delays with different hot spot fractions (vertical axis) grow as the access request rate il grows (horizontal axis). The network has six stages connected by a group of 4 x 4 switches. The memory delay in Fig. 2 is represented by factors of a normal memory access (delay factor = 1). A normal access has the request rate of 0 .175 without the presence of hot spots. Examining Fig. 2, one can notice that memory latency is increased significantly but no tree saturation is observed. For example, the memory latency would be eight times higher in the presence of 32% hot spot fraction. In contrast, with a hot spot in a blocking MIN-based network, the latency climbs to an asymptote at the point the traffic and hot spot percentage combine to saturate the entire network. The model and the figure on page 945 in [ 1 l] address the reasons why cool memory requests are dramatically delayed, and de- scribe the fact of tree saturation.

Delay facta

0

____---- k-- ,-

_--- ____----- ___--- ____---- /- _--- . . . . . . . . . . . . _ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . _ . . . . . . . . . . . . . . . . .* ,’ #’ /..- *.....-. ,,’ _ ..-. . ..- 0.2 0.6

Fig. 2. Average memory latency in the presence of a 4 x switch nonblocking interconnection network.

0.0

hot spot on a 6 stage and

Another reason why nonblocking MIN architecture may not suffer tree saturation is that a cool memory access request in the presence of the hot spot may eventually escape the contention switches at some stage. Let’s define this as stage i. Then the request will reach the target cool memory through noncon- tention switches. The probability for a successful access to a cool memory module at state j is

k-l ) i< jln.

(2.3)

874 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 6, NO. 8, AUGUST 1995

The probability model (2.3) indicates that qfDDLqy, l<j<i,

and

qfoo’>qy,iIj<n.

Compar ing (2.1) and (2.3), we have q,F’ < qfoo’Q

where

(2.4)

This means that the successful access probability of a hot memory request on state i is Q times less than the probability of a cool memory access at the same state. Therefore, an access request to a cool memory in the presence of memory hot spots may be significantly reduced in a nonblocking MIN architecture. The experimental results on the Butterfly I reported in [ 131 confirms this. However, as network content ion inherent in a MIN architecture is increased, an access request to a cool memory in the presence of memory hot spots may be significantly delayed. This is because the state number i where a cool memory request leaves the hot spot path may be increased when the network content ion is increased. Zhang and Qin [16] have studied the potential network content ion in computat ion predicted by the existing MIN speed (the network bandwidth) and the processor speed (CPU clock rate) in the system. Ac- cording to their analytical model and experiments, if the MIN speed remains the same in two systems, and the speed of processors is doubled in one system, the potential network contention in a computat ion in that system is at least doubled. The major difference in architecture between the Butterfly I and the TC2000 we used is the type of processors. The Butterfly I is the first generat ion of BBN MIN-based multiprocessors. Each processor node contains an 8 MHz MC68000. The BBN GPlOOO uses slightly faster Butterfly network but replaces the MC68000 processors by much faster 20-MHz 88100 processors. Based on the performance evaluation in [ 161, the relative amount of overhead caused by network content ion for the same computat ion on the TC2000 is at least two times higher than the overhead on the Butterfly I. In Section III, we will report the experimental results of hot spots on the TC2000, which give different per formance results than the ones presented by Thomas [ 131 on the Butterfly I.

B. Analytical Models for HRs

For generality and simplicity of modeling, we define the target HR-based architecture model as a two level hierarchical ring with the following structures, functions and parameters:

1) The basic structure of the HR-based architecture is described in Fig. 3 where each of the m local r ings (LR) are connected to a global ring (GR) through a link with a pair of ports on its two ends. The global ring has rn equally sized slots connect ing to rn local rings. Each local ring has n equally sized slots, each of which connects to a

0 - Processor Node

Fig. 3. The hierarchical ring architecture in the analytical model.

processor node. A local memory is associated with each processor node and can be accessed globally by other processor nodes through the ring network.

2) There are two buffers in each pair of ports of the link to connect the global ring and a local r ing-an input buffer (IB), and an output buffer (OB). The IB is used to buffer coming messages, and the OB is used to buffer outgoing messages.

3) Both the global ring and local r ings are rotated con- stantly. A processor node in a local ring ready to transmit a message waits until an empty slot is available. The rotation period is def ined as tr which also reflects the size of the slot.

4) A message will be un loaded to the destination processor slot and an acknowledging message will be loaded to the slot from the processor to be transmitted to the source processor. These two operat ions are conducted within the rotation period.

5) There are two different schemes for an access/storage of variables in an HR-based architecture. A fixed type variable def ines a variable in a fixed physical memory location which has a permanent ownership during a program execut ion unless it is invalidated due to writing. The variable can be accessed through remote accesses by other processors. The fixed type operat ions are performed on a cache coherence nonuniform memory access architecture (CC-NUMA) [ 121. Examples of CC-NUMA machines are Stanford DASH mult iprocessor [9] and MIT Alewife machine [l]. In such a machine a home node for the corresponding physical address is required for operat ions of data access and cache miss. A movable type variable may be moved around among memory modules in the whole HR system, and its ownership is dynamically ass igned to the processor node where it is

ZHANG, YAN, AND CASTAREDA: COMPARATIVE PERFORMANCE EVALUATION OF HOT SPOT CONTENTION 875

currently located. In addition, copies of the variables may be dupl icated among the memory modules. The movable type operat ions are performed on a cache only memory architecture (COMA). Examples of COMA machines are the Swedish Institute of Computer Science’s Data Diffu- sion Machine (DDM) [7] and the KSRl machine. In this type of machine, the location of a cache block is totally decoupled from its physical address. Data are dynamically migrated and replicated in the entire cache system.

6) Cache coherence is maintained by a directory-based, write-invalidate cache coherence protocol. An analytical model and experimental case studies for data migration including cache coherence effects on CC-NUMA and CC-COMA ring architectures will be publ ished in [ 171.

7) The f requency of memory accesses is def ined as request rate il, (the number of the requests per time unit). The request pattern sent from a processor is assumed in a Pois- son function.

Furthermore, we assume: 1) The request miss rate of each local cache follows a Pois-

son process. 2) One message packet can be completely carried by one

slot which only conveys this message packet, so the successive slots behave independently.

3) When a station receives a message packet from a slot, it will p roduce a reply into the same slot without any delay.

B.1. Remote Access Delay in the Presence of Fixed Hot Spots

Since the targeted HR system, such as the KSR-1, only allows one outstanding memory access request per processor, it is a c losed system. Using a closed model by queueing theory to analyze the network content ion would be a reasonable choice. This is because the close model is relatively precise. However, the closed model for a complex system needs to solve a large nonl inear system of equations. The solutions of the system may not be easy to get because the initial points of the system may not be close enough to the solutions. The results of the model solved by iterative methods are often approximate. In contrast, an open model is usually simple, which allows us to take more parameters and more performance issues into con- sideration. However, it may not be as precise as a closed model. In a c losed system, if we can identify and distinguish the effects among multiple servers, it is still possible to use an open model to reasonably capture the performance character- istics of the system. Examples of using an open model to study a closed ring system are reported in [5] for network analysis and in [4] for cache coherence protocols. The model ing results in [4] are well supported by the simulation.

Besides its practical applications in per formance analysis, there are three reasons why using an open model to study the ring hot spot effects is acceptable and sufficient. First, the open model could be reasonably precise for this study because we have special information for the architecture and the system, such as memory access distribution patterns. Second, the content ion points in the ring system exist in the interface ports between the global ring and the local ring, and between a

processor and the local ring. The communicat ion distribution in the steady state can be completely determined by the given conditions. Thus, for each content ion point, its content ion degree can be reasonably modeled by the gated /M/G/l queue of an open model. Finally, the analytical results of the models have been consistent with the experiment measurements on the KSR- 1.

W e use an open model to study network content ion in the HR system with the following considerat ions of memory access distributions:

1) Only the performance in steady state is considered. In the steady state, assume the following two performance parameters are given.

a) &,: a fraction of il contr ibuted to a hot-spot memory module.

b) a fraction of 41 - &,) directed to memory modules in its local ring. Therefore the nonlocal request rate is (1 - a[)( 1 - &,)a. In addition, the local request rate and nonlocal request rate is assumed to be uniformly distributed.

2) The numbers of request packets and acknowledgement packets traveling in the ring at any time are the same.

In model ing hot spot effects, the hierarchical r ings are di- v ided into three parts in terms of network activities: the local ring in the presence of the hot spot called the hot local ring, the global ring, and the rest of the local r ings without the presence of hot spots, called cool local rings. A comprehensive access delay model for the entire HR-system in the presence of memory hot spots is presented based on model ing the non-hot spot local rings, the hot spot local ring and the global ring.

A. Model ing the Non-Hot Spot Local Rings

Due to the similarity among the access patterns of each non- hot local ring, the analysis on the non-hot local r ings can be focused on a single non-hot local ring without loosing generality. Based on the ring architecture descr ibed in the previous section, accesses to the interface ports between the global ring and a local ring can be modeled by a gated M/Q/l queue with a mean packet arrival rate &,, which may be formed by a steady state formula,

a, =naa,+2n(i-a#-a.[)a (2.5) This arrival rate is the sum of the probe packet arrival rate and the data packet arrival rate.

A general network utilization is def ined by

U= limC r+- t

(2.6)

where C is the total number of bytes transmitted to the network from all processor nodes during the time period oft. Based on (2.5) and (2.6), a local ring utilization is formulated as

v,=t&&+na)=n)3t,(l+a~+2(l-a~)(l-a1)). (2.7)

Because the successive slots have been assumed to behave independently, the probability of a slot on a cool local ring to be full is U,. The time needed for a packet in each station to find an empty slot can be approximated to have a geometr ic


distribution. The analyses in [5] show that the error caused by the independence assumption is trivial. The probability of this event is I$ (1 - U,) . Thus, the average time of finding an empty slot is

2” =-gr,(l-u&l; i=O

VI =- l--U,

(2.8)

d$(i+A,+2(1--a,)(l-a,))

= i-dt,(i+A, +2(1-a,)(l-a,)) *

The average delay for an access from an interface port to a processor in this ring, denoted as gP,r, consists of a delay W in the input buffer and the average destination searching time nt,/2:

q,, = w+% 2

The average request queue length in the input buffer may be calculated by Little’s law,

Q = a ,W (2.9)

where iG, an average waiting time in the queue may be calculated by

F=(;?,+t,)+Q&+t,). (2.10)

Combining (2.9) and (2.10), W becomes

W = 2” +t,

l-a,(;i,+t,)’

Finally, the average delay for a nonlocal ring access gives

d,d = d,+t, nb

l-a,(& +t,) ‘1’

(2.11)

B. Model ing the Hot Spot Local Ring

The hot spot local ring can be modeled similarly as a cool local ring in A. Let &, be the packet arrival rate on the hot interface port from the global to the hot local ring. It has

a@ = (m - i)naah + 2n( 1 - ah)(i - alla. (2.12)

Based on (2.12) and (2.6), the utilization of the hot local ring is

u, = (a& + nap, = &(l + (m - i)ah + 2(1 - ah)(i - al)).

So the mean time to find an empty slot on the hot ring is

dh =U,t,= ‘GP +4 (2.13)

l--U, i-t,(a,,+d) * Let gPhl be average delay for an access from the hot port to

a processor in the hot ring. Using (2.9), (2.10) and (2.1 l), zPhl can be expressed as

qhl = & + t, nt r

l-a,(;i, +t,) +2* (2.14)

Moreover the average length of input queue in the hot interface port is

E* = $$+f!j*) (2.15)

hpr ((m-l)nLah +2n(l-a~)(l-~,)a)t,

= 1-n)3t,(1+2(m-l)~h+4(1-&,)(1-;l,)) *

C. Model ing the Global Ring

Access to a slot in the global ring from a local ring via the interface port can also be modeled as a gated M/Q/l queue. Let & and agn be the packet arrival rates of the hot port and a non-hot port, then we have

agh = (m - l)n& + 2n( 1 - ah)( 1 - alla, (2.16)

agn = &ah + 2n( 1 - a*)( i - a[)a. (2.17)

Based on (2.16), (2.17), and (2.6), the utilization of the global ring can be expressed as

4 = (afh + (m - ihh (2.18)

= (2(m - l)na& + 2mn(l - ad(i - a[)ajt,.

The average delay for a processor node to find an empty slot in the global ring, denoted as I?~, can also be determined by (2.18) and (2.8),

d = 2&,2((m-l)a, +m(l-a&-a,)) (2.19)

g 1-2njlr,((m-l)A,+m(l-&)(1-a!))

Moreover, by (2.9), (2.10) and (2.17), the queue length, denoted as Lgh, on the hot port can be expressed as

Agh(fr +d,) (2.20) Lgh =

ba,,(t,+d,)’

The average queueing delay on the hot port, denoted as vh, is

dg +t, (2.21) &, =

1-a,h(d, +6)

= 1-n~~(3(m-l)ah+~m+l)(l-ah)(l-a~))’

The average queueing delay on the non-hot port, denoted as W,, is

dg +t, (2.22) iv” = l-a$&? +tr)

D. Access Delay Analysis in the Hierarchical Rings

Here we summerize various average access delays in the entire HR network, where each access takes a round trip from the source back to source again.

1) Average local request delay in a non-hot spot ring, denoted as dlcmr, consists of the time of finding an empty

ZHANG, YAN. AND CASTAREDA: COMPARATIVE PERFORMANCE EVALUATION OF HOT SPOT CONTENTION a77

slot on a cool local ring and the request travel time on the local ring. By (2.8) we have

d lcool = d, +ntr* (2.23)

2) Average local request delay in the hot local ring, denoted as dlh,, consists of the time of finding an empty slot on hot local ring and the request travel time. By (2.13) we have

d lbr = dh + nt,. (2.24)

3) Average request delay from a cool ring to another cool ring, denoted as d,,, consists of four travel timing parts: time from the source cool ring to the global ring, the time from the global ring to the destination local ring, the time for searching the destination node in the destination ring and the time for the data packet to come back to the source node. By (2.8), (2.1 l), and (2.22), we have

d,,=&+2F~+2~p,,,+tr(m+n). (2.25)

4) Average request delay from a cool ring to the hot ring, denoted as dchr consists of four travel timing parts: the time from the source cool ring to the global ring, the time from the global ring to the hot ring, the time for searching the destination node in the hot ring and the time for the data packet to come back to the source node. By (2.8), (2.11). (2.14), (2.21), and (2.22), we have

dch=d,+~“+dphl+~h+;lpnl+t,(m+n). (2.26)

5) Average request delay from the hot ring to a cool ring, denoted as dk, consists of four travel timing parts: the time from the hot ring to the global ring, the time from the global ring to the destination cool ring, the time for searching the destination node in the destination cool ring, and the time for the data packet to come back to the source node. By (2.13), (2.11), (2.14), (2.21), and (2.22), we have

dhe =i.?,, +Fa +c&, +Fh +i&, +t,(m+n). (2.27)

6) Average delay to access a cool ring in the presence of hot spots, denoted as DC,,, comes from (2.23), (2.25), and (2.26):

&,I = k&h + (1 - $)(&kml + (1 - wc>. (2.28)

7) Average delay to access the hot ring, denoted as Dh,, comes from (2.24) and (2.27):

at = @h + (1 - wl)dI~t + (1 - ah)(l - wk. (2.29)

Combining the (2.28) and (2.29), the average delay D of a request in an HR in the presence of a hot spot is

DC (m-l)DCd : ‘h0, (2.30) m m ’

E. Hot Spot Effects on the Hierarchical Ring Network

The effects of the hot spot in the hierarchical ring network may be determined by the average access request rate iz. It gives a quantitative view of network contention caused by the hot spot. We assume that each access request is done when the

corresponding message from the target processor node comes back to the source node. The average access request rate a is bounded by the average time of completing a request, T. In a real system, the speed of network transactions will be stable when the access time is stable. Let ;1, and T, be a request rate and an average access request time respectively in a steady state, we have

(2.3 1)

Network transactions can be determined by utilizations of the three classes of rings. The network untilization in a cool local ring is

UC& = nk[(l + ah + a1 - ah)(l - &)I,

in the hot local ring is

(2.32)

ubt = n&[( 1 + (m - 1)ah + 2( 1 - ah)( 1 - A)],

and in the global ring is

(2.33)

U, = 2nAfr[(m - l)ah + m(1 - ah)(l - A)], (2.34)

An upper bound of the average access request rate, denoted as A, in the presence of the hot spot can be calculated by substituting the maximum access rate ;1= 1 to (2.33).

a,2 (2.35) t r

where

Formula (2.35) indicates that network transactions will be slowed down (&, is decreased) when the numbers of slots of the rings (m and n) are increased, and when the frequency of access to the hot spot ($) is increased.

Quantitative evaluation of effects of a given buffer length on an average access request rate in the presence of the hot spot is another important factor to be considered. Let Lob be an average length of the access request queue buffered in the port entering the hot local ring. It then can be calculated by Little’s law and the Untilization law based on the different types of request rates (I, ;lh and &), the numbers of slots in the global ring (m) and in a local ring (n), and the ring rotation period (tr):

(2.36) Lob =

1-nkr[3(m-1)ah +2(m+l)(l-ah)(l-&)]

For a given buffer length in design, denoted as Ld, the buffer in the port of the hot ring is full or overflow when Ld I Ld. An- other upper bound of access request rate based on the buffer length, denoted as &can be calculated based on (2.36).

(2.37)

where

Ld

A1 = 2n(n+mLd+Ld)(1-1,) ’


and 2(n+mLd +Ld)(l-&)

A2 = (m-1)(3Ld +n)ah +2(n+mLd +Ld)(l-ah)(lsa2,) * Formula (2.37) indicates that the upper bound access request rate of the system, h buf is increased when queue length 4 is increased, and it is decreased when the the ring rotation period t, is increased. Furthermore, Formula (2.37) presents a clear view of hot spot effects on the average access request rate &&. Substituting a,, = 0 to (2.37), we have &,,f I +, (A2 = l), which gives the average access request rate without the presence of the hot spot. In other words, AZ is the reduction factor for the access request rate in the presence of the hot spot. As- sume that non-hot spot requests are evenly distributed in the global ring, namely a, = &. For ah = 1, and a, = ;t;, we have

A, = 2

2+ mLd +m-2-2Ld

mLd+Ld+l where

mLd +m-2-2Ld c2. mLd+Ld+l Then the lower bound of A2 is

A,>+

(2.38)

(2.39)

This gives an important performance result of hot spot effects on the ring network. It indicates that network transactions will be reduced no more than 50% in the presence of the hot spot comparing with the network transactions without the presence of the hot spot in the hierarchical ring architecture. The rotating ring orders and delays remote data access requests, this structure may naturally reduce network contention for programs with hot spots. We will confirm our comparative analytical results by experiments on the TC2000 and the KSRl in Section III.

Delay rkla 3

2.8 _ 32%lKdhaim _ 16%hatfKtlm -

2.6 - l : 8%bo1fndi0n I--

1.6

1.4

I.2

Fig. 4. Request delays in the txesence of a hot swt on a two level 32 rings

As our analytical study indicates, the reason why an HR- based architecture can significantly reduce network contention for programs with hot spots is because rotating rings delay remote data access requests in order. On the other hand, the memory access request rate would be limited by the system in the presence of a hot spot. Based on the analytical models, we can observe various performance effects in the presence of a hot spot. The modeled HR-based architecture has 32 rings in two levels. Each ring has 32 processor nodes.

Since the request rate is controlled by the ring structure, request delays would not be significantly increased in the presence of hot spots. In Fig. 4, a group of memory delay curves with different hot spot fractions (vertical axis) grows as the access request rate ;Z grows (horizontal axis), where the ring network has 32 local rings, each of which has 32 processors, and the request rate is bounded to 0.5 x 10”. Examing Fig. 4, one can observe three features of memory latency changes in an HR-based network in the presence of hot spots. First, hot spots have no significant effects on request access delays. Memory latency is slightly increased as the hot spot fractions are increased. Second, the memory latency is mainly determined by the access request rate. Especially, when the request rate reaches the upper bound, the memory latency is sharply increased. Finally, comparing with a nonblocking MIN-based network (see Fig. 2), high access request rate is a major factor to memory latency on an HR-based network, while the presence of hot spots is a major factor on a MIN-based network.

The analytical models can also help us to examine traffic distributions among cool rings and the hot ring in the presense of a hot spot. Fig. 5 illustrates the average delay factors for the three types of accesses: from a cool ring to the hot ring, from a cool ring to another cool ring and from the hot ring a cool ring. The delay factors were calculated in the presense of 32% hot spot fraction. An access from the hot ring to a cool ring has the long- est delay, while an access from a cool ring to another cool ring has the shortest delay. However, only when the request rate reaches to the upper bound, can the delay factor differences

Delay Factor I.7 1 I I I I I I I 1 I 1.6

1.5

1.4

1.3

I.2

1.1

fromthehottoacool - fromacooltothehot - I fromacooItoacool---

1 L , I 1 I I I I , 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

Rqu&rate( x lo9 )

Fia. 5. Request delay distributions in the presence of a 32% hot spot fraction

ZHANG, YAN, AND CASTA&DA: COMPARATIVE PERFORMANCE EVALUATION OF HOT SPOT CONTENTION 879

among the three types of accesses be observed. For example, the difference could be at most 37% between a cool to cool access and a hot to cool access when the request rate is 0.43 x lo-‘. Fig. 5 indicates that the traffic in the ring network is reasonably balanced even when a hot spot appears in the system. This is because the rotating ring orders and delays remote access. The experimental measurements on the KSRl presented in Section III confirm this analytical result.

B.2. Remote Access Delay in the Presence of Movable Hot Spots

Fixed hot spots descr ibed in the previous sections appear in Non-CC-NUMA and CC-NUMA architectures. A non-CC- NUMA architecture either supports no local caches (e.g., the BBN GPlOOO) or provides local cache but disallows caching of shared data to avoid the cache coherence problem (e.g., the BBN TC2000). In a CC-NUMA architecture, each processor node consists of a high-performance processor, the associated cache, and a port ion of the global shared memory. Like CC- NUMA, each processor node has a high-performance processor, a cache, and a port ion of the global shared memory. The difference, however, is that the memory associated with each node is augmented to act as a large cache. Consistency among cache blocks in the system is maintained using a write- invalidate protocol. A COMA system allows transparent migration and replication of data items to nodes where they are referenced. In this case, many processors access the same variable which is not fixed in one physical location. Therefore, the name of the variable is hot but not a single memory location.

In general, the location of a movable hot spot is randomly distributed among the processors. Assume that each processor has the same rate of memory accesses, the distributions a hot spot over a period of time can be considered as an uniform distribution. Therefore, the model ing of the network contention in the presence of a movable hot spot is a special case in the models presented in the previous section. Considering the hot spot accesses as a port ion of uniform memory accesses, and using formulas (2.5) to (2.25) with ;lh = 0, we can derive the average local request delay dlmov~lc and the average remote request delay d-die as follows:

d lmovoble = ‘&ml + % 1 (2.40)

d rmovable = qfind + 2dgWoi, + 2qWai, + (m + 2n)t, . (2.41)

By (2.8), the average time of finding an empty slot in a local ring is

&find = nI.$(3-21,) (2.42)

l-n&,(3-2ill) ’

By (2.10) and (2.5), the average queueing delay in the interface port on a local ring is

&twit = 4)w + tr (2.43)

1 - 2nil( 1 - l,)(&nd + t,) ’

By (2.22), the average queuing delay in an interface port on the global ring can be expressed as

dgwair = I-2nAI,(i+l)(l-A,) ’ (2.44)

Our analysis shows that a movable hot spot reduces network content ion in general. This analytical result has been verified by experimental results on the KSR- 1 in Section III.

B.3. Some HR-Based Architecture Factors

Based on the analytical models of HRs descr ibed in the previous sections, we can also produce several important factors for HR-based architecture design. Fig. 6 illustrates the effects of the numbers of processors of a ring on the limit of memory access rate (upper bound of the request rate). It indicates that increasing the number of processors and slowing down the rotation speed (increasing the rotation period) of the ring will limit the request rate. Fig. 7 illustrates the effects of the queue length of a port buffer between a local ring and the global ring on the memory access request rate. It indicates that memory access request rate is almost independent of the queue length. This fact can be explained as follows. If the queue length in the buffer is increased, traffic content ion may be reduced by increasing the waiting time for a request in the buffer. Therefore, the memory request rate limit will remain as a constant when the queue length is adjusted.

Fig. 8 shows the access delay is also dependent on the number of processors in each ring besides the request rate. For a given rotation rate and a given request rate without presence of hot spots, increasing the number of processors in a ring will increase the request delay. Another dependent factor of the access delay is the ring rotation period. Fig. 9 illustrates the request delay curves by changing the request rate for different given rotation periods. The faster the ring is rotated, the lower delay will be.

III. EXPERIMENTAL PERFORMANCE EVALUATION

A. The Experiment Testbeds

A. 1. The BBN TC2000

The MIN-based execut ion testbed in our study is the BBN TC2000 [3], which is the latest and the most powerful member of the BBN family, support ing up to 512 Motorola 88100 processors nodes. A butterfly switch composed of 8x 8 switches is used for the network. The processors are operated at a clock speed of 20 MHz. The bandwidth of each path of a switch is 38Mb per second in the TC2000. However, the network speed is still not fast enough to catch up to the fast processor speed, causing greater network contention. Each processor includes 16MBytes of memory that can be accessed from any processor in the system via the network. Each processor in the TC2000 has a Motorola 88200 paged memory management unit for virtual memory processing. Each processor provides a 32K byte code cache and a 16K byte data cache. The TC2000 avoids the cache coherence problem by disallowing caching of shared data. This scheme caches private data, shared data that is read-only, and instructions, while references to modifiable shared data bypass the cache. In addition, an interleaved shared-memory scheme is supported as an option in order to reduce memory content ion in applications.


0.0016

I6mda - 32oodcl - 64cdu ---

,a - . . . .

0 s IO 15 20 25 30 3s 40 45 33 55 RcludollPeriod

Fig. 6. Effects of the numbers of processors of a ring on the limit of memory access rate.

ABrluKl

o.w2 - _._,...I _._...._.! . . . . . . . . . .‘.........? . . ..._._.! . . . . . . . . . I . . . . . . . . . . I . . . . . . . . . * . . . . . . I I -

0.0018 - 16mdu ...’ 3200&J --

0.0016 - ulna&I- -

om14

0.0012

0.001 -

o.ocQs - 1

0.0006

I

____-------------_--------------------------- O.OW4

O.CW2 1

Fig. 7. Effects of the queue length of a port buffer between a local ring and the global ring on the limit of memory access rate.

A.2. The KSRl System

The KSRl [8], introduced by Kendall Square Research, is an HR-based and cache coherent shared-memory mult iprocessor system with up to 1,088 64-bit custom superscalar RISC processors (20 MHz). A basic ring unit in the KSRl has 32 processors. The system uses a two-level hierarchy to interconnect 34 rings (1088 processors). Each processor has a 32MB cache.

The basic structure of the KSRl is the slotted ring, where the ring bandwidth is divided into a number of slots circulating continuously through the ring. The number of slots in the ring is equal to the number of processors plus the number of routers connect ing to the upper ring. A standard KSRl ring has 34 message slots, where 32 are des igned for the 32 processors and the remaining two slots are used by the directory cell connect- ing to level 1 ring. Each slot can be loaded with a packet, made UD of a 16 bvte header and 128 bytes of data which is the

Delay Fwa

3so -

MO-

250’ ’ :

200 :-:

so 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 22

Regoeclrue(X 10-3)

fig. 8. Effects of the number of processors in each ring on average memory access request delay (1 delay time unit = 1 rotation period unit).

Dchy Faaa

45m- i pariodwl3 ..‘.

4ooo- j pied- 9 -- p&d=5 - - pmiod-I -

3500- i

3ooo- f

/ 0

0 0.05 0.1 0.15 0.2 025 0.3 0.3s 0.4 0.45 0.5 0.5s

R-P-t-(x loJ 1

Fig. 9. Effects of the ring rotation period on average memory access request delay (1 delay time unit = 1 rotation period unit).

basic data unit in the KSRl, called a subpage. A processor in the ring ready to transmit a message waits until an empty slot is available. A single bit in the header of the slot identifies an empty slot. as the slot rotates through a ring interface of the processor.

B. The Structure of the Experiments Two sets of experiments were conducted to measure and

compare the performance of the KSRl and the TC2000 in the presence of hot spots. The first set of the experiments measured the effects of hot spots on different remote cache/memory access and block transfer operations. This was done by comparing different access times in an environment without any hot spots and an environment where some processor nodes were used to generate a hot spot. The second set of experiments measured and compared simple spin-lock and distributed lock algorithms’ performance.

ZHANG. YAN. AND CASTAREDA: COMPARATIVE PERFORMANCE EVALUATION OF HOT SPOT CONTENTION 881

In the first set of experiments, the hot spots were generated in two different ways:

1) Via read and write references: a set of processors were used to make a target cache/memory module hot by reading and writing the same location in that cache/memory module.

2) Via block transfer: a set of processors were used to make a target cache/memory hot by using the block transfer operat ion to copy data from that cache/memory module to their local cache/memory modules.

In the second set of experiments, two types of synchronization locks are used: simple spin-lock and distributed lock. The simple spin-lock consist of a globally shared lock al located on a single processor to which all access to it is serialized. When a processor gets the lock, it sets the lock busy immediately and starts the atomic operations. An unlock is set by the processor immediately after the critical section work is finished. The rest of the processors have to busy-wait for the lock using remote memory accesses through an interconnection network. W e con- struct the simple lock using the atomic test-an&set operation.

In order to reduce content ion in the network and in the memory module holding the lock, it would be more efficient to have each processor busy-wait only on a locally-accessible variable for the lock. A distributed algorithm decentral izes locks throughout the memory modules. The concept of the distributed algorithm has been implemented and tested on various shared-memory systems [2], [lo], and [ 151. In fact, the execut ion behavior of spin-locks is significantly different between MNbased and HR-based architectures. For example, while best algorithms for the KSRl are simple locks with delay options, best algorithms for the TC2000 are distributed locks.

C. Comparat ive Hot Spot Effects on Cache/Memory Access

Four types of cache/memory accesses were measured:

1) Single word read (4 bytes) accesses; 2) Single word write (4 bytes) accesses; 3) Block transfer of data from the remote cache/memory; 4) Block transfer of data to the remote cache/memory;

where the block length is def ined as 128 bytes. Different number of blocks are appl ied for block transfer accesses.

To clarify the terminologies that will be used in this section pertaining to hot spots, here we descr ibe the important delini- t ions in the experiments:

l Remote access refers to when a processor makes a memory/cache reference to another processor. In the KSRl, this access could be a remote cache access in the local ring or a remote access in a remote ring. In the TC2000, all remote memory accesses are of the same distance.

l Hot spot is a shared variable that is accessed simultane- ously by a large number of processors which are called hot processors.

l Hot memory/cache is the memory/cache module where the hot spot variable resides.

l Cool processors are the processors in the system which do not access the hot spot.

l Cool variables in hot memory/cache are non-hot spots in the hot memory/cache, and will be accessed by the cool processors remotely.

l Cool variables in cool memory/cache are non-hot spots in non-hot memory/cache, and will be accessed by cool processors remotely.

Fig. 10 illustrates the above definitions for the hot spot experiments, where variable X is the hot spot, variables a and b are cool variables in the hot memory/cache and variables c and d are cool variables in a cool memory/cache. The system in Fig. 10 assumes p processors are used in the system where h nodes are used for hot processors and p - h nodes are used for cool processors, and h >> p - h.

C.1. Performance on the KSRl

The hot spot on the KSRl is al located either in a fixed location, called the fixed hot spot or in movable locations, called movable hot spot. The fixed hot spot remains physi- cally on one processor as other processors try to read it with a single variable or a block of data. The movable hot spot will be migrated around the ring on demand of any processor who does a read with a single variable or a block of data. This data migration is a feature of the KSRl intended to en- hance the data locality.

In the fixed hot spot experiment on the KSRl, the hot spot must be directly and intentionally p laced in a fixed memory location. This action is necessary because when a memory reference is normally referenced, the system migrates that data to the cache of the request ing processor. Therefore, a scheme is required to access a variable or a block of data and have it remain there during the reference. It is accompl ished in the following way. The hot variable was represented as a long vector. Each hot processor read a different element of the hot variable vector. This ensures that only that element will be read and that the entire hot variable vector will not be migrated over to that hot processor. The number of times that the hot variable was read by each hot processor was 1000 times (1000 distinct elements per hot processor). The type of the hot variable is also important because types differ according to size and the unit of transfer in the KSRl is 128 bytes. If the hot variable vector is of type “integer” then each element is 4 Bytes long. Therefore when a hot processor reads the. next element it must be increased by 32. Since when one element is read, an entire subpage (128 Bytes) will be actually read from the hot variable. If the next element to be read is not increased by 32, then the next element to be read will a lready reside in the hot processor’s cache, thus making this reading a local read within its own cache. If the hot variable vector is of type “block” then each element is 128 bytes long. Therefore when a hot processor reads the next element it must be increased by only one (because each element in the hot variable vector is equal to a subpage). The reading of the cool variables is handled in the same manner. Since the cool variables can be either a single variable or a “block” the increase to access the next element will be 32 and 1, 2, or 3, respectively. The access of the cool variables was also done 1,000 times per cool processor. The average


timing results are used as the final results. Since a remote- write which updates a variable in a remote cache module is not supported on the KSRl, we only used read operations for the hot spot experiments.

ronment with the hot spot generated by cache references in a block unit (the 3rd data block row).

TABLE II

The movable hot spot experiments were implemented using normal cache access on the KSRl, where the hot spot migrates across the ring from processor to processor on reference demand.

VARIOUS REMOTE CACHE ACCESS ll~!z MEASUREMENTS 00) IN THE PRESENCE OF A FIXED HOT SPOT ON THE KSRl

In this implementation, any cache among the hot processors would be “hot” at any given time. Since we cannot possibly know which cache is hot, the measurements of accessing cool variables at the hot cache were not conducted.

For both fixed and movable hot spot experiments on the KSRI system, 57 processors were used to generate the hot spot. Since only 64 processors are available for computing in the two ring system, there was one remote hot cache module, 6 remote cool cache modules. Before presenting our experimental results of hot spot effects, we list the standard access times in Table I published by the KSR 181, for a comparison and a reference.

TABLE I STANDARD CACHE ACCESS AMES ON THE KSRI

~1

Considering the different level accesses and block transfers in the experiments, the remote cache access measurements (first data row in Table II) in a non-hot-spot environment are quite consistent with the standard results in Table I. When there was a hot spot present, remote access to cool variables in the cool cache modules were not affected. However, there is a degrada- tion in performance when referencing the cool variables from the hot cache module. According to present information available on the KSRl pertaining to the network structure (HR-based), a claim is made that the rotating ring(s) that make up the network of the KSRl rotate in a balance efficient manner. This means that a hot spot should have little effect on other processors who need referencing through the ring.

Table II reports the average timing results of the fixed hot In practice, the claim of the balanced network ring rotation is spot experiments. The first data column lists the results of re- indeed true. The rotating ring orders and delays remote access mote-read of one word; the second, third and fourth data col- and makes access path to each processor of the system almost umns list the results of remote-read of one block, two blocks equally busy in the presence of hot spots. The cause of the deg- and three blocks respectively. These data transactions are un- radation is actually at the processor location itself. When a proc- der the environment without any hot spots (the 1st data row), essor is receiving many requests for it to process, a queue will the environment with the hot spot generated by cache refer- made of these requests so that it will be able to eventually proc- ences in a word unit (the 2nd data block row), and the envi- ess them. Since the job of the CIUs (Cell Interconnecting Units)

A large number of hot prcxxssors access the hot spot variable X intensively.

The hot memory/cache A small number of cool processors randomly access the cool variables “a” and “b” in the hot memory/cache, and cool variables “c” and “d” in a cool memory/cache.

d

t A cool memory/cache

of a processor is to extract the incoming requests and pass them These remote readings were also unidirectional among the two to the CCUs (Cache Control Units) to act upon them, the queue rings during any run of the experiment. Thus, the 16 proces- will be located at the CCU to be able to eventually process them. sors on the cool ring read the 16 processors on the hot ring This will obviously slow down the time for the cool processors during the hot spot activity during one run of the experiment. that are trying to reference the cool variables that are located on Then we reversed the remote reading and had the 16 proces- the hot cache. Further, if a processor is receiving an overwhelm- sors on the hot ring read remotely the 16 processors on the ingly number of requests, then that processor will eventually cool ring during the hot spot activity during another run. The begin to ignore other additional requests that may also be com- hot spot was generated by either reading a single variable or by ing to it (because the queue has become full since the processor reading a block of data. Thus, there were four different timings speed is not fast enough to satisfy all the requests in time). This from the hot spot activity that was compared to the remote will then force the requesting processor(s) to reattempt to try readings when there was no hot spot. For 97% of processors again to succeed in either retrieving the data item it needs or be on the hot ring in usage there was one available processor on inserted into the hot processor’s queue. Either way will cause a the hot ring and one available processor on the cool ring. As considerable delay in access time. shown in Tables IV and V for the different variations, the hot

Table III lists the average timing results of the movable hot spot has very little effect for all the remote reads on this ex- spot experiments. In general, movable hot spots reduce the periment. This additional experiment on the KSRl further access delay time on different types of remote access in the strengthens our analysis that a HR-based architecture, such as presence of the hot spot. However, the overhead of moving the KSR machine handles the activity of the hot spot efficiently. data and corresponding data item invalidations in the ring network is also significant. In comparison with the fixed hot spot

TABLE N READING MEASUREMENTS @) WHEN HOT SPOT Is GENERATED BY 50% OF

experiments, Table III indicates that a movable hot spot slightly increases the access delay to cool variables in the cool cache modules due to heavier traffic caused by more data movement. Although we were not able to measure the access delay to cool variables in the hot cache module for reasons stated above, we believe the delay would be about the same as the access delay to cool variables in the cool cache modules based on our analytical and fixed hot spot experimental results.

PROCESSORS ON THE HOT RING ON THE KSRl

TABLE III VARIOUS REMOTE CACHE ACCESS TIME MEASUREMENTS (pv) IN THE

PRESENCE OF A MOVABLE HOT SPOT ON THE KSRl TABLE V READING MEASUREMENTS (pi) WHEN HOT SPOT Is GENERATED BY 97% OF

P~oc~~so~s ON THE HOT RING ON THE KSRl .

I i read read 1 read read

Another experiment to verify our modeling work is to see if From Ho&T;, Co& 1 27.69 32.42 64.03 95.07

a hot spot can effect remote readings among processors that FromCoolRTo HotR 1 27.47 I 32.74 1 65.46 I 96.23

are not involved in the process of generating the hot spot. Again, two rings were used for the experiment. The difference C.2. Performance on the TC2000 between this experiment and the previous one is that all of the We conducted similar experiments described in the previous processors contributing towards generating the hot spot are on section on the TC2000, where remote write operations are the same ring (the hot ring). While the other processors on the included. The hot spot experiments on the TC2000 have the other ring (the cool ring) are not involved in generating the hot following three different features from the ones running on the spot. The hot spot is fixed to one processor within the hot ring. KSRl in terms of architecture support and memory systems. We varied the experiment by increasing the number of proces- First, the hot spot can only be allocated in a fixed physical sors to be involved in the hot ring to generate the hot spot. We location. All memory accesses to the hot spot and cool mem- chose to use 50% and 97% of the available processors on the ory modules are conducted through the Butterfly interconnec- hot ring to generate the hot spot for these variations. Thus, for tion switching network. Hot spot experiments are performed 50% usage of the processors on the hot ring there were also 16 by using remote read and remote write. Second, no global processors that were not involved in generating the hot spot. cache coherence protocols are supported in the system. Fi- At the same time, there were also 16 processors to be used on nally, the network contention happens when two messages the cool ring. These two sets of 16 processors were used to do need to access the same portion of a path at the same time. The remote readings of their counter processors, respectively. TC2000 supports a nonblocking switching network. When

ZHANG. YAN. AND CASTAREDA: COMPARATIVE PERFORMANCE EVALUATION OF HOT SPOT CONTENTION 883

884 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, voL. 6, ~0. 8, AUGUST 1995

TABLE VI VARIOUS REMOTE MEMORY Access TIME MEASUREMENTS ws) IN THE

PRESENCE OF THE HOT SPOT MEMORY ON THE TC2OOO

Memory I I I I I From Hot 1 106.87 1 Ill.79 1 510.96 568.72 1159.79 1 1331.49

Memory I I I I I I From Hot I 115.28 I 118.49 I 3429.60 I 3615.32 I 6972.84 I 7132.77 Memory I I I I I I I

conflicts happen, the switch has all but the “first” message retreat back to its source and free up their path. It then selects an alternative route, and after a random delay, tries again.

In our experiments on the TC2000, 100 processors were used to generate the hot spot. There was one remote hot memory module, 11 remote cool memory modules. Table VI gives the measured access times on the TC2000. When there is a hot spot memory, remote accesses to cool memory were slowed down more than three to four times for all types of hot spot experiments. These results are considerably different from the results of similar experiments on the Butterfly I conducted by Thomas [13], where accesses to cool memory were slowed down slightly. In addition, accesses to cool memory are slowed down about the same amount as block transfer accesses to cool memory. As we briefly discussed in the previous section, the TC2000 has heavier inherent network contention than the Butterfly I because the faster processors generate more network traffic. Therefore, memory accesses are significantly delayed in the presence of the hot spot. Accesses to the hot spot memory are substantially slower on the TC2000. These results are consistent with the results on the Butterfly I reported in [ 131. This is because an access to the hot memory must compete with all the other accesses to the same memory in all the switch states from its source processor to the target processor. Our experiments show that a nonblocking network suffers from large access time delay in the presence of the hot spot memory on the TC2000. The experiments also show that block transfer operations are substantially delayed on the TC2000 in the presence of memory hot spots. This is because a block transfer operation on the TC2000 will not release a sequence of switches along the path until the data transfer is done.

D. Comparative Hot Spot Effects on Synchronization Algorithms Between MIN and HR

D.I. Synchronization Algorithms

As we analyzed in the Section II, an HR-based architecture takes advantage of higher bandwidth links, efficient broadcast, and ordering properties of ring networks. Unlike a MIN-based architecture, each processor in the KSRl does not have direct

communication links to other processors except one neighbor. All the messages sent by a processor are loaded to the slotted ring to be delivered to a target processor. Since rotating rings delay remote data access requests in order, this structure can significantly reduce network contention for programs with hot spots. In order to further confirm this, we compare the hot spot effects of architectures between the TC2000 and the KSRl by measuring, comparing and evaluating synchronization algorithms on the TC2000 and the KSR 1.

.

.

The Simple Lock and Its Variations. The simple lock consists of a globally shared lock allocated on a single processor with all access to it serialized. When a processor gets the lock, it sets the lock busy immediately, and starts the atomic operations. An unlock is set by the processor immediately after the critical section work is finished. The rest of the processors have to busy-wait for the lock using remote memory accesses through an interconnection network. Using the atomic test-and-sef operation, the simple lock and its variations can be con- structed. Variations of the simple lock may be implemented by applying time delays between attempts to access the lock variable for each processor. Distributed Locks. In order to reduce contention in the network and in the memory module holding the lock, it would be more efficient to have each processor busy-wait only on a locally-accessible variable for the lock. A distributed algorithm decentralizes locks throughout the memory modules.

0.2. Performance on the TC2000 and the KSRl

Recently, new experimental results of spin-locks on both the TC2000 and the KSRl were reported in [ 151. While the algorithm of the simple spin lock with no delay functions maxi- mized the number of remote accesses to a single data location (lock) through the MIN network, the distributed lock algorithm minimized the number of remote accesses to the single location through the MIN network. However, on the KSRl we have chosen the simple spin lock algorithms (using the gspnwt instruction) for a comparison instead of using the best overall compared to the worst overall in timings. The reason for this choice is because of how the KSRl handled the ticket lock algorithm (the worst overall timing) in its implementation. As is described in [15], the “ping pong” effect that leads to the cache coherence problem coupled with the “skipping over” phenomenon causes its poor performance. More important, the architecture of the KSRl did not truly exploit the algorithm as it should have. We feel that use of the timing of the ticket lock algorithm could be misleading in our comparison. In the algorithms using the simple lock with delay options, the KSRl truly exploits the algorithms in its architecture. Therefore, for a fair comparison we have chosen to use the worst and the best timings in these simple lock algorithms, namely s-lock (no delay) and s-lock (exponential delay). Using a linear least square approximation, the slope of the timing curves from the experiments presented in [15], can be calculated. Each slope gives an average time increment in p or ms for the addition of each processor for each synchronization algorithm. Table VII

ZHANG, YAN, AND CASTAREDA: COMPARATIVE PERFORMANCE EVALUATION OF HOT SPOT CONTENTION

hows the slopes for the simple lock (with no delay and exponential delay) and the distributed lock for the TC2000 (where applicable) and the KSRl (where applicable). The ratios are calculated for the hot spot effect compar ison between the TC2000 and the KSRl. The ratio is between the slope of worst algorithm timing and the best algorithm timing for each machine. The ratios also quantitatively represent the increase of network content ion generated by the hot spot of the simple spin lock. The higher the ratio, the higher the network content ion will be. Compar ing the ratios for the two architectures clearly indicates that the KSRl architecture is significantly less sensitive to the hot spot in synchronizat ion program than the TC2000. Com- paring the ratios of the two architectures the KSRl would have about 14 times (rut io(TC2000) / rutiol(KSR1)) less network content ion generated in its architecture for spin-lock synchronization hot spot.

TABLE VII COMPARISONS OF THE LWK EXECUTION TWE SLOPES

ON THE TC2000 AND THE KSR 1 I

IV. SUMMARY

W e have conducted analytical and experimental studies on the remote memory accesses and network content ion in the presence of memory hot spots. Experiments were performed on the TC2000, a MIN-based system, and on the KSRl, an HR-based system. In both MIN-based and HR-based systems, a processor can access shared cache/memory space at different distances with different timing costs. For example, the TC2000 provides a local/remote memory access model, while the KSRl provides a local/levelO-ring/levell-ring cache access model. This type of shared-memory architecture is def ined as a Non-Uniform Memory Access (NUMA) mult iprocessor system. A potential problem of parallel processing in non-uniform memory access, def ined as the NUMA problem, is that a user has to explicitly manage processor locality, the best use of the processor resources, and the reduction of different content ions in the execut ion of a program. “Hot spot” in a parallel program is an important source of degraded execut ion performance on NUMA architectures. In this study we investigated the following hot spot per formance problems and their evaluation results based on analytical models and experiments:

l Analytical models of remote access delay in the presence of memory hot spots on a nonblocking MIN-based architecture are presented. Our analytical study indicates that an access delay to cool memory modules may be considerably smaller than an access delay to the hot memory. However, access delay to cool memory modules can be significantly delayed on a system with fast processors connected by relatively slow MIN where the chances that messages used to access cool memory modules collide with the messages used to access the hot spot, are poten-

tially high. The tree saturation descr ibed in [ 111 showed that very slight non-uniformities in memory access patterns can lead to severely degraded performance for the entire system, including processor nodes that avoid accessing the hot spot. This may not apply to a nonblocking MIN-based architecture.

l In an HR-based architecture, a slotted ring orders and delays remote data access requests. This structure may naturally reduce network content ion for programs with hot spots. Our analytical models indicate that ring network transactions will be reduced no more than 50% in the presence of memory hot spots. This analytical performance result has not been previously reported. Analysis also indicates that in the presense of hot spots overall ring traffic will be heavier but it will be distributed evenly in the ring network. Both analytical results have been verified by the experiments on the KSRl.

l Although there is no ev idence that the tree saturation phenomenon occurs in the experiments on the TC2000, remote accesses to both hot and cool memory modules will be considerably s lowed down, and performance may be significantly degraded. Our experimental results on the TC2000 show that the experimental results on the Butterfly I in [13] do not general ize to all nonblocking MIN-based architectures.

l In an HR-based architecture, a hot spot variable can be either fixed in a physical memory module or movable among the processor nodes. A movable hot spot may reduce network content ion in general because accesses to the shared variable are distributed among the memory modules in the whole system. However, the overhead of moving data and cache coherence operations, such as data item invalidations, may bring extra costs. Our experiments on the KSRl show that a movable hot spot provides higher memory latency than that of a fixed hot spot. W e also show that block transfer operat ions on the KSRl are much more efficient than that on the TC2000 in the presence of memory hot spots. This is because a block transfer operat ion on the TC2000 would not release a sequence of switches along the path until the data transfer is done, while a block transfer operat ion on the KSRl breaks the block into a set of units and sends them one by one whenever a slot is available.

l Two major compar isons between the two network architectures are as follows. (1) High access request rate is a major factor to memory latency on an HR-based network, while the presence of hot spots is a major factor on a MIN-based network. (2) In the presence of hot spots, network traffic distributions could be much more balanced in an HR-based architecture than that of a MIN- based architecture.

Our current work includes proposing and examing efficient hardware modifications and system software on both nonblocking MD&based and HR-based architectures to reduce effects of hot spot, cache coherence and processor locality management .


ACKNOWLEDGMENTS

We appreciate G. Butchee’s careful reading of the manuscript and constructive comments. We wish to thank the anonymous referees for their helpful comments and suggestions.

This work is supported in part by the National Science Foundation under grants CCR-9102854 and CCR-9400719, by the US. Air Force under research agreement FD-204092- 64157, by the Air Force Office of Scientific Research under grant AFOSR-95-01-0215, and by a fellowship from the Southwestern Bell Foundation. Part of the experiments were conducted on the BBN TC2000 in Lawrence Livermore Na- tional Laboratory, and on the KSRl machines at Cornell Uni- versity, and at the University of Washington.

[II

VI

[31 [41

f51

WI r71

VI [91

REFERENCES

A. AgarwaJ et. al., “APRIL: A processor architecture for multiproces- sing,” Proc. of ihe 17th Int’l Symp. on Computer Architecture, 1990, pp. 104-114. T.E. Anderson, ‘The performance of spin-lock alternatives for shared- memory multiprocessors,” IEEE Trans. Parallel and Distributed Sys- tems, vol. 1, no. 1, 1990, pp. 6-16. BBN Advanced Computer Inc., Inside the TC2000. L.A. Barroso and M. Dubois, ‘The performance of cache-coherent ring- based multiprocessors,” Proc. 2Ofh Int’l Con& of Computer Architec- tures, IEEE Computer Society Press, Apr. 1993, pp. 268-277. L.N. Bhuyan, D. Ghosal, and Q. Yang, “Approximate analysis of single and multiple ring networks,” IEEE Trans. on Computers, vol. 38, no. 7, July 1989, pp. 1,027- 1,040. E. Gelenbe, Multiprocessor Performance, John Wiley and Sons, 1989. E. Hagersten, A. Landin, and S. Haridi, “DDM-a cache-only memory architecture,” IEEE Computer, Sept. 1992, pp. 44-54. Kendall Square Research, KSRI Technology Background. D. Lenoski et al., ‘The DASH prototype: Logic overhead and performance,” IEEE Trans. on Parallel and Distributed Systems, vol. 4, no. 1, 1993, pp. 41-61.

[lo] J.M. Mellor-Crummey and M.L. Scott, “Algorithms for scalable syn chronization on shared-memory multiprocessors,” ACM Trans. on ComputerSystems. vol. 9. no. 1, 1991, pp. 21-65.

[l l] G.F. Pfister and V.A. Norton, “‘Hot spot’ contention and combining in multistage interconnection networks,” IEEE Truns. on Computers, vol. 34. no. 10, pp. 943-948, 1985.

[12] P. Stenstriim, T. Joe, and A. Gupta, “Comparative performance of cache-coherent NUMA and COMA architectures,” Proc. 19th Int’l Symp. of Computer Architectures, 1992, pp. 80-91.

[13] R.H. Thomas, “Behavior of the Butterfly parallel processor in the presence of memory hot spots,” Proc. 1986 Int’l ConJ on Parallel Process- ing, pp. 51-58, 1986.

[14] Y. Yan and X. Zhang, Performance modeling and analysis of MIN- based and HR-based networks, Technical Report, High Performance Computing and Software Laboratory, University of Texas at San Anto- nio, Dec. 1993.

[15] X. Zhang, R. CastafIeda, and W.E. Chart, “Spin-lock synchronization on the Butterfly and KSR-1,” IEEE Parallel and Distributed Technology, vol. 2, spring issue, 1994, pp. 51-63.

[16] X. Zhang and X. Qin, “Performance prediction and evaluation of parallel processing on a NUMA multiprocessor,” IEEE Trans. on Software Engineering, vol. 17, no. 10, 1991, pp. 1,059-1,068.

[17] X. Zhang and Y. Yan, “Comparative modeling and evaluation of CC- NUMA and COMA systems in hierarchical rings,” to appear in IEEE Trans. on Parallel and Distributed Systems.

Xiaodong Zhang received the BS degree in electri- cal engineering from Beijing Polytechnic University, China, in 1982 and the MS and PhD degrees in computer science from the University of Colorado at Boulder in 1985 and 1989. respectively.

He is an asscoiate professor of computer science and director of the High Performance and Comput- ing and Software Laboratory at the University of Texas at San Antonio. He has held research and visiting faculty positions at Rice University and Texas A&M University. His research interests are

parallel and distributed computation, parallel architecture and system performance evaluation, and scientific computing.

Zhang has served on the program committees of several conferences, and is the program chair of the Fourth International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS’96). He also currently serves on the editorial board of Parallel Computing, and is an ACM National Lecturer.

Yang Yan is a PhD student of computer science at the Univetsity of Texas at San Antonio. He received the BS and MS degrees in computer science from Huazhong University of Science and Technology, Wuhan, China, in 1984 and 1987, respectively. He has been a faculty member there since 1987. He was a visiting scholar in the High Performance Computing and Software Labo- ratory at UTSA, 1993-1995. Since 1987, he has published extensively in the areas of parallel and distributed computing, performance evaluation, operating systems and algorithm analysis.

Robert Castafieda is a PhD student of computer science at the University of Texas at San Antonio. He received the BS and MS degrees in computer science from the same university in 1990 and 1994, respectively. He was a recipient of Southwestern Bell Foundation Graduate Fellowships, 1993-1995 and won the 1994 University Life Award for aca- demic performance in his graduate study. He was a research associate at the High Performance Comput- ing and Software Laboratory at UTSA, 1994-1995. His research interest is in the areas of performance

evaluation of parallel and distributed architectures and systems.

Comparative Performance Evaluation of Hot Spot Contention...

Documents

Transcript of Comparative Performance Evaluation of Hot Spot Contention...