Vlsi19_10 Performance Testing Vlsi
-
Upload
goran-sicanica -
Category
Documents
-
view
228 -
download
0
Transcript of Vlsi19_10 Performance Testing Vlsi
7/31/2019 Vlsi19_10 Performance Testing Vlsi
http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 1/13
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011 1861
On the Use of Simple Electrical Circuit Techniquesfor Performance Modeling and Optimization
in VLSI SystemsG. Hazari and H. Narayanan
Abstract—Leading-edge VLSI systems, essentially multi-processor systems-on-chip, have a wide range of componentsintegrated together and operating in unison. They can be analyzedas flow networks in which the system performance depends on thebandwidth, transmission time, and queueing delay characteristicsof the individual components, their connectivity and interactions,as well as the traffic patterns they encounter. The flow in variousparts of the system must ideally be distributed so as to extract themaximum throughput possible with minimum end-to-end delays.Such an ideal distribution for flow networks has previously been
obtained using simple electrical circuits. We demonstrate a similarmethodology for typical VLSI systems and provide the necessaryextensions of the theory. We empirically validate the methodologyusing a cycle-accurate simulation model as the reference. We findthis methodology to supply better distributions in the average caseand comparable distributions in the worst case as compared tostandard search procedures such as random sampling and simu-lated annealing. The real strength is that it provides a speedup of several orders of magnitude, i.e., 3–5 orders in our experiments.Thus it is an elegant means for analyzing and optimizing the flowin VLSI systems, which can easily be incorporated into design pro-cedures, compilers and on-chip modules for real-time allocations.
Index Terms—Analytical models, circuits, system performance.
I. INTRODUCTION
VLSI systems for high performance applications such as
networking, communications, servers, multimedia, and
gaming, among others are nowadays being built as multipro-
cessor systems-on-chip (SoC) [2]–[5]. Even general purpose
computers are being built with multiple processor cores [6] and
have similar levels of complexity. These systems integrate a
large number of processors, memories, input/output interfaces,
interconnects, and application specific hardware onto the same
chip. In order to meet the performance demands, the architec-
ture contains multiple instantiations of each of the componentswhich are made to operate in a highly parallel and pipelined
manner.
Such systems can be visualized as shown in Fig. 1. The
number of processors is expected to increase manifold in the
near future [6]. The memories include register files, caches,
Manuscript received January 14, 2010; revised April 23, 2010; accepted July07, 2010. Date of publication August 23, 2010; date of current version August10, 2011. This work was supported by the sponsored project “VLSI Consortiumat I.I.T. Bombay”.
The authors are with the Department of Electrical Engineering, I.I.T.Bombay, Bombay 400076, India (e-mail: [email protected];[email protected]).
Digital Object Identifier 10.1109/TVLSI.2010.2060502
Fig. 1. Visualization of multiprocessor SoC.
scratch pads, SRAMs, and DRAMs. As the number of compo-
nents is increasing, the interconnect sub-systems are evolving
towards elaborate networks with their own protocols [7]–[10].
Reconfigurable topologies are also available [8]–[10].
There is a tremendous performance demand on the memory
and interconnect sub-systems [4]–[6], [8]–[10]. They are known
to be performance bottlenecks in terms of both their bandwidth,
i.e., maximum service rate supported, and the access or trans-
mission delays [5], [6], [8]–[11]. The situation is expected to
worsen since memory performance is not scaling as fast as thatof the processors [5], [6], [11]. The memory and interconnect
sub-systems are often designed together while trading off per-
formance against chip area and energy consumption [12]–[15].
The design process requires performance evaluation tools and
optimization procedures. The most popular practice is to use
cycle accurate simulators for the evaluation and exploratory pro-
cedures for the optimization [9], [10], [12], [13].
We propose an alternative strategy for certain steps in the de-
sign process, wherein we view the system as a flow network.
We first separate out the components that generate activity and
those that simply respond to the activity generated by others.
The network is composed of the second category, and we refer
to the first category as its generators. The memories and inter-connects go into the network. The hardware units typically go
into the network. In some cases they may also generate activity,
but we neglect such cases for the present discussion. The gen-
erators are primarily the processors and input/output interfaces.
Thus we visualize the system as shown in Fig. 2.
The traffic is of various types which includes memory ac-
cesses, application data that is transferred between processors
or between a processor and an interface, as well as informa-
tion pertaining to the management protocols in the system. Each
traffic unit follows a particular path through the network which
is defined by its source and destination. In the case of a memory
access, the destination is a particular memory and either the data
1063-8210/$26.00 © 2010 IEEE
7/31/2019 Vlsi19_10 Performance Testing Vlsi
http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 2/13
1862 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011
Fig. 2. Visualizing the system as a flow network.
Fig. 3. Example of a flow network.
or control information is returned to the source. The choice of
memory depends on the data object accessed and the memory al-
location which assigns each data object to a physical memory.
If there are multiple paths between a source-destination pair, the
path allocation determines which one is taken. It may either be
fixed before entry or decided along the way by considering the
system state. Both the memory and path allocations have an im-
pact on the traffic rate and delays through the network.
In general, the throughput of the network, i.e., traffic rate sup-
ported, as well as transmission delays depend on the network ar-
chitecture, traffic patterns and the allocations. The architectureplaces an upper bound on the throughput and a lower bound on
the delay for every path. For example, the total bandwidth across
the memories is an upper bound for the rate at which memory
accesses can be serviced, and the total bandwidth of the inter-
connects is an upper bound for the rate across all traffic. The
connectivity constraints reduce the bounds further. For example
when many memories are connected to a single interconnect
the access rate may get limited by the interconnect. The actual
throughput may still be lower than the upper bound depending
on the application design, allocations [14], [15], and manage-
ment protocols. This typically happens when the traffic flow gets
concentrated only in certain parts of the network. Similarly the
latency or minimum delay along any path depends only on the
architecture parameters. The actual delays also include the time
spent waiting in queues for busy resources to become available.
Once again the application design and allocations have a signifi-
cant impact on the traffic patterns and thus the actual delays [8],
[9].
In order to estimate the throughput and delays for a partic-
ular system, it can be modeled as a flow network [16]. An ex-
ample is shown in Fig. 3. Flow networks are built up of nodes
and branches. Each branch connects two nodes and has the fol-
lowing characteristics: the maximum flow permitted and the
cost incurred as a functionof the flow . Two types ofopti-
mization problems are generally solvedfor flow networks: 1) themaximum flow problem which determines the maximum flow
possible through a given network and 2) the minimum cost flow
problem which determines the flow distribution that minimizes
the total cost in the network when a specified amount of flow is
injected into it. When is a linear or quadratic function,
the minimum cost flow problem can be solved using linear or
quadratic programming, respectively [1].
In [1], flow networks are converted to electrical circuits tosolve these two problems. The power of this approach is that
electrical devices can cover arbitrary cost functions as well.
Thus we propose that the same technique be applied to VLSI
systems, with data transfer rates being equated to flows and the
delays being equated to the cost. We argue that the queueing
delay functions are unlikely to be either linear or quadratic since
they are not for the M/M/1, M/G/1 and M/D/1 queues [17]. This
transformation is convenient since there are a number of soft-
wares available, both commercially and freely, for solving elec-
trical circuits. We use a dc circuit solver developed in-house [35]
for our purposes. Other solvers such as SPICE may be used
just as well. However with commercial solvers, the range of
queueing delay functions that can be modeled is restricted bythe electrical devices available. If the source code is accessible,
additional devices having the required characteristics may be
added. Since the networks are being solved only in software,
hypothetical devices having arbitrary characteristics can also be
used. It is not essential for a physical device with the same char-
acteristics to exist.
On mapping the flows to currents and the cost functions to
voltage drops [1], an electrical circuit can be constructed to
find the maximum flow or the minimum cost flow distribution
in such networks. The branch currents automatically distribute
themselves so as to attain the throughput upper bound and min-
imize the mean delay. This distribution indicates how the trafficin the original system should ideally be distributed. We shall
refer to the circuit that models the flow network for a particular
VLSI system as the electrical flow-delay model or the flow-delay
model in short.
An interesting aspect of this approach is that a simpler elec-
trical circuit models a more complicated one. The currents and
voltages in the model have no relation to those in the actual cir-
cuit, rather they are related to the flow rates and delays. Since
electrical network analysis is a well established field, there are
optimized procedures for solving the circuits. In principle, we
can directly apply the required procedures to the design of VLSI
systems. The visualization as an electrical circuit is not abso-
lutely essential, but there is an advantage in linking the two
domains. First, we get a better understanding of the analysis
and procedures, and second as the techniques and softwares
continue to evolve they can be applied directly. Thus if em-
pirical studies can confirm a correspondence between the two
domains, then there are significant benefits in modeling VLSI
system performance using simple electrical circuits. The elec-
trical flow-delay model can then be integrated into the design
flow for VLSI systems.
The flow-delay model can replace the simulation models and
exploratory procedures in certain steps of the design flow. It can
be solved much faster and can potentially lead to better designs
within shorter time budgets. However, we need to understandthe approximations it makes, which are essentially as follows.
7/31/2019 Vlsi19_10 Performance Testing Vlsi
http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 3/13
HAZARI AND NARAYANAN: ON THE USE OF SIMPLE ELECTRICAL CIRCUIT TECHNIQUES FOR PERFORMANCE MODELING AND OPTIMIZATION 1863
1) The flow rate and delays are modeled as continuous quan-
tities, whereas both of them are discrete in VLSI systems.
For example, each memory access or data packet is a
single unit that travels together. The circuit currents do not
capture this effect. Second, VLSI chips are synchronous
systems in which all delays are multiples of the system
clock. Simple electrical devices do not have discrete cur-rent-voltage characteristics.
2) Non-deterministic delays such as those encountered in
caches and DRAMs with row-buffering must be modeled
either as their average values or separately for all cases.
Such approximations can potentially introduce large inac-
curacies with the underlying system.
In addition, the approach has two fundamental limitations: 1)
it requires that the queueing delay functions be characterized in
advance and 2) it can only give the ideal distribution. Additional
procedures are required while optimizing the memory and path
allocations. Constraints arise because each data object must re-
side in a unique physical memory, otherwise the overheads in-
curred to maintain consistency must also be taken into account.For some applications, the ideal distribution may not actually be
realizable. The simplest example of such a situation is when the
proportion of accesses to one data object is larger than the pro-
portion of flow through each of the memories. Further, the ac-
cess patterns may be bursty due to which the allocation which
comes closest to the ideal distribution may not always be the
best one.
The above issues must be kept in mind when integrating the
flow-delay model into the design methodology. It is well suited
for the following steps in the design process. First, it can eval-
uate the maximum throughput possible during architecture ex-
ploration and application design. Since the flow-delay model of-fers a substantial speedup over simulations, a larger number of
candidates can be evaluated within similar time budgets. How-
ever, due to the approximations one should not solely rely on
this model when the differences in performance predictions are
small. Second, it provides the ideal distribution for the memory
and path allocations. When these are done at compile time there
is sufficient time to employ intensive search procedures such as
simulated annealing and others. These procedures will almost
always find better solutions but the flow-delay model can poten-
tially improve their speed by providing a good initial allocation.
The speedup provided by the flow-delay model can make a sig-
nificant contribution when the allocations are done in real-time,
which becomes relevant when there are reconfigurable inter-
connects [18]–[20] or when the traffic patterns change signif-
icantly with the inputs. We note that when used in real-time, the
flow-delay model would have to be solved on the chip itself, but
the systems in use today do have processors that are capable of
doing so. The experiments in this study are designed to gauge
such contributions in the design process.
The objective of this paper is to develop the modeling
methodology and conduct empirical studies to validate it.
The organization is as follows. We compare this methodology
against existing analytical modeling strategies in Section II.
We develop the methodology in Section III and also extend
the ideas in [1] to cover arbitrary queueing delay functions. InSection IV, we model the Intel IXP2800 architecture [3] and
conduct a preliminary validation. In Section V, we conduct a
more thorough empirical validation including a comparison
with random sampling and simulated annealing. We find that
the flow-delay model indeed provides good distributions along
with a substantial speedup.
II. POPULAR ANALYTICAL MODELING TECHNIQUES
The flow-delay model provides an analytical technique for
modeling and optimizing performance in large distributed sys-
tems. In this section, we look at other analytical modeling tech-
niques and discuss how they compare with it.
The most popular technique is queueing theory, which was
conceived for manufacturing processes [21]–[24] and later ap-
plied to VLSI systems [25]. Queueing theory is a good means for
modeling service time and waiting time distributions for com-
ponents with queues, and then determining the sojourn time (or
transmission delay) distributions for a network of components
and queues. It is well suited for analyzing a given flow dis-
tribution, and for comparing dynamic allocation or schedulingpolicies when the path taken by each job is not fixed a priori.
However when it comes to finding the optimal flow distribu-
tion for a static allocation, additional algorithms and heuristics
are required that are not well suited for highly distributed sys-
tems [22], [23], [25]. The optimization is far simpler for systems
with up to two components or when all components have iden-
tical service time distributions [22]. The optimal distribution can
be obtained by analytical means for certain networks [26] how-
ever the number of such cases is very limited. The advantage
of the flow-delay model is that it has a naturally associated pro-
cedure for finding the optimal static flow distribution. Further,
queueing theory often assumes Poisson arrivals to enable theanalyses [22]. The analysis becomes difficult when the arrival
traffic stresses the network and further becomes infeasible for
general networks. In contrast the flow-delay model easily cap-
tures stressed arrivals and its solution procedure is not limited
by the complexity of the network.
Queueing theory has wider application than the flow-delay
model in terms of first modeling the distributions for through-
puts and delays rather than just the mean values, second being
able to analyze any given flow distribution and a variety of allo-
cation or scheduling policies, and third providing more informa-
tion such as expected queue lengths. The flow-delay model can
capture only a limited set of performance aspects and is well
suited for exactly two purposes: obtaining the maximum pos-
sible network throughput and a good flow distribution, however
it has inherent advantages for each of these. The two techniques
can also be used in conjunction since the flow-delay model re-
quires the delay versus flow characteristics of individual com-
ponents, which can be obtained conveniently through queueing
theory.
Another pair of similar techniques is real-time calculus [27],
[28] and network calculus [29]. They provide an elegant means
for modeling the throughput and delays, in terms of lower and
upper bounds for individual components, and then determining
the same for a network. However, once again there is no elegant
procedure for optimizing the flow distribution. The advantagesand disadvantages are similar to those with queueing theory.
7/31/2019 Vlsi19_10 Performance Testing Vlsi
http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 4/13
1864 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011
Fig. 4. Current limiting circuit.
Recent research has also used concepts from thermodynamics
and statistical physics to model complex VLSI systems [30].
These techniques are well suited for evaluating a particular de-
sign with a specified flow distribution, and then determining the
queue capacities required at various points in the system. Opti-
mizing the flow distribution remains a difficult task.
Large amounts of research have been directed towards sim-
plifying the performance evaluation for specific systems into an-
alytical models [30], [31]. The subsequent optimization is done
either using intensive search algorithms or heuristics, such as
greedy ones, simulated annealing and genetic algorithms [30],or using linear programming, integer linear programming, or
quadratic programming wherever applicable [30], [32]. It is in-
feasible for search algorithms or heuristics to cover the entire de-
sign space, which is typically very large due to its combinatorial
nature [29]. In this regard, the second class of models is better,
however they are likely to be less accurate. Also the possibility
of expressing the performance metric as a linear or quadratic
function of the design variables exists in a limited number of
cases. Note that when all queueing delays are linear or quadratic
functions of the flow rate, the flow-delay model can also be
solved using linear or quadratic programming. Thus under these
conditions it can be considered to be in that class of analyticaltechniques.
To summarize this comparison, the flow-delay model cancap-
ture a very limited set of performance features as against other
techniques, however it has an optimization procedure associated
with it which makes it a stronger candidate for exactly two steps
in the design process. In the next section we proceed with de-
veloping the flow-delay modeling technique.
III. FLOW-DELAY MODELING METHODOLOGY
The basic premise is that we are going to convert data flows
in VLSI systems to electrical currents, and delays to voltagedrops. We use simple electrical circuits to model the system
components and then connect them together as in the architec-
ture. We organize this section as follows. In Section III-A, we
present the circuit models for individual components along with
the formulation for generalized delay versus flow functions. In
Section III-B, we describe the network construction.
A. Component Models
The models must essentially capture two features: the max-
imum flow rate and the relationship between delay and flow.
We model the maximum rate using a current source in parallel
with an ideal diode (which has characteristics such that), as shown in Fig. 4. Diode ensures that the
Fig. 5. Circuit model for a general component.
Fig. 6. Component with no queueing delay.
Fig. 7. Component with a linear queueing delay function.
current flows in the forward direction. Thus the current satis-
fies .
We model the delay function by adding an appropriate device
as indicated in Fig. 5. The simplest function is a constant one,
which can be modeled as a voltage source. The next in the pro-
gression is a linear function which can be modeled as a voltage
source in series with a resistor. The device characteristics mustbe selected keeping the following point in mind. The current
distribution in the resulting network is governed by Kirchhoff’s
laws, and satisfies the following property [1]: It minimizes the
sum of the power in the voltage sources and half the power in
the resistors.
Since we intend to minimize the mean delay through the net-
work, we select the values of the voltage sources to be equal
to the constant component and the values of the resistors to be
twice the slope of the delay function. We elaborate on these two
cases next.
Consider a component with a throughput or flow limit, a la-
tency which is the minimum delay through it, and no additionalqueueing delay. This component is modeled as shown in Fig. 6.
The value of the current source is selected to be equal to the
throughput limit and that of the voltage source equal to the min-
imum delay. Note that the actual current through this circuit can
be less than and the voltage drop can be greater than .
Next consider a component having queueing delay that in-
creases linearly with the flow. This is modeled as in Fig. 7. The
resistor value corresponds to a minimum queueing delay of 0
and a maximum of .
Now let us illustrate a system with two components operating
in parallel. If both of them have constant delay functions, the
network is as shown in Fig. 8. Let the current entering be which
splits up as and . Let us assume then the currentdistribution is as follows.
7/31/2019 Vlsi19_10 Performance Testing Vlsi
http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 5/13
HAZARI AND NARAYANAN: ON THE USE OF SIMPLE ELECTRICAL CIRCUIT TECHNIQUES FOR PERFORMANCE MODELING AND OPTIMIZATION 1865
Fig. 8. Current distribution: Example-1.
Fig. 9. Current distribution: Example-2.
• If then and diode is reverse
biased, which carries the voltage drop required for Kirch-
hoff’s voltage law to be satisfied.
• If then and is
reverse biased.
• If then while and
are reverse biased.
It is straightforward to see that this distribution minimizes the
power in the voltage sources.
If both components have linear delay functions, the network
is as shown in Fig. 9. When is small, and , which
holds until . As increases beyond this value,
there is a range of values where .
Then comes a point beyond which either or .
If then will be reached and
. The situation in the converse condition is similar.
Let us look at the intermediate range more closely. Here,
Kirchhoff’s voltage law gives
. Let denote the sum
of the power in the voltage sources and half the power in the
resistors. Then
(1)
(2)
(3)
Thus is equivalent to solving Kirchhoff’s
voltage law and since it is a local min-
imum. This example clearly demonstrates the minimization
property of Kirchhoff’s laws, for both voltage sources and
resistors.
The focus of the rest of this section is to extend the propertyto general devices so that we can model any arbitrary delay-flow
function. Consider an electrical network with nodes and
branches. Let be the vector of b ranch currents.
and be the node voltages. Let be the
incidence matrix and be the fundamental circuit matrix.1
Then Kirchhoff’s laws can be expressed as
Kirchhoff's Current Law: (4)Kirchhoff's Voltage Law: (5)
The currents in the network can also be expressed as the cur-
rent circulating in each loop of the fundamental circuit matrix.
Let this vector be . The following equation re-
lates the branch and loop currents:
(6)
where is the transpose of the fundamental circuit matrix2.
Thus the rows of form a basis for the vector space of branch
currents that satisfies Kirchhoff’s current law.
Let us separate the branches containing current sources fromthose that do not. Let the number of branches without current
sources be . Let us split the incidence matrix vertically into
and to correspond to the current sources and the other
branches, respectively. Similarly let us split the current vector
into and . Then Kirchhoff’s current law gives
(7)
Let the current and voltage vectors that satisfy Kirchhoff’s
laws be and . Let be split as and . Let us assume
that Kirchhoff’s laws minimize a quantity that can be expressed
as
(8)
where and is a property of the
device along the respective branch. In general can depend on
the entire current vector, however in our case it depends only on
. Thus . Our objective for the
rest of this section is to derive in terms of the current-voltage
characteristics of the devices.
Since minimizes , which gives
(9)
can be approximated using Taylor’s expansion to give the fol-
lowing:
(10)
where is the Jacobian matrix. It is a square matrix
of size and the element in row column is .
Since depends only on it becomes a diagonal matrix.
1An introduction to electrical network theory, if required, can be got from[33].
2We use the notation to denote the transpose of vectors or matricesthroughout this section.
7/31/2019 Vlsi19_10 Performance Testing Vlsi
http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 6/13
1866 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011
In the above equation we can neglect the
term, subtract the right hand side and reorganize the transposed
vectors, to get
(11)
Let us now apply Kirchhoff’s laws to the above expression.
Since satisfies Kirchhoff’s current law
(12)
Let the loop current vector be , then
(13)
Note that the branch currents satisfy Kirchhoff’s current law for
all choices of . We can replace the above relationship in (11)
and reorganize the transposed matrices to get
(14)
Since can be chosen arbitrarily, the following must hold:
(15)
The vector therefore satisfies Kirch-
hoff’s voltage law, as given in (5). Thus we get the following
relationship:
(16)
Now the elements of are and the elements of are .
and are functions of which is the corresponding element
of . They can thus be related as follows:
(17)
This expression specifies how the circuit models should be de-
veloped since corresponds to the quantity to be minimized
in the original system. Thus if corresponds to a constant
, and if is a linear function .If all components have linear delay versus flow functions the
terms are constant and the current distribution in
the resulting network can be got by solving a system of linear
equations. If there are nonlinear functions iterative techniques
are required. In the next section, we describe how these compo-
nent models are to be connected together to build the network.
B. Network Construction
The architecture of the VLSI system defines the connections
between the various components. The circuits for the compo-
nent models must be connected in exactly the same way. For
example if the architecture is as shown in Fig. 10, the corre-sponding flow-delay model is as shown in Fig. 11. The currents
Fig. 10. Network construction example: System architecture.
Fig. 11. Network construction example: Flow-delay model.
Fig. 12. Network for maximum flow.
Fig. 13. Network for minimum cost flow.
at points where multiple paths merge together automatically add
up, thus adjusting the queueing delays. The network is then to
be solved in the following two steps.
• First solve the maximum flow problem by modeling the
generators as voltage sources and removing the compo-
nents that model delays in the network, as indicated inFig. 12. The currents flowing through each of the gener-
ator models indicate the maximum throughput supported
for each of them.
• Next solve the minimum cost flow problem by modeling
the generators as current sources with values obtained from
the maximum flow, as shown in Fig. 13.
A few details must be kept in mind while solving these two flow
problems. We discuss them in the following paragraphs.
The selection of voltage source values for the first step can
play a role. If all generators see a symmetric view of the net-
work, sources of identical values should be used. We can expect
the resulting current flow through them to be equal. All genera-
tors that are connected to the network at the same points can alsobe merged together and modeled as a single voltage source. If
7/31/2019 Vlsi19_10 Performance Testing Vlsi
http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 7/13
HAZARI AND NARAYANAN: ON THE USE OF SIMPLE ELECTRICAL CIRCUIT TECHNIQUES FOR PERFORMANCE MODELING AND OPTIMIZATION 1867
Fig. 14. Example of current distributions guiding the application design andallocations.
the generators see an asymmetric view of the network, the rela-
tive values of the voltage sources can affect the current distribu-
tion across them. We have not studied such situations as yet. Our
empirical studies cover only symmetric systems. We note that
a majority of systems are likely to be symmetric, as is the case
with Intel’s IXP Network Processor [3], Sun Microsystem’s Ni-
agara Processor [4], and IBM’s Cell Multiprocessor [5].
The currents that flow through the voltage sources give the
maximum flow rate through the corresponding generator or set
of generators. In the second step, the generator voltage sources
must be replaced by current sources having values exactly equal
to the current values found in the first step. The delay modeling
components inside the network, i.e., the voltage sources and re-
sistors, must be brought back. Then the current through the re-
spective branches of the flow-delay model gives the desired flow
distribution for the original system.
As an example consider a system with the processors con-nected to memories through point-to-point interconnects. If
the system has two processors and two memories as shown in
Fig. 14, let us denote the currents flowing out from the proces-
sors as and . Then the application should be parallelized
such that the number of memory accesses generated by the
processors are in the ratio . Further, let split up as
and in the branches going towards the two memories.
Then the memory allocation should be such that the number of
accesses flowing through the respective connections are also in
the ratio .
Note that multiple distributions are also possible in such net-
works. In the above scenario, let the processors and memoriesbe connected through a common bus which offers a throughput
bottleneck. Then if the memories are identical and are modeled
as having constant delays, the maximum flow is governed by
the bus and can be distributed across the memories in any ratio.
All such distributions will give the same mean delay in the flow
network and satisfy Kirchhoff’s laws in the flow-delay model.
However if the memory models have resistive components then
the solution is unique. In general if a majority of the components
do not have constant delay functions we can expect a unique so-
lution whereas if a majority have constant delay functions mul-
tiple solutions are likely to exist.
This completes the construction of the flow-delay model and
an understanding of how the current distribution can help indesigning the system. In the next two sections we validate the
Fig. 15. Intel’s IXP 2800 network processor block diagram [3].
methodology first on the Intel IXP2800 system and then through
a more detailed study on a cycle accurate simulator.
IV. PRELIMINARY DEMONSTRATION
In this section, we conduct a preliminary study to demonstrate
that the results from the flow-delay model are consistent with re-
sults obtained through intuitive analyses of simple systems. Weuse the Intel IXP architecture [3] as the basis for this exercise.
The IXP architecture has a single bottleneck that can be identi-
fied intuitively.
We show the architecture in Fig. 15. The salient features are
as follows [3].
• There are 16 micro-engines arranged as 2 clusters of 8
each. Let us assume that the processors generate memory
accesses faster than they can be serviced.
• There are three kinds of memories: SRAMs, DRAMs, and
a scratchpad. The SRAMs are 4 in number, each running
at 200 MHz and providing 800 MB/s read as well as write
bandwidth. The DRAMs are 3 in number, each running at533 MHz DDR and providing approximately 2.12 GB/s of
bandwidth shared between the reads and writes. There is a
single scratchpad running at 700 MHz, providing 2.8 GB/s
bandwidth.
• The interconnect sub-system consists of four buses. The
first two are connected to the SRAMs and scratchpad. The
third and fourth to the DRAMs. The first and third are con-
nected to the first cluster of processors. The second and
fourth being connected to the second cluster. Each bus has
two units, one for each direction, both of which operate at
700 MHz. Thus a bandwidth of 2.8 GB/s is provided in
each direction. All connections have identical latency.
We show the flow-delay model in Fig. 16. Since all the pro-cessors within a cluster are connected to the rest of the system
7/31/2019 Vlsi19_10 Performance Testing Vlsi
http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 8/13
1868 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011
Fig. 16. Flow-delay model for IXP 2800.
in an identical manner, we model each cluster as a single gen-
erator. We neglect the queueing delays for this study. Thus we
model the component delays using only a voltage source as dis-
cussed earlier through Fig. 6.
We analyze the following situation. The processors gen-
erate an equitable mix of read and write accesses very fast,
and we need to determine the maximum rate supported.
Now the total bandwidth or throughput provided by therespective sub-systems is as follows. The memories provide
approximately 15.7 GB/s3 while the buses provide 11.2 GB/s
both to and from the memories. Hence the buses are expected
to be the bottleneck. The analysis must also factor in the fact
that the buses connected to the different memories are separate.
The SRAMs and scratchpad provide a total bandwidth of
9.2 GB/s while the DRAMs provide approximately 6.5 GB/s.3
The respective buses provide 5.6 GB/s. Thus the buses are
expected to limit the throughput towards both the SRAMs and
DRAMs.
We find that our methodology predicts a system throughput
of exactly 11.2 GB/s. The flow distribution for all processors is
identical, the flow in all SRAMs is the same and the flow in allDRAMs is the same. We refer to such a distribution as being
balanced for the rest of this section.
We then experiment with the following modifications in the
architecture. Note that we start with the original architecture for
each of the modifications listed as follows.
• Mod-1: We increase the bandwidth of each bus by a factor
of 2. Now we expect the memories to become the bottle-
neck.
3While implementing the flow delay model we go through an intermediateformat wherein we model the throughput of each component as a cycle timeparameter having an integer value. This introduces a rounding-off error which
gives us a nett DRAM bandwidth of 6.5 GB/s rather than the actual 6.4 GB/s.Thisis a limitationof our implementation and notthe methodology.We continuewith the modeled value for the rest of this section.
TABLE ICOMPARISON OF RESULTS FROM THE INTUITIVE ANALYSIS AND THE
FLOW-DELAY MODEL
• Mod-2: We increase the bandwidth of each bus by a factor
of 2 and remove the scratchpad. We again expect the mem-
ories to be the bottleneck but the system throughput to
change.
• Mod-3: We decrease the bandwidth of the DRAMs by a
factor of 2. Now we expect the DRAMs to be the bottle-
neck in their part of the system and the buses to be the bot-
tleneck in the part of the system containing the SRAMs and
scratchpad.
• Mod-4: We consider the architecture with the memoriesas the bottleneck, i.e., Mod-1. We assign unequal connec-
tion latencies to the SRAMs as follows: We assume that the
first processor cluster is close to the first two SRAMs and
far from the third and fourth. The second cluster is con-
versely close to the third and fourth SRAMs and far from
the first two. We assume that the bus latencies for a close
and far connection are in the ratio . Now we expect
the accesses from the first cluster to be biased towards the
first two SRAMs and vice versa for the second cluster.
We present all the results in Table I4. We find that the flow-
delay model predicts the expected throughput very closely in all
scenarios. When the connection latencies are equal, the flow dis-tribution isalso balanced.For inwhich the latencies are
unequal, the flow between the processors and SRAMs is biased
in the following manner: The entire flow from the first cluster
goes to the first and second SRAMs while the entire flow from
the second cluster goes to the third and fourth SRAMs. This
is the expected distribution in the absence of queueing delays.
With queueing delay functions, it is not possible to estimate the
optimal distribution in a simple manner.
Through this exercise we have demonstrated that the flow-
delay model gives correct throughput predictions in simple situ-
ations. It also gives the flow distribution that achieves the system
throughput and minimizes mean latencies. In the next section
we consider a wider range of system configurations and com-
pare this methodology against standard search procedures for
finding the optimal distributions.
V. EMPIRICAL STUDIES
The purpose of this section is a thorough validation of the
flow-delay model. We use a cycle accurate simulation frame-
work called MemSim [36]–[38] to model the reference system.
We start with simple regular configurations and then move on
to randomized ones. We compare the proposed methodology
with the following two procedures: 1) where we generate a large
number of random distributions and select the best one and 2)4 We explain the biased distribution later in this paragraph.
7/31/2019 Vlsi19_10 Performance Testing Vlsi
http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 9/13
HAZARI AND NARAYANAN: ON THE USE OF SIMPLE ELECTRICAL CIRCUIT TECHNIQUES FOR PERFORMANCE MODELING AND OPTIMIZATION 1869
where we use an intensive search procedure, namely simulated
annealing to find a good distribution. We simulate the MemSim
model both for evaluating the random distributions and for eval-
uating the objective function during simulated annealing. We
compare the procedures based on two performance parameters,
namely system throughput and mean delay, with throughput
being the primary one.We organize this section as follows. In Section V-A, we
present relevant details for the construction of the flow-delay
model and the procedures involving random sampling and
simulated annealing. In Section V-B, we illustrate the type of
systems considered by giving a brief overview of MemSim.
Then in Section V-C we present details of the configurations
used and study the distribution given by the flow-delay model
for a few of the simpler ones. Finally, in Section V-D, we
present the consolidated results across all configurations and
procedures.
A. Details of Flow Optimization Procedures
Let us start with the flow-delay model. Since we do not have
techniques to characterize the queueing delay functions as yet,
we use the following three strategies to construct the model.
• FDM-L: We consider only the component latencies and
neglect the queueing delays.
• FDM-Q: We assume a linear dependence between the flow
through a component and its queueing delay. We assume
the queueing delay to be 0 when the flow rate is negli-
gibly small and the maximum to be the product of its queue
capacity and cycle time, when the flow rate reaches its
throughput.
• FDM-I : We iteratively determine the queueing delay
at which each component is operating in the followingmanner. We initially assume all queueing delay functions
to be linear and identical to FDM-Q. We consider the slope
to be a variable parameter. Once we get the distribution,
we simulate it on the MemSim model and note the actual
delays. We compute the difference between the delays
predicted by the flow-delay model and the observed delays
for each component. We adjust each slope by an amount
proportional to the difference and repeat the process. We
stop when the difference is within a specified tolerance for
all components or the number of iterations exceeds a spec-
ified limit. We use the following practical considerations
as the stopping criteria: either the difference in delay iswithin 5% for all components or the number of iterations
has exceeded 1000. This method simply ensures that the
queueing delays modeled are approximately equal to the
actual delays for the distribution at which the system is
operating. A possible limitation is that the flow distribution
may converge to one that is suboptimal for the underlying
system. We still use this method as a practical means in the
absence of an accurate characterization of the queueing
delay functions.
For the random samples, we generate a random probability
for each path through the MemSim model. Each access path
starts at a processor goes to a memory and returns to the same
processor. Each path also involves an input port and an outputport. The processor-port connectivity and memory-port connec-
Fig. 17. MemSim model overview.
tivity are specified as a part of the configuration. They determine
which out of the possible paths are present. We then normalize
the path probabilities to add up to 1 for each processor and sim-
ulate each distribution on the MemSim model. While presenting
the results, we refer to the random search procedures as RND-
where is the number of distributions generated. We consider
and in our study.
For simulated annealing we use the library available at [34].We start with a random distribution and compute the objective
function as follows. We simulate the MemSim model and define
the objective that is to be maximized as
objective system throughput mean delay. (18)
In this expression is a weight constant which we take to be
0.01. We simulate a short access trace for evaluating the objec-
tive, wherein each processor generates 1000 accesses. We refer
to this procedure as SA while presenting the results. In the next
section we quickly introduce MemSim, the modeling frame-
work within which we have conducted this validation.
B. Reference Simulation Model
MemSim provides a cycle accurate modeling framework for
the memory accesses in multiprocessor systems having a shared
and distributed memory sub-system. The system is modeled as
shown in Fig. 17. It is assumed to be synchronous with a global
clock. The processors generate memory accesses each of which
is allocated to a path through the memory and interconnect sub-
systems. For this study we assume that the processors randomly
allocate the accesses based on a probability distribution.
The memory and interconnect sub-systems are modeled using
the template shown in Fig. 18. The components with perfor-
mance parameters are the memory modules, arbiters, distribu-tors (which direct the accesses at points where multiple paths
diverge) and interconnect wires. The memories, arbiters, and
distributors are characterized by a cycle time which is the recip-
rocal of the throughput. Their latency is also assumed to be equal
to this cycle time. The wires are pipelined, they are character-
ized by the number of stages which we refer to as the length. An
access proceeds one stage in each clock, thus the throughput is 1
and latency equal to the length. There are queues at a number of
positions in the template. They are characterized by a capacity
in terms of the number of accesses. All components follow a
first-in-first-out policy with blocking.
We have developed the infrastructure to automatically con-
vert a MemSim description to a flow-delay model. The compo-nents are modeled as described in the previous section and then
7/31/2019 Vlsi19_10 Performance Testing Vlsi
http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 10/13
1870 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011
Fig. 18. Model for the memory and interconnect sub-systems.
Fig. 19. Memory and interconnect parameters for Configuration-1.
TABLE IIDEFAULT PARAMETERS FOR CONFIGURATION-1
connected exactly as in Fig. 18. The description also contains
the valid path connections for the design, which gets reflected
in the flow-delay network constructed. In the next section, we
describe the configurations we have used for this study in terms
of the MemSim template.
C. Configurations Used
The simplest configuration we consider has four processors,
four memories, one input port, and one output port. Since there
is a single input port and a single output port, the connectivity
is trivial. We show the memory cycle time and wire length pa-
rameters in Fig. 19. For the rest of the parameters, we use thedefaults given in Table II. We label this configuration as C-1.
Fig. 20. Memory and interconnect parameters for Configuration-2.
In this configuration, the memories clearly offer a throughput
bottleneck. Since their cycle times are in the ratio 1:4 we can ex-
pect the ideal flow distribution to be in the same ratio, favouring
the faster memories. We do indeed observe that the distribution
in the flow-delay model directs 40% of the flow towards each of
the faster memories and 10% towards each of the slower ones.
The distribution is identical for all processors. We observe the
same for all three strategies: FDM-L, FDM-Q, and FDM-I .
The remaining configurations consist of four processors, four
memories, two input ports, and two output ports. In the first
such configuration we consider uniform memory cycle times
and nonuniform wire lengths as shown in Fig. 20. We use the
same defaults as in Table II for the remaining parameters. We
assume that all processors are connected to all ports and simi-
larly for the memories. We label this configuration C-2.
The memories still offer a throughput bottleneck but the first
input and output ports are close to the first two memories and
far from the other two, and vice versa for the second input and
output ports. This is reflected in the wire lengths. All three forms
of the flow-delay model result in the flows being distributed
equally across the memories but using only the short wires.In the next configuration labeled C-3, we interchange the wire
lengths as follows. We assume that the first input and output
ports are close to the memories while the second input and
output ports are far. We find that FDM-L and FDM-I use only the
short wires whereas FDM-Q divides the flow, placing a larger
share on the short wires.
The subsequent configurations cover the following system
features.
• Nonuniform memory cycle times, as in Fig. 19 along with
the two previous wire length combinations. We label these
C-4 and C-5.
• An arbiter as the throughput bottleneck with the memorycycle times and wire lengths being selected randomly, as
indicated in Table III. We use the connectivity scheme
shown in Fig. 21 for such configurations. We consider 2
of them labeled C-6 and C-7 .
• Partially randomized configurations in which each of the
memory cycle times and wire lengths is selected indepen-
dently from the values given in Table IV. The rest of the
parameters are taken to be the defaults in Table II. We con-
sider two connectivity schemes: 1) complete connectivity
as before and 2) all processors are connected to both ports
but the memories are connected as shown in Fig. 21. We
study four such configurations labeled C-8–C-11.
• Completely randomized configurations in which allparameters are selected independently from the ranges
7/31/2019 Vlsi19_10 Performance Testing Vlsi
http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 11/13
HAZARI AND NARAYANAN: ON THE USE OF SIMPLE ELECTRICAL CIRCUIT TECHNIQUES FOR PERFORMANCE MODELING AND OPTIMIZATION 1871
Fig. 21. Alternate port-memory connectivity scheme.
TABLE IIIPARAMETERS FOR CONFIGURATIONS WITH AN ARBITER AS
A THROUGHPUT BOTTLENECK
TABLE IVPARAMETERS FOR PARTIALLY RANDOMIZED CONFIGURATIONS
TABLE VPARAMETERS FOR COMPLETELY RANDOMIZED CONFIGURATIONS
shown in Table V. We assume complete connectivity for
these configurations. We consider five of them labeled
C-12–C-16 .
We assume that the processors stress the memory and in-terconnect sub-systems. We generate 10 000 accesses per pro-
cessor and observe the time taken to service them. We also ob-
serve the mean access delay. We present a summary of the re-
sults across all configurations in the next section.
D. Results
We have six flow distribution procedures: FDM-L, FDM-Q,
FDM-I , RND-1000, RND-10000, and SA. We say that one
procedure is better than another either if it results in a higher
throughput or if it results in approximately the same throughput
and a lower mean delay. We use the following two ratios to
compare two procedures, say and , against each other:
(19)
(20)
Thus is better than if or and .
If we can expect since we are stressing the
network. In general if supports a higher throughput we expect
it to be accompanied by larger delays.
We first compare the three forms of the flow-delay model
against each other. We present the detailed results in Table VI.
We find the differences in resulting system performance to besmall throughout and that no single strategy to be consistently
TABLE VIDETAILED RESULTS FOR COMPARISON BETWEEN THE VARIOUS FORMS OF THE
FLOW-DELAY MODEL
TABLE VII
SUMMARY OF THE COMPARISON BETWEEN THE FLOW-DELAY MODELS
better than the others. A more in-depth look at the results
gives the following observations: In terms of only throughput,
FDM-Q is better than the other two by 5% for exactly two
configurations, FDM-I is better than FDM-L by 4% for one
configuration and all other differences are within 2%. The larger
throughput gains are accompanied by a significant increasein delay. In terms of only delay, the differences between the
strategies are larger for certain configurations. These cases also
show a difference in throughput which accounts for the effect
on mean delay. The configurations for which one strategy is
better than another in terms of both throughput and mean delay
are restricted to the following: FDM-L is better than FDM-Q
for C-3 and C-12 while FDM-Q is better than FDM-L for C-14.
We present a summary of the above results in Table VII. This
information is sufficient to conclude that: 1) the differences in
throughput are small; 2) the differences in mean delay are larger
and show greater variations; and 3) no particular strategy is con-
sistently better than the other two, across all configurations. Al-though FDM-I is consistently better than FDM-L the average
throughput gain is negligible and it is not consistently better than
FDM-Q. Thus we continue with all three forms for the rest of
this section.
Next we compare the flow-delay model with the random
search and simulated annealing procedures. Let us start by
studying the speedup provided. In our experiment setup, a
single simulation of the MemSim model is 4–5 times slower
than constructing and solving the flow-delay model. Thus the
speedup is almost and over the two random
procedures we use. The simulated annealing procedure is much
slower and the average speedup is . Thus the proposed
methodology offers a substantial speedup over the other twoprocedures.
7/31/2019 Vlsi19_10 Performance Testing Vlsi
http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 12/13
1872 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011
TABLE VIIICOMPARISON OF THE FLOW-DELAY MODEL WITH RANDOM SAMPLING AND
SIMULATED ANNEALING
To compare the procedures in terms of system performance,
we modify the method followed so far in the following two
ways.
1) When computing for the random sampling we take the
distribution giving highest throughput. When computing
for the random sampling we take the distribution giving
minimum mean delay with a throughput at least as high asthat given by the flow-delay model it is being compared
against.
2) Then, for computing the maximum, minimum and average
statistics of we consider only those configurations for
which the random sampling or simulated annealing gives a
higher or equal throughput than the flow-delay model. We
present the results in Table VIII.
In terms of system throughput, we find cases where the flow-
delay models give better distributions as well as cases where
they give worse distributions. In the worst case, the flow-delay
models give a distribution that comes within 7% of the best
distribution the other two procedures can find. For the config-urations in which the flow-delay model finds a better distribu-
tion, the differences are much larger. In the average case they
give better distributions than the two random procedures by ap-
proximately 20% and 15%, respectively, and equivalent ones as
compared to simulated annealing. In terms of delays, we once
again find cases where the flow-delay models give better as well
as worse distributions. The differences are large when they are
better and small when they are worse. In the average case the
mean delays given by the proposed methodology is lower. Thus
we conclude that in general the flow-delay models do indeed
give flow distributions that achieve high throughput along with
low delays.
Our experiments clearly show that the proposed methodologycan find comparative if not better distributions within much
smaller time budgets as compared to standard search proce-
dures. This result establishes it as a promising approach towards
designing the flow distributions in VLSI systems.
VI. CONCLUSION AND FURTHER DIRECTIONS
Through this paper we have demonstrated a methodology for
modeling VLSI system performance using simple electrical cir-
cuits. We have shown how the network solution provides a nat-
ural means for optimizing the flow distributions from the per-
spective of obtaining high throughputs with low end-to-end de-
lays. Although this approach can model only a limited set of features and makes a number of approximations, the empirical
results are promising. Our present studies show that the approx-
imation of discrete VLSI quantities, such as data transfer units
and synchronous time, as continuous currents and voltage drops
is valid.
We have built on the methodology proposed in [1]. The ad-
vancement from linear queueing delay functions to arbitrary
ones is important for VLSI systems. The next requirement is tobe able to characterize the queueing delay functions. Although
our results show promise when we assume simple linear func-
tions or even neglect queueing delays, this may not be the case in
all systems. Further research on this characterization can make
the approach more robust and efficient.
We have conducted a substantial empirical study which
clearly shows that the flow-delay model gives good distribu-
tions as compared to standard search procedures. Its strength
is that it offers a substantial speedup, which makes it a good
candidate to be incorporated into design procedures, compilers,
and for on-chip allocations. Further studies can validate the
methodology on a wider range of systems. Nondeterministic
behavior such as caching and row-buffering can also be covered,however we do not expect the flow-delay model to capture these
effects well. A final research direction is to develop algorithms
to construct good allocations using the ideal flow distribution
as a reference. This would complete the requirements for a
complete design flow.
ACKNOWLEDGMENT
The authors would like to thank Prof. M. P. Desai and his stu-
dents for simulation and optimization infrastructure. They are
grateful to Y. Save for initiating them with the circuit simulator
and helping to revise the drafts. They would also like to thank
the reviewers for suggesting significant improvements over theinitial submission.
REFERENCES
[1] J. B. Dennis , Mathematical Programming and Electrical Networks.New York, London: MIT, Wiley, Chapman & Hall, 1959.
[2] W. Wolf, A. A. Jerraya, and G. Martin, “Multiprocessor system-on-chip (MPSoC) technology,” IEEE Trans. Comput.-Aided Des. Integr.Circuits Syst., vol. 27, no. 10, pp. 1701–1713, Oct. 2008.
[3] M. Adiletta, M. Rosenbluth, D. Bernstein, G. Wolrich, and H.Wilkinson, “The next generation of Intel IXP network processors,”
Intel Technol. J. , vol. 6, no. 3, pp. 6–18, Aug. 2002.[4] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-way Mul-
tithreaded SPARC processor,” IEEE Micro, vol. 25, no. 2, pp. 21–29,Feb. 2005.
[5] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer,and D. Shippy, “Introduction to the cell multiprocessor,” IBM J. Res.
Developm., vol. 49, no. 4/5, pp. 589–604, Jul./Sep. 2005.[6] S. Borkar, “Thousand core chips—A technology perspective,” in Proc.
44th ACM IEEE Des. Autom. Conf., San Diego, CA, Jun. 2007, pp.746–749.
[7] G. Micheli and L. Benini, “Networks on chip: A new paradigm for sys-tems on chip design,”in Proc. Des., Autom., Test Eur. Conf. Exhibition,Paris, France, Mar. 2002, pp. 418–419.
[8] G. Micheli and L. Benini , Networks on Chip. San Francisco, CA:Morgan Kauffman, 2006.
[9] J. Nurmi, “Network-on-chip: A new paradigm for system-on-chip de-sign,” in Proc. Int. Symp. Syst.-on-Chip, Tampere, Finland, 2005, pp.2–6.
[10] T. Bjerregaard and S. Mahadevan, “A survey of research and practicesof network-on-chip,” ACM Comput. Surveys, vol. 38, no. 1, pp. 1–51,
2006, Article 1.[11] S. A. McKee, “Reflections on the memory wall,” in Proc. 1st ACM Conf. Comput. Frontiers, Ischia, Italy, Apr. 2004, p. 162.
7/31/2019 Vlsi19_10 Performance Testing Vlsi
http://slidepdf.com/reader/full/vlsi1910-performance-testing-vlsi 13/13
HAZARI AND NARAYANAN: ON THE USE OF SIMPLE ELECTRICAL CIRCUIT TECHNIQUES FOR PERFORMANCE MODELING AND OPTIMIZATION 1873
[12] S. Medardoni, M. Ruggiero, D. Bertozzi, L. Benini, G. Strano, andC. Pistritto, “Capturing the interaction of the communication, memoryand I/O subsystems in memory-centric industrial MPSoC platforms,”in Proc. Des., Autom. Test Eur., Nice, France, 2007, pp. 660–665.
[13] M. Monchiero, G. Palermo, C. Silvano, and O. Villa, “Exploration of distributed shared memory architectures for NoC-based multiproces-sors,” J. Syst. Arch., vol. 53, no. 10, pp. 719–732, Oct. 2007.
[14] B. H. Meyer and D. E. Thomas, “Simultaneous synthesis of buses, data
mapping and memory allocation for MPSoC,” in Proc. 5th IEEE/ACM Int. Conf. Hardw./Softw. Codes. Syst. Synth., Salzburg, Austria, Sep./ Oct. 2007, pp. 3–8.
[15] S. Pasricha and N. Dutt, “COSMECA: Application specific co-syn-thesis of memory and communication architectures for MPSoC,” inProc. Des., Autom. Test Eur., Munich, Germany, Mar. 2006, pp. 1–6.
[16] R. Ahuja, T. Magnanti, and J. Orlin , Network Flows: Theory, Algo-rithms, and Applications. Englewood Cliffs, NJ: Prentice-Hall, 1993.
[17] L. Kleinrock , Queueing Systems. New York: Wiley, 1975.[18] P. K. F. Holzenspies, J. L. Hurink, J. Kuper, and G. J. M. Smit,
“Run-time spatial mapping of streaming applications to a hetero-geneous multi-processor system-on-chip (MPSOC),” in Proc. Des.,
Autom. Test Eur., Munich, Germany, 2008, pp. 212–217.[19] K. Goossens, J. Dielissen, O. P. Gangwal, S. G. Pestana, A. Radulescu,
and E. Rijpkema, “A design flow for application-specific networks onchip with guaranteed performance to accelerate SOC design and veri-fication,” in Proc. Des., Autom. Test Eur., 2005, pp. 1182–1187.
[20] C. Chou and R. Marculescu, “Incremental run-time application map-ping for homogeneous NoCs with multiple voltage levels,” in Proc.
IEEE/ACM Int. Conf. Hardw./Softw. Codes. Syst. Synth., Salzburg,Austria, Sep.–Oct. 2007, pp. 161–166.
[21] J. A. Buzacott and D. D. Yao, “Flexible manufacturing systems: Areview of analytical models,” Management Sci., vol. 32, no. 7, pp.890–905, Jul. 1986.
[22] A. Federgruen and H. Groenevelt, “Characterization and optimizationof achievable performance in general queueing systems,” Oper. Res.,vol. 36, no. 5, pp. 733–741, Sep.–Oct. 1988.
[23] J. A. Buzacott and J. G. Shanthikumar, “Design of manufacturing sys-tems using queueing models,” Queueing Syst., vol. 12, no. 1–2, pp.135–213, Mar. 1992.
[24] M. K. Govil and M. C. Fu, “Queueing theory in manufacturing: Asurvey,” J. Manuf. Syst., vol. 18, no. 3, pp. 214–240, 1999.
[25] O. Boxma, G. Koole, and Z. Liu, “Queueing-theoretic solutionmethods for models of parallel and distributed systems,” in Proc. 3rd QMIPS Workshop Perform. Evaluation Parallel Distrib. Syst.—Solu-tion Methods, Torino, Italy, 1993, pp. 1–24.
[26] B. Gaujal and E. Hyon, “Optimal routing in several deterministicqueues with two service times,” J. Eur. Syst. Automat., vol. 36, no. 2,pp. 945–957, 2002.
[27] F. E. B. Ophelders, S. Chakraborty, and H. Corporaal, “Intra- andinter-processor hybrid performance modeling for MPSoC architec-tures,” in Proc. 6th IEEE/ACM/IFIP Int. Conf. Hardw./Softw. Codes.Syst. Synth., Atlanta, GA, 2008, pp. 91–96.
[28] S. Schliecker, A. Hamann, R. Racu, and R. Ernst, “Formal methodsfor system level performance analysis and optimization,” Des. Autom.
Embed. Syst., vol. 13, no. 1–2, pp. 27–49, Jun. 2009.[29] S. Chakraborty, S. Kunzli, L. Thiele, and P. Sagmeister, “Performance
evaluation of network processor architectures: Combining simulationwith analytical estimation,” Comput. Netw., vol. 41, no. 5, pp.641–665,Apr. 2003.
[30] R. Marculescu and P. Bogdan, “The chip is the network: Toward a sci-ence of network-on-chip design,” Foundations Trends Electron. Des.
Autom., vol. 2, no. 4, pp. 371–461, 2009.[31] A. L. Varbanescu, H. Sips, and A. van Gemund, “PAM-SoC: A
toolchain for predicting MPSoC performance,” in Euro-Par 2006 Parallel Processing. Berlin, Germany: Springer, 2006, pp. 111–123.
[32] B. Ristau, T. Limberg, and G. Fettweis, “A mapping framework basedon packing for design space exploration of heterogeneous MPSoCs,”
J. Signal Process. Syst., vol. 57, no. 1, pp. 45–56, Oct. 2009.[33] N. Balabanian andT. Bickart , Electrical Network Theory. New York:Wiley, 1969.
[34] G. Kliewer and S. Tschoke, “Parallel simulated annealing library,”1998. [Online]. Available: http://wwwcs.uni-paderborn.de/fach-bereich/AG/monien/SOFTWARE/PARSA/
[35] S. H. Batterywala and H. Narayanan, “Efficient DC analysis of RVJcircuits for moment and derivative computations of interconnect net-works,” in Proc. IEEE Int. Conf. VLSI Des., Goa, India, Jan. 1999, pp.169–174.
[36] G. Hazari, M. P. Desai, and H. Kasture, “On the impact of addressspace assignment on performance in systems-on-chip,” in Proc. IEEE
Int. Conf. VLSI Des., Bangalore, India, 2007, pp. 540–545.[37] G. Hazari, M. P. Desai, andG. Srinivas,“Bottleneck identification tech-
niques leading to simplified performance models for efficient designspace exploration in VLSI memory systems,” presented at the IEEEInt. Conf. VLSI Des., Bangalore, India, 2010.
[38] H. Kasture, “A memory subsystem simulator for SoC applications,”B.Tech. and M. Tech. dissertation, Dept. Elect. Eng., I.I.T., Bombay,India, Jul. 2006.
Gautam Hazari receivedthe DualDegree which includes a B.Tech. in electricalengineering and an M.Tech. in microelectronics from I.I.T. Bombay, India, in2002. He completed the Ph.D. degree from the same institute in 2010.
His current research interests are centered around performance analysis andmodeling at the system level. He has previously enjoyed working with digitalcircuits, especially asynchronous ones.
H. Narayanan received the B.Tech. and Ph.D. degrees from I.I.T. Bombay,India, in 1969 and 1974, respectively.
He hasbeen a faculty memberwith theDepartment of Electrical Engineering,I.I.T. Bombay, since 1974. He has also been a visiting faculty with the De-partment of Electrical Engineering and Computer Sciences, University of Cali-fornia, Berkeley, from 1983 to 1985. During 2000 to 2003, he was the Head-of-Department of Electrical Engineering, I.I.T. Bombay. His primary research in-terests are in the area of electrical network analysis, particularly in the use of topological methods for efficient analysis. He has supervised the building of thegeneral purpose circuit simulator BITSIM at I.I.T. Bombay, which uses suchmethods.He hasparticipated inthe buildingof VLSI circuit partitionersfor real-ization through FPGAs, in collaboration with industry partners from the UnitedStates and Japan. He is the author of a monograph titled Submodular Functionsand Electrical Networks (North Holland, 1997), a revised edition of which isavailable online at http://www.ee.iitb.ac.in/~hn/book/.