Post on 07-Sep-2020
Quality-of-Service for Network-on-Chip-based Smartphone/Tablet Systems-on-Chip
by
Kai Feng
A thesis submitted in conformity with the requirements for the degree of Master of Applied Science
Electrical and Computer Engineering University of Toronto
© Copyright by Kai Feng 2012
ii
Quality-of-Service for Network-on-Chip-based
Smartphone/Tablet Systems-on-Chip
Kai Feng
Master of Applied Science
Electrical and Computer Engineering
University of Toronto
2012
Abstract
Smartphone/tablet Systems-on-Chip (SoCs) integrate increasing number of components to offer
more functionality. Capacity and efficiency of data communication between memory and other
hardware blocks have become a major concern in the SoC design. To address this concern, we
propose to use Network-on-Chip (NoC) architectures, to meet high bandwidth, and low power
and area demands. We propose a Quality-of-Service (QoS) scheme to differentially provision
network resources to cater to different performance requirements by different hardware blocks.
Implementation and evaluation are performed on a simulation infrastructure we construct
specifically for this type of SoCs. We demonstrate, via simulation results, that the proposed
Dynamic QoS schemes can achieve better bandwidth provisioning, with good area and power
efficiencies.
iii
Acknowledgments
This thesis may mark the end of my life in school, but not the end of my journey in pursuing
knowledge. I am very grateful that I have been blessed with support and encouragement from
numerous people.
First I would like to express my sincere thanks to my supervisor, Prof. Natalie Enright Jerger for
her guidance and patience for the past two years. I also want to extend my gratitude to Dr. Serag
Gadelrab and Prof. Andreas Moshovos for their great support in my research.
I want to thank my fellow graduate students in Natalie's research group, for their faithful
comments and suggestions on this project. Especially I owe thanks (as well as apologies) to
Sheng Ma for my infinite consultations.
Last but not least, I am extremely indebted to my parents and Emma for their love and constant
encouragement, which got me through many tough moments. Thank them for always being there
for me.
iv
Table of Contents
Acknowledgments………………………………………………………………………………..iii
Table of Contents ........................................................................................................................... iv
List of Tables ................................................................................................................................. vi
List of Figures ............................................................................................................................... vii
List of Acronyms ........................................................................................................................... ix
Chapter 1 Introduction ................................................................................................................. 1
1.1 Motivations ......................................................................................................................... 1
1.2 Research Goals .................................................................................................................... 2
1.3 Thesis Organization ............................................................................................................ 3
Chapter 2 Related Work .............................................................................................................. 4
Chapter 3 Simulation Infrastructure .......................................................................................... 9
3.1 Interconnect ......................................................................................................................... 9
3.2 DRAM ............................................................................................................................... 10
3.3 Workloads ......................................................................................................................... 11
3.3.1 CPU ....................................................................................................................... 11
3.3.2 Traffic Generator (TG) ......................................................................................... 11
3.3.3 Video-Conferencing Workload (VCW) ................................................................ 17
Chapter 4 Quality-of-Service Schemes ..................................................................................... 19
4.1 Hierarchical-Multiplexers Baseline .................................................................................. 19
4.2 Dynamic QoS .................................................................................................................... 22
Chapter 5 Experimental Evaluation ......................................................................................... 32
5.1 Experiment Setup .............................................................................................................. 32
5.2 Experiment Results ........................................................................................................... 35
5.2.1 Latencies ............................................................................................................... 35
v
5.2.2 Case Study: a micro-experiment ........................................................................... 41
5.2.3 Throughput ............................................................................................................ 44
5.2.4 Area and Power ..................................................................................................... 46
Chapter 6 Conclusions ................................................................................................................ 49
6.1 Future Work ...................................................................................................................... 49
Bibliography ................................................................................................................................. 51
vi
List of Tables
Table A: Main configurations of each hardware block ................................................................ 34
vii
List of Figures
Figure 1: Exynos 4212 .................................................................................................................... 2
Figure 2: Microarchitecture of a generic credit-based NoC router ................................................. 4
Figure 3: Simulation infrastructure for smartphone/tablet SoCs .................................................... 9
Figure 4: Markov chain model for parameter collection and request generation. ........................ 14
Figure 5: Markov chain TG verification results of address model with different configurations 15
Figure 6: verification of self-similar timing model ....................................................................... 17
Figure 7: VCW implementation .................................................................................................... 18
Figure 8: 16-node Hierarchical-multiplexers baseline network .................................................... 20
Figure 9: 16-node Dynamic QoS network .................................................................................... 24
Figure 10: Zoom-in view of satellite router's outputs and backbone router's inputs .................... 27
Figure 11: Two examples of step-by-step procedures of token handshakes ................................ 28
Figure 12: Comparison of average latencies of packets in Dynamic QoS with different lengths of
buffer queues ................................................................................................................................. 34
Figure 13: Average latencies of packets from hardware blocks associated with VCW ............... 36
Figure 14: Latency distributions of packets from camera ............................................................ 37
Figure 15: Latency distributions of packets from display ............................................................ 37
Figure 16: Latency distributions of packets from encoder ........................................................... 37
Figure 17: Latency distributions of packets from decoder ........................................................... 38
Figure 18: Latency distributions of packets from modem ............................................................ 38
Figure 19: Average latencies of packets from non-VCW hardware blocks ................................. 39
viii
Figure 20: Average total round-trip latencies of packets from VCW hardware blocks ............... 40
Figure 21: Average round-trip latencies for every 1000 cycles .................................................... 42
Figure 22: Average round-trip latencies for every 200 cycles ...................................................... 43
Figure 23: Network throughputs comparison ............................................................................... 44
Figure 24: Router areas and channel areas ................................................................................... 46
Figure 25: Router power consumptions ........................................................................................ 46
ix
List of Acronyms
NoC Network-on-Chip
SoC System-on-Chip
QoS Quality-of-Service
WRR weighted round robin
TG traffic generator
VCW video-conferencing workload
ACK acknowledgement
BE best-effort
GS guaranteed services
UI user interface
ISA instruction set architecture
1
Chapter 1 Introduction
Smartphones are different from simple voice communication devices since they offer far more
superior capabilities, such as navigation and video chatting. In fact, smart handheld devices
nowadays, e.g. smartphones and tablets, have enabled more and more sophisticated tasks that
were either only provided by those big machines of the past, or simply never possible due to lack
of sensors. To achieve these functionalities, the number of components, such as processing cores,
specialized accelerators and various sensors, that are integrated onto a system-on-chip (SoC)
continues to increase.
1.1 Motivations
Figure 1 shows the scale of a smartphone/tablet SoC as of early 2012. The chip is Exynos 4212
by Samsung Electronics, which is marketed as appropriate for either a smartphone or a tablet [1].
Each component requires that data be communicated between it and other parts of the system.
More specifically, the majority of this communication traffic is that various processing cores,
accelerators and sensors require access to the same memory module, which forms a unique N-to-
1 traffic pattern, comparing with other types of SoCs in which the network heterogeneity and
traffic patterns are rather different. The interconnection network that supports such data supply
determines memory latency and memory bandwidth, two key performance factors in a system [2].
2
Therefore the interconnection network, especially its ability to meet low latency requirements
and constraints on both size and power, has become a major concern in today's smartphone/tablet
SoC design.
Figure 1: Exynos 4212 SoC, combines a 32nm dual-core ARM Cortex-A9 CPU, by Samsung
1.2 Research Goals
In this research, we investigate network designs for these specific SoCs, to facilitate data
communication requirements between DRAM controller and other components. We propose to
use many concepts of Network-on-Chip (NoC) architectures, driven by high bandwidth, low
power and area demands. In particular, we would like to avoid fair network resources
distribution to all applications or hardware blocks, as they would have different performance
3
requirements to the network. Therefore we particularly focus on differentially provisioning
network resources to cater to different requirements by different hardware blocks, by using
Quality-of Service (QoS) schemes,.
To implement and evaluate our NoC-based interconnection network and QoS framework design,
we also construct a simulation infrastructure, which is composed of simulators and workloads.
The workloads are implemented by first characterizing a traffic pattern and then abstracting it
into a specific model.
1.3 Thesis Organization
The rest of the thesis is organized as follows. In Chapter 2, we provide an overview of basic NoC
and QoS concepts, and an overview of related work. In Chapter 3, we introduce our simulation
infrastructure, with descriptions of each element, including simulators and workloads. Then in
Chapter 4, we present two QoS schemes, weighted round robin (WRR) and Dynamic QoS, as
well as their corresponding network designs. In Chapter 5, we evaluate these QoS designs
through experiments in our simulation infrastructure. Lastly in Chapter 6, we summarize our
contributions and discuss potential plans for the next step.
4
Chapter 2 Related Work
With the trend of increasing core counts in a single chip, Network-on-Chip (NoC) architectures
have been proposed [3] [4] and are employed in homogeneous, general-purpose chips [5] [6] [7]
[8], to provide high bandwidth and scalable on-chip interconnection networks. In a NoC
structure, one or several cores or memory controllers are bound with a router [9]. Figure 2 shows
the microarchitecture of a generic credit-based NoC router. Traffic injected to the network by
these cores or controllers through their router would first be packetized, and each packet is then
further divided into a head flit, a tail flit and several body flits. Within the network, data
communication between routers is in unit of flits. The flits are serialized and reassembled back to
packets at their destination. A more comprehensive description of NoC concepts can be found in
Principles and Practices of Interconnection Networks, by Dally and Towles [2].
Figure 2: Microarchitecture of a generic credit-based NoC router
5
There has been an increasing trend in industry to have heterogeneous cores, i.e. different types of
cores, on a chip. AMD's Fusion multi-core processors [10] and the Cell chips [7] developed by
IBM, Sony and Toshiba serve as good examples. There is also a fair amount of research on NoC
that targets heterogeneous networks. Lee et al. [11] hierarchically design and implement a
heterogeneous NoC based on a topology named hierarchical star. They focus on low-power
communication in design levels such as circuits, signaling, channel coding, protocol and
topology, using various power-efficient techniques. Lambrechts et al. [12] provide a power
breakdown analysis for heterogeneous NoCs, and identify the power bottlenecks considering the
platform as a whole. They carefully map an MPEG2 video chain as well as other applications
onto a heterogeneous NoC-based platform, and point out that the global interconnect is not that
critical for a well-optimized mapping. Kreutz et al. [13] employ a mix of 3 types of routers to
optimize heterogeneous NoCs for latency and energy consumption, along with an optimization
algorithm to find optimal placements for application cores.
Another important concept in NoC is Quality-of-Service (QoS). It is defined as service
quantification that is provided by the NoC to the demanding core [14]. In other words, it refers to
reserving and provisioning different resources for applications or traffic streams with different
priorities. Goossens et al. [15] identify two basic QoS classes: best-effort (BE) services, which
improve average resource utilization but offer no commitment, and guaranteed services (GS),
6
which do. The general shortcoming of BE is its proneness to network congestion [16]. Avasare et
al. [17] present a centralized OS communication management scheme that addresses congestion
of a BE NoC. In the work, control data is immune to congestion due to its own separate NoC. On
the other hand, a good example of GS is Preemptive Virtual Clock [18], which uses packet
preemption and a dedicated ACK network to provide GS, by allocating network bandwidth to
threads or applications. Many actual implementations of NoCs choose a combination of both
basic QoS classes. Bjerregaard et al. [19] propose MANGO, which utilizes allocated virtual
channels to provide connection-oriented GS and connection-less BE routing. Similarly in
Æthereal NoC [20], routers provide both GS and BE services. GS are obtained by means of
TDMA slot reservations. BE traffic makes use of non-reserved slots and of any slots reserved but
not used.
Nevertheless, the functionality and connectivity of most modern NoC with QoS designs do not
perfectly match the communication demands of our targeting SoC architecture. Our system is
heterogeneous, containing a large variety of processing units, accelerators and sensors, while
most previous research in the field of NoC focus only on homogeneous cores. Nowadays, QoS
for heterogeneous networks has become a more attractive topic to NoC researchers. Murali et al.
[21] exploit the heterogeneity of applications, based on their different communication
requirements and traffic patterns, and map them onto reconfigurable NoCs. Cheng et al. [22]
7
leverage a heterogeneous interconnect to map different coherence protocol messages onto wires
of different widths and thicknesses. Grot et al. [23] propose a heterogeneous network to support
a thousand connected components with high area and energy efficiency, and strong QoS
guarantees. They reduce router complexity by isolating shared resources in dedicated QoS-
equipped regions of the chip. Bolotin et al. [24] present QNoC, a low-cost customized NoC to
meet QoS requirements. Services are categorized into 4 levels, where signaling has the highest
priority, followed by real time, read/write and block transfer. However, the typical NoC
application investigated in those papers is limited to support cache coherence protocols in shared
memory multi-core systems. The networks are usually of N×N sizes, either in mesh or torus
topologies, whereas the SoC network we investigate here is a unique N-to-1 communication
structure with rather different traffic patterns expected.
Regarding traffic characterization and generation, related work is as follows. Soterious et al. [25]
propose an empirically-derived statistical traffic model for NoCs. The model exposes both
spatial and temporal dimensions of traffic via 3 statistical parameters: hop count, burstiness, and
packet injection distribution. Hestness et al. [26] propose to collect application traces while still
preserving dependencies between network messages. They introduced Netrace, a trace-based
simulation platform with high fidelity due to dependencies enforced. Again, the work mentioned
above, the benchmarks used to generate traces, is for homogenous general purpose
8
interconnection networks. Although not focusing on NoC, Gutierrez et al. [27] make an
interesting analysis of smartphone applications. They measure a variety of mobile applications
for audio, video, and interactive gaming. They conclude that the characteristics of these
applications markedly differ from those of general-purpose benchmarks. We adopt their BBench,
a web-browser benchmark, as the CPU workload in this research.
9
Chapter 3 Simulation Infrastructure
As the first step of the project, we have constructed a simulation infrastructure for
smartphones/tablets, as shown in Figure 3. This infrastructure allows us to simulate and evaluate
modern Android platform workloads, as well as the interactions between the on-chip network
and the DRAM controller. Each component of this infrastructure is described as follows.
Figure 3: Simulation infrastructure for smartphone/tablet SoCs
3.1 Interconnect
We implement our QoS schemes and their corresponding interconnect topologies in BookSim
2.0, a cycle-accurate interconnection network simulator [2]. We have made proper modifications
10
to channels, routers and routing functions to suit our designs. In addition, instead of using the
built-in synthetic traffic patterns, we have implemented an interface between BookSim and all
the other simulators or workloads (to be described in the following sections) to provide real-life
traffic patterns. At the interface, each hardware block can either receive or inject available
memory requests/responses from or to the network at each clock cycle. We adopt open-loop
measurement configuration [2], which incorporates an injection queue of infinite length at each
network interface. These queues isolate the traffic processes from the network itself so that the
traffic patterns are kept as originally specified. Since in Chapter 5 we evaluate our network
designs on traffic patterns that we specify in various workloads, we will use open-loop
measurements for both latency and throughput.
3.2 DRAM
We choose DRAMSim2 to model the DRAM memory controller of the system. It also models
memory channels, DRAM ranks and banks, and provides timing for memory accesses based on
its configurations [28]. It keeps monitoring the network for memory read/write requests, and
transforms them to memory transactions. When a transaction is returned, the generated response
would be packetized and sent back to the request owner.
11
3.3 Workloads
3.3.1 CPU
GEM5 is chosen to simulate the CPU, which can run in full system mode [29]. It provides CPU
models with various ISAs, while ARM best suits our case since most smartphone/tablet SoCs
adopt ARM processors. We set it to boot Android operating system, and run BBench, a web-
page rendering benchmark specifically for Android [27]. Due to some technical difficulties,
instead of directly connecting GEM5 with BookSim, we have collected memory access traces of
it running BBench, and use them as the CPU traffic model. Although we lose some fidelity due
to lack of dependency information, the traces can still provide a usable traffic pattern of CPU. In
addition, dependencies make CPU self-throttling, which is a stabilizing behavior that when
network becomes congested, CPU will become idle due to little or no memory requests fulfilled,
then congestion of the network would be reduced [2]. However, QoS is generally more useful
during network congestion, which tends to be avoided by self-throttling.
3.3.2 Traffic Generator (TG)
Smartphone SoCs are heterogeneous networks, which in addition to “traditional” on-chip
elements we mentioned above, also have many specialized hardware elements such as video
encoder or camera. When simulating the entire network, each element also needs to be properly
12
modeled. The most direct way is to perform complete hardware simulation, but it is usually slow
and heavy on system resources. A comparably simpler solution is trace replay. However, trace
files tend to be very large in size. In addition, both full simulation and trace replay have the same
problem in that for most hardware blocks in a smartphone SoC, it is impossible to find or to
implement a specific simulator so that we could directly use or collect trace from.
Let’s revisit the reason we need to simulate the entire SoC. The most important goal of this
research is to investigate Quality-of-Service schemes for this particular type of networks. We use
those hardware blocks as sources to inject traffic into the network. In this case, it is traffic
patterns that matter the most, while absolute precisions in bytes for addressing or in exact cycles
for timing is not as important. Therefore, we have implemented a series of probabilistic traffic
generators that mimic traffic patterns from the specialized hardware blocks in a lightweight way,
rather than reproduce exact memory traces from them. Memory requests are generated based on
selection of input parameters, e.g. injection frequency and correlation probabilities, which are
either trained from real traffic traces or based on technical white papers.
One major kind of traffic generators we have implemented use history-based Markov chain
address model as a base for collecting parameters (e.g. probabilities) and reproducing the traffic
pattern. Markov chain is a finite-state system that has probabilities of transitions from one state
to another, depending on the current state [30]. As shown in Figure 4, each node in chain
13
represents a particular sequence of loads and stores, with probabilistic transitions (edges)
representing the type (load/store) of the next memory request. A queue of latest address histories
is maintained, and is updated after each request learned/generated. Address of the next memory
request is determined by the historical correlations with previous requests in the queue, because
probabilities of re-accessing the latest memory addresses are high, especially in photographic
applications, where dependencies on previous pixels/frames are commonly seen. In addition, the
correlation with the next adjacent row of the current row is also evaluated, because accessing
next row is expected to be fairly regular as well. Therefore, when generating new requests, the
address predicted will either have the same address (or row+1) as one of the requests maintained
in the history queue, or be random.
14
Figure 4: Markov chain model for parameter collection and request generation. (a) shows a
sample of history-of-two Markov chain, with transitions representing next request to be
read/write. (b) is the queue of probabilities of the next request’s address equal to one of the
latest memory requests’ address. The sum of the probabilities, including the probabilities of the
next request address being random and row+1, is 1.
This Markov-chain address model is validated by recollecting parameters (second collection)
from the already reproduced traffic pattern, from traffic generator with parameters collected (first
collection) from a benchmark’s traffic. The idea is to compare two sets of parameters and
examine their similarity. The example hardware block we select is h.264 video encoder, which is
modeled by 464.h264ref benchmark in SPEC CPU2006 [31] with foreman_ref_encoder_baseline
input. Parameters, such as read ratio and address correlations, are collected by PIN binary
instrumentation tool [32], which collects memory accesses (loads/stores) from applications and
feeds them into the Markov-chain model. The model determines parameters on-the-fly, then
saves them to a file.
15
Figure 5: Markov chain TG verification results of address model with different configurations
Experimental results for the Markov chain address model are obtained for 6 different request
sequence history lengths (from 4 to 9) and address history lengths either equal to or twice as
large as request sequence lengths. For each pair of history values, test is run for 3 times and the
results are averaged. Mean square error is also calculated across runs for same history pairs given
same input parameters, to examine whether the traffic generated is unique to the benchmark
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
(4;4) (4;8) (5;5) (5;10) (6;6) (6;12) (7;7) (7;14) (8;8) (8;16) (9;9) (9;18)
Average Deviation
(R/W historydepth; Addr history depth)
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
(4;4) (4;8) (5;5) (5;10) (6;6) (6;12) (7;7) (7;14) (8;8) (8;16) (9;9) (9;18)
Average Deviation
(R/W historydepth; Addr history depth)
(a)
(b)
16
input. The results show that mean square error is either close or equal to 0 in almost all cases,
meaning the output parameters are consistent. Then the comparison of parameters is shown in
Figure 5, where (a) shows difference between first collection and second collection regarding
read/write speculation from each node. Y-axis is an average of mean-square-error with respect to
average between read ratio parameters in first and second collections, across all nodes. It shows
that error is kept constant until request sequence history length reaches 8, when it starts to grow.
This is most likely due to increasing number of possible histories, which amplify the impact of
noise and randomness. Figure 5(b) shows a similar metric for address correlation percentages,
with a trend that performance improves with both longer request sequence and address histories.
One particular detail to note is that while longer request type history length lowers the error rate,
longer address history has a much stronger impact, as can be seen by comparing values for
combinations (4, 8) and (8, 8). The address history length is the same in both cases (4, 8) and (8,
8), but combination (4, 8) provides lower error rate due to smaller number of overall options,
which provides better robustness and less sensitivity to noise.
In addition to the address model, we have also modeled the injection behaviors in time domain.
The timing model varies with different types of hardware blocks. For example, some tend to be
of a streaming type where packets are sent intensely and periodically, while some others may
show a pattern of self-similarity, which is that when zooming in to observe the pattern, we could
17
find a pattern similar to the overall pattern [33]. Therefore, the timing parameters generally need
more manual tuning than parameters in address model. Figure 6 shows an example of h.264
video encoder’s streaming and self-similar pattern, along with output from a tuned traffic
generator. Y-coordinate shows the time when the corresponding x-coordinate request is issued.
Figure 6: verification of self-similar timing model
3.3.3 Video-Conferencing Workload (VCW)1
This workload simulates traffic injected to the network by a combination of hardware blocks
involved in a video-conferencing application. It runs through a series of 1080p HD video frames.
In the workload, there are two sections: outgoing video and incoming video. Outgoing video
section simulates the following procedures. Camera writes a frame to DRAM. Then h.264
1 This work was done by Goran Narancic, a fellow M.A.Sc candidate supervised by Prof.
Andreas Moshovos.
cycles
1st request 1000000
th
18
encoder reads this frame, encodes it, and writes the encoded frame to DRAM. At last modem
reads this encoded frame (to send out via antenna). On the other side, incoming video section
simulates almost in an opposite order. Modem writes an encoded frame (received via antenna) to
DRAM. Then decoder reads, decodes this frame, and writes it back to DRAM. Display reads this
decoded frame in the end. The sections are modeled by h.264 reference implementation encoding
or decoding frames, while PIN tool is used to capture memory requests it produces. Figure 7
shows the implementation of memory requests capturing. The captured traffic is then used to
create memory request streams from camera, display and modem.
Figure 7: VCW implementation
19
Chapter 4 Quality-of-Service Schemes
To facilitate different performance requirements by different hardware modules during resource
contention, we design networks that can perform two different QoS schemes: a baseline scheme,
and our proposed dynamic scheme.
4.1 Hierarchical-Multiplexers Baseline
We build our baseline topology based on a mapping of the N-to-1 communication structure in a
straightforward fashion, as well as an idea of decentralized arbitrations. If we were to use a
centralized arbiter to arbitrate packets from all the hardware blocks, both size and complexity of
the arbiter would get unacceptably large, and the arbitration process would be very slow. Instead
we use 3 levels of routers to fit the current scale (approximately 16 nodes) of smartphone SoC
networks. Routing is simple, since paths are fixed. Inside each router, switch (a.k.a. crossbar)
arbitration is performed independently and locally among a relatively small number of traffic
streams. In this design, since switches have 2 or 3 inputs and only 1 output, so they have the
same functions with multiplexers. This hierarchical-multiplexers network design is shown in
Figure 8.
20
Figure 8: 16-node Hierarchical-multiplexers baseline network
We choose weighted round robin (WRR) [33] as our baseline QoS scheme to compare with our
proposed QoS scheme. Each traffic stream injected from a hardware block is called a service in
this context. WRR schedules services with different pre-assigned weights and is a very effective
and relatively light-weight method. There is also a simple starvation avoidance mechanism that
monitors time that packets have been waiting. In general, it is easy to implement and manipulate.
21
Priorities, or weights in WRR, of traffic packets by different hardware blocks are assigned based
on both industry insights and observations of results from experiments. As we assume a VCW
application is running, we would intentionally prioritize services from the hardware blocks
involved. Within these services, as will be shown in Section 5.2.1, we find that packets' latency
of camera is relatively more vulnerable to network congestions, thus we assign its packets with
top priority. Encoder is the bottleneck of VCW workload as it is computationally heavy.
Therefore we also assign its packets with a high priority. Other than VCW, there are several
services that should also be crucial to performance of the entire system. For example, traffic of
user interface (UI) from GPU is obviously important to user experience, so this service should be
prioritized. More detail of priority assignment can be found in Table A in Chapter 5. We should
note that if any assumptions made to each hardware block's priority were wrong, it should not
affect the correctness of our design. In fact as long as the results we obtain through experiments
can justify the QoS functionalities we expect even based on faulty priority assumptions, our
design could be easily adjusted to any service priority orders.
In order to further ease the burden of arbitration in each latter level, we group services with
similar priorities together and bind them with a specific level-1 router, as shown in Figure 8. For
instance, camera, encoder and decoder have the highest priorities among all, so they all inject
their packets through R3. From R3 to R7, the hardware blocks bound to them decrease with
22
respect to the priorities of their packets. For this reason, R1 could always assume services from
R3 are superior to services from R4 and then R5. The same rule applies in R2, and further in R0.
The following example can illustrate the benefit of applying QoS. Suppose the display and the
streaming TG are injecting traffic at the same rate. In the hierarchical-multiplexer network
without QoS, each router adopts round-robin arbitration. This absolute fairness would provision
1/18 of network resources to the display traffic but 1/12 to the less-critical streaming TG traffic.
With such small portion of network resources, data supply from DRAM to display may therefore
be delayed and user experience would be jeopardized. By being assigned with a proper weight
that is much higher than that of the streaming TG in WRR, the display can thus reserve enough
network resources to meet its tight performance requirement.
4.2 Dynamic QoS
The baseline topology, i.e. hierarchical-multiplexers, with static priority assignment is a simple
and effective solution to the Quality-of-Service requirement of smartphone SoCs. However, it
may suffer from a major limitation, illustrated in Section 5.2.2, that as each packet’s priority and
path are fixed, the flow control cannot guarantee that the router and channel resources are evenly
distributed. For example, in VCW when the user at this end decides to stop filming himself, the
camera and the encoder will be turned off. The resources in R5, e.g. buffers, will be
23
underutilized, while the active services are still fiercely contending for resources at the other
routers. The situation would be even worse when the entire video-conference workload is shut
down. Therefore, the system needs more flexibility to adapt to different task combinations.
The scheme we propose in this project is called Dynamic QoS. It is based on the fact that in most
cases, not all services are active, so network resources shall be allocated dynamically to each
active service. In the topology, as shown in the Figure 9, there is one original input-queue router
R0 connected with DRAM at one end, and on the end connected with three slightly modified
routers that are named as backbone routers. Inside these four routers, weighted-round-robin
arbitration scheme is adopted so that weights assigned to each port decrease from the most upper
port to the bottom port. Instead of utilizing level-1 routers and their dedicated channels
connecting with their corresponding unique level-2 router in Figure 8, Dynamic QoS adopts an
intermediate network between the hardware blocks’ inputs and backbone routers. This
intermediate network could be fully connected, which allows packets from each hardware blocks
to travel through different backbone routers via different paths. Each green arrow in Figure 9
represents a bundle of such paths, including a data channel from each network input to the
backbone router.
24
Figure 9: 16-node Dynamic QoS network
At each hardware block’s network input, as shown in Figure 9, there is control logic that directs
packets to go through the appropriate output port. The number of output port options depends on
the number of downstream backbone routers. For the convenience of explanation here, we could
regard this control logic and ports together as satellite routers, even though they are not typical
routers since they neither buffer request packets nor have a large crossbar inside. The no-buffer
25
design is due to there being no direct contention between requests from different network inputs.
In fact, it would be more appropriate to treat a satellite router as an extension of the injection
channel from a hardware block to a backbone router. Every three hardware blocks are grouped
based on their priority, and are bound with a satellite router. Therefore, packets from R4 have the
highest priority, and packets from R8 have the lowest priority. Similar to baseline, backbone
routers are also assigned with different priorities, with R1 being the highest, and R3 being the
lowest. It should be noticed that it does not make sense to send high-priority packets to low-
priority backbone routers, so R4 is only connected to R1. It is also nearly impossible for packets
with the lowest priorities to reach R1 or R2, which will be explained later. After eliminating
redundant channels from R7 to R1, from R8 to R1 and R2, the intermediate network has become
partially connected (as opposed to fully connected) but still fully functional, shown in Figure 9.
Each backbone router keeps a number of tokens. A token simply represents available resources
to accommodate one service, i.e. one traffic stream from one hardware block. The maximum is
preset based on the maximum number of services it is expected to accommodate at one time.
Each satellite router keeps a record of the numbers of current tokens at each downstream
backbone router. Each satellite router also monitors the injection activity of each local service,
i.e. input from each connected hardware block. For example, after a certain time interval, if a
satellite router detects that one service “wakes up” from silence and starts to inject packets
26
regularly, it will redirect this service to the highest downstream backbone router which has at
least one token, and inform the backbone router that it needs to consume one token for this new
service. The backbone router will then decrease its token by one, and broadcast this change to all
its subscribing satellite routers, which will update their local record of tokens. Similarly, if a
satellite router determines that one service has changed from active to inactive, it will inform this
service’s backbone router to increase its token by one, meaning it now has the ability to
accommodate one more active service. The backbone router will also inform all its subscribers,
i.e. connected satellite routers, of this change, but not at the same time, which will be explained
in the following examples. The token handshaking signals travel through a different kind of
signaling channels, shown as dash lines in Figure 9 besides their corresponding data channels.
To present a clearer picture of the structure, we magnify the output ports of R5, a satellite router,
and the input ports of R1, a backbone router, and show them in Figure 10. R5 has two
downstream backbone routers, thus each hardware block bound in R5 has two optional
destinations to send its packets to. On the other hand, R1 has reserved input ports and buffer
queues for all possible services from its upstream satellite routers. Between each pair of
connected backbone and satellite routers, there is an independent signaling channel for token
handshakes. For every group of services bound to a satellite router, there is only one data channel
for DRAM response packets, which is enough since these response packets are usually scattered
27
due to differences in DRAM access time. As shown by yellow arrows in Figure 10, R1 sends
DRAM response packets to R4, and R5's DRAM response packets are from R2.
Figure 10: Zoom-in view of satellite router's outputs and backbone router's inputs
The following is an example to better illustrate the dynamic service rearranging and token
handshaking procedures. Suppose a smartphone user decides to only switch off the display
during a video-conference call, as shown in the first step of Figure 11(a), while the other
hardware blocks keep working. After a certain time interval, R5 detects this change and marks
the service of “display” inactive. Suppose originally the service of “display” went through R1.
Therefore R5 will signal R1, and increase R1’s token by one. If R1’s token count was more than
zero before the adjustment (though very unlikely), nothing needed to be done except for
broadcasting this change to R4-R7, because no other service needs to use R1 anyway. If R1 had
28
zero tokens before the adjustment, now with this one available token, R1 will need to query R5
through R7 to find a new service that can be promoted from R2 or even R3. The reason not to
query R4 is that the services at R4 all have higher priorities than “display”, so they should be
either already using R1 or inactive. R1 will now inform R5 of this available token. Suppose
Audio is now using R2, it will be redirected to R1. R2 now has one more token. Similar
procedures would be gone through to find a new owner for this R2’s token. In the end, if no
service needs the token, R2 will broadcast this to all the subscribers, and R5-R8 will then
increase their local record of R2’s token by one.
Figure 11: Two examples of step-by-step procedures of token handshakes
(a)
(b)
29
On the other hand, suppose the user now switches on the display, as shown in Figure 11(b), and
R5 has detected that display start to regularly inject packets again represented by the green arrow.
If R5 finds R1’s token count greater than zero, it will assign R1 as the new downstream router of
“display” traffic. R1 will then broadcast this change to R4-R7, and done. If R1 was already fully
reserved by traffic streams from R4, the “display” traffic will be redirected to R2. It’s possible
that R2 was also fully reserved, and now R2 is temporarily overloaded. R2 will find the service
from the lowest satellite router, and send two pulses via the signaling channel to deactivate and
activate this service. Now the satellite router will recognize this service as newly activated, and
will find an appropriate downstream backbone router for it.
Things to note:
Initially, all the services are carefully distributed to backbone routers, so that each
backbone router would use up its tokens to accommodate the highest available services.
For example, R2 has 5 tokens in total, so initially it would accommodate all the 3 services
from R5 and 2 of the services from R6.
When a satellite router has received more than one status change notifications by either
newly activated or deactivated services, it will deal with them in the order of their
priorities. Similarly, when more than one newly activated service arrives at a backbone
30
router during the same clock cycle, the router would satisfy each service in the order of
their priorities.
Each time if a service was to be promoted or degraded to another backbone router, the
satellite router would need to wait until the moment that the tail flit of the current packet
has just been sent. Otherwise, there may be a serialization problem when packets exit the
network, because some flits may be reordered.
There is a threshold to classify services to be active or not, based on the number of
packets they send within a specific period of time. Therefore a service being inactive may
still send limited number of packets to DRAM. During inactive state, the packets would
be routed to the backbone router that was assigned to this service when last time it was
active, until this service is active again and decision of new destination backbone router
is made. During this period, the service would be assigned with lower priorities than
currently active services within the same router.
The worst case of this token handshake protocol is when R1 and R2 are fully subscribed,
and R1 receives a new active service from R4. Therefore, it may subsequently force R1
and R2 to degrade a service to a lower backbone router. It would take 3 rounds of
handshakes and the cycles to wait for the tail flit for each service rearranging. However,
no packet would actually need to wait for this long before assigned with a new backbone
31
router. Instead, each new active service would only wait for at most one round of
handshake until new destination backbone router is decided.
Dynamic QoS can address the limitation of baseline, i.e. the weighted-round-robin hierarchical-
multiplexer design that is previously discussed. As is in its name, this new QoS scheme
dynamically allocates the best resources for packets from different input sources. Therefore, it
would provide more throughput of the whole network. In addition, satellite routers are actually
much smaller units than level-1 routers regarding buffer areas in baseline, therefore we could see
in the experiment results later that to achieve similar performance, regarding costs of routers,
Dynamic QoS saves router buffer area by 35.2%, and router buffer power consumption by 34.5%.
32
Chapter 5 Experimental Evaluation
We use the simulation infrastructure described in Chapter 3 to run simulations and collect results.
This chapter first describes the parameters we use to setup each component in the experiments,
then demonstrates and analyses the results obtained.
5.1 Experiment Setup
We set the network and TGs to run at 3.2GHz, while the other hardware blocks have different
clock frequencies, which are shown in Table A. Also shown in the table are main configurations
of different simulators or workloads. We have included 9 TGs in this network. Two of them are
used to model GPU traffic, one for user interface (UI), and the other for 3D graphics (3D).
Though generated by the same hardware unit, we model these two services separately because
they should have different traffic patterns, and more importantly different priorities. Another TG
is used to model streaming audio traffic, to be more specific, 320kbps mp3 streaming. The
remaining TGs are not assigned with specific services. It does not mean they are unimportant.
On the contrary, they play critical rules in stressing the network, since network in a real
smartphone system is stressed by a variety of less known traffic streams.
Network packets are uniformly 64 bytes long, and each of them is broken down into 16 4-byte
flits. Each data channel is 4-bytes wide, thus allows one flit to traverse it using one cycle. At
33
each level of routers in the baseline network, memory request flits have 1-cycle switch allocation
delay, while memory response flits have 1-cycle routing delay. In Dynamic QoS network, delay
compositions are the same except for a 1-cycle routing delay instead of switch allocation delay at
satellite routers. Also taking 15-cycle flits serialization delay into consideration, zero-load round-
trip latency of baseline and Dynamic QoS networks are both 40 cycles. In Dynamic QoS, the
width of signaling channels for token handshakes is 1 byte, which is enough to transfer a signal
containing block ID bits and signal type bits within 1 cycle.
Hardware Block Priority Modeling Main Configurations
DRAM
N/A
DRAMSim2
Memory size: 4GB
Frequency: 800MHz DDR3
Bus width: 64 bits
Controller policy: FCFRFS
Row buffer policy: open page
Camera 4 VCW Frequency: 160MHz
Display 3 VCW Frequency: 160MHz
Encoder 4 VCW Frequency: 3.2GHz
Decoder 4 VCW Frequency: 3.2GHz
Modem 2 VCW Frequency: 800MHz
CPU
2
GEM5 trace
Frequency: 1GHz
Caches: L1i 32KB, L1d 64KB
ISA: ARM
Mode: full-system
Benchmark: BBench
GPU(UI) 3 TG Address: Markov-chain
34
Timing: self-similar
GPU(3D) 2 TG Address: Markov-chain
Timing: self-similar
Audio 3 TG Address: linear
Timing: streaming
unspecified 1 TG Address: linear
Timing: streaming
unspecified 1 TGx5 Address: Markov-chain
Timing: random
Table A: Main configurations of each hardware block
Figure 12: Comparison of average latencies of packets in Dynamic QoS with different lengths
of buffer queues
Routers have a buffer queue for each of their input ports. In baseline and R0 in Dynamic QoS
each buffer queue has 24 buffer slots. In backbone routers of Dynamic QoS, there are 10 buffer
slots in each queue. All the above parameters were chosen by running experiments with different
100
117
96
116
94
64
107
93
108
83
64
107
94
107
83
0
20
40
60
80
100
120
140
Camera Display Encoder Decoder Modem
cycles
8 buffers
10 buffers
12 buffers
35
configurations, as the example shown in Figure 12, where smaller numbers of buffers per queue
would result in performance loss, and larger numbers would not bring obvious benefits. Similar
methods are adopted when selecting the interval of active/inactive service detection to be 1000
cycles. Smaller intervals than that mean finer granularity of control, but would only return
negligible performance gain.
We choose to use number of frames to represent length of a simulation. Typically, we set
simulations to run for 10 frames, where we could get a good balance of corner case coverage and
simulation time, which is just over 1 billion network cycles with more than 40 million memory
requests.
5.2 Experiment Results
5.2.1 Latencies
To evaluate network performance with respect to its quality in serving each hardware block, we
adopt average round-trip latency as the main metric. Lower average latency means smaller
memory access delay, as network functions as part of the data supply architecture. We compare
the average round-trip in-network latencies across all three network configurations: hierarchical-
multiplexers with simple round-robin arbiters, hierarchical-multiplexers with weight-round-robin
arbiters, and Dynamic QoS.
36
Figure 13: Average latencies of packets from hardware blocks associated with VCW
Figure 13 shows the results of hardware blocks in VCW. We can see performance gains, i.e.
reduction in average latencies, over non-QoS baseline, since we intentionally prioritize the
services of VCW. We can observe that after being assigned with top priority over all the other
services, camera's latency reduction of 42.9% is the most significant. This proves what we
claimed in Section 4.1 that camera is relatively more vulnerable to resource contention.
Figure 14 to Figure 18 show latency distributions of packets from VCW services in the form of
histograms. Generally we can find that latencies with QoS have a higher concentration into
categories that are close to zero-load latency. At the other end, the number of long-latency
packets is reduced by QoS schemes.
112 120
99
118
93
64
114
95
110
82
64
107
93
108
83
0
20
40
60
80
100
120
140
Camera Display Encoder Decoder Modem
cycles
HMux RR
HMux WRR
Dynamic QoS
37
Figure 14: Latency distributions of packets from camera
Figure 15: Latency distributions of packets from display
Figure 16: Latency distributions of packets from encoder
0
100000
200000
300000
400000
500000
600000
700000counts
In-Network Latency Distributions - Camera
HMux RR
HMux WRR
Dynamic
0
2000
4000
6000
8000
10000
12000counts
In-Network Latency Distributions - Display
HMux RR
HMux WRR
Dynamic
0
1000000
2000000
3000000
4000000
5000000
6000000counts
In-Network Latency Distributions - Encoder
HMux RR
HMux WRR
Dynamic
38
Figure 17: Latency distributions of packets from decoder
Figure 18: Latency distributions of packets from modem
Similarly, the other services which are prioritized in QoS schemes, e.g. GPU(UI), streaming
audio and CPU, also show better performance than in non-QoS network. All the above
performance gains come at the expense of a performance penalty to the low-priority services.
Especially, as we can see in Figure 19, average latencies of the streaming TG and the lowest-
priority TG5 are severely affected. In general, the results of HMux-WRR and Dynamic QoS
0
20000
40000
60000
80000
100000
120000
140000
160000counts
In-Network Latency Distributions - Decoder
HMux RR
HMux WRR
Dynamic
0
500
1000
1500
2000
2500
3000
3500counts
In-Network Latency Distributions - Modem
HMux RR
HMux WRR
Dynamic
39
show in both Figure 13 and Figure 19 are fairly close. The reason is that both schemes use
weighted-round-robin arbitration in their routers, and the weight assignments to each service's
packets are also the same.
Figure 19: Average latencies of packets from non-VCW hardware blocks
In Figure 20, we take memory access delay into round-trip delay. By comparing with Figure 13,
we could find that in-network round-trip delay is a very important portion of the total delay.
However, improvements in performance are slightly shadowed by so-far unpredictable memory
access delays. We also show a comparison with the average deadline that guarantees 30
frames/sec performance for each VCW hardware block. The deadlines are calculated by different
numbers of read/write requests per frame for different hardware blocks. For example, encoder
0
100
200
300
400
500
600
700
800
900
1000
cycles
HMux RR
HMux WRR
Dynamic QoS
40
reads 8MB/frame from memory, and writes back 0.16MB/frame according to typical 50:1
compression ratio of h.264 video [34]. Since each memory write/read is 64-byte and the system
is running at 3.2GHz, in order to achieve a minimum 30 frames/sec performance, packets would
have a latency of 837 cycles on average at maximum. Modem only reads or writes encoded
frames which are very small in size, thus we do not show its expected average latency to
guarantee 30fps. The comparisons show the average latencies we have obtained are all well
below expected their corresponding 30fps average latencies, indicating 30fps could be achieved.
In fact, it is also justified by calculating frame rate using total number of cycles (slightly over 1
billion) spent to finish a 10-frame simulation.
Figure 20: Average total round-trip latencies of packets from VCW hardware blocks
474 523 372
577 451
471 510
374
571 437
458 508 360
567 451
853
2276
837
2231
0
500
1000
1500
2000
2500
Camera Display Encoder Decoder Modem
cycles
HMux RR
HMux WRR
Dynamic QoS
30fps
41
5.2.2 Case Study: a micro-experiment
In real-life scenarios, there may suddenly be a stream of high-intensity but low-priority service
that may overwhelm network resources. The question of how well and how quickly can the QoS
scheme respond to such a change should always be one of the bottom lines when designing a
QoS scheme. We therefore have designed the following micro-experiment, in order to justify that
our Dynamic QoS scheme is well able to handle similar cases.
In this experiment setup, R4 has 3 active streaming TGs injecting a packet every 100 cycles;
each of R6 and R7 has another 3 active streaming TGs injecting a packet every 20-30 cycles; the
other satellite routers do not have any active services. The active streaming TGs are set to have
different initial waiting periods and different silence intervals, to provide stably high pressure on
DRAM. The silence intervals are set small enough to not trigger inactive detection. TG1-TG3
start at cycle 0. They have a silence interval of 200 cycles for every 1500-2500 cycles of
continuously sending packets. TG4-TG9 started 300 cycles later. They have a silence interval of
80 cycles for every 1000-1500 cycles.
R1 has 3 tokens and R2 has 6 tokens. According to the routing algorithm, all R4's services would
be served by R1, and all R6’s and R7's would go to R2. Every 1,000 cycles as a period, we let
each satellite router scan locally to detect newly activated/deactivated services, and output last
42
period's average round-trip latencies of each service's packets. We start to collect results after
5,000 cycles to allow for initial stabilization, and we set the total simulation time to be 45,000
cycles.
Figure 21: Average round-trip latencies for every 1000 cycles
The results are shown in Figure 21. At cycle 20,000, we deactivate TG1 at R4. Then this
available token at R1 is consumed by TG4 from R6. As can be observed, the average latencies of
TG2’s and TG3’s packets are not only affected by newly joined TG4, but also slightly decrease
due to absence of TG1 which used to have the highest weight in arbitration. So are the average
latencies of packets from all the other services. One may not expect that TG4 has such a
performance improvement at cycle 21,000, since its promotion to R1 is not yet performed until
30
35
40
45
50
55
60
65
50
00
70
00
90
00
11
00
0
13
00
0
15
00
0
17
00
0
19
00
0
21
00
0
23
00
0
25
00
0
27
00
0
29
00
0
31
00
0
33
00
0
35
00
0
37
00
0
39
00
0
41
00
0
43
00
0
45
00
0
47
00
0
49
00
0
Cycles
Time
TG1
TG2
TG3
TG4
TG4*
43
then. In fact, this performance gain during cycle 20,000 to 21,000 is simply due to less
competitor(s) in the arbitration(s), the same reason with other active services. At cycle 36,000,
after we set TG1 active again at cycle 35,000. R1 takes back the token from TG4, assigns it to
TG1, and we can observe that all average latencies resume their original states.
Figure 22: Average round-trip latencies for every 200 cycles
It may get more interesting when we zoom in to observe what exactly happens to each service's
packets at around cycle 35,000. As shown in Figure 22, when TG1 resumes its traffic injection,
according to the protocols, its packets are routed to R1 and assigned with lower priority than
TG2-TG4 in R1. In spite of this, performance of TG2-TG4 is still slightly affected with this one
more competitor for router resources. When active TG1 is finally detected at cycle 36,000, it gets
assigned with top priority in R1, and on the other hand, TG4 is rerouted back to R2. All packets'
30
35
40
45
50
55
60
65
35000 35200 35400 35600 35800 36000 36200 36400
Cycles
Time
TG1
TG2
TG3
TG4
44
latencies return to their original states gradually, considering the delay of waiting for tail flits in
order to perform the service rerouting and priority reassigning.
To serve as a comparison, we conduct the same experiment on the same network but without
dynamic routing or dynamic priority assignment. Similar to the baseline with WRR, priorities are
pre-assigned to TG1's-TG9's packets from high to low. We mainly focus on the latency change
to packets from TG4, shown as TG4* in Figure 21. Since TG1-TG3 still have higher priorities
and are "isolated" in R1, their performance would not change. We can find from the result that
when TG1 is inactive, TG4* does not have as much performance improvement as TG4 does. The
reason is that TG4* still competes with 5 other TGs in R2, even though R1 now has available
resources. This could be analogous to WRR baseline, which also lacks such adaptively.
5.2.3 Throughput
Figure 23: Network throughputs comparison
0
5000
10000
15000
20000
25000
800 400 200 100 50 25 12
Packets
Cycles
HMux
Dynamic
45
Network throughput is an important metric to evaluate a NoC design. It represents the maximum
capacity of communications that the network can support. In a typical N-to-N NoC research,
throughput is evaluated by increasing traffic injection rates of each nodes and measuring the
average number of flits received at each node. Since we aim to improve efficiency of data supply
by DRAM memory, what interests us more is the maximum number of memory
requests/responses the system could provide given a specific time interval. Therefore, we place a
counter at the output port to DRAM, which counts the number of packets arriving at DRAM
within 1 million network cycles. At the other end, hardware blocks are all replaced by streaming
TGs, and we gradually increase the injection rate of each streaming TG. As shown in Figure 23,
Dynamic QoS has a larger stabilized packet count than WRR baseline regardless of further
increased injection rates, which demonstrates a 5.2% higher capacity of communications
between DRAM and its requesters. The difference should lie in packets with mid-level priorities.
In baseline with WRR, these packets contend for resources in limited number of mid-priority
routers, even when sometimes high-priority routers are nearly idle. However, in same scenarios
with Dynamic QoS, part of these packets would be rerouted to high-priority routers and thus
prioritized. On the other hand, the severity of contention in mid-priority routers is relieved.
46
5.2.4 Area and Power
Figure 24: Router areas and channel areas
Figure 25: Router power consumptions
0
0.005
0.01
0.015
0.02
0.025
0.03
Hmux:router Dynamic:router HMux-channel Dynamic-channel
mm2
Channel
Sat-Output
Sat-Crossbar
Sat-Buffer
BB-Output
BB-Crossbar
BB-Buffer
Regular-Output
Regular-Crossbar
Regular-Buffer
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
HMux Dynamic
watts
Channel Leakage
Output
Switch
Input
47
We measure area and power costs of both QoS schemes, by using power_module in BookSim2.0,
with 32nm CMOS process. Area and static power consumption are calculated based on
configuration parameters of each network. Dynamic power consumption is based on activity
factors of each router component that are recorded during the entire simulation. Figure 24 shows
a saving of 6.7% in total router areas, especially in buffer areas with a total reduction of 35.2%.
The reason is that satellite routers of Dynamic QoS do not possess request flit buffers that are in
level-1 routers of baseline WRR. The same reason applies as we can observe in Figure 25, a 40.9%
reduction in input (buffer) power consumptions for Dynamic QoS. On the other side of the coin
is the increased channel area in Dynamic QoS, by providing more abundant number of channels
to give more options to each service.
It should be noted that routers and data channels reside in different layers. Router logic is in
silicon, while channels incur overhead in metal layers and in silicon through the insertion of
repeaters to meet cycle time constraints. Given the large number of metal layers in modern
ASICS, increasing channel requirements should not be problematic. The repeater insertion may
impact logic density, but that exploration requires detailed layout beyond the scope of this
research. The other tradeoff to the adaptivity of this intermediate network in Dynamic QoS is the
increased number of inputs of switches in backbone routers. We can see in Figure 24 and Figure
48
25 that the switch area and power consumption of Dynamic QoS are increased by 36.4% and
3.9x respectively over those in baseline WRR.
49
Chapter 6 Conclusions
In summary, in this research work we have made the following contributions:
We have investigated traffic patterns and analyzed prioritizations for streams of data
communications between DRAM controller and other components.
We have implemented WRR and a newly designed Dynamic QoS scheme specifically for
smartphone/tablet SoC networks. We have also described their protocols and corresponding
network topologies.
We have constructed a simulation infrastructure for smartphone/tablet SoCs. We have
evaluated our QoS designs with this infrastructure. Results show performance gains
comparing with non-QoS baseline regarding average latencies. Dynamic QoS outperforms
baseline with WRR on network throughput as well as router area and power consumption.
As a tradeoff, Dynamic QoS has more channel and in-router switch cost.
6.1 Future Work
As future work, we plan to directly integrate a CPU simulator to the simulation infrastructure.
We would regain data dependency information which could affect traffic patterns. Similarly, we
also plan to integrate a GPU simulator to provide more realistic traffic patterns, as compared
50
with those from the current dummy model. Once we have introduced such a credit-feedback
mechanism to all the network injections, we could switch to finite injection queues, a.k.a. closed-
loop measurement [2]. In that case, we could directly run benchmarks on simulators and measure
the sensitivity of their run time to network parameters as another evaluation metric. Moreover, a
more adaptive scheme would be required with injection queues being finite. For instance, when a
finite injection queue is filled beyond a certain level, the corresponding priority should be
increased.
Regarding our proposed Dynamic QoS, we have proved its capability in reducing average
latency of packets from high-priority traffic. We may still need to watch the jitter, i.e. variance,
of packets’ latencies, since it also serves as an important aspect that can affect the overall
performance. In addition, we may be able to prove its good scalability by building scaled
versions and performing experiments to compare with the baseline. To evaluate area and power
cost, it should be more accurate to use RTL implementations of the routers, though the current
method should be sufficient for comparisons. Lastly and more importantly, we will target a QoS
co-design with the DRAM controller. From system's perspective, this may lead to a more
effective yet maybe simpler design.
51
Bibliography
[1] J. Hruska, "Exynos 4212," Extreme Tech, 4 1 2012. [Online]. Available:
http://www.extremetech.com/computing/111315-blood-in-the-water-nvidia-qualcomm-
samsung-and-ti-prepare-for-arm-war.
[2] W. J. Dally and B. Towles, Principles and Practices of Interconnection Networks, Elsevier,
Inc., 2004.
[3] W. J. Dally and B. Towles, "Route packets, not wires: On-chip interconnection networks,"
in Design Automation Conference, 2001.
[4] N. Enright Jerger and L.-S. Peh, On-Chip Networks, M. Hill, Ed. Morgan and Claypool
Publishers, 2009.
[5] Y. Hoskote, "A 5-GHz mesh interconnect for a Teraflops processor," IEEE MICRO, vol. 27,
no. 5, pp. 51-61, 2007.
[6] J. Howard et al., "A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS,"
International Solid State Circuit Conference, 2010.
[7] J. A. Kahle et al., "Introduction to the cell multiprocessor," IVM Journal of Research and
Development, vol. 49, no. 4.
[8] D. Wentzlaff et al., "On-chip interconnection architecture of the tile processor," IEEE
MICRO, vol. 28, 2007.
[9] J. Kim, J. Balfour and W. J. Dally, "Flattened butterfly topology for on-chip networks,"
IEEE MICRO, 2007.
[10] N. Brookwood, "AMD fusion family of apus – enabling a superior, immersive pc," AMD
52
white paper, 2010.
[11] K. Lee, S.-J. Lee and H.-J. Yoo, "Low-power network-on-chip for high-performance SoC
design,," in IEEE Trans. VLSI Syst., 2006.
[12] A. Lambrechts et al., "Power breakdown analysis for a heter. NoC," in ASAP, 2005.
[13] M. Kreutz et al., "Design space exploration comparing homogenous and heterogeneous
network-on-chip architectures," in SBCCI, 2005.
[14] T. Bjerregaard and S. Mahadevan, "A survey of research and practices of network-on-chip,"
ACM Comput. Surv., 2006.
[15] K. Goossens, "Networks on silicon: Combining best effort and guaranteed services".IEEE
DATE.
[16] J. W. van den Brand, C. Ciordas, K. Goossens and T. Basten, "Congestion-controlled best-
effort communication for networks-on-chip," in Des., Autom. Test Eur. Conf., 2007.
[17] P. Avasare et al., "Centralized end-to-end flow control in a besteffort network-on-chip," in
EMSOFT, 2005.
[18] B. Grot et al., "Preemptive virtual clock: A flexible, efficient, and cost-effective QoS
scheme for networks-on-a-chip," IEEE MICRO, 2009.
[19] T. Bjerregaard and J. Sparso, "A router architecture for connection-oriented service
guarantees in the MANGO clockless network-on-chip," IEEE DATE, vol. 2, pp. 1226-1231,
2005.
[20] K. Goossens et al., "The Æthereal network on chip: concepts, architectures, and
implementations," in IEEE Design and Test of Computers, 2005.
[21] S. Murali et al., "A methodology for mapping multiple use-cases onto networks on," in Des.
Autom. Test Eur. Conf., 2006.
53
[22] L. Cheng et al., "Interconnect-aware coherence protocols for chip multiprocessors,"
IEEE/ACM ISCA, pp. 339-351, 2006.
[23] B. Grot et al., "Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and
service guarantees," IEEE/ACM ISCA, vol. 38, 2011.
[24] E. Bolotin et al., "QNoC: QoS architecture and design process for network on chip," J. Syst.
Architecture: EUROMICRO J., vol. 50, no. 2/3, pp. 105-128, 2004.
[25] V. Soteriou, H. Wang and L.-S. Peh, "A statistical traffic model for on-chip interconnection
networks," Int. Symp. Model., Anal., Simul. Comput. Telecommun. Syst., pp. 104-116, 2006.
[26] J. Hestness, B. Grot and S. W. Keckler, "Netrace: dependency-driven trace-based network-
on-chip simulation," the Third Internanional Workshop on NoC Architectures, 2010.
[27] A. Gutierret et al., "Full-system analysis and characterization of interactive smartphone
applications," IEEE Intl. Sym. on Workload Characterization, 2011.
[28] P. Rosenfeld, E. Cooper-Balis and B. Jacob, "DRAMSim2: A cycle accurate memory
system simulator," Computer Architecture Letters, vol. 10, no. 1, pp. 16-19, 2011.
[29] N. Binkert et al., "The gem5 simulator," SIGARCH Compute. Archit. News, vol. 39, 2011.
[30] S. Meyn, R. L. Tweedie and P. W. Glynn, Markov Chains and Stochastic Stability, 2 ed.,
Cambridge University Press, 2008.
[31] J. L. Henning, "SPEC CPU2006 benchmark descriptions," ACM SIGARCH Computer
Architecture News, 2005.
[32] V. J. Reddi et al., "Pin: a binary instrumentation tool for computer architecture research and
education," WCAE, 2004.
[33] "Self-similarity," Wikipedia, [Online]. Available: http://en.wikipedia.org/wiki/Self-
similarity.
54
[34] A. Lewin and T. K. Zvi, "Configurable Weighted Round Robin Arbiter". United States
Patent 6,032,218, 29 02 2000.
[35] "Compression Ratio Rules of Thumb," [Online]. Available:
http://www.kanecomputing.co.uk/pdfs/compression_ratio_rules_of_thumb.pdf.
[36] D. Pham et al., "The design and implementation of a first-generation cell processor," in
IEEE International Solid-State Circuits Conference, 2005.