Combining Transaction-level Simulations and Model Checking...

31
Combining Transaction-level Simulations and Model Checking for MPSoC Verification and Performance Evaluation GABOR MADL 1 , SUDEEP PASRICHA 2 , QIANG ZHU 3 , LUIS ANGEL D. BATHEN 1 , NIKIL DUTT 1 1 Donald Bren School of Information and Computer Sciences, University of California, Irvine, CA 92697 2 Department of Electrical and Computer Engineering, Colorado State University, Fort Collins, CO 80523 3 Cadence Design Systems Japan, 2-100-45 Shin-Yokohama Kouhoku-ku, Yokohama Kanagawa 222.0033 Japan The ARM Advanced Microcontroller Bus Architecture (AMBA) is a widely used interconnection standard for MPSoC design. This paper describes a method for the functional verification and per- formance evaluation of AMBA-based MPSoC designs by combining transaction-level simulations and model checking. The application of formal methods provides a way for functional verification and allows to obtain end-to-end execution bounds of AMBA-based MPSoC designs. Using our formal models we were able to uncover an undocumented ambiguous case in the AMBA speci- fication that can lead to deadlocks. We demonstrate how the combination of transaction-level simulations and model checking can be used to evaluate design alternatives for a digital camera case study, and guarantee the correctness of the design. Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation and Analysis General Terms: Design, Performance, Verification Additional Key Words and Phrases: System-on-Chip, Model Checking, Performance Evaluation 1. INTRODUCTION Multi-Processor System-on-Chips (MPSoCs) are deeply embedded electronic sys- tems operating in resource-constrained environments. The complexity and func- tionality of these systems often rivals that of high-performance processors from a decade ago, at a fraction of price and energy costs. Bus interconnect standards such as the ARM Advanced Microcontroller Bus Architecture (AMBA) Advanced High-speed Bus (AHB) [ARM 1999] and CoreConnect [IBM 2001] are commonly used to integrate heterogeneous components into MPSoC designs. Bus protocols provide reliable communication in MPSoC systems by specifying standard meth- ods for interaction between components connected to the bus. Key issues that bus protocols must address include synchronization, dealing with concurrent requests, transmission errors, preventing deadlocks, and QoS support for MPSoC designs. Despite the fact that bus protocols have a critical role in providing a reliable plat- form in MPSoC systems, their specifications are typically written as a combination of natural languages and timing diagrams. Although this approach is effective in ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY, Pages 1–0??.

Transcript of Combining Transaction-level Simulations and Model Checking...

Page 1: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

Combining Transaction-level Simulations andModel Checking for MPSoC Verification andPerformance Evaluation

GABOR MADL1, SUDEEP PASRICHA2, QIANG ZHU3, LUIS ANGEL D. BATHEN1,

NIKIL DUTT1

1Donald Bren School of Information and Computer Sciences, University of California,

Irvine, CA 926972Department of Electrical and Computer Engineering, Colorado State University, Fort

Collins, CO 805233Cadence Design Systems Japan, 2-100-45 Shin-Yokohama Kouhoku-ku, Yokohama

Kanagawa 222.0033 Japan

The ARM Advanced Microcontroller Bus Architecture (AMBA) is a widely used interconnectionstandard for MPSoC design. This paper describes a method for the functional verification and per-

formance evaluation of AMBA-based MPSoC designs by combining transaction-level simulationsand model checking. The application of formal methods provides a way for functional verification

and allows to obtain end-to-end execution bounds of AMBA-based MPSoC designs. Using our

formal models we were able to uncover an undocumented ambiguous case in the AMBA speci-fication that can lead to deadlocks. We demonstrate how the combination of transaction-level

simulations and model checking can be used to evaluate design alternatives for a digital camera

case study, and guarantee the correctness of the design.

Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation and

Analysis

General Terms: Design, Performance, Verification

Additional Key Words and Phrases: System-on-Chip, Model Checking, Performance Evaluation

1. INTRODUCTION

Multi-Processor System-on-Chips (MPSoCs) are deeply embedded electronic sys-tems operating in resource-constrained environments. The complexity and func-tionality of these systems often rivals that of high-performance processors from adecade ago, at a fraction of price and energy costs. Bus interconnect standardssuch as the ARM Advanced Microcontroller Bus Architecture (AMBA) AdvancedHigh-speed Bus (AHB) [ARM 1999] and CoreConnect [IBM 2001] are commonlyused to integrate heterogeneous components into MPSoC designs. Bus protocolsprovide reliable communication in MPSoC systems by specifying standard meth-ods for interaction between components connected to the bus. Key issues that busprotocols must address include synchronization, dealing with concurrent requests,transmission errors, preventing deadlocks, and QoS support for MPSoC designs.

Despite the fact that bus protocols have a critical role in providing a reliable plat-form in MPSoC systems, their specifications are typically written as a combinationof natural languages and timing diagrams. Although this approach is effective in

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY, Pages 1–0??.

Page 2: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

2 ·

explaining basic use cases to developers, it cannot cover every possible use case,and introduces ambiguity in the specification. This ambiguity is especially trouble-some when heterogeneous IP blocks have to be integrated on the bus, as differentvendors might implement the ambiguous parts of the specification differently. Thusthe interoperability of such components can be at risk.

Although most vendors provide test vectors to validate and certify whether com-ponents work with a bus protocol, there is no well-defined methodology to checkwhether the system as a whole satisfies high-level design constraints. Simulation-based approaches have been found useful in designing large-scale embedded systems,however, they can only show the presence of errors, not their absence. Moreover,simulations are time-consuming limiting designers to a few test cases.

This paper extends our earlier work [Madl et al. 2006] by showing how the com-bination of transaction-level simulations and model checking may be utilized toevaluate MPSoC design alternatives built on the AMBA AHB [ARM 1999] bus.We describe in detail how we modeled AMBA AHB masters, slaves, and a round-robin arbiter, and describe how we used this model to uncover an ambiguity in thefinal version of the AMBA AHB protocol.

Section 3 gives a high-level overview of the proposed analysis framework. AMBAAHB is a highly successful bus protocol widely used in embedded MPSoC systemsworldwide. The final AMBA 2.0 specification is a 230 page document freely avail-able for download at ARM’s website1. We evaluate the proposed approach on a setof design alternatives described in Section 4. We utilize the finite state machinemodel of computation as a means for the formal modeling of the AMBA proto-col as described in Section 5. The AMBA specification describes several use casesthat may result in bus deadlocks showing the need for the functional verificationof AMBA-based MPSoC designs. Section 6 describes our approach for the formalverification of bus interactions using NuSMV [A. Cimatti and E. Clarke and E.Giunchiglia and F. Giunchiglia and M. Pistore and M. Roveri and R. Sebastianiand A. Tacchella 2002]. We build on simulation results to obtain parameters forfunctional blocks that we use to annotate the formal models for design space explo-ration. This approach allows to combine transaction-level simulations and modelchecking for performance evaluation as described in Section 7.

Our results show that the formal verification complements the simulation re-sults providing a systematic method for the transaction-level validation of MPSoCdesigns.

2. RELATED WORK

2.1 Static Analysis Methods

SymTA/S [Rafik Henia and Arne Hamann and Marek Jersak and Razvan Racuand Kai Richter and Rolf Ernst 2005] is a formal analysis tool that applies methodsfrom scheduling theory for the performance analysis of complex heterogeneous MP-SoCs. A generic, component-based formal framework for the scheduling analysisand formal performance evaluation of platform-based embedded systems was pro-posed in [Richter et al. 2003]. Modular Performance Analysis [Ernesto Wandeler

1http://www.arm.com

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 3: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 3

and Lothar Thiele and Marcel Verhoef and Paul Lieverse 2006] is an approach basedon real-time calculus that models dependencies as arrival curves.

Although all static analysis methods provide scalable solutions for performanceevaluation they cannot model dynamic effects, such as varying delays and raceconditions, as they do not capture the flow of data, and are less accurate thandynamic estimation methods. Communication in embedded systems is often non-deterministic, data-dependent, and hard to model as well-formed event streams. Incontrast, this paper proposes a more detailed formal analysis method that capturesthe AMBA AHB bus commonly used in several MPSoC designs.

2.2 Dynamic Analysis Methods

Simulations are the preferred and widely accepted way to evaluate MPSoC designsin the industry today. A simulation-based design space exploration method, how-ever, has several disadvantages. Developing the models for a design alternative maytake weeks or months therefore only a handful of alternatives may be practicallyanalyzed given the short product development cycles. Moreover, designers typicallynotice performance issues late in the design cycle - after the simulation model iscomplete - therefore addressing changes can be rather time-consuming and costly.

Register-transfer level (RTL) languages such as VHDL [IEEE 2000] and Ver-ilog [IEEE 2001] are classic hardware description languages that target hardwarespecification at low-level abstractions providing a high precision, synthesizable plat-form for hardware development. The low-level abstraction, however, results in slowsimulation speeds unsuitable for the analysis of complex MPSoCs.

Due to the increase in System-on-Chip design complexity as well as the decreasein the time to market window, today’s designers are turning to transaction-levelmodeling languages such as SystemC [OSCI 2005] and SystemVerilog [IEEE 2005]to perform early design exploration and hardware-software co-design in order toshorten the design cycle. Transaction-level modeling focuses on the interactionsbetween systems components, such as bus transfers, interrupts or signals, ratherthan on gates or registers. Transaction-level languages employ higher-level abstrac-tions than RTL languages and are often not synthesizable.

In this paper we describe a case study that has been implemented using SystemC.SystemC is a library implemented in the C++ language and has a set of featuresuseful for hardware modeling, such as threads, ports, channels, modules, events,processes, etc. SystemC allows the use of cycle accurate, cycle approximate andmixed accuracy modeling abstractions. SystemC employs a logical notion of timeand computations are synchronized with respect to a global clock.

An approach for formal model-based performance evaluation is demonstrated inPtolemy II [Lee et al. 2001]. Ptolemy II is a general modeling framework that com-poses heterogeneous models of computation to evaluate embedded systems, andalso provides a way for performance evaluation through symbolic simulations ofthe formal MoCs. Ptolemy II focuses on deterministic models [Lee 2006] and doesnot provide a systematic method for the measurement of state space coverage innon-deterministic models. A semi-formal simulation-based performance evaluationmethod for MPSoCs was proposed in [Lahiri et al. 2001]. The authors representexecution traces as symbolic graphs for performance analysis, annotated with exe-cution times obtained by simulating individual components of the system.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 4: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

4 ·

Although the approaches described in [Lahiri et al. 2001; Lee et al. 2001] im-prove simulation speed by utilizing symbolic representations of execution traces,the quality of results depends on the ad-hoc selection of test vectors. In contrast,the method described in this paper captures all possible execution traces of the sys-tem, not just one execution trace, and provides a way for the functional verificationof AMBA-based MPSoC designs as described in Section 6 as well.

2.3 Model Checking Methods

A generic method for protocol verification using synchronous protocol automata ispresented in [D’silva et al. 2005]. The method proposed in this paper is specific toAMBA-based MPSoC designs as it formally models the AMBA protocol but couldbe represented using synchronous protocol automata as well. We have decidedto use the NuSMV representation instead of synchronous protocol automata as itprovides a practical approach for model checking as well.

A method for the functional verification of the PCI protocol is described in [Chauhanet al. 1999] using the Cadence SMV [K.L. McMillan 1992] tool. A similar approachis used to verify the IBM CoreConnect arbiter in [Goel and Lee 2000]. An earlywork on applying model checking methods to the AMBA protocol was presentedin [Roychoudhury et al. 2003], where the authors used finite state machine modelsand the SMV tool to uncover an unspecified condition in the AMBA specification.The described case study is due to flawed implementation rather than the protocolitself. A verification platform for AMBA-ARM7 is presented in [Susanto and Mel-ham 2003]. The authors use the SMV tool to prove the functional correctness ofthe AMBA protocol by checking various properties. The authors do not describeany ambiguities, rather they focus on properties that have turned out to be valid.A verification platform for AMBA using a combination of model checking and the-orem proving is described in [Amjad 2006]. The author extends earlier approachesby considering both control and data properties, and describes properties that haveproven to be true.

We have presented a method for the functional verification and formal perfor-mance evaluation of AMBA-based MPSoC designs in [Madl et al. 2006], and showedhow model checking may improve the test coverage of transaction-level simulationsfor performance evaluation. In this paper we extend our earlier work and describehow the proposed method may be utilized to find the right MPSoC design from a setof design alternatives. This paper also describes an ambiguous case in the AMBAAHB specification that might lead to flawed implementations. Our results do notimply that the AMBA protocol is incorrect, and neither does it imply that theworks described in [Susanto and Melham 2003] [Amjad 2006] are invalid. Rather,it shows that ambiguities in protocol specifications are often manually resolved ona case-by-case basis when implementations or formal models are created and onlythe correctness of such models can be shown rather the correctness of the specifi-cation itself. The main reason for this is the ambiguity of natural languages thatshould be resolved by future designers by providing a formal specification for theirprotocol.ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 5: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 5

Fig. 1. Evaluation of MPSoCs using a Combination of Simulations and Model Checking

3. PROBLEM FORMULATION

This paper considers two problems in the domain of analyzing AMBA-based MP-SoC designs; functional verification and performance evaluation. The functionalverification of MPSoCs is often performed using extensive testing and/or formalmethods. Despite the successful application of formal methods in MPSoC designsfor functional verification, the performance of embedded systems is dominantlyevaluated using simulations.

The novelty of our approach lies in the combination of the formal methods usedfor functional verification with transaction-level simulations used for performanceevaluation. In our proposed method, simulation and formal verification are com-plementary techniques for performance analysis, as they address each others weak-nesses. Simulation models inherently capture MPSoCs with greater accuracy thanformal models tailored for verification. However, simulations can typically coveronly a few test cases. Formal verification is essentially an exhaustive state spacesearch therefore it can cover the whole design space, but the models used for theanalysis are more abstract and typically cannot capture the same complexity as asingle execution trace using simulations.

We propose a model-based analysis for the performance evaluation of MPSoCdesigns. Figure 1 shows a high-level view of the proposed design flow. The designflow starts with the domain-specific model (DSM) that is a high-level specificationwhich specifies key properties of the design, such as its structure, behavior, envi-ronment, and key constraints that it has to satisfy. The domain-specific model canbe expressed in several ways, using textual specification, timing diagrams, meta-modeling, or other visual methods.

The DSM is then mapped into a formal analysis model. This step is required forany formal analysis unless the DSM is already specified using a formal language (i.e.finite state machines or Petri-nets). The translation abstracts out the necessary

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 6: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

6 ·

details from the DSM; the formal models usually capture key properties of thesystem at higher-level abstractions than the DSM itself. As with simulations, theabstraction influences the complexity of the analysis as well as its precision. If theanalysis model is too abstract results may become inaccurate, if it is too complexthe analysis will likely hit the state space explosion problem. Finding the rightabstraction is the key for the successful model-based analysis of MPSoC designs.

Figure 1 illustrates how the combination of simulations and formal verificationcan be used for the evaluation of MPSoCs. We use the simulation and profilinginformation from the simulations to express the execution times for each master asintervals; the best case execution time (BCET) refers to the smallest execution time,the worst case execution time (WCET) refers to the largest execution time found dur-ing the simulation and profiling phase. These execution parameters are obtainedfor each component independently, and they do not include the time spent waitingfor the bus or time spent during communication, rather they refer to the computa-tion time of each component. The size of each transaction corresponds to the sizeof messages written to and read from the slave. Again, we build on the simulationresults to approximate the expected size of transactions. Using this information aformal analysis model can be created that captures key parameters of the system,but abstracts out computations resulting in faster design space exploration.

Since a single simulation usually consists of several hundreds/thousands trans-actions on the bus (i.e. memory accesses), parameters can be usually estimatedwith good accuracy, even when a small number of test cases are used. The majorsource of uncertainty arises from the concurrent processing and non-deterministicexecution times, that the formal models inherently capture. This method allowsto obtain worst case end-to-end deadlines with high accuracy and formal guaran-tees by proving that the end-to-end computation time of the system is below apredefined constant bound.

4. DIGITAL CAMERA MPSOC DESIGN ALTERNATIVES

In this section we describe three alternative MPSoC designs for a digital camerausing the AMBA AHB bus. The digital camera used for the case study implementsthe new JPEG2000 [JPEG Committee ] still image compression standard devel-oped by the JPEG Committee. The advantages of JPEG2000 over its predecessorJPEG include lossy to lossless compression, region of interest (ROI), multiple res-olution representation, error resiliency, etc. The JPEG2000 encoder is divided intothree main parts: image transformation, quantization and entropy coding. UnlikeJPEG, which relies on the more commonly used discrete cosine transform (DCT),JPEG2000 uses the Discrete Wavelet Transform (DWT) as it facilitates the notionof progressive image transmission. JPEG2000’s choice of entropy coding is based onthe Embedded Block Coding with Optimal Truncation (EBCOT) [Taubman 2000].

4.1 The AMBA AHB Protocol

Over the past decade and a half, several bus-based on-chip communication architec-ture standards have been proposed to handle the communication needs of emergingSoC designs. Of these, the ARM Microcontroller Bus Architecture (AMBA) version2.0 [ARM 1999] is one of the most widely used on-chip communication standardsto interconnect components in SoC designs. The goal of this standard is to pro-ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 7: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 7

Fig. 2. AMBA AHB Write Transaction Initiation

vide a flexible, high-performance bus architecture specification that is technology-independent, takes up minimal silicon area and encourages IP reuse across designs.The AMBA 2.0 bus architecture standard defines three buses: AHB (AdvancedHigh-performance Bus), APB (Advanced Peripheral Bus) and ASB (Advanced Sys-tem Bus). The AHB bus is used for high bandwidth and low latency communica-tion, primarily between CPU cores, high performance peripherals, DMA controllers,on-chip memories and interfaces such as bridges to the slower APB bus. The APBis used to connect slower peripherals such as timers, UARTs etc. and uses a bridgeto interface with the AHB. It is a simple bus that does not support the advancedfeatures of the AHB bus. The ASB bus is an earlier version of the high-performancebus that has been superseded by AHB in current designs. Since we use the AHBbus standard in this paper, we present a brief overview of is features next.

The Advanced High-Performance Bus (AHB) is a high-speed, high-bandwidthbus that supports multiple masters. AHB supports pipelined data transfers for highspeed memory and peripheral access without wasting precious bus cycles. Bursttransfers allow optimal usage of memory interfaces by giving advance informationof the nature of the transfers. AHB also allows split transactions which maximizethe use of the system bus bandwidth by enabling high latency slaves to release thesystem bus during the dead time while the slave is completing its transaction. Aslave can split a master, essentially masking it from getting access to the bus untilthe slave is ready to continue the split transaction. The slave then acts on behalf ofthe master to signal the arbiter that the master should be unmasked. The mastermust then retry the transfer when it is next granted access to the bus.

AHB architectures can have various topologies such as single shared bus, multi-layer (or hierarchical) shared bus and bus matrix. A designer can select any ofthese topologies based on SoC communication requirements and customize it tooptimize overall bandwidth and improve performance. An AHB bus consists of anaddress bus (typically 32 bits wide) and separate (or shared) read and write data

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 8: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

8 ·

buses that have a minimum recommended width of 32 bits, but can have any valuesranging through 8, 16, 32, 64, 128, 256, 512 or 1024 bits, depending on applicationbandwidth requirements, component interface pin constraints and the bit width ofwords accessed from memory modules (i.e. embedded DRAM).

Figure 2 shows the timing diagram of AMBA AHB signals used to establish atransaction. When a master needs to send or receive data in AHB, it requestsan arbiter for access to the bus by raising the HBUSREQx signal. The arbiter, inturn, responds to the master via the HGRANTx signal. Depending on which mastergains access to the bus, the arbiter drives the HMASTER signals to indicate whichmaster has access to the bus (this information is used by certain slaves). When aslave is ready to be accessed by a master, it drives the HREADY signal high. Onlywhen a master has received a bus grant from the arbiter via HGRANTx and detectsa high HREADY signal from the destination slave, will it initiate the transaction.The transaction consists of the master driving the HTRANS signal, which describesthe type of transaction (sequential or non-sequential), the HADDR signals which areused to specify the slave addresses, and HWDATA if there is write data to be sentto the slave. Any data to be read from the slave appears on the HRDATA signallines. The master also drives control information about the data transaction onother signal lines: HSIZE (size of the data item being sent), HBURST (number ofdata items being transferred in a burst transaction), HWRITE (whether the transferis a read or a write) and HPROT (contains protection information for slaves whichmight require it).

4.2 JPEG2000 Encoder Description

Figure 3 shows the block diagram for the JPEG2000 encoder. Designers havethe option of implementing a distributed compression method, where the image isbroken up into tiles, and the compression is carried out for each tile separately. Al-though this feature is not required by the JPEG2000 specification we have decidedto implement it to improve the concurrent processing in the system and thus theoverall performance of the SoC. Tile size varies, from smaller sizes – 64x64 pixelsfor memory restrained designs – to 512x512 for better compression quality. Theseparameters vary and designers need to consider the requirements for their specificdesigns.

After the image is tiled, each tile is passed though the DC Level Shifting stepwhich converts the tile pixels from unsigned integers to two’s complements. Inthe next step the tile is passed through the Multi-Component Transform (MCT),which is in charge of transforming the input tile from RGB color format to eitherYUV by using the reversible color transform (RCT), or to YCbCr by using theirreversible color transform (ICT). RCT can be used in both lossless and lossycompression whereas ICT can only be used for lossy compression. After the tilehas been transformed, it is processed by the discrete wavelet transform (DWT), whichfurther decomposes the tile into different levels of decomposition. For every passDWT makes on a tile, depending on the number of decomposition levels needed, DWTgenerates four sub-bands, denoted as LL, HL, LH, and HH, where LL representsthe downsampled tile (half the width/height of the previous tile), and the otherthree sub-bands represent a residual version of the tile which are used for the imagereconstruction process. Once DWT has processed the tile it is passed through theACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 9: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 9

Fig. 3. JPEG2000 Encoder Block Diagram

quantization step only when lossy compression is used. The user has the option todeclare Regions Of Interest (ROI), that are encoded independently from the rest ofthe image based on user specifications. This allows to use lossless compression forsome (interesting) parts of the image, while using lossy compression for the rest ofthe image. Finally, the image (or tile) is processed with EBCOT, which producesthe final bitstream for the image.

4.3 Description of MPSoC Design Alternatives

We consider three design alternatives for the implementation of the digital cameraas shown in Figure 4. In design 3 all communication between IP blocks uses theAMBA AHB bus, in design 1 the DWT and EBCOT functional blocks are combinedinto a single bus component, and design 2 uses two AMBA AHB buses and twoJPEG2000 encoders to reduce congestion on the buses and increase the throughputof the digital camera MPSoC.

Fig. 2 shows the major steps used within the JPEG2000 encoding algorithm.Not all steps are implemented in HW. The DWT HW unit implements the DC-levelshifting, Multi-component Transform, and Discrete Wavelet Transform steps. TheData Dispatcher unit implements Quantization. EBCOT can is subdivided into twoparts, commonly known as Tier-1 and Tier-2. Tier-1 is the most computationintensive part of JPEG2000, and is implemented in hardware. Tier-2, however, isvery control intensive, therefore it is implemented in software on the main CPU.None of the designs use the ROI feature of the JPEG2000 specification.

In all of the architectures shown in Figure 4, the DWT module has an internalDMA engine (iDMA) that fetches the tiles from main memory to either DWT’slocal memory or bank A of the tile memory, depending on whether DWT is currentlyprocessing a tile or not. The DWT module is capable of lossless and lossy compressionand implements DC Level Shifting and MCT. In design 1 and 2 shown in Figure 4the DWT unit writes the transformed image to bank B in tile memory. The DWTunit can only write new information if Data Dispatcher has finished fetching allcodeblocks for the previous tile from bank B. Otherwise, the DWT module may beblocked by the slower Data Dispatcher. In design 3 the AMBA AHB bus is used todirectly write to the (slave) memory therefore in this design the Data Dispatchernever blocks the DWT unit.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 10: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

10 ·

Fig. 4. Design Alternatives of the Digital Camera Case Study using the JPEG2000Encoder

The Data Dispatcher module reads the codeblocks in the tile memory and per-forms the quantization step on them. Its main job is to feed bitplanes onto eachBit Plane Coder (BPC) so that at any given time, there could be up to N differentbitplanes being processed by the BPC modules. A BPC is actually the modulein charge of performing Tier-1 on incoming DWT coefficients, and it is subdividedinto two parts, the Context Formatter (CF) and the Arithmetic Coder (MQ-Coder).These blocks are denoted as CF and MQ in Figure 4. Finally, the processed data iscollected by the Data Collector, from which it is written to the main memorythrough the AMBA AHB bus.

5. FORMAL MODELING OF THE AMBA AHB PROTOCOL

In this section we formalize our model for the AMBA AHB protocol. The notion oftime used in the protocol specification is discrete (bus cycle), therefore the protocolcan be represented as a discrete event system. A discrete event system (DES)is a 5-tuple G = (Q,Σ, δ, q0, Qm), where Q is a finite set of states, Σ is a finitealphabet of symbols that we refer to as event labels, δ : Q×Σ→ Q is the transitionfunction, q0 is the initial state, and Qm is the set of marker states (exiting states).A transition or event in G is a triple (q, σ, q′) where δ(q, σ) = q′, q, q′ ∈ Q are theexit and entrance states, respectively, and σ ∈ Σ is the event label. The event setof G is the set of all such triples. Events in the DES are untimed and transitionsdepend only on the current state and the event label.

There are several models of computation that can express discrete event sys-tems, well-known examples include finite state machines, Petri-nets, and data-flownetworks. Timed automata and hybrid automata are extensions to finite statemachines in order to express the continuous evaluation of system variables, andACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 11: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 11

are therefore too heavyweight to represent cycle-based bus protocols. The dis-crete event model is simpler than timed automata or hybrid automata models,and offers a more scalable approach for verification by utilizing Binary DecisionDiagrams [Bryant 1992]. We propose the use of the finite state machine (FSM)model of computation for the representation of the AMBA AHB bus protocol as adiscrete event system. We chose finite state machines mainly because they are sup-ported by several model checkers [A. Cimatti and E. Clarke and E. Giunchiglia andF. Giunchiglia and M. Pistore and M. Roveri and R. Sebastiani and A. Tacchella2002] [K.L. McMillan 1992] [Holzmann 2004].

We have created a cycle-accurate model of the AMBA AHB bus in order to modelbus transactions accurately. Creating the NuSMV models took a few weeks. Theoriginal models were developed relatively quickly within a matter of days, but moreeffort was needed to ensure that it corresponds to the AMBA AHB specification,and that no bugs are present in the models.

To ensure that a single split transfer or RETRY response does not deadlock the busas described in [Roychoudhury et al. 2003], we assume that the arbiter only grantsaccess to a new master when the HRESP signal is OK and the HTRANS signal is IDLE.This introduces an extra cycle arbitration delay in the model. We model arbitrationdelays, pipelining, and busy slaves (HREADY is 0) in the bus as well. We have alsomodeled the two-cycle response times for RETRY and SPLIT responses according tothe AMBA specification. We implemented a round-robin arbiter, mainly to avoidstarvations that might arise when a fixed-priority arbiter is used. We considerRETRY responses from the slave (HRESP = RETRY), as well as split transfers.

In this paper we do not model HLOCK signals, that are set by a master thanneeds uninterrupted access to the bus during a transaction. When HLOCK is setby the master, the arbiter simply holds its state as it is forced to grant access tothe locking master as long as the signal remain active. It is easy to see that amaster that asserts HLOCK and does not deassert it essentially causes a deadlock.Unless the master is faulty, this condition should not occur in practice. The HLOCKsignal set by a master overrides the arbiter, and therefore it is the responsibility ofthe master to ensure that it eventually deasserts this signal. As long as the masterdeasserts HLOCK, no deadlocks occur, as the arbiter continues where it left off beforethe HLOCK signal was set.

5.1 Modeling AMBA AHB Masters

We modeled a generic AMBA AHB master as a finite state machine (FSM) with sixstates (idle, busreq, haddr, read, write, error) as shown in Figure 5. Thisfinite state machine provides a “black box” model for AMBA masters as seen fromthe bus. The master requests access to the bus, then reads, and finally writes to thebus. The error state is used to check for inconsistent replies from the slave/arbiter,and turns protocol checking into a reachability problem; in correct protocols, theerror state should not be reachable. By specifying how much time the FSM spendsin each state we can capture performance analysis in a formal setting.

Figure 6 shows the NuSMV syntax for the FSM shown in Figure 5. Transitionsare specified within the case ...esac; block. Transitions are ordered determinis-tically; the next value of state will be specified by the first guard that evaluates totrue. The figure is only partial; the HADDR, HTRANS, HRDATA, HWDATA, and BUSREQ

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 12: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

12 ·

Fig. 5. Finite State Machine Model of an AMBA AHB Master

signals depend on the state of the master. The MASTER STATE variable is usedfor the performance evaluation described in Section 7 to provide a fast and simplemethod to track the master’s current state from the arbiter.

The BCET and WCET parameters are given as inputs to the AMBA AHB master.The READSIZE and WRITESIZE parameters specify the size of the data read fromand written to the bus. The BCET, WCET, READSIZE, and WRITESIZE parameters areprovided by the simulations.

5.2 Modeling AMBA AHB Slaves

We modeled a generic AMBA AHB slave using four states (idle, write, read,error) as shown in Figure 7. The transitions of the slave have to be synchronizedwith the master – i.e. the slave has to be in the read state when the master is inthe write state – otherwise the slave (and the master) will enter the error state.Thus by verifying that the error state is unreachable from both the master andthe slave we can prove that the master and slave communicate with each other asexpected.

Figure 8 shows the NuSMV syntax for the FSM shown in Figure 7. Transitionsare specified within the case block. Figure 8 is only partial; the HREADY and HRESPsignals are assigned values non-deterministically for the functional verification. Theslave records split transactions by storing the master’s address in the MASK MASTER1,MASK MASTER2, and MASK MASTER3 flags. These flags are managed by the slave (thearbiter also maintains its own flags for which master is masked) and are clearedwhen the slave issues an HSPLITx signal. The extended variable is used to extendthe duration of RETRY and SPLIT responses for two clock cycles according to theACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 13: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 13

MODULE master read write (BUSREQ, HGRANT, MASTER STATE, MASK MASTER, BCET, WCET, READSIZE,

WRITESIZE, START, FINISH)

VAR

state : idle, busreq, haddr, read, write, busy, error;

prev state : idle, busreq, haddr, read, write, busy, error;

io : read, write;

ET : 0..MAXET;

SIZE : 1..MAXSIZE;

HADDR : boolean;

HTRANS : IDLE, NONSEQ, SEQ, BUSY;

HWDATA : boolean;

ASSIGN

init (state) := idle;

init (io) := read;

init (SIZE) := 1;

init (prev state) := idle;

next (prev state) := idle;

next (state) :=

case

HRESP = ERROR : error;

MASK MASTER & HGRANT : error;

HRESP = SPLIT & HGRANT : state;

!HREADY : state;

MASK MASTER : state;

HRESP = RETRY & HGRANT : prev state;

state = idle & START & READSIZE = 0 : busy;

state = idle & START : busreq;

state = idle : idle;

state = busreq & HGRANT : haddr;

state = busreq & !HGRANT : state;

state = haddr & HGRANT : read;

state = read & HGRANT : write;

state = write & HGRANT & SIZE = READSIZE & io = read : busy;

state = write & HGRANT & SIZE = WRITESIZE & io = write : idle;

state = write & HGRANT & SIZE < READSIZE & io = read : haddr;

state = write & HGRANT & SIZE < WRITESIZE & io = write : haddr;

state = busy & ET < BCET : busy;

state = busy & ET = WCET : busreq;

state = busy & BCET <= ET : busy, busreq;

1: error;

esac;

...

Fig. 6. Partial NuSMV Finite State Machine Model for an AMBA AHB Master

AMBA specification.

5.3 Modeling an AMBA AHB Round-robin Arbiter

We modeled a round-robin arbiter to evaluate the case studies described in Section 6and Section 7. The arbiter is specific to the AMBA AHB protocol, and capturesmost bus signals used by the master and the slave. The design is too complexto show in a figure, therefore we use the NuSMV syntax to describe the arbiter’sfunctionality. We describe the partial implementation of arbiter for two mastersand one slave for simplicity, for more details and case studies please visit

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 14: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

14 ·

Fig. 7. Finite State Machine Model of an AMBA AHB Slave

MODULE slave (HADDR, HTRANS, HWDATA, HRDATA, HREADY, HRESP, HMASTER, HSPLIT, MASK MASTER1,

MASK MASTER2, MASK MASTER3, SLAVE STATE)

VAR

state : {idle, write, read, error};prev state : {idle, write, read, error};extended : boolean;

ASSIGN

init (state) := idle;

init (prev state) := state;

init (extended) := 0;

next (prev state) := state;

next (state) :=

case

SLAVE STATE != x : SLAVE STATE;

HRESP = SPLIT : idle;

!HREADY : state;

HTRANS = BUSY : state;

HRESP = RETRY : prev state;

state = idle & HTRANS = NONSEQ & HADDR : write;

state = idle : state;

state = write & HTRANS = NONSEQ : read;

state = read & HTRANS = NONSEQ & HWDATA : idle;

1 : error;

esac;...

Fig. 8. Partial NuSMV Finite State Machine Model for an AMBA AHB Slave

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 15: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 15

http://alderis.ics.uci.edu/amba2. We also define a dummy default masterfor the bus that expresses the case when none of the two masters have access to thebus. The arbiter keeps track of both masters current state, and their previous state.An alternative approach would be to check the masters’ state directly, however thatwould require a dedicated wire from the masters to the arbiter, which is unnecessaryhardware overhead. Therefore, the arbiter keeps track of masters’ state using thefollowing simple rules. Whenever the slave sets the RETRY signal, repeat theprevious transaction:

next (master1 state) :=

case -- Roll back for retrys

HRESP = RETRY & !lasterror : master1 prev state;

Follow the transitions in the FSM shown in Figure 5:

master1 state = idle & HREADY & HRESP = OK & HGRANT1 : grant;

master1 state = grant & HTRANS != IDLE & HREADY & HRESP = OK & HGRANT1 : transmit;

master1 state = transmit & HTRANS = IDLE & HREADY & HRESP = OK & HGRANT1 : idle;

Mask masters (prevent them from acquiring access to the bus) whenever the slaveset the SPLIT response:

HREADY & HRESP = SPLIT & HGRANT1 & !lasterror : mask;

master1 state = mask & HSPLIT = master1 : idle;

The lasterror variable is introduced to hold the master’s state if the previousresponse was either RETRY or SPLIT, according to the AMBA specification. Thedefault behavior is to hold the master’s state.

lasterror : master1 state;

1 : master1 state;

esac;

The arbitration policy determines which master will be given preference in thenext bus cycle when asking access to the bus by setting its BUSREQ signal. We ex-press the arbitration policy with the help of the preferred variable. Given that weimplemented a round-robin arbiter, the value of the preferred variable alternatesbetween master1 and master2. The first set of rules expresses that if a masterrequests access to the bus, it is eventually granted access to the bus. When bothmasters request access, one of them will be granted access non-deterministically:

next (preferred) :=

case

-- Master starts the transmission

!HGRANT1 & !HGRANT2 & master1 state != mask & master2 state != mask & HREADY & HRESP = OK &

BUSREQ1 & BUSREQ2 : master1, master2;

!HGRANT1 & !HGRANT2 & master1 state != mask & HREADY & HRESP = OK & BUSREQ1 & !BUSREQ2 :

master1;

!HGRANT1 & !HGRANT2 & master2 state != mask & HREADY & HRESP = OK & !BUSREQ1 & BUSREQ2 :

master2;

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 16: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

16 ·

When one master (i.e. master1) is split and the other one (master2) is trans-mitting to the slave, the slave may decide to split (mask) the current transmittingmaster (master2), and unsplit the masked master (master1). If the masked mas-ter (master1) requires access to the bus, grant access immediately, otherwise grantaccess to the default (dummy) master.

-- Cross-splittingHGRANT1 & master1 state = transmit & master2 state = mask & HRESP = SPLIT & !BUSREQ2 & HSPLIT

= master2 : default;HGRANT2 & master2 state = transmit & master1 state = mask & HRESP = SPLIT & !BUSREQ1 & HSPLIT

= master1 : default;

HGRANT1 & master1 state = transmit & master2 state = mask & HRESP = SPLIT & BUSREQ2 & HSPLIT =

master2 : master2;

HGRANT2 & master2 state = transmit & master1 state = mask & HRESP = SPLIT & BUSREQ1 & HSPLIT =

master1 : master1;

When one master is masked and the slave sends a SPLIT response, mask the othermaster as well. If none of the masters are masked and the slave sends a SPLITresponse, mask the active master, and grant access to the other master if it requestsaccess to the bus. If not, move on to the default master.

-- Split masters

HGRANT1 & master2 state = mask & HRESP = SPLIT : default;

HGRANT2 & master1 state = mask & HRESP = SPLIT : default;

HGRANT1 & master2 state != mask & HRESP = SPLIT & BUSREQ2 : master2;

HGRANT2 & master1 state != mask & HRESP = SPLIT & BUSREQ1 : master1;

HGRANT1 & HRESP = SPLIT & !BUSREQ1 & !BUSREQ2 : default;

HGRANT2 & HRESP = SPLIT & !BUSREQ1 & !BUSREQ2 : default;

If a master finishes the transaction, grant access to the other master if it requiresaccess to the bus. If not, move on to the default master.

-- Master finishes the transaction - round-robin

HGRANT1 & master1 state = transmit & HTRANS = IDLE & HREADY & HRESP = OK & !BUSREQ1 & !BUSREQ2

: default;

HGRANT2 & master2 state = transmit & HTRANS = IDLE & HREADY & HRESP = OK & !BUSREQ1 & !BUSREQ2

: default;

HGRANT1 & master1 state = transmit & master2 state != mask & HTRANS = IDLE & HREADY & HRESP =

OK & BUSREQ2 : master2;

HGRANT2 & master2 state = transmit & master1 state != mask & HTRANS = IDLE & HREADY & HRESP =

OK & BUSREQ1 : master1;

If a master cancels its BUSREQ signal, either grant access to the other master (if ithas its BUSREQ signal set), or move to the default master.When a slave requests a masked master to be unsplit, then unsplit that master.By default, leave the current preferred master.

As seen from the formalism above, the arbiter design is rather complex even inthe case of two masters. If designers want to support the advanced features of theAMBA AHB protocol, they need to verify the functionality and performance ofACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 17: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 17

-- Master gives up its BUSREQHGRANT1 & master1 state = idle & master2 state != mask & HTRANS = IDLE & HREADY & HRESP = OK &

!BUSREQ1 & BUSREQ2 : master2;HGRANT2 & master2 state = idle & master1 state != mask & HTRANS = IDLE & HREADY & HRESP = OK &

!BUSREQ2 & BUSREQ1 : master1;

HGRANT1 & master1 state = idle & HTRANS = IDLE & HREADY & HRESP = OK & !BUSREQ1 & !BUSREQ2 :

default;

HGRANT2 & master2 state = idle & HTRANS = IDLE & HREADY & HRESP = OK & !BUSREQ1 & !BUSREQ2 :

default;

-- Unmasking masters!HGRANT1 & !HGRANT2 & master1 state = mask & HSPLIT = master1 : master1;!HGRANT1 & !HGRANT2 & master2 state = mask & HSPLIT = master2 : master2;

HGRANT1 & master2 state = mask & HRESP = SPLIT & BUSREQ2 & HSPLIT = master2 : master2;HGRANT2 & master1 state = mask & HRESP = SPLIT & BUSREQ1 & HSPLIT = master1 : master1;

HGRANT1 & master1 state = transmit & master2 state = mask & HTRANS = IDLE & HREADY & HRESP =OK & BUSREQ2 & HSPLIT = master2 : master2;

HGRANT2 & master2 state = transmit & master1 state = mask & HTRANS = IDLE & HREADY & HRESP =OK & BUSREQ1 & HSPLIT = master1 : master1;

HGRANT1 & master1 state = idle & master2 state = mask & HTRANS = IDLE & HREADY & HRESP = OK &

BUSREQ2 & !BUSREQ1 & HSPLIT = master2 : master2;

HGRANT2 & master2 state = idle & master1 state = mask & HTRANS = IDLE & HREADY & HRESP = OK &

BUSREQ1 & !BUSREQ2 & HSPLIT = master1 : master1;

1 : preferred; esac;

the design. In Section 6 we describe how we verified the correctness of the MPSoCdesigns shown in Figure 4 using the formal models described in this section.

6. FUNCTIONAL VERIFICATION OFAMBA-BASED SOC DESIGNS

This section describes how we utilized the NuSMV models introduced in Section 5for the functional verification of design alternatives discussed in Section 4. Theproposed approach focuses on system-level verification, and aims to verify deadlock-freedom and liveness properties in a single bus system with two, three, and fourmasters and one slave, using round-robin arbitration. In this paper we assume thateach master and slave is functional, and is compliant to the AMBA AHB protocol.

In the discrete event model a deadlock can be observed as a state with no tran-sitions enabled. A livelock, on the other hand, refers to a state from which only asubset of all the states is reachable. One exception to these rule are states used forinitialization, as they need not be reachable after they are visited once.

For the functional verification we have considered the case when all the mas-ters are allowed to concurrently request access to the bus and carry out read/writetransactions in an arbitrary (non-deterministic) manner, and the slave can arbitrar-ily split/unsplit masters and issue RETRY responses. This covers all the valid usesof the bus and therefore can be applied to prove the correctness of the designs. Theproposed functional verification does not take the internal computation of compo-nents into consideration, rather it treats them as “black boxes” that use the busaccording to the specification. The results described in this section are thereforeapplicable to any MPSoC that uses any of the architectures show in Figure 4 witha round-robin scheduler.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 18: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

18 ·

Design 1 shown in Figure 4 employs three masters on the same bus thereforewe verify the functional correctness of this design using three masters and a slave,and we verify design 3 using four masters and a slave. Design 2 shown in Figure 4has two buses, both of which can access the main memory. In our analysis weassume that the memory can be accessed from both buses with no risk of deadlocks.This requirement is provided by the use of a memory unit with two interfaces fordata access, and is guaranteed by hardware. Therefore, we can verify design 1 byindependently verifying the two AMBA buses with two masters.

We have used the open-source NuSMV model checker to verify Computation TreeLogic (CTL) [Clarke and Emerson 1981] properties on the finite state machines.CTL is a branching-time logic, where the passing of time is represented as a tree.Paths from the root to the leaves represent a possible passage of time. Timedanalysis then is concerned with proving either that some property holds along allpaths, or that a property is reachable on some path.

During the verification process we have discovered several trivial deadlock casesthat are covered by the AMBA specification. For example, we were able to showthat a SPLIT response followed immediately by a RETRY response deadlocks thesystem, as the master receiving the RETRY response has not started transmitting onthe bus yet. The AMBA specification, however, requires the slave to issue an OKresponse following the SPLIT response. Similarly, we found that the combination ofa RETRY response and a low HREADY signal may deadlock the bus because the masteris required to keep its state when the HREADY signal is low, but is also requiredto repeat the last transmission since the response is RETRY. These ambiguities,however, do not have a high practical value as their use does not seem to be logicalin real-life MPSoCs.

To keep consistency all the formulas described in this section apply to the 4-master design 2 and design 3, which we adapted to design 1 by simply removingany assumptions/constraints on (the non-existent) master4. We have assumed thatthe following formulas evaluate to true infinitely often (using the JUSTICE NuSMVkeyword) in all MPSoC designs for the analysis: HREADY, HRESP = OK, HSPLIT =master1, HSPLIT = master2, HSPLIT = master3, HSPLIT = master4. This wasnecessary to avoid trivial erroneous cases, such as when the slave is never ready toreceive data, or when it continuously sends RETRY responses. Using these assump-tions we were able to prove several properties in all design alternatives shown inFigure 4. First we showed that the error state is unreachable in all the mastersand the slave by using the CTL formulas (x refers to the index used for all masters):

AG (masterx.state != error),AG (slave.state != error).

The AMBA protocol permits a simple way for livelock by allowing the slave toarbitrarily split masters. If the slave splits a master and does not unsplit it, we endup in a livelock condition as the split master never gets a chance to serve requests.Moreover, if the slave splits all the masters and does not unsplit them the systemdeadlocks. We showed these conditions by checking the following CTL formula:

EF (MASK MASTER1 & MASK MASTER2 & MASK MASTER3 & MASK MASTER4).ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 19: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 19

Fig. 9. Ambiguity in the AMBA AHB Specification

We had to specify rules within the slave to enforce that whenever two mastersare split one of them will eventually be unsplit. Then we tried to verify whetherthe system can always recover from a deadlock caused by the slave by splitting allthe masters on the bus by using the following CTL formula:

AG ((MASK MASTER1 & MASK MASTER2 & MASK MASTER3 & MASK MASTER4) ->AF (!MASK MASTER1 | !MASK MASTER2 | !MASK MASTER3 | !MASK MASTER4)).

However, to our surprise we found that this property is not necessarily true inall designs. The proposed model checking method uncovered the undocumentedambiguity in the AMBA specification described below.

6.1 Ambiguity in the AMBA AHB Specification

Despite the fact that the functional verification of the AMBA AHB protocol hasbeen addressed before by various researchers [Roychoudhury et al. 2003] [Susantoand Melham 2003] [Amjad 2006], we were able to uncover an ambiguity in theprotocol that has not yet been documented by other authors. Consider an MP-SoC system based on the AMBA AHB protocol, using two masters (master 1,master 2) and a slave. The arbiter has to keep track of the masters’ state in orderto manage the split transfers. This could be implemented by providing dedicatedwires between the masters and the arbiter, however this is impractical in most casesas it requires extra computation and hardware. An alternative method is to moni-tor the bus traffic to obtain the master and slave states. The arbiter may use theHTRANS signal to check whether the master is idle or transmitting (NONSEQ, SEQ),the HBURST signal to predict the remaining cycles from the transfer, and the HRESPsignal to monitor whether the active master and slave has to step back to repeat atransaction.

The AMBA protocol allows three types of responses by the slave: OK signalsthat the transaction in the previous clock cycle has been successfully completed,RETRY signals that the slave wants the master to repeat the transaction from theprevious clock cycle, and SPLIT is a signal to the arbiter to mask the master. Aslave issues the SPLIT response when it predicts that it will be unable to receivedata – a rather ambiguous definition in the AMBA specification. Later, a slave cansignal the arbiter using the HSPLITx signal that it is now ready to process data andrequests that the arbiter unmasks a previously masked master.

Let’s assume that the slave have previously split master 1 (master 1 is maskedACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 20: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

20 ·

by the arbiter), and is in a transaction with master 2 as shown in Figure 9. Theslave can unmask master 1 by issuing an HSPLIT1 signal using the masked master’saddress to the arbiter. Consider that the slave tries to unmask master 1 by settingHSPLIT1 when it issues a RETRY response. The AMBA specification is ambiguouson what the arbiter should do in this case. The specification says that a master hasto repeat the last transaction when it receives the RETRY response. If the arbitermonitors the bus signals to keep track of the masters’ states it will try to go backto its previous state to keep synchronized with the master and the slave. However,if the arbiter implements this behavior it will not unmask master 1 as the clientrequests. Since there is no acknowledgement for HSPLITx signals the client thinksthat master 1 is already unmasked, and won’t request that the arbiter unmasks itagain. This may result in deadlock as master 1 never gets access to the bus again.The AMBA specification states, that “A slave which issues RETRY responses mustonly be accessed by one master at a time.” The authors of this paper could notreach an agreement whether the “access” refers to access through the bus or accessby being split by the slave - which would cover this deadlock, but would also implythat a slave cannot issue a RETRY response if it may split a master.

AMBA AHB is one of the most successful and widely used interconnect protocolsin the world. While the described ambiguity should be relatively rare in practice,our goal was to highlight that designers cannot simply rely on industry standardinterconnect protocols to ensure deadlock- and livelock-freedom in their design.When new or less commonly used bus protocols are used by designers, the need forfunctional verification is even stronger.

6.2 Resolving the Ambiguity

Once we recognized the possibility for deadlock we tried to resolve the ambiguityand show the correctness of the design. We have disallowed the simultaneous useof the HRESP = RETRY and the HSPLITx signal – that have caused a deadlock asdescribed above – using a simple constraint, and tried to verify whether the systemcan always recover from a deadlock caused by the slave by splitting all the masterson the bus by using the CTL formulas below:

AG ((MASK MASTERx & MASK MASTERy) ->AF (!MASK MASTERx | !MASK MASTERy)),

AG ((MASK MASTER1 & MASK MASTER2 & MASK MASTER3) -> AF (!MASK MASTER1 |!MASK MASTER2 | !MASK MASTER3)),

AG ((MASK MASTER1 & MASK MASTER2 & MASK MASTER3 & MASK MASTER4) ->AF (!MASK MASTER1 | !MASK MASTER2 | !MASK MASTER3 | !MASK MASTER4)).

Using the constraint that disallows the simultaneous use of the HRESP = RETRYand the HSPLITx signal we were able to show that the MPSoC design works cor-rectly. We were able to show that all bus requests by the masters eventually getserved in the constrained MPSoC designs by the arbiter by checking the followingCTL formulas:ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 21: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 21

SPEC AG (masterx.state = busreq -> AF HGRANTx), SPEC AG (masterx.state= busreq -> AF masterx.state = write).

These formulas ensure that the system does not deadlock or livelock and showthe correctness of our designs.

7. PERFORMANCE EVALUATION OFAMBA-BASED MPSOC DESIGNS

This section describes the proposed method for performance evaluation that com-bines the transaction-level simulation approach with model checking.

7.1 Simulation-based Evaluation

Our simulation abstraction is cycle accurate at the transaction-level, and the func-tional behavior of each module is “cycle-count-accurate”. Each one of the blocks inFigure 4 is implemented as SC MODULE which is a special class in SystemC used to de-clare modules. Communication between modules is implemented through SC PORTSusing SC SIGNALS.

Within each SC MODULE there may be several concurrently executing threads,declared as SC THREAD in SystemC. For instance, DWT has a tiling engine threadthat emulates iDMA and fetches the tiles from main memory, a compute threadthat emulates the DWT lifting kernel and wakes up when the controller signals thatthere is a tile ready to be processed, a read thread that fetches tiles from tilememory, and a write thread that writes DWT coefficients to tile memory. The DataDispatcher has two threads, one that reads DWT coefficients from tile memory andthe main data dispatcher thread that distributes the bitplanes among all of the bitplane coders in round robin fashion.

The Context Formatter and the MQ-Coder both have three separate threads, aread (from input FIFO) thread, a write (to output FIFO) thread and a computethread. The Data Collector module also has two threads, one for reading fromthe bit plane coder output FIFOs in round robin fashion, and one for writing theencoded data back to main memory. The exhaustive verification of the SystemCmodel is practically infeasible due to the large number of threads and the degreeof non-determinism present in the simulation models. A recent paper summarizesthe problems arising from the complexity caused by the inherent non-determinismof multi-threaded embedded systems [Lee 2006].

The SystemC model is configured using a configuration script that sets up itsparameters based on the input image that the model will process. The parametersinclude tile width, tile height, image width, image height, DWT decomposition levels,etc. The script configures and runs the model for a given amount of test images.From each simulation run we obtain the execution intervals for processing tiles, andthe size of compressed tiles sent over the AMBA AHB bus.

Tables I–VI show the parameters that we have obtained by the SystemC sim-ulations. The SystemC simulation is deterministic, and depends on the followingfactors: (1) the input image size, as well as content, (2) the tile size, (3) the designof the JPEG2000 encoder. We have run simulations on five different pictures using64 × 64 and 128 × 128 pixel images as input for the compression. Each image is ofsize 1024 × 1024 pixels in bmp format and were encoded from bmp to j2k format.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 22: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

22 ·

Table I. JPEG2000 Encoding SystemC Simulation Results for Design 1 Shown inFigure 4 using 64×64 pixel Tiles (for a single tile). Scale: cycles

Image DWT Tier-1 BC Tier-1 WC Tier-2 Input Output End-to-end

baboon 194188 517005 741519 9122240 12288 11099 10335043boat 194188 165141 737046 8750875 12288 10046 10044857

goddesses 194188 513846 772461 8663630 12288 11456 9996487goldhill 194188 242055 747954 8672436 12288 10376 9978464lena 194188 461601 769239 8689815 12288 11979 10024198

Table II. JPEG2000 Encoding SystemC Simulation Results for Design 2 Shown inFigure 4 using 64×64 pixel Tiles (for a single tile). Scale: cycles

Image DWT Tier-1 BC Tier-1 WC Tier-2 Input Output End-to-end

baboon 194188 612189 741951 9122240 12288 10546 10385552boat 194188 362484 741915 8750875 12288 9482 9999909

goddesses 194188 373950 743544 8663630 12288 11811 9936290goldhill 194188 483885 742743 8672436 12288 10481 9936927lena 194188 206181 741753 8689815 12288 9689 9950482

Table III. JPEG2000 Encoding SystemC Simulation Results for Design 3 Shown inFigure 4 using 64×64 pixel Tiles (for a single tile). Scale: cycles

Image DWT Tier-1 BC Tier-1 WC Tier-2 Input Output End-to-end

baboon 194188 513675 731124 9122240 12288 8622 10351259boat 194188 390141 721467 8750875 12288 7842 9963563

goddesses 194188 505197 766737 8663630 12288 8157 9926747goldhill 194188 411192 736254 8672436 12288 8592 9907612lena 194188 416304 756117 8689815 12288 8031 9943368

Table IV. JPEG2000 Encoding SystemC Simulation Results for Design 1 Shown inFigure 4 using 128×128 pixel Tiles (for a single tile). Scale: cycles

Image DWT Tier-1 BC Tier-1 WC Tier-2 Input Output End-to-end

baboon 751393 2315254 3151948 9010373 49152 36537 14290609boat 751393 1764568 3086892 8758372 49152 41719 13990027

goddesses 751393 1843190 3219664 9451990 49152 42391 14823509goldhill 751393 2325098 3173076 8768459 49152 41645 14090307lena 751393 2364360 3241400 8793070 49152 37578 14172351

Table V. JPEG2000 Encoding SystemC Simulation Results for Design 2 Shown inFigure 4 using 128×128 pixel Tiles (for a single tile). Scale: cycles

Image DWT Tier-1 BC Tier-1 WC Tier-2 Input Output End-to-end

baboon 751393 2530709 2872989 9010373 49152 37755 13921490boat 751393 1693281 2851434 8758372 49152 33136 13578227

goddesses 751393 1897617 2998701 9451990 49152 43176 14502501goldhill 751393 1935384 2901609 8768459 49152 32449 13682391lena 751393 1759751 2904408 8793070 49152 32471 13690690

Table VI. JPEG2000 Encoding SystemC Simulation Results for Design 3 Shown inFigure 4 using 128×128 pixel Tiles (for a single tile). Scale: cycles

Image DWT Tier-1 BC Tier-1 WC Tier-2 Input Output End-to-end

baboon 751393 1778247 2863026 9010373 49152 32298 13819444boat 751393 1723257 2790918 8758372 49152 36460 13479349

goddesses 751393 1778346 2939823 9451990 49152 29430 14315158goldhill 751393 1758681 2884545 8768459 49152 31797 13603176lena 751393 1763469 2947986 8793070 49152 29820 13674139

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 23: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 23

The number of tiles for each image is 3×1024×1024128×128 = 192 tiles when 128 × 128 tile

sizes are used, and 3×1024×102464×64 = 768 tiles when 64 × 64 tile sizes are used (3 refers

to color components).The DWT component implements the DC-level shifting, Multi-component Trans-

form, and Discrete Wavelet Transform steps shown in Figure 3. Execution time isconstant as it only depends on the size of the tiles. The Tier-1 columns describethe measured execution time of the Tier-1 JPEG2000 compression in executioncycles. Since Tier-1 operates on tiles, we encounter different execution times de-pending on the content of the tile. For example, an tile consisting of a single coloronly will likely be encoded faster than an tile with many colors and objects. Sim-ulating the encoding of a single image yields a substantial amount of data for theperformance estimation of the Tier-1 HW unit. During simulation, we keep trackof the lowest and highest execution time for the block, and thus we can estimate itsexecution interval. Tier-2 columns correspond to the software implementation ofTier-2 on the main CPU. During this step the encoded image is assembled fromthe tiles, and therefore we get a single execution time. However, the execution timeof Tier-2 varies between different images, as the images need to be constructedfrom different tiles. The Input column shows the size of a tile (3 × 128 × 128 and3 × 64 × 64) as input to the DWT and Tier-1, Output specifies the worst case sizeof the tile after the compression in Tier-1.

7.2 Evaluating the Simulation-based Performance Estimate Results

Each table in Tables I–VI contains simulation data for a particular JPEG2000 en-coder design using a particular tile size, and each row corresponds to the simulationof encoding a single image on the design with a particular tile size. We have simu-lated the encoding of 5 images for each design. Within a row, we encounter variableexecution time for Tile-1, as execution times depend on actual tiles, and there are192 or 768 tiles per image depending on the tile size as discussed above. The resultsshown are deterministic; they only depend on (1) the input image size, as well ascontent, (2) the tile size, (3) the design of the JPEG2000 encoder.

To increase the coverage of simulations, more “representative” images have to beused for analysis. Whenever a simulation-based performance estimation method isused, the coverage of the analysis plays an important role. Designers need to ensurethat they choose representative test vectors for the analysis, but there are often noclear guidelines to determine whether the test vectors represent the actual data.

Although increasing the number of simulations (i.e. by encoding more images)clearly has the effect of improving the performance estimation accuracy, it is moreuseful to estimate average execution times than to estimate worst case executiontimes in practice. In this section we describe two problems as key motivations forcombining simulations with formal analysis for performance estimation.

Tables I–VI show simulation results encoding 5 different images on 3 differentdesigns and 2 different tile sizes. We calculated the difference between the perfor-mance estimates for each design/tile size combination as shown in the first row ofTable VII. For example, in Table I, the best case end-to-end estimate is 9978464,and the worst case estimate is 10335043, giving us a difference of 356579. Thesecond row shows the difference as a percentage of the WCET encountered during

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 24: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

24 ·

Table VII. Difference in Simulation-based End-to-end Execution Time Estimates

T 1 T 2 T 3 T 4 T 5 T 6

WCET - BCET 356579 449262 443647 833482 924274 835809

Diff % 3.45% 4.33% 4.29% 5.62% 6.37% 5.83%

the simulation (i.e. 35657910335043 ).

Table VII shows that the difference between estimates for 64 × 64 tiles is under5%, and the difference between estimates for 128 × 128 tiles is under 7% for thesesets of experiments, despite the different images. Running more than 5 simulations,therefore would likely result in estimates that are generally close to these results,unless designers stumble upon images that really push the architecture to its limit.The problem is that the global worst case time is not necessarily the product oflocal worst case times, and it is hard to find by simulations.

The other key problem is that simulation-based performance estimation is arather time-consuming process. Simulating the encoding of a single image takesbetween 2-3 hours on an 2.8GHz Intel Pentium 4 machine running 1GB of memory.Therefore, it takes 3 × 5 × 2 × 2-3 hours = 60 - 90 hours of pure simulation timeto evaluate 3 designs on 5 images with two different tile sizes, and potentially moretime to set up the experiments. This is a key motivation to combine simulationswith formal analysis; the goal is to increase analysis coverage to better estimate theworst case end-to-end performance based on the existing simulation data, withoutthe need for time-consuming simulations.

When analyzing the results of the performance estimation, we see that both de-sign 2 and 3 improve upon the performance of design 1 slightly on average, howeverfor the baboon image design 1 has the lowest worst case end-to-end computationtime. The main source of performance gain in design 2 is the reduced congestionon the AMBA bus. In design 3 performance gain is obtained because of the non-blocking communication between the DWT and the EBCOT unit. Design 2 does notseem to benefit from using 2 JPEG2000 encoders as the performance bottleneckis the CPU. However, to measure the expected performance gain in case a fasterCPU is used we ran simulations to obtain the average throughput of the JPEG2000Encoder(s) in all designs as shown in Tables VIII–IX. These tests adhere to ourexpectations that the two JPEG2000 Encoders should have nearly twice as muchthroughput as a single one.

7.3 Model Checking-based Performance Evaluation

This section describes how we utilized model checking to evaluate the worst casebehavior of the digital camera design alternatives shown in Figure 4 based on thesimulation results described in Subsection 7.1 above. The formal models used forthe evaluation are cycle-accurate on the bus transaction-level.

During the formal analysis, we do not focus on improving the selection of initialtest vectors. As the selection is influenced by a myriad of factors, most embeddeddesign flows already have methods in place to guide simulation-based performanceestimation with the help of experienced designers. Rather, our focus is to improvecoverage based on the existing simulation data. The simulations give us very ac-ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 25: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 25

Table VIII. Average Throughput of the JPEG2000 Encoders SystemC SimulationResults using 64×64 pixel Tiles. Scale: tile/sec

Image Design 1 Design 2 Design 3 Design 2 vs 1 Design 3 vs 1

baboon 186.71 365.65 187.70 1.9583 1.0052

boat 198.68 371.32 198.84 1.8690 1.0008

goddesses 190.59 364.43 187.55 1.9121 0.9841

goldhill 188.70 365.23 188.36 1.9355 0.9982

lena 190.72 368.43 187.80 1.9318 0.9847

Table IX. Average Throughput of the JPEG2000 Encoders SystemC SimulationResults using 128×128 pixel Tiles. Scale: tile/sec

Image Design 1 Design 2 Design 3 Design 2 vs 1 Design 3 vs 1

baboon 48.47 93.11 49.86 1.9211 1.0287

boat 50.89 94.18 52.39 1.8506 1.0295

goddesses 49.03 94.03 49.99 1.9179 1.0196

goldhill 49.03 93.16 50.02 1.9003 1.0203

lena 48.84 93.80 49.94 1.9204 1.0224

curate results for the end-to-end processing of image tiles, however they can onlycover a few execution traces of the system. The proposed model checking approachfor performance estimation builds on execution parameters obtained by simulationsto prove the end-to-end performance bound. The formal model checking approachprovides the means to evaluate larger design spaces to obtain the worst case end-to-end execution time of the overall MPSoC designs. As a result, this approach hasthe potential to increase the accuracy of performance estimations in most practicaldesign flows.

We have shown in Section 6 how we proved the overall functionality of the sys-tem. The FSM models used for the performance evaluation are more lightweightthan for the functional verification described in Section 6; we do not consider splittransactions, RETRY responses, or blocking slaves (HREADY is assumed to be set), asthese functionalities are not used in the digital camera design alternatives shown inFigure 4. Transforming the models used for functional verification for performanceanalysis took us less than an hour. Although the transformation is not required forthe performance analysis, it increases model checking scalability by removing partsof the model that are not utilized, but contribute to analysis complexity.

We have used simple Boolean variables to model the interrupts and signals in thedigital camera, thus enforcing the dependencies between components. Althoughfinite state machine is inherently an untimed model of computation it can capturetime on a discrete time scale as transitions can be ordered. In our analysis we havedeclared a global time variable that is increased at every cycle.

Since most model checkers (including NuSMV) are optimized for property check-ing by giving yes/no answers, we had to conform to some restrictions. First, the per-formance analysis has to be be expressed as a reachability problem: when Master 3has written its data to the memory is the execution time always smaller than somevalue x? Formally using CTL formula: AG (finish -> TIME < x), where finishis the signal generated by the CPU (master 1) when it is in the write state and it

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 26: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

26 ·

Table X. Parameters used for Performance Evaluation by Model Checking. Scale:104 cycles

Case study M 1 M 2 BC M 2 WC M 3 BC M 3 WC M 4 BC M 4 WC

Design 1 64× 64 1 35 97 866 913 N/A N/A

Design 1 128× 128 1 251 400 875 946 N/A N/A

Design 2 64× 64 1 40 94 40 94 866 913

Design 2 128× 128 1 245 376 245 376 875 946

Design 3 64× 64 1 19 19 39 77 866 913

Design 3 128× 128 1 75 75 172 295 875 946

Table XI. Worst Case Bounds on the End-to-end Computation Times of the DesignsShown in Figure 4 obtained using Model Checking. Scale: cycles

Tile size Design 1 WCET Design 2 WCET Design 3 WCET

64× 64 pixel tiles 10670000 10580000 10540000

128× 128 pixel tiles 17000000 16800000 16600000

has written all the required data to the memory.Second, the state space used by NuSMV is influenced by the range of variables

used in the model. As seen in Tables I–VI the execution intervals are in the order ofmillions of cycles. The execution intervals for system components introduce a largedegree of non-determinism in the model checking-based analysis. The practicallimit on the state space size of analyzable systems using state-of-the-art methodstoday is in the order of 1020 states, and this limit is quickly reached when modelchecking is performed at the cycle-accurate level.

To overcome this problem we have increased the timescale of the simulation(from cycles to 1000 cycles), thereby creating an abstraction of the system thatis cycle-approximate with the highest precision available without hitting the stateexplosion problem. We used this precision to find the smallest value for x, andthus the tightest worst case execution time that we were able to prove. Table Xshows the execution parameters obtained by this reduction technique, and Table XIsummarizes the results of formal model checking on the worst case execution boundsof the digital camera design alternatives.

Designers need to provide the x bound on end-to-end execution time to seewhether it holds. This may involve multiple model checker runs that is a timeconsuming process. For the work described in this paper, we have performed mul-tiple model checker runs to find a tight bound that we were able to prove. Infuture work we plan to combine this approach with methods that aim to provideperformance estimates at a higher level of abstraction [Madl et al. 2009] [Madlet al. 2007]. Given that designers have a good estimate on the worst case end-to-end performance, the number of model checker runs necessary could be greatlyreduced.

Evaluating the worst case performance of each of the digital camera design al-ternatives shown in Figure 4 using 64x64 tiles and 104 bus cycles resolution takesaround 6 hours on a dual AMD Opteron 240 (1.4GHz) computer using 1GB mainmemory using Linux OS, and 4 days using 128x128 tile sizes. While this processis not particularly fast, it is an exhaustive analysis at the given abstraction, andACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 27: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 27

Fig. 10. Performance Estimation of MPSoC Design Alternatives Shown in Figure 4

guarantees complete state space coverage. When relying only on simulations, onecould only perform up to a few dozen simulations within the same time.

The proposed method is computationally intensive, and its performance degradesexponentially with respect to the state space size of the analyzed system. This mightpresent scalability issues when trying to apply the method for large-scale MPSoCs.In this case, the combination of several techniques may be used. First, we mightincrease the scalability by increasing the timescale of the analysis, at the loss ofsome precision. Second, the method can be hierarchically composed. End-to-endexecution times for MPSoCs can also be represented as intervals thus providing away to encapsulate larger MPSoC designs as a single component. Third, we canlimit how many execution traces we want to capture in the models i.e. by usingconstant execution times. As with any method for performance estimation, thereis a tradeoff between analysis accuracy and performance.

7.4 Evaluating the Performance Estimation Results

Figure 10 summarizes the experiments performed for the performance estimationof the digital camera MPSoC design alternatives shown in Figure 4. Each group of3 bars shows the cycle estimates we obtained for a design alternative with a giventile size.

The first bar (D 1 64x64 Block) illustrates the analytical block performanceestimates for design 1 with 64×64 pixel tiles given in Table I. The analyticalapproach simply adds together of the worst case execution times for the Tier-1and Tier-2 blocks, without considering communication on the interconnect. Thebar shows the best case cycle estimate for the Tier-1 block, the difference betweenTier-1 WCET and Tier-1 BCET (so that the height of the two blocks corresponds toTier-1 WCET), the Tier-2 BCET estimate, and the difference between Tier-2 WCETand Tier-2 BCET. Naturally, Tier-1 and Tier-2 of the EBCOT algorithm are themost computationally intensive blocks of the camera design, and therefore providea lower bound on the expected performance of the digital camera design.

The second bar (D 1 64x64 Sim) illustrates the end-to-end performance of design1 obtained by simulations as shown in Table I. These results include all blocks for

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 28: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

28 ·

Fig. 11. Communication Overhead Estimates by Simulations and Model Checking

the performance estimation, as well as the overhead of the communication, such asreading/writing tiles from/to the memory through the AMBA AHB bus.

To obtain the true worst case performance estimates, we build on model checkingto explore the finite state machine models augmented with execution parametersfor each block as described in Section 7.3. The third bar (D 1 64x64 MC) shows theworst case performance estimate that we were able to prove assuming that the bestand worst case execution time parameters used for masters/slaves were correct.This approach does not provide a “hard” bound on the worst case performanceestimate, since we build on simulation results to estimate the performance of theindividual blocks. In realistic design problems, however, designers are less con-cerned about the performance of individual blocks than choosing the right designalternative that provides the best performance based on their assumptions.

7.5 The Impact of Transaction-level Simulations and Model Checking on the Accuracyof the Performance Estimates

In Figure 11, we quantified the difference between the worst case estimates obtainedby analytic results, simulations, and model checking shown in Figure 10. The com-munication overhead and bus congestions are responsible for the difference in theworst case performance estimates obtained by analytic calculations and simulations.The first bar of each group of 3 bars shows the communication overhead estimatedby simulations as opposed to not considering the communication subsystem. For64x64 tiles the difference is less than 5%, showing that the AMBA AHB bus rarelyencounters any congestion, and the overhead is nearly negligible. Once we consider128x128 tiles, we see that simulations estimate that the communication throughthe AMBA AHB bus is responsible for ∼15-20% overhead in the worst case. Sincelarger tiles are used, the number of memory accesses while processing a tile increasesignificantly, simply because we deal with larger data sets. The first bars in eachgroup in Figure 11 show that simulations are essential to accurately predict theimpact of the communication overhead.

Let us now consider the practical impact of applying model checking to obtainend-to-end performance estimates. The primary difference between simulation re-sults and model checking results are due to the fact that model checking considersthe end-to-end worst case execution time by performing an exhaustive state spacesearch. During simulations, we can only cover a few execution traces of the MPSoCdesigns, and therefore we cannot estimate the impact of non-deterministic delaysACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 29: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 29

and congestions. Moreover, we cannot quantify the coverage of the performanceestimates either.

The second bar of each group of 3 bars shows the communication overhead es-timated by model checking as opposed to simulations. This basically shows thatour simulations are not “pessimistic” enough, and cannot find the actual worst caseend-to-end performance of the design. This is mainly due to the fact that not allfunctional blocks experience their worst case behavior at the same time, and thesimulations do not encounter the maximum number of possible congestions on theAMBA AHB bus. As with simulations, the impact of considering model checkingfor performance estimation grows as the complexity of the design grows, and ac-counts to nearly 15%. This practically means that the actual performance of thedesign may be 15% worse than the worst case performance estimate obtained bysimulations.

Finally, the third bar of each group of 3 bars shows the communication overheadestimated by model checking as opposed to the analytical method, where we did notconsider the AMBA AHB bus. In this case, the difference may be more than 35%,showing that analytical methods simply cannot estimate the worst case performanceof the digital camera MPSoC designs with acceptable accuracy.

By considering both simulations and model checking, we obtained performanceestimates for the worst case end-to-end performance of the digital camera designearly in the design flow, and improved the accuracy of the performance estimatecompared to simulations by around 3% when 64x64 pixel tiles are used, and nearly15% when 128x128 tiles are used.

The performance analysis shows that design 3 offers the best worst case end-to-end processing, while design 2 is second (as the bottleneck is the CPU not theencoder block), and design 1 is the slowest alternative. Our results show that theformal performance analysis is able to provide tight worst case execution numbersfor the end-to-end processing of the digital camera MPSoC design alternatives.

The proposed formal performance evaluation method is unique compared tosimulation-based evaluations as it covers orders of magnitude larger design spaces.The application of the method allows designers to avoid the common mistake ofunderestimating the worst case performance of MPSoCs as a result of inadequatecoverage by simulations.

8. CONCLUDING REMARKS

We have presented a method that combines transaction-level simulations and modelchecking for MPSoC verification and performance evaluation, providing a way forhigh accuracy design space exploration. We have described an ambiguity in theAMBA specification that might lead to flawed designs. The formal models in-herently capture bus communication at the transaction-level, thereby creating anabstraction with a practical balance between analysis accuracy and scalability. Theproposed formal performance analysis can be used to obtain the worst-case boundson the end-to-end execution time of MPSoCs and – unlike simulations – guaran-tees the correctness of the results. Our experiments with a digital camera MPSoCdemonstrate the applicability and high accuracy of the method. The formal evalua-tion described in this paper could be fully automated using existing model checkers.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 30: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

30 ·

Comparisons with state-of-the art simulation techniques show that the proposedmethod can be efficiently used for the systematic transaction-level validation ofMPSoC designs.

9. ACKNOWLEDGEMENTS

We would like to thank Professor Fadi J. Kurdahi for his help with the JPEG 2000architecture design and his help in developing the SystemC simulation models. Thisresearch was partially supported by a CPCC Fellowship.

REFERENCES

A. Cimatti and E. Clarke and E. Giunchiglia and F. Giunchiglia and M. Pistore and M.

Roveri and R. Sebastiani and A. Tacchella. 2002. NuSMV 2: An OpenSource Tool forSymbolic Model Checking. In Proceedings of the 14th International Conference on Computer-

Aided Verification (CAV).

Amjad, H. 2006. Verification of AMBA Using a Combination of Model Checking and Theorem

Proving. Electronic Notes in Theoretical Computer Science, Proceedings of the 5th Interna-tional Workshop on Automated Verification of Critical Systems (AVoCS 2005) 145, 45–61.

ARM. 1999. AMBA Specification rev 2.0, IHI-0011A.

Bryant, R. E. 1992. Symbolic Boolean Manipulation with Ordered Binary-Decision Diagrams.

ACM Computing Surveys 24, 3, 293–318.

Chauhan, P., Clarke, E. M., Lu, Y., and Wang, D. 1999. Verifying IP-Core based System-

On-Chip Designs. In Proceedings of IEEE ASIC SOC Conference. 27 – 31.

Clarke, E. and Emerson, E. 1981. Design and synthesis of synchronisation skeletons usingbranching time temporal logic. Logic of Programs, Lecture Notes in Computer Science 131,

52–71.

D’silva, V., Ramesh, S., and Sowmya, A. 2005. Synchronous protocol automata: a framework

for modelling and verification of SoC communication architectures. In IEEE Proceedings ofComputers and Digital Techniques. Vol. 152. 20–27.

Ernesto Wandeler and Lothar Thiele and Marcel Verhoef and Paul Lieverse. 2006.

System architecture evaluation using modular performance analysis - a case study. Software

Tools for Technology Transfer (STTT) 8, 6 (Oct.), 649–667.

Goel, A. and Lee, W. R. 2000. Formal Verification of an IBM CoreConnect Processor LocalBus Arbiter Core. In Proceedings of the 37th Design Automation Conference (DAC). 196–200.

Holzmann, G. J. 2004. The SPIN model checker: Primer and reference manual. Addison Wesley.

IBM. 2001. 32-bit Processor Local Bus Architecture Specifications ver 2.9, SA-14-2531-01.

IEEE. 2000. VHDL (IEEE 1076 Standard).

IEEE. 2001. Verilog (IEEE 1364 Standard).

IEEE. 2005. SystemVerilog (IEEE 1800 Standard).

JPEG Committee. ISO/IEC JTC1/SC29/WG1 N1855, JPEG 2000 Part I: Final Draft Interna-tional Standard (ISO/IEC FDIS15444-1). 8.2000.

K.L. McMillan. 1992. The SMV system. Tech. Rep. CMU-CS-92-131, Carnegie Mellon Univer-sity.

Lahiri, K., Raghunathan, A., and Dey, S. 2001. System-Level Performance Analysis for Design-ing On-Chip Communication Architectures. IEEE Transactions on Computer Aided-Design of

Integrated Circuits and Systems 20, 768–783.

Lee, E. A. 2006. The Problem with Threads. IEEE Computer 39, 5 (May).

Lee, E. A., Hylands, C., Janneck, J., II, J. D., Liu, J., Liu, X., Neuendorffer, S., Stewart,

S. S. M., Vissers, K., and Whitaker, P. 2001. Overview of the Ptolemy Project. Tech. Rep.UCB/ERL M01/11, EECS Department, University of California, Berkeley.

Madl, G., Dutt, N., and Abdelwahed, S. 2007. Performance Estimation of Distributed Real-time Embedded Systems by Discrete Event Simulations. In Proceedings of EMSOFT. 183–192.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.

Page 31: Combining Transaction-level Simulations and Model Checking ...gabe/papers/MPZBD_ACM_TODAES_2009.pdf · Categories and Subject Descriptors: I.6.4 [Simulation and Modeling]: Model Validation

· 31

Madl, G., Pasricha, S., Dutt, N., and Abdelwahed, S. 2009. Cross-abstraction Functional

Verification and Performance Analysis of Chip Multiprocessor Designs. IEEE Transactionson Industrial Informatics, Special Section on Real-time and (Networked) Embedded Systems

(accepted for publication).

Madl, G., Pasricha, S., Zhu, Q., Bathen, L. A. D., and Dutt, N. 2006. Formal Performance

Evaluation of AMBA-based System-on-Chip Designs. In Proceedings of EMSOFT. 311–320.

OSCI. 2005. SystemC ver 2.1 (IEEE 1666 Standard).

Rafik Henia and Arne Hamann and Marek Jersak and Razvan Racu and Kai Richter

and Rolf Ernst. 2005. System Level Performance Analysis - the SymTA/S Approach. IEE

Proceedings on Computers and Digital Techniques 152, 148–166.

Richter, K., Jersak, M., and Ernst, R. 2003. A Formal Approach to MpSoC PerformanceVerification. IEEE Computer 36, 60–67.

Roychoudhury, A., Mitra, T., and Karri, S. R. 2003. Using Formal Techniques to Debug the

AMBA System-on-Chip Bus Protocol. In Design, Automation and Test in Europe (DATE).828–833.

Susanto, K. W. and Melham, T. F. 2003. An AMBA-ARM7 Formal Verification Platform. In

International Conference of Formal Engineering Methods (ICFEM). 48–67.

Taubman, D. 2000. High performance scalable image compression with EBCOT. IEEE Trans-actions on Image Processing 9, 1158 – 1170.

ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Month 20YY.