Design Exploration using Multiple ARM Instruction Set ...

36
Center for Embedded Computer Systems University of California, Irvine Design Exploration using Multiple ARM Instruction Set Simulators - A Case Study on a JPEG Encoder Kyoung Park Kim, Rainer D¨ omer Technical Report CECS-09-08 May 28, 2009 Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA (949) 824-8059 {kpkim, doemer}@uci.edu http://www.cecs.uci.edu/

Transcript of Design Exploration using Multiple ARM Instruction Set ...

Page 1: Design Exploration using Multiple ARM Instruction Set ...

Center for Embedded Computer SystemsUniversity of California, Irvine

Design Explorationusing Multiple ARM Instruction Set Simulators

- A Case Study on a JPEG Encoder

Kyoung Park Kim, Rainer Domer

Technical Report CECS-09-08May 28, 2009

Center for Embedded Computer SystemsUniversity of California, IrvineIrvine, CA 92697-3425, USA

(949) 824-8059

{kpkim, doemer}@uci.eduhttp://www.cecs.uci.edu/

Page 2: Design Exploration using Multiple ARM Instruction Set ...

Design Explorationusing Multiple ARM Instruction Set Simulators

- A Case Study on a JPEG Encoder

Kyoung Park Kim, Rainer Domer

Technical Report CECS-09-08May 28, 2009

Center for Embedded Computer SystemsUniversity of California, IrvineIrvine, CA 92697-3425, USA

(949) 824-8059

{kpkim, doemer}@uci.eduhttp://www.cecs.uci.edu

Abstract

Embedded software synthesis is an integral step in the embedded system design flow. Tovalidate the generated software in the context of the entire system-on-chip, instruction set sim-ulation is a critical task. In previous work [8], an instruction set simulator (ISS) for an ARMprocessor core has been integrated into the System-on-Chip Environment. Unfortunately, theexisting integration is limited to a single processor unit in the entire system.

In this work, we extend the previous implementation so that multiple ARM ISS instancescan be used concurrently within the system model. As a result, platform architectures withmultiple ARM CPUs can now be accurately simulated at ISS level. This report demonstratesour multi-ARM ISS support by use of a case study on a JPEG encoder application.

Page 3: Design Exploration using Multiple ARM Instruction Set ...

Contents1 Introduction 1

1.1 Top-Down System Design with SCE . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Multiple Instruction Set Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Exploration of JPEG Encoder Application 32.1 JPEG Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 JPEG Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 JPEG Design Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Single CPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Multi-ARM Instruction Set Simulation 123.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Approach and Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Multiple CPU Implementation 154.1 JPEG Architecture using 2 ARM CPUs . . . . . . . . . . . . . . . . . . . . . . . 15

4.1.1 Communication via Hardware Block . . . . . . . . . . . . . . . . . . . . . 154.1.2 Communication via Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 JPEG Architecture using 3 ARM CPUs . . . . . . . . . . . . . . . . . . . . . . . 164.2.1 Communication via Hardware Block . . . . . . . . . . . . . . . . . . . . . 164.2.2 Communication via Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Experimental Results 195.1 Platform Execution Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.2 Multi-ARM Simulation Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Conclusions and Future Work 20

References 21

A Appendix 22A.1 SCE Design Steps for Single-CPU Implementation . . . . . . . . . . . . . . . . . 22A.2 SCE Design Steps for Two-CPU Implementation . . . . . . . . . . . . . . . . . . 24A.3 SCE Design Steps for Three-CPU Implementation . . . . . . . . . . . . . . . . . . 27

i

Page 4: Design Exploration using Multiple ARM Instruction Set ...

List of Figures1 Embedded system design flow using SCE . . . . . . . . . . . . . . . . . . . . . . 32 Block diagram of JPEG encoder application[1] . . . . . . . . . . . . . . . . . . . 43 JPEG encoder test bench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Exploration model of JPEG encoder in SCE . . . . . . . . . . . . . . . . . . . . . 65 Computation profile of JPEG encoding . . . . . . . . . . . . . . . . . . . . . . . . 76 Cost/Speed trade off graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Hierarchy chart of bus functional model . . . . . . . . . . . . . . . . . . . . . . . 98 Platform model for JPEG encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Extern variables and classes used in funtions in SWARM ISS . . . . . . . . . . . . 1210 System architecture for JPEG encoder using 2 CPUs . . . . . . . . . . . . . . . . 1511 System architecture for JPEG encoder using 2 CPUs with bridge . . . . . . . . . . 1612 System architecture for JPEG encoder using 3 CPUs . . . . . . . . . . . . . . . . 1713 System architecture for JPEG encoder using 3 CPUs with bridge . . . . . . . . . . 18

ii

Page 5: Design Exploration using Multiple ARM Instruction Set ...

List of Tables1 Architecture refinement from perfect model . . . . . . . . . . . . . . . . . . . . . 82 Execution time depending on the number of CPUs . . . . . . . . . . . . . . . . . . 193 Simulation time depending on the number of ARM ISS . . . . . . . . . . . . . . . 19

iii

Page 6: Design Exploration using Multiple ARM Instruction Set ...

Design Explorationusing Multiple ARM Instruction Set Simulators

- A Case Study on a JPEG Encoder

Kyoung Park Kim, Rainer Domer

Center for Embedded Computer SystemsUniversity of California, IrvineIrvine, CA 92697-3425, USA

{kpkim, doemer}@uci.eduhttp://www.cecs.uci.edu

AbstractEmbedded software synthesis is an integral step in the embedded system design flow. To validate thegenerated software in the context of the entire system-on-chip, instruction set simulation is a criticaltask. In previous work [8], an instruction set simulator (ISS) for an ARM processor core has beenintegrated into the System-on-Chip Environment. Unfortunately, the existing integration is limitedto a single processor unit in the entire system.

In this work, we extend the previous implementation so that multiple ARM ISS instances can beused concurrently within the system model. As a result, platform architectures with multiple ARMCPUs can now be accurately simulated at ISS level. This report demonstrates our multi-ARM ISSsupport by use of a case study on a JPEG encoder application.

1 IntroductionHardware/software co-design is a set of methodologies and techniques specifically created to

support the concurrent design of both systems, reducing the development time. Time to market isvery important in chip business. There are so many companies to try to survive in SOC industry.The first company to release the chip to market takes most of profits. In addition to its criticalrole in the development of embedded systems, many experts believe that co-design will be a core

1

Page 7: Design Exploration using Multiple ARM Instruction Set ...

design methodology for Systems-on-a-Chip. Also, concurrent design, or co-design of hardware andsoftware is extremely important for meeting design goals, such as high performance, that are the keyto commercial competitiveness because designers can trade-off in the way hardware and softwarecomponents work together to exhibit a specified behavior.

1.1 Top-Down System Design with SCEIn this project, top-down design methodology using SpecC[5][6][7] will be dealt with. Software

should be developed in the early stage of chip design because software development time forembedded system takes more time than hardware development time. In Figure 1, top-down designmethodology is briefly described by using SCE. It is composed of 6 steps. It starts from productspecification. The specification model is generated from product specification. This specificationmodel is untimed and has only the functional description of the design. Architecture refinementtransforms this specification to an architecture model. It involves partitioning the design and map-ping the partitions onto the selected components. The architecture model thus reflects the intendedarchitecture for the design. The next scheduling refinement steps add RTOS to architecture model.Dynamic scheduling like Round-robin and priority-based scheduling and static scheduling areavailable for scheduling for each behavior in Spec-C model. Networking refinement is performedby adding buses to DUT and mapping all components under DUT to slaves and masters for buses.Communication refinement generates a timing accurate BFM. The final step is HW/SW synthesiswhich produces clock cycle accurate RTL model for hardware components and instruction specificassembly code for processors.

1.2 Multiple Instruction Set SimulatorThis report contributes the support of multiple ARM instruction set simulators to the SCE de-

sign flow[2]. Previously there is a limitation in SCE to support at most one ARM instruction setsimulator due to some global variables used in SWARM[3]. Now System-on Chip using multipleARM7TDMI processors can be simulated with our extension of the SWARM. The multiple ARMsimulation could be achieved by making those global variables local. The multiple ARM instructionset simulator is being developed, based on SpecC ARM ISS[8] currently integrated in SCE. Moredetailed information on this will be available later in this report.

2

Page 8: Design Exploration using Multiple ARM Instruction Set ...

SCE Top-d ow n D e s i g n M e t h od ol og y�����������

�� ������������������� � � �"!��#� �%$

&'�(��)+* ,�* )��+��* -/.102-�3��'4

5 6)+7(* �"��)����(8�28�%,�*9.'�'02�:.;�

5 6)+7(* �"��)��<�(8�=02-�3�'4

&>)%7'��3?�4 *@.'AB��%,�*9.'�'02�:.;�

&�)C7���3?�(4@* .�AD0E-F3��'4

G�H�I J8KLH�I MN O6P�Q R8M#SUT T S V P#Q M6I W H�XUMN#O8P#Q9R�M�SUT T SN6Y6W Q KLH�I MN#O8P#Q9R6M�SUT T S

Z 0[�F4 �'0E�'.>�"�+��* -?.\02-�3��'4

] �+�8^_-?�`E��%,"* .(�:02�'.%�

] �+�a^b-?�`c0E-F3��'4

�d-?0=02�.(* )��+��* -?.\6�+,e* .��'0c�'.;�

�/-?0=0[�.(* )��%�<* -?.\02-�3��'4

Figure 1: Embedded system design flow using SCE

2 Exploration of JPEG Encoder ApplicationTo get through SCE top down design methodology, we start from JPEG encoder application[9].

Hardware synthesis is not covered at this time. Software synthesis is going to be dealt with toachieve the performance goals for JPEG encoder. In computing, JPEG is a commonly used methodof compression for photographic images. The degree of compression can be adjusted, allowinga selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 com-pression with little perceptible loss in image quality.To be familiar with JPEG Encoder application,the structure of JPEG encoder is investigated. In Figure 2, the block diagram for JPEG encoderapplication[1] is shown. It consists of 8 blocks. chendct1, chendct2, quantize, zigzagand huffencode are the main parts of JPEG encoder. Readbmpheader, InitGlobals andReadBmpBlock exist to generate inputs and initialize. Above JPEG encoder C model will beconverted to Spec C model to do software synthesis. The interesting part of this JPEG applicationis chendct1 and chendct2 to support multi-processing. chendct1 is covering odd block andchendct2 is executed on even block.

2.1 JPEG SpecificationTo specify jpegencoder application for system design using SCE[4], we went through many steps.

In version 0, JPEG encoder can be compiled through SpecC compiler. In version 1, chendct1,chendct2, quantize, zigzag and huffencode are integrated under DUT. And other blocks

3

Page 9: Design Exploration using Multiple ARM Instruction Set ...

f�g+h%i'j/kml:n�g+h+i'g>o p9q:r s�tvu w�x%y>u z {F|+}+~��/�m���/� ���>� �>�;�>�>�;�����

�:���>�C�8� ����C� �;������;�+� �8�+�;�%�; :�

¡ ¢>£6¤"¥#¦?§ ¨�© ª« ¬>­6®"¯a°�¯a± ®�²6³

´<µ'¶8· ¸º¹ »�¼e½6¾8¿�½"À Á Â#ÃÄ<Å'Æ8Ç ÈºÉ Ê�ËeÌ6Í8Î�Ì"Ï Ð Ñ6Ò

Ó%Ô�Õ"Ö�× Ø"Ù

Ú�Û:Ü6Ý<Þ6ß�à"áeâ�ã ä�å6æ

ç%è�é�ê�ë6ì�í�î8ï ð"ñºò

ó�ô:õºö ÷8ø ù�úeû6üºýaûeþ ÿ ������������ ��� ��� �

������������ � ��! "

#%$�&(' ) *(+,%-�.�/�0�13254�6 758�9:%;�<�=�>�?�@ A ?�B C

D�EGFIH�JIK�L(M3N�O P3QRS�TGUIV�WIX�YZ\[ ]�^ _

`�aGbdc e�f g�h5ijdk�i(l m ndop q r(s

t�uGvdw x5v�yz {�|G}d~ �(}���� ���� �(��� �

�%������������

�(�I� �3�� � ���� �(��� ��5�¡  ¢3£d  ¤¥(¦�§�¨�© ª «¡¬�­(¦�©®5¯�°d±�² ³ ´¡µ�¶(¯�²·¸ ¹�·¡º�¹�»3¼½¾d¿ À�¾¡Á�À�Â3ÃdÄ

ÅÇÆÉÈËÊÍÌÏÎÑÐÉÒÔÓÖÕ�ÒØ×ÙÈÛÚØÜÞÝàßâáäãæåÍÊÛÈÔÎËãØÕçÒØèËèÔÆ¡ÐéÊêÒÛëìÐíÈæå

Figure 2: Block diagram of JPEG encoder application[1]

are divided into stimulus and monitor as it is seen in Figure 3. In version 1.1, printing statementfor timing is added. In version 2.0, double handshake channels are added to SpecC model to sup-port parallel execution. In version 2.1, typed queue channel are introduced to support pipelinedexecution. This step was the most tricky part in this project. A lot of efforts are made to make itworking[4].

2.2 JPEG ExplorationJPEG Encoder is created by improving version 2.1. It has zero warnings and clean hierarchy

shown in Figure 4. There are no global variables and no global functions. They are merged intobehaviors. It shows detailed timings for each encoded block. The reference picture, test.jpg, ismoved into Monitor. For our convenience, this model is called perfect model.

More timing analysis is performed with newly created JPEG Encoder SpecC model. Computa-tion profile for each block is shown on Figure 5. Timing for chendct1 and chendct2 are allsame as 10.41 ms. quantize is consuming 7.84 ms of 180 block encoding time. zigzag isspending 2.32 ms of CPU time. huffman does use 8.88 ms. So total encoding time for 180 blockis 39.86 ms. chendct1 and chendct2 are spending half of CPU time. And huffman is third.quantize is fourth. zigzag is the last.

4

Page 10: Design Exploration using Multiple ARM Instruction Set ...

î�ïìðòñôóöõ÷óöî

ñùøûú\ð ï�øûü

ý�þ ÿ ú � ý ï�� ý%þ ÿ ú � ý�ï�� � ó�� ú�ï ð �ÿ �ð�� �� þ�ó�����ÿ ú ý ø � ÿ

Jpeg encoder Test bench

���

�����

���

� � � � ��� ���

�! #"%$

Figure 3: JPEG encoder test bench

5

Page 11: Design Exploration using Multiple ARM Instruction Set ...

Figure 4: Exploration model of JPEG encoder in SCE

6

Page 12: Design Exploration using Multiple ARM Instruction Set ...

Figure 5: Computation profile of JPEG encoding

From the computation profile, dct1, dct2 and huff are consuming most part of computationtime. They are possible candidates for improving performance.

2.3 JPEG Design AnalysisTo do architecture refinement, timing calculated from perfect model is back annotated. Estimated

computation delays are inserted manually. Timing delays are added by using waitfor statement. Thetimings from perfect model are divided by 180 which means the number of input stimulus. Thattiming is added to each block.

Architecture refinements are performed and various system architecture is tested to get best ar-chitecture. From many tests and experiments, results show that performance is increased accordingto the increase of number of CPUs. But there is some saturation of performance at more than 3CPUs. So 3 CPUs are the optimal number in terms of cost and performance. Table 1 shows sometest results depending on number of CPUs and scheduling method of RTOS. Also blocks in mod-els are randomly allocated to each CPU. Some RTOS is doing scheduling statically, and others arescheduling dynamically. Eventually, a graph for cost and speed trade off is drawn in Figure 6.

In reality, our JPEG encoder is still too slow to be used. It takes about 40 ms for encoding a116X96 pixel image in black and white. 116X96 pixel is only 0.011136 mega-pixels. It needs about1000 times performance improvements to cover 11.1 mega pixels.

7

Page 13: Design Exploration using Multiple ARM Instruction Set ...

Number of CPU DCT1 DCT2 Quantize Zigzag Huffman RTOS Scheduling Execution time(us)1 CPU1 CPU1 CPU1 CPU1 CPU1 CPU1:Priority 398602 CPU1 CPU1 CPU2 CPU2 CPU2 CPU1:No OS 19153

CPU2:Priority3 CPU1 CPU1 CPU2 CPU3 CPU3 CPU1:No OS 11358

CPU2:No OSCPU3:Round Robin

4 CPU1 CPU2 CPU3 CPU4 CPU4 CPU1:No OS 11358CPU2:No OSCPU3:No OS

CPU4:Round Robin5 CPU1 CPU2 CPU3 CPU4 CPU5 CPU1:Round Robin 10574

CPU2:Round RobinCPU3:Round RobinCPU4:Round RobinCPU5:Round Robin

Table 1: Architecture refinement from perfect model

2.4 Single CPU ImplementationTo support bus functional model for CPU, platform model for JPEG encoder is created in Fig-

ure 8. Communications in jpegencoder can be refined to actual CPU bus. IO units datain and dataoutare added to platform model to support BFM communication via CPU bus. Channels from c1 to c4are going to be real system AMBA bus. They are using totally different protocols compared to c1and c5 which means channels for c1 and c2 are TLM bus model and more abstract bus model thanBFM bus model. So there should be some kind of wrapper to convert channel to real hardware bus.IO Unit1 is inserted into JPEG platform and mapped to datain to transfer the inputs from stimulusto jpegencoder. Also IO Unit2 is added to JPEG Platform and mapped to dataout to send outputof JPE encoder to monitor which compares the results to check if those results are matched to thegolden data of test.jpg.

Network refinement is performed to map all of PEs under JPEG Plaform to slaves and mastersof buses.Bus0 are renamed to ”CPU BUS”. Double handshake bus is added as HW Bus. Addi-tional Port1 is created for HW. Port1 is eventually connected to stimulus through IO Unit1. H2 isconnected to CPU Bus as slave4 through port1. Hierarchy chart of BFM is drawn in Figure 7.

Communication refinement is performed to generate timing accurate BFM model of JPEG en-coder. Addresses are assigned to HW and IO Unit2 for memory mapped IO. C codes for executionfor ARM7TDMI are generated from communication model.

8

Page 14: Design Exploration using Multiple ARM Instruction Set ...

&'(&�&)&*+&)&�&)&*('(&�&)&,)&)&�&)&,-'(&�&)&.(&)&�&)&.)'(&�&)&/0&)&�&)&/1'(&�&)&

* , . / '

2�35476(809;:-<>=�?)@ACB%8(D+3(EGF :IHKJ0F 4�8)L�3)MON

Figure 6: Cost/Speed trade off graph

PRQTS U>VCW V�P

XZYC[ \)]

^ _�`a[�b S Q�c U;d b S QTd%e

fhgig+e X Q�W jkVmlmn foe pCS QTqCe g%qOradigiqoe g�qms fhVCW Q�tmW fvumq w fOQTqOg X W x yie qhfhgoz X d b�b qhraQTd%e

eT{�fGQTfojkVml ] {(fOQTfmjkVhl

^ _�`G[�b S QR|

SYSTEM CHART OF JPEGENCODER WITH ONE CPU

}�~������k���

\�] `v�%[ t

X%�

XZ�~�� �1�m�C���C�

��������������� ��k���;�5�>� �>�5�C�0���I�1�>���;�1���� �;�1�>�1�>�5���-���;�1�;���;�1�a����;�5�>� �>�5�C�0���I�1�>���;�1���� �;�1�>�1�>�5���-���;�1�;���;�1�a�¡£¢�¤¡£¢�¤¡£¢�¤¡£¢�¤¥�� ��k�§¦�¨1©;�5�Oª «0��� « ª ¬I«0© ¬k���I¨5­G­h®#©;��§¦k¨1©>�5�Gª «0�>��« ª ¬;«-©�¬k����¨5­G­m®#©I��§¦�¨1©;�5�Oª «0��� « ª ¬I«0© ¬k���I¨5­G­h®#©;��§¦k¨1©>�5�Gª «0�>��« ª ¬;«-©�¬k����¨5­G­m®#©I�

Figure 7: Hierarchy chart of bus functional model

9

Page 15: Design Exploration using Multiple ARM Instruction Set ...

Figure 8: Platform model for JPEG encoder

10

Page 16: Design Exploration using Multiple ARM Instruction Set ...

To run C codes generated from communication model, instruction set simulator model is created.System architecture consists of 4 processing elements. Those PEs are including main ARM CPU forquantize, zigzag, and Huffman whose using a priority based scheduling using microC-OS2 RTOS,a hardware accellarator for the DCTs, 2 I/O units for data in and output. Those PEs are connectedthrough AMBA-AHB bus which is the built-in CPU bus and a custom double-handshake bus. Butsome problems exists when C code generator inserts unwanted TaskDelay(). Bus Functional modelsget stuck after encoding for block 177. Encoding is still too slow. It takes 142 milliseconds for 177blocks. Our goal is to achieve 0.013 ms to support 1.1 mega pixels for color photograph. The targetspeed is far away from the result regardless of some limitation. The detailed steps in SCE for thissingle CPU implementation are listed in Appendix A.1.

11

Page 17: Design Exploration using Multiple ARM Instruction Set ...

3 Multi-ARM Instruction Set SimulationWe will now describe out work in extending the SWARM ISS in SCE to support multiple ARMISSs instances.

3.1 Problem definitionThe reason why the original ARM instruction simulation does not support the multiple instances

is that there are nine global variables to be referenced in some local and global functions in in-struction simulator. When JPEG simulation is run with multiple ARM ISSs, those extern variablesshould not be shared and commonly used in multiple ARM ISSs. Eventually those extern variablescause simulation to stop and prevent multiple ARM ISSs from running concurrently. To instancethe multiple ARM ISSs properly, all global functions should not touch global variables and classes.Moreover, global variables and classes should be localized to support multiple ARM ISSs. All ex-tern variables that need to be localized are shown in Figure 9 as an example to show what variablesare directly used in the source code of ARM ISS.

¯�°%± ²%³ ´± ²+± µ�³Z´¶v·�¶o¸ ¹%º�·�¶v·�¶o¸ ¹�³(´º(»%¼a¶C½i¹�¶o¾o³+´°�¿(¿)À ¶m½%¹�¶o¾m³�´Á�À�± µÂ¹�¿(¹%º(»iÃ(¿�»Z¯ÅÄZ³(´À ¹C°�¿%Á�ÀT± µR¹i³�´Á�À�± µÂ¹%¯�¹+¯�³Z´À ¹C°�¿)¯�¹%¯�³(´

ÆaÇCÈ É�Ê�È ËaÌ Í+ÎÂÇCÈ É ÏiÊRÐaÑ(ËvÎRÒ Ó Ê�Ô ÕoÏiÖvÐ0Î�× ØRËvÙRÒÌ�ÚÂÛÂÈ Í%ÎvÜ�ÝÂÉ+ËGÈ Þ

ÄCß�À�¯�à)Ä�áŹ%¯�â�À ·Gà â�Ä�µÂ¼Ä�ß�À�¯�à ¶hâ�²Cµa± ²Z»i¹iãhÁ;°%ÀT¯

Ä(± ²%â�»oµ�àÄZ± ²iâ)»CµTà

Ä(á�¹%¯�â)À ·và�ÄZ± ²iâ�»CµTàÄ(á�¹%¯�â)À ·và�ÄZ± ²iâ�»CµTàÄZ± ²iâ)»CµTà

Ó Ò ÛTÒ × Ì(ÖG× Ø�Ò äRå�æ ÒÎÂÌ ÐRÈ Û�Ì Ý�ç ÊÂÆ�æ ÐvèÂÇOÆOé æ Ü(ÇOêvëÓ Ò ÛTÒ × ÌZÙaØRÓ × ìGØRÝ�í�× Ø�ÒÎÂÌ ÐÂÈ Û�Ì Ý�ÊTËvÓ Ó Ò ÛTÒ × Ì+ÙvØÂÓ × ìGØÂÝ�í)× ØRÒÎÂÌ ÆOÚ�Û�ØRìaÝRÆvËvÙ�Ø�Ò Ý�È Ó Ò ÛTÒ × Ì+ÙÂ× ØRÒ äRå�æ ÒÌ ËGØRÒ × ØÂÙÂÝÂÑTîiÛÂÈ É Ó Ò ÛTÒ × ÌZÙ�Ø�Ó × ìGØRÝRí�ï ËGØ�ì0ï ËGØÂìÝTð�Ò ñaÙ�Ó Æ�ÞRÌTï ÝÂÓ

ÄCßIÀT¯

ò�óCôiõ�ó+öÂ÷ùø�úZöRû úZü)ý ó%þ�û ÷ùÿ���� û ÷Zþvõvö��(÷��Oõaû �)÷Kþhû �(ý úiõ���ö

Figure 9: Extern variables and classes used in funtions in SWARM ISS

3.2 Approach and SolutionAlso some global functions to use those external variables are described. To protect the functions

in the source code of ISS from referencing the global variables directly, they are passed to thefunctions as one of the parameters. CArmProc pArm is first selected to be localized because it ismost frequently used in ARM ISS. SpecC is ANSI-C based system level description language anddoes not recognize class object. So to make the extern variables localized, structure object will beused instead of class object. All extern variables are packed into one structure called SArmProc.

12

Page 18: Design Exploration using Multiple ARM Instruction Set ...

CArmProc pArm is hidden as an member of struct SArmProc to make the SpecC wrapperof ARM ISS compiled.

struct SarmProc is declared as an empty struct in swarm sim.h which means that it doesnot have any members in it because ANSI-C compiler does not understand the class declaration andcreates compiler error. So the real struct SArmProc is defined again in swarm sim.cpp likebelow.struct SArmProc{CArmProc* pArm;};

And then it is declared as struct SArmProc* pArm. To hide all of detailed informationrelated to Class, pArm will be declared as a pointer of SArmProc. So &pArm is passed to initfunction like result = init(argc,argv,&pArm); And then all of objects like pArm andpMemory necessary to run simulation will be created in init() function.

Now most of function calls in SWARM source code have CArmProc* pArm as a func-tion parameter. While modifying source code, atexit() is updated different from the way tomodify other functions. atexit( PrintISSCycles ) is executed when simulation ended.atexit() has function pointer as a parameter which points to the function. That function shouldhave void pointer as a parameter. But PrintISSCycles does have pArm as a parameter due tolocalization of pArm. Now PrintISSCylces() is inserted into class CArmProc as the memberfunction of class CArmProc.

3.3 ImplementationAll extern variables are removed from the original source code for ARM instruction simulator

one by one. Finally all of extern variables are merged into structure like below.

struct SArmProc{CArmProc* pArm;char* pMemory;OPTS opts;PINOUT pinout;uint32 t pcTrace[PC TRACE MAX];unsigned int pcTracePos;unsigned int pcChangeCounter;uint32 t continueSwarm;};

To verify all updates for new ARM instruction simulator, JPEG simulation with one ARM ISSis run and the results match the original ARM ISS.

13

Page 19: Design Exploration using Multiple ARM Instruction Set ...

3.4 InstallationSource codes for multiple SWARM instruction set simulators and SpecC wrappers for multi-

ple SWARM ISSs are imported into CVS repository under /home/lecs/cvs/multi swarm.They are also checked out under /home/lecs/chkout/multi swarm. multi swarm di-rectory consists of two directories. One is arm7tdmiwrapper to include SpecC wrapper formultiple SWARM ISS and the other is swarm which contains source codes for multiple SWARMISSs. Actually src directory under swarm contains all source codes for multiple SWARM ISSs.

14

Page 20: Design Exploration using Multiple ARM Instruction Set ...

4 Multiple CPU ImplementationWe will now describe two implementations of the JPEG encoder with multiple ARM CPUs.

4.1 JPEG Architecture using 2 ARM CPUsTo run JPEG simulation with 2 CPUs, new architecture platform with 2 CPUs needs to be devised.

4.1.1 Communication via Hardware Block

HW2 is used as an intermediary processor element to connect CPU1 BUS1 and CPU2 BUS eachother because there is no bridge currently available in SCE library. chendct1 and chendct2 areassigned to HW1 like JPEG platform with one CPU. CPU1 handles quantize behavior. And HW2is taking care of zigzag. CPU2 performs huffman. The detailed JPEG platform with 2 CPUsis shown in Figure 10. The detailed steps in SCE for this implementations are listed in AppendixA.2.

� ��� ����� ���

������� ��� �

� �� !�#" � �%$ ��& " � ��&('

� �� )��" � � �

SYSTEM CHART OF JPEGENCODER WITH TWO CPUs

*,+.-0/�1�23-54

��� )67�98

�(:

��;

+�< =?>A@�B3C�D

EGFIHEGFJHEGFIHEGFJHLKK KKNMPO�Q?R�SUTVO?WXHZY.O�Q[R�S?T�OZW!\M]O�Q[R�SUT�O?W^H[Y.OVQ?R�S_T�OZW`\MPO�Q?R�SUTVO?WXHZY.O�Q[R�S?T�OZW!\M]O�Q[R�SUT�O?W^H[Y.OVQ?R�S_T�OZW`\a0b3cdHa0bNcdHa0b3cdHa0bNcdHLKK KKNMfeNgZh�SZWji klRMfe3gZh�S[Wmi klRMfeNgZh�SZWji klRMfe3gZh�S[Wmi klREGFn\EGFo\EGFn\EGFo\fKKKKNMokUi p�k�hUpMnk_i p�k�h_pMokUi p�k�hUpMnk_i p�k�h_pa0b3cq\a0bNcq\a0b3cq\a0bNcq\fKK KKNM]Q_g?r)rAsth�SM]QUg[r)r^sth�SM]Q_g?r)rAsth�SM]QUg[r)r^sth�S

��� $*,+.-5u[1�23-54

����� $

v�w�8jx7yjz 8�{|w7}�y

Figure 10: System architecture for JPEG encoder using 2 CPUs

15

Page 21: Design Exploration using Multiple ARM Instruction Set ...

4.1.2 Communication via Bridge

If the bridge HW is available, JPEG platform will be drawn like in Figure 11. HW2 is replaced withbridge HW and then HW2 is now the slave of CPU2 BUS. There are some advantages in this JPEGplatform because CPU1 can access CPU2 BUS through bridge. When bridge is used to connect thebuses each other, it is safe to disallow the bidirectional access because it might cause the deadlockand then the whole system might get stuck. The JPEG platform with bridge allows more flexibilityin the architecture of Platform than the architecture without bridge. Also it will be better to supportthe multiple masters in each bus because the bridge hardware usually is the slave of one bus andalso it is the master of the other bus. Eventually there exist two masters on the bus. Also it is veryimportant for CPU to be able to access all of slaves on the platform due to the memory mappedI/0 to control slaves. In Figure 11, bridge is the slave of CPU1 BUS and it is also the master ofCPU2 BUS.

~ ��� ����� ��~

������� ��� �

� ���!�#� � �%� ��� � � ���(�

� ���)��� � � �

SYSTEM CHART OF JPEGENCODER WITH TWO CPUs

�,�.�0�����3�5�

��� �)�7�9�

�(�

���

��� �?�A�� 3¡�¢

£G¤I¥£G¤J¥£G¤I¥£G¤J¥L¦¦ ¦¦N§P¨�©?ª�«U¬V¨?­X¥Z®.¨�©[ª�«?¬�¨Z­!¯§]¨�©[ª�«U¬�¨?­^¥[®.¨V©?ª�«_¬�¨Z­`¯§P¨�©?ª�«U¬V¨?­X¥Z®.¨�©[ª�«?¬�¨Z­!¯§]¨�©[ª�«U¬�¨?­^¥[®.¨V©?ª�«_¬�¨Z­`¯°0±3²d¥°0±N²d¥°0±3²d¥°0±N²d¥L¦¦ ¦¦N§f³N´Zµ�«Z­j¶ ·lª§f³3´Zµ�«[­m¶ ·lª§f³N´Zµ�«Z­j¶ ·lª§f³3´Zµ�«[­m¶ ·lª£G¤n¯£G¤o¯£G¤n¯£G¤o¯f¦¦¦¦N§o·U¶ ¸�·�µU¸§n·_¶ ¸�·�µ_¸§o·U¶ ¸�·�µU¸§n·_¶ ¸�·�µ_¸°0±3²q¯°0±N²q¯°0±3²q¯°0±N²q¯f¦¦ ¦¦N§]©_´?¹)¹Aºtµ�«§]©U´[¹)¹^ºtµ�«§]©_´?¹)¹Aºtµ�«§]©U´[¹)¹^ºtµ�«

� � � »7¼�½

�,�.�5¾[���3�5�

����� � ��� �

¿�À��jÁ7Âjà ��Ä|À7Å�Â

Figure 11: System architecture for JPEG encoder using 2 CPUs with bridge

4.2 JPEG Architecture using 3 ARM CPUs4.2.1 Communication via Hardware Block

To run JPEG simulation with 3 CPUs, new architecture platform with 3 CPUs needs to be devised.HW1 still needs to be used as intermediary processor elements to connect CPU BUS1 and CPU BUS2

16

Page 22: Design Exploration using Multiple ARM Instruction Set ...

together because there is no bridge currently available in SCE library. HW2 is used to connectCPU BUS2 and CPU BUS3. chendct1 is assigned to CPU1. HW1 handles chendct2 differentfor JPEG platform with one CPU and 2 CPUs. And CPU2 is taking care of quantize. HW2executes zigzag. CPU3 performs huffman. The detailed JPEG platform with 3 CPUs is shownin Figure 12.

Æ|Ç%È ÉËÊ(Ì Ê(Æ

Í�Î7Ï#Ð Ñ#ÒdÓ

Ô ÕZÖjÏ�× È Ç Ð É�Ø × È Ç%Ø9Ù

Ô ÕZÖjÏ�× È Ç Ó

SYSTEM CHART OF JPEGENCODER WITH THREE CPUs

ÚÛ3Ü0Ý_ÞZßqÜáà

Í�â

Í�ãÛËä åUæ^ç�è,é�ê

ëíìGîíïëíìGîðïëíìGîíïëíìGîðïòññ ññqóõô.öV÷.øUùNô?ú7ïóõô.öU÷.øVùËô_ú�ïóõô.öV÷.øUùNô?ú7ïóõô.öU÷.øVùËô_ú�ïû5üýïû5üþïû5üýïû5üþïòññññqóÿô.ö_÷.ø�ù.ôUú��ó ô.öU÷.øVùËô_ú��óÿô.ö_÷.ø�ù.ôUú��ó ô.öU÷.øVùËô_ú��ëíìGî��ëíìGî��ëíìGî��ëíìGî��Pññññqó������ø_ú� �_÷ó�� ��.ø?ú� �?÷ó������ø_ú� �_÷ó�� ��.ø?ú� �?÷û5ü��û5ü��û5ü��û5ü��Pññññqó��� ������ó��� �������ó��� ������ó��� �������ëíìGî��ëíìGî��ëíìGî��ëíìGî��Ëññ ññqóJö�����������øóJö����������.øóJö�����������øóJö����������.ø

Ñ�Ò ÐÚÛ3Ü��?ÞZßqÜáà

Í�Î7Ï��

Í�Î(Ï ÓÚÛ3Ü! ZÞZßqÜáà

"$#�%'&)('* %,+-#)./(

Figure 12: System architecture for JPEG encoder using 3 CPUs

4.2.2 Communication via Bridge

If the bridge HW is available, JPEG platform will be drawn like in Figure 13. HW1 is replacedwith bridge HW and then HW1 is now the slave of CPU2 BUS.

17

Page 23: Design Exploration using Multiple ARM Instruction Set ...

0-132 4$576 570

8:9,;=<

>:?!@ A BC ;ED 2 1 < 4�F D 2 13FHG

A BC ;ED 2 1 @

SYSTEM CHART OF JPEGENCODER WITH THREE CPUs

I�J K�L�MONPK�Q

8)R

8:S

J$T U�VXWZY\[^]

_a`�bac_a`�bdc_a`�bac_a`�bdc�ee eePfhgjikjl�mng�o,cfhgji�kjlm$g�oZcfhgjikjl�mng�o,cfhgji�kjlm$g�oZcp�qrcp�qscp�qrcp�qsc�eeeePftgji�kjl�mjg�o�ufvgji�kjlm$g�o�uftgji�kjl�mjg�o�ufvgji�kjlm$g�o�u_a`�b�u_a`�b�u_a`�b�u_a`�b�uweeeePf�x�yz�l�o�{ |�kf�x yzjl�o�{ |�kf�x�yz�l�o�{ |�kf�x yzjl�o�{ |�kp�q�up�q�up�q�up�q�uweeeePf�|�{ }�|�z}f�|�{ }�|�z�}f�|�{ }�|�z}f�|�{ }�|�z�}_a`�b�~_a`�b�~_a`�b�~_a`�b�~$ee eePf�i�y�������z�lf�i�y�������zjlf�i�y�������z�lf�i�y�������zjl

>=?�<

I�J K���MONPK�Q

8:9,;��

8:97;E@I�J K!�OMONPK�Q

�)� A��=�/�

�)� A��=�O�

�$���'� � � �,�-�)� �

Figure 13: System architecture for JPEG encoder using 3 CPUs with bridge

18

Page 24: Design Exploration using Multiple ARM Instruction Set ...

5 Experimental Results5.1 Platform Execution Times

JPEG encoder software synthesis is explored. It starts from JPEG encoder application. To obtainour performance goal, many architecture variations are analyzed and revealed. If there is a bridgeavailable in SCE library, JPEG platform having more than 3 ISS will be able to be tested. Due tothe current limitation of SCE library, the results for simulation could be obtained up to 3 ARM ISS.The results are shown in Table 2. This result is reasonable to take more time in 3 CPUs than 2CPUs because DCT1 is processed in ARM ISS in 3 CPU case differently from 2 CPU case. Usuallyhardware performs faster than software.

Number of CPU PE Mapping Execution time(us)1 DCT1,DCT2 7→ HW 142043

Quantize,Zigzag,Huffman 7→ CPU2 DCT1,DCT2 7→ HW1 92979

Quantize 7→ CPU1Zigzag 7→ HW2

Huffman 7→ CPU23 DCT1 7→ CPU1 93467

DCT2 7→ HW1Quantize 7→ CPU2

Zigzag 7→ HW2Huffman 7→ CPU3

Table 2: Execution time depending on the number of CPUs

5.2 Multi-ARM Simulation TimesSimulation times are obtained by using /usr/bin/time command. Simulation times are

shown in Table 3 according to the number of CPU. As assumed, simulation time is increasingproportional to the number of CPU.

Simulation TimeNumber of CPU User(sec) System(sec) Total(User+System)

1 554.59 91.06 651.962 837.70 164.33 1002.333 1250.63 232.01 1482.62

Table 3: Simulation time depending on the number of ARM ISS

19

Page 25: Design Exploration using Multiple ARM Instruction Set ...

6 Conclusions and Future WorkJPEG encoder software synthesis is explored. It starts from JPEG encoder application. SWARM

ISS library in SCE has been successfully patched such that multiple ARM ISS instances of thesimulator can run independently within the same platform model. Our JPEG example runs now finewith 1, 2, or 3 ARM CPUs.

Now SCE does not support the multiple C code generation concurrently. So C codes for multipleARM ISS need to be generated separately until the next SCE is released. Regarding to HW IPs, ifthe bridge HW is available, JPEG platform might be able to have better architecture than now. Whenbridge is used to connect the buses each other, it is safe to disallow the bidirectional access becauseit might cause the deadlock and then the whole system might get stuck. The JPEG platform withbridge allows more flexibility in the architecture of Platform than the architecture without bridge.Due to unavailability of bridge, JPEG ISS model having more than 3 ARM ISSs is impossible totest. Also it will be better to support the multiple masters in each bus because the bridge hardwareusually is the slave of one bus and also it is the master of the other bus. Eventually there existtwo masters on the bus. Also it is very important for CPU to be able to access all of slaves on theplatform due to the memory mapped I/0 to control slaves.

20

Page 26: Design Exploration using Multiple ARM Instruction Set ...

References[1] Samar Abdi. JPEG Encoder Software Synthesis,Talks in Class, Fall 2008.

[2] Samar Abdi, Junyu Peng, Haobo Yu, Dongwan Shin, Andreas Gerstlauer, Rainer Domer, andDaniel Gajski. System-on-Chip Environment (SCE Version 2.2.0 Beta): Tutorial. TechnicalReport CECS-TR-03-41, Center for Embedded Computer Systems, University of California,Irvine, July 2003.

[3] Michael Dales. SWARM, ARM instruction simulator,GNU software , 2000.

[4] Rainer Domer. SOC Software Synthesis, Lecture Notes, Fall 2008.

[5] Rainer Domer, Andreas Gerstlauer, and Daniel Gajski. SpecC Language Reference Manual,Version 2.0. SpecC Technology Open Consortium, http://www.specc.org, December 2002.

[6] Daniel D. Gajski, Jianwen Zhu, Rainer Domer, Andreas Gerstlauer, and Shuqing Zhao. SpecC:Specification Language and Design Methodology. Kluwer Academic Publishers, 2000.

[7] Andreas Gerstlauer, Rainer Domer, Junyu Peng, and Daniel D. Gajski. System Design: APractical Guide with SpecC. Kluwer Academic Publishers, 2001.

[8] G.Schirner, G.Sachdeva, A. Gerstlauer, and R. Domer. Modeling, Simulation and Synthesis inan Embedded Software Design Flow for an ARM Processor. Technical Report CECS-TR-06-06,Center for Embedded Computer Systems, University of California, Irvine, May 2006.

[9] International Telecommunication Union (ITU). Digital Compression and Coding of Continous-Tone Still Images, September 1992. ITU Recommendation T.81.

21

Page 27: Design Exploration using Multiple ARM Instruction Set ...

A AppendixA.1 SCE Design Steps for Single-CPU Implementation

Here a r e some d e t a i l e d i n s t r u c t i o n s t o run JPEG s i m u l a t i o n wi th one ARM ISS .∗ S p e c i f i c a t i o n s t e p− new p r o j e c t− i m p o r t J p e g P l a t f o r m . sc− add t o p r o j e c t a s ” P l a t f o r m S p e c . s i r ”− view t h e new h i e r a r c h y c h a r t ( e n t i r e l y , i n c l . c o n n e c t i v i t y )− compi l e and s i m u l a t e

∗ A r c h i t e c t u r e Ref inemen t s t e p− choose ” p l a t f o r m ” as top− l e v e l− a l l o c a t e two ARM 7TDMI as ”CPU”− a l l o c a t e two HW Standard as ”HW”− a l l o c a t e two HW Virtual a s ” IO Uni t1 ” and ” IO Uni t2 ”− map d a t a i n t o IO Uni t1− map d a t a o u t t o IO Uni t2− map c i n t o IO Uni t1− map c o u t t o IO Uni t2− map d c t 1 and d c t 2 t o HW− map q u a n t i z e , z i g z a g and huffman t o CPU− pe r fo rm a r c h i t e c t u r e r e f i n e m e n t ( no t i m i n g back a n n o t a t i o n )− rename g e n e r a t e d model a s ” P l a t f o r m A r c h ”− compi l e and s i m u l a t e

∗ S c h e d u l i n g Ref inemen t− use p r i o r i t y −based s c h e d u l i n g f o r CPU

( p r i o r i t i e s 1 , 2 , 3 f o r quan , z igz , hu f f , r e s p e c t i v e l y )− l e a v e I O U n i t s a l o n e− l e a v e HW a l o n e− pe r fo rm s c h e d u l i n g r e f i n e m e n t ( bo th s t a t i c and dynamic )− rename g e n e r a t e d model a s ” P l a t f o r m S c h e d ”− compi l e and s i m u l a t e

∗ Network Ref inemen t− rename ” Bus0 ” t o ”CPU Bus”− add DblHndShkBus as ”HW Bus”− c r e a t e a d d i t i o n a l ” P o r t 1 ” f o r HW− c o n n e c t HW ( P o r t 0 ) t o CPU Bus as ” s l a v e 4 ”− c o n n e c t HW ( P o r t 1 ) t o HW Bus as ” Mas te r ”− c o n n e c t IO Uni t2 ( P o r t 0 ) t o CPU Bus as ” s l a v e 5 ”− c o n n e c t IO Uni t1 ( P o r t 0 ) t o HW Bus as ” S l a v e ”− pe r fo rm network r e f i n e m e n t− rename g e n e r a t e d model a s ” P l a t f o r m N e t ”− compi l e and s i m u l a t e

∗ Communicat ion Ref inemen t− p r e s s ”CPU Bus” t a b− s e l e c t s t a r t a d d r e s s 0 x50000000 f o r c l i n k C P U I O U n i t 2− s e l e c t s t a r t a d d r e s s 0 x40000000 f o r c link HW CPU− p r e s s ” Bus0 ” t a b− s e l e c t s t a r t a d d r e s s 0 x0000 f o r c l i n k I O U n i t 1 H W− pe r fo rm communica t ion r e f i n e m e n t ( pin−a c c u r a t e model )− rename g e n e r a t e d model a s ” PlatformComm ”− compi l e and s i m u l a t e− S i m u l a t i o n g e t s s t u c k a f t e r m o n i t o r r e c e i v e s b l o c k 1 7 7 .− p r e s s CTRL−C t o s t o p s i m u l a t i o n

22

Page 28: Design Exploration using Multiple ARM Instruction Set ...

∗ Code g e n e r a t i o n f o r CPU− c l o s e PlatformComm . s i r− s i r n o t e PlatformComm ARM 7TDMI Core 20000 0 CPU ’ PE HAL MODEL=”

ARM 7TDMI HAL 20000 0 CPU ” ’− re−open PlatformComm . s i r− pe r fo rm code g e n e r a t i o n f o r CPU1 ( f o r ARM 7TDMI OS 20000 0 CPU1 NET )

( s t o r e i n f i l e s ”CPU / CPU . c ” and ”CPU /CPU . h ” )− rename g e n e r a t e d model a s ” PlatformCommC ”− compi l e and s i m u l a t e− S i m u l a t i o n g e t s s t u c k a f t e r m o n i t o r r e c e i v e s b l o c k 1 7 7 .− p r e s s CTRL−C t o s t o p s i m u l a t i o n

∗ Cross−c o m p i l a t i o n f o r CPU− we w i l l c r o s s−compi l e t h e g e n e r a t e d C code i n t h e CPU d i r e c t o r y

cd CPU− i n s p e c t t h e g e n e r a t e d ANSI−C code i n f i l e s ”CPU . c ” and ”CPU . h ”

I t a p p e a r s t h a t t h e g e n e r a t e d C code c o n v e r t s our back−a n n o t a t e d t i m i n ge s t i m a t e s i n t o TaskDelay s t a t e m e n t s t h a t suspend a t a s k f o r t h e g i v e np e r i o d o f t ime . Thus , t h e s o f t w a r e t a s k s a r e a c t u a l l y p u t t o s l e e p f o rour back−a n n o t a t e d w a i t f o r ( ) s t a t e m e n t s ! !

A f t e r t h e C code g e n e r a t i o n s t e p , t a k e a look a t t h e f i l e CPU1 . h .At t h e t o p of t h e f i l e , you w i l l f i n d t h e f o l l o w i n g d e f i n i t i o n :# d e f i n e WAITFOR(X) TaskDelay ( ( u n s i g n e d long ) ( (X) / 1 0 0 0 ) )P l e a s e change t h i s t o t h e f o l l o w i n g :# d e f i n e WAITFOR(X) / / n o t h i n g !Th i s w i l l make t h e bogus w a i t i n g d i s a p p e a r . Th i s changes s h o u l d be done a g a i n i n CPU2 . h− t o compi l e t h e g e n e r a t e d code t o g e t h e r wi th t h e microC−OS−I I

( and some o t h e r f i l e s ) , we ’ l l use a p r e p a r e d M a k e f i l e− u p d a t e M a k e f i l e f o r r e f e r e n c e p a t h f o r SWARM ISS and u s e r s o u r c e code l i k e below .

# r e f e r e n c e t o SWARM ISS# ISS DIR = / o p t / pkg / sw / swarm

ISS DIR = / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / swarm

# USER s p e c i f i c s o u r c e f i l eUSR SRC := CPU . cUSR HDR := CPU . h

− t y p e ” make ” a f t e r m o d i f i c a t i o n o f M a k e f i l e− t h e g e n e r a t e d ARM−e x e c u t a b l e i s found i n f i l e ” userCode ”

l s− copy t h e g e n e r a t e d e x e c u t a b l e userCode i n t o your SCE working d i r e c t o r y

( so t h a t t h e SCE s i m u l a t i o n can f i n d i t )cp userCode . .

∗ I n s e r t i o n o f ISS model f o r CPU− s e l e c t Comm. s i r ( i n SCE P r o j e c t window )− Edi t−>I m p o r t D e s i g n

” / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / arm7tdmiwrapper / a r m 7 t d m i i s s . s i r ”( t h i s w i l l show up i n t h e l i s t o f unused b e h a v i o r s i n t h e working windowas b e h a v i o r ”ARM 7TDMI ISS ” )

− l o c a t e t h e i n s t a n c e ”CPU” of b e h a v i o r ” ARM 7TDMI Core 20000 0 CPU ”i n t h e h i e r a r c h y browse r

− r e p l a c e t h i s a b s t r a c t model w i th t h e ISS model( r i g h t −c l i c k ” ARM 7TDMI Core 20000 0 CPU ” , ChangeType t o ”ARM 7TDMI ISS ” )− go t o P r o j e c t menu −> S e t t i n g s− change I m p o r t p a t h t o ” . : / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / arm7tdmiwrapper ”− change L i b r a r y p a t h t o ” / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / swarm ”− s e t V e r b o s i t y l e v e l and Warning l e v e l t o 2 .− save t h i s d e s i g n model ( a s a new model i n t h e p r o j e c t )

23

Page 29: Design Exploration using Multiple ARM Instruction Set ...

F i l e−>SaveAs ” P l a t f o r m I S S . s i r ” ( i n your working d i r e c t o r y ! )− compi l e and s i m u l a t e .− S i m u l a t i o n g e t s s t u c k a f t e r m o n i t o r r e c e i v e s b l o c k 1 7 7 .

139632: Moni to r r e c e i v e d b l o c k 174 ( wi th 3 b y t e s ) .139632: Encoding f o r b l o c k 174 took 21154 micro s e c o n d s .140444: Moni to r r e c e i v e d b l o c k 175 ( wi th 15 b y t e s ) .140444: Encoding f o r b l o c k 175 took 21179 micro s e c o n d s .141251: Moni to r r e c e i v e d b l o c k 176 ( wi th 9 b y t e s ) .141251: Encoding f o r b l o c k 176 took 21199 micro s e c o n d s .142043: Moni to r r e c e i v e d b l o c k 177 ( wi th 6 b y t e s ) .142043: Encoding f o r b l o c k 177 took 21188 micro s e c o n d s .

A.2 SCE Design Steps for Two-CPU Implementation

Here a r e some d e t a i l e d i n s t r u c t i o n s t o run JPEG s i m u l a t i o n wi th 2 ARM ISSs which a r e q u i t es i m i l a r t o t h e ones wi th one CPU .

∗ S p e c i f i c a t i o n s t e p− new p r o j e c t− i m p o r t J p e g P l a t f o r m . sc− add t o p r o j e c t a s ” P l a t f o r m S p e c . s i r ”− view t h e new h i e r a r c h y c h a r t ( e n t i r e l y , i n c l . c o n n e c t i v i t y )− compi l e and s i m u l a t e

∗ A r c h i t e c t u r e Ref inemen t s t e p− choose ” p l a t f o r m ” as top− l e v e l− a l l o c a t e two ARM 7TDMI as ”CPU1” and ”CPU2” r e s p e c t i v e l y− a l l o c a t e two HW Standard as ”HW1” and ”HW2”− a l l o c a t e two HW Virtual a s ” IO Uni t1 ” and ” IO Uni t2 ”− map d a t a i n t o IO Uni t1− map d a t a o u t t o IO Uni t2− map c i n t o IO Uni t1− map c o u t t o IO Uni t2− map d c t 1 and d c t 2 t o HW1− map q u a n t i z e t o CPU1− map z i g z a g t o HW2− map Huffman t o CPU2− pe r fo rm a r c h i t e c t u r e r e f i n e m e n t ( no t i m i n g back a n n o t a t i o n )− rename g e n e r a t e d model a s ” P l a t f o r m A r c h ”− compi l e and s i m u l a t e

∗ S c h e d u l i n g Ref inemen t− use p r i o r i t y −based s c h e d u l i n g f o r CPU1 and CPU2

( no p r i o r i t i e s needed f o r CPU1 and CPU2 b e c a u s e on ly one b e h a v i o r a s s i g n e d )− l e a v e I O U n i t s a l o n e− l e a v e HW1 and HW2 a l o n e− pe r fo rm s c h e d u l i n g r e f i n e m e n t ( bo th s t a t i c and dynamic )− rename g e n e r a t e d model a s ” P l a t f o r m S c h e d ”− compi l e and s i m u l a t e

∗ Network Ref inemen t− rename ” Bus0 ” t o ” CPU1 Bus ”− rename ” Bus1 ” t o ” CPU2 Bus ”− add DblHndShkBus as ”HW Bus”− c r e a t e a d d i t i o n a l ” P o r t 1 ” f o r HW1− c r e a t e a d d i t i o n a l ” P o r t 1 ” f o r HW2− c o n n e c t HW1 ( P o r t 0 ) t o CPU1 Bus as ” s l a v e 4 ”− c o n n e c t HW1 ( P o r t 1 ) t o HW Bus as ” Mas te r ”− c o n n e c t HW2 ( P o r t 0 ) t o CPU1 Bus as ” s l a v e 5 ”− c o n n e c t HW2 ( P o r t 1 ) t o CPU2 Bus as ” s l a v e 4 ”− c o n n e c t IO Uni t2 ( P o r t 0 ) t o CPU2 Bus as ” s l a v e 5 ”

24

Page 30: Design Exploration using Multiple ARM Instruction Set ...

− c o n n e c t IO Uni t1 ( P o r t 0 ) t o HW Bus as ” S l a v e ”− pe r fo rm network r e f i n e m e n t− rename g e n e r a t e d model a s ” P l a t f o r m N e t ”− compi l e and s i m u l a t e

∗ Communicat ion Ref inemen t− p r e s s ” CPU1 Bus ” t a b− s e l e c t s t a r t a d d r e s s 0 x50000000 f o r c link CPU1 HW2− s e l e c t s t a r t a d d r e s s 0 x40000000 f o r c link HW1 CPU1− p r e s s ” CPU2 Bus ” t a b− s e l e c t s t a r t a d d r e s s 0 x50000000 f o r c l i n k C P U 2 I O U n i t 2− s e l e c t s t a r t a d d r e s s 0 x40000000 f o r c link HW2 CPU2− p r e s s ” Bus0 ” t a b− s e l e c t s t a r t a d d r e s s 0 x0000 f o r c l i n k I O U n i t 1 H W− pe r fo rm communica t ion r e f i n e m e n t ( pin−a c c u r a t e model )− rename g e n e r a t e d model a s ” PlatformComm ”− compi l e and s i m u l a t e− S i m u l a t i o n s h o u l d be working f i n e . I t does n o t g e t s t u c k d i f f e r e n t from one CPU c a s e .

∗ Code g e n e r a t i o n f o r CPU1 and CPU2Now SCE does n o t s u p p o r t t h e m u l t i p l e C Code g e n e r a t i o n on t h e f l y .So C Codes f o r bo th CPU1 and CPU2 needs t o be g e n e r a t e d s e p a r a t e l y u n t i l t h e n e x t v e r s i o n

o f SCE i s r e l e a s e d .− c l o s e PlatformComm . s i r− s i r n o t e PlatformComm ARM 7TDMI Core 20000 0 CPU1 ’ PE HAL MODEL=”

ARM 7TDMI HAL 20000 0 CPU1 ” ’− s i r n o t e PlatformComm ARM 7TDMI Core 20000 0 CPU2 ’ PE HAL MODEL=”

ARM 7TDMI HAL 20000 0 CPU2 ” ’− re−open PlatformComm . s i r− pe r fo rm code g e n e r a t i o n f o r CPU1 ( f o r ARM 7TDMI OS 20000 0 CPU1 NET )

( s t o r e i n f i l e s ”CPU1 / CPU1 . c ” and ”CPU1 / CPU1 . h ” )− rename g e n e r a t e d model a s ” PlatformCommC1 ”− compi l e and s i m u l a t e− pe r fo rm code g e n e r a t i o n f o r CPU2 ( f o r ARM 7TDMI OS 20000 0 CPU2 NET )

( s t o r e i n f i l e s ”CPU2 / CPU2 . c ” and ”CPU2 / CPU2 . h ” )− rename g e n e r a t e d model a s ” PlatformCommC2 ”− compi l e and s i m u l a t e− S i m u l a t i o n s h o u l d be working f i n e . I t does n o t g e t s t u c k d i f f e r e n t from one CPU c a s e .

∗ Cross−c o m p i l a t i o n f o r CPU1Now SCE does n o t s u p p o r t t h e m u l t i p l e ARM ISS . Only one ARM ISS s i m u l a t i o n i s a l l o w e d .− we w i l l c r o s s−compi l e t h e g e n e r a t e d C code i n t h e CPU1 d i r e c t o r y

cd CPU1− i n s p e c t t h e g e n e r a t e d ANSI−C code i n f i l e s ”CPU1 . c ” and ”CPU1 . h ”

I t a p p e a r s t h a t t h e g e n e r a t e d C code c o n v e r t s our back−a n n o t a t e d t i m i n ge s t i m a t e s i n t o TaskDelay s t a t e m e n t s t h a t suspend a t a s k f o r t h e g i v e np e r i o d o f t ime . Thus , t h e s o f t w a r e t a s k s a r e a c t u a l l y p u t t o s l e e p f o rour back−a n n o t a t e d w a i t f o r ( ) s t a t e m e n t s ! !

A f t e r t h e C code g e n e r a t i o n s t e p , t a k e a look a t t h e f i l e CPU1 . h .At t h e t o p of t h e f i l e , you w i l l f i n d t h e f o l l o w i n g d e f i n i t i o n :# d e f i n e WAITFOR(X) TaskDelay ( ( u n s i g n e d long ) ( (X) / 1 0 0 0 ) )P l e a s e change t h i s t o t h e f o l l o w i n g :# d e f i n e WAITFOR(X) / / n o t h i n g !Th i s w i l l make t h e bogus w a i t i n g d i s a p p e a r . Th i s changes s h o u l d be done a g a i n i n CPU2 . h− t o compi l e t h e g e n e r a t e d code t o g e t h e r wi th t h e microC−OS−I I

( and some o t h e r f i l e s ) , we ’ l l use a p r e p a r e d M a k e f i l e− u p d a t e M a k e f i l e f o r r e f e r e n c e p a t h f o r SWARM ISS and u s e r s o u r c e code l i k e below .

# r e f e r e n c e t o SWARM ISS# ISS DIR = / o p t / pkg / sw / swarm

25

Page 31: Design Exploration using Multiple ARM Instruction Set ...

ISS DIR = / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / swarm

# USER s p e c i f i c s o u r c e f i l eUSR SRC := CPU1 . cUSR HDR := CPU1 . h

− t y p e ” make ” a f t e r m o d i f i c a t i o n o f M a k e f i l e− t h e g e n e r a t e d ARM−e x e c u t a b l e i s found i n f i l e ” userCode ”

∗ Cross−c o m p i l a t i o n f o r CPU2Now SCE does n o t s u p p o r t t h e m u l t i p l e ARM ISS . Only one ARM ISS s i m u l a t i o n i s a l l o w e d .

− we w i l l c r o s s−compi l e t h e g e n e r a t e d C code i n t h e CPU1 d i r e c t o r ycd CPU2

− i n s p e c t t h e g e n e r a t e d ANSI−C code i n f i l e s ”CPU2 . c ” and ”CPU2 . h ”more CPU2 . cmore CPU2 . h

− t o compi l e t h e g e n e r a t e d code t o g e t h e r wi th t h e microC−OS−I I( and some o t h e r f i l e s ) , we ’ l l use a p r e p a r e d M a k e f i l e

− u p d a t e M a k e f i l e f o r r e f e r e n c e p a t h f o r SWARM ISS and u s e r s o u r c e code l i k e below .# r e f e r e n c e t o SWARM ISS# ISS DIR = / o p t / pkg / sw / swarm

ISS DIR = / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / swarm

# USER s p e c i f i c s o u r c e f i l eUSR SRC := CPU2 . cUSR HDR := CPU2 . h

− t y p e ” make ” a f t e r m o d i f i c a t i o n o f M a k e f i l e− t h e g e n e r a t e d ARM−e x e c u t a b l e i s found i n f i l e ” userCode ”∗ I n s e r t i o n o f ISS model f o r CPU1 and CPU2− s e l e c t Comm. s i r ( i n SCE P r o j e c t window )− Edi t−>I m p o r t D e s i g n

” / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / arm7tdmiwrapper / a r m 7 t d m i i s s 1 . s i r ”( t h i s w i l l show up i n t h e l i s t o f unused b e h a v i o r s i n t h e working windowas b e h a v i o r ”ARM 7TDMI ISS1 ” )

− Edi t−>I m p o r t D e s i g n” / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / arm7tdmiwrapper / a r m 7 t d m i i s s 2 . s i r ”( t h i s w i l l show up i n t h e l i s t o f unused b e h a v i o r s i n t h e working windowas b e h a v i o r ”ARM 7TDMI ISS2 ” )

− l o c a t e t h e i n s t a n c e ”CPU1” of b e h a v i o r ” ARM 7TDMI Core 20000 0 CPU1 ”i n t h e h i e r a r c h y browse r

− r e p l a c e t h i s a b s t r a c t model w i th t h e ISS model( r i g h t −c l i c k ” ARM 7TDMI Core 20000 0 CPU1 ” , ChangeType t o ”ARM 7TDMI ISS1 ” )− l o c a t e t h e i n s t a n c e ”CPU2” of b e h a v i o r ” ARM 7TDMI Core 20000 0 CPU2 ”

i n t h e h i e r a r c h y browse r− r e p l a c e t h i s a b s t r a c t model w i th t h e ISS model( r i g h t −c l i c k ” ARM 7TDMI Core 20000 0 CPU2 ” , ChangeType t o ”ARM 7TDMI ISS2 ” )− go t o P r o j e c t menu −> S e t t i n g s− change I m p o r t p a t h t o ” . : / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / arm7tdmiwrapper ”− change L i b r a r y p a t h t o ” / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / swarm ”− s e t V e r b o s i t y l e v e l and Warning l e v e l t o 2 .− save t h i s d e s i g n model ( a s a new model i n t h e p r o j e c t )F i l e−>SaveAs ” P l a t f o r m I S S . s i r ” ( i n your working d i r e c t o r y ! )− compi l e and s i m u l a t e .− S i m u l a t i o n s h o u l d end l i k e below .

0 : S t i m u l u s s e n d s b l o c k 0 .0 : S t i m u l u s s e n d s b l o c k 1 .0 : S t i m u l u s s e n d s b l o c k 2 .0 : S t i m u l u s s e n d s b l o c k 3 .0 : S t i m u l u s s e n d s b l o c k 4 .0 : S t i m u l u s s e n d s b l o c k 5 .

26

Page 32: Design Exploration using Multiple ARM Instruction Set ...

0 : S t i m u l u s s e n d s b l o c k 6 .0 : S t i m u l u s s e n d s b l o c k 7 .0 : S t i m u l u s s e n d s b l o c k 8 .0 : S t i m u l u s s e n d s b l o c k 9 .0 : S t i m u l u s s e n d s b l o c k 1 0 .

UARTCTRL: S e r i a l d e v i c e S l a v e p t s [ / dev / p t s / 5 ]Note : Uploaded t h e Program−B i n a r y : . / CPU1 / userCodeUARTCTRL: S e r i a l d e v i c e S l a v e p t s [ / dev / p t s / 6 ]Note : Uploaded t h e Program−B i n a r y : . / CPU2 / userCode

0 : S t i m u l u s s e n d s b l o c k 1 1 .2 : S t i m u l u s s e n d s b l o c k 1 2 .

6 2 : S t i m u l u s s e n d s b l o c k 1 3 .123 : S t i m u l u s s e n d s b l o c k 1 4 .

R e g i s t e r IRQ R e g i s t e r IRQ 00 pFunc pFunc 183 : S t i m u l u s s e n d s b l o c k 1 5 .0 x0x5e05e0 pArg pArg 0 x0x00243 : S t i m u l u s s e n d s b l o c k 1 6 ....... . . . .

89338 : Moni to r r e c e i v e d b l o c k 172 ( wi th 4 b y t e s ) .89338 : Encoding f o r b l o c k 172 took 13623 micro s e c o n d s .89834 : Moni to r r e c e i v e d b l o c k 173 ( wi th 4 b y t e s ) .89834 : Encoding f o r b l o c k 173 took 13607 micro s e c o n d s .90329 : Moni to r r e c e i v e d b l o c k 174 ( wi th 3 b y t e s ) .90329 : Encoding f o r b l o c k 174 took 13599 micro s e c o n d s .90863 : Moni to r r e c e i v e d b l o c k 175 ( wi th 15 b y t e s ) .90863 : Encoding f o r b l o c k 175 took 13629 micro s e c o n d s .91392 : Moni to r r e c e i v e d b l o c k 176 ( wi th 9 b y t e s ) .91392 : Encoding f o r b l o c k 176 took 13653 micro s e c o n d s .91904 : Moni to r r e c e i v e d b l o c k 177 ( wi th 6 b y t e s ) .91904 : Encoding f o r b l o c k 177 took 13640 micro s e c o n d s .92406 : Moni to r r e c e i v e d b l o c k 178 ( wi th 6 b y t e s ) .92406 : Encoding f o r b l o c k 178 took 13604 micro s e c o n d s .92979 : Moni to r r e c e i v e d b l o c k 179 ( wi th 305 b y t e s ) .92979 : Encoding f o r b l o c k 179 took 13686 micro s e c o n d s .92979 : Moni to r e x i t s s i m u l a t i o n .

A.3 SCE Design Steps for Three-CPU Implementation

Here a r e some d e t a i l e d i n s t r u c t i o n s t o run JPEG s i m u l a t i o n wi th t h r e e ARM ISSs which a r eq u i t e s i m i l a r t o t h e ones wi th one CPU .

∗ S p e c i f i c a t i o n s t e p− new p r o j e c t− i m p o r t J p e g P l a t f o r m . sc− add t o p r o j e c t a s ” P l a t f o r m S p e c . s i r ”− view t h e new h i e r a r c h y c h a r t ( e n t i r e l y , i n c l . c o n n e c t i v i t y )− compi l e and s i m u l a t e

∗ A r c h i t e c t u r e Ref inemen t s t e p− choose ” p l a t f o r m ” as top− l e v e l− a l l o c a t e t h r e e ARM 7TDMI as ”CPU1” , ”CPU2” and ”CPU3” r e s p e c t i v e l y− a l l o c a t e two HW Standard as ”HW1” and ”HW2”− a l l o c a t e two HW Virtual a s ” IO Uni t1 ” and ” IO Uni t2 ”− map d a t a i n t o IO Uni t1− map d a t a o u t t o IO Uni t2− map c i n t o IO Uni t1− map c o u t t o IO Uni t2

27

Page 33: Design Exploration using Multiple ARM Instruction Set ...

− map d c t 1 t o CPU1− map d c t 2 t o HW1− map q u a n t i z e t o CPU2− map z i g z a g t o HW2− map Huffman t o CPU3− pe r fo rm a r c h i t e c t u r e r e f i n e m e n t ( no t i m i n g back a n n o t a t i o n )− rename g e n e r a t e d model a s ” P l a t f o r m A r c h ”− compi l e and s i m u l a t e

∗ S c h e d u l i n g Ref inemen t− use p r i o r i t y −based s c h e d u l i n g f o r CPU1 , CPU2 and CPU3

( no p r i o r i t i e s needed f o r CPU1 , CPU2 and CPU3 b e c a u s e on ly one b e h a v i o r a s s i g n e d )− l e a v e I O U n i t s a l o n e− l e a v e HW1 and HW2 a l o n e− pe r fo rm s c h e d u l i n g r e f i n e m e n t ( bo th s t a t i c and dynamic )− rename g e n e r a t e d model a s ” P l a t f o r m S c h e d ”− compi l e and s i m u l a t e

∗ Network Ref inemen t− rename ” Bus0 ” t o ” CPU1 Bus ”− rename ” Bus1 ” t o ” CPU2 Bus ”− rename ” Bus2 ” t o ” CPU3 Bus ”− c r e a t e a d d i t i o n a l ” P o r t 1 ” f o r HW1− c r e a t e a d d i t i o n a l ” P o r t 1 ” f o r HW2− c o n n e c t HW1 ( P o r t 0 ) t o CPU1 Bus as ” s l a v e 4 ”− c o n n e c t HW1 ( P o r t 1 ) t o CPU2 Bus as ” s l a v e 4 ”− c o n n e c t HW2 ( P o r t 0 ) t o CPU2 Bus as ” s l a v e 5 ”− c o n n e c t HW2 ( P o r t 1 ) t o CPU3 Bus as ” s l a v e 4 ”− c o n n e c t IO Uni t2 ( P o r t 0 ) t o CPU3 Bus as ” s l a v e 5 ”− c o n n e c t IO Uni t1 ( P o r t 0 ) t o CPU1 Bus as ” s l a v e 5 ”− pe r fo rm network r e f i n e m e n t− rename g e n e r a t e d model a s ” P l a t f o r m N e t ”− compi l e and s i m u l a t e

∗ Communicat ion Ref inemen t− p r e s s ” CPU1 Bus ” t a b− s e l e c t s t a r t a d d r e s s 0 x40000000 f o r c link CPU1 HW1− s e l e c t s t a r t a d d r e s s 0 x50000000 f o r c l i n k I O U n i t 1 H W 1− p r e s s ” CPU2 Bus ” t a b− s e l e c t s t a r t a d d r e s s 0 x50000000 f o r c link CPU2 HW2− s e l e c t s t a r t a d d r e s s 0 x40000000 f o r c link HW1 CPU2− p r e s s ” CPU3 Bus ” t a b− s e l e c t s t a r t a d d r e s s 0 x40000000 f o r c link HW2 CPU3− s e l e c t s t a r t a d d r e s s 0 x50000000 f o r c l i n k C P U 3 I O U n i t 2− pe r fo rm communica t ion r e f i n e m e n t ( pin−a c c u r a t e model )− rename g e n e r a t e d model a s ” PlatformComm ”− compi l e and s i m u l a t e− S i m u l a t i o n s h o u l d be working f i n e . I t does n o t g e t s t u c k d i f f e r e n t from one CPU c a s e .

∗ Code g e n e r a t i o n f o r CPU1 , CPU2 and CPU3Now SCE does n o t s u p p o r t t h e m u l t i p l e C Code g e n e r a t i o n on t h e f l y .So C Codes f o r bo th CPU1 and CPU2 needs t o g e n e r a t e d s e p a r a t e l y u n t i l t h e n e x t v e r s i o n o f

SCE i s r e l e a s e d .− c l o s e PlatformComm . s i r− s i r n o t e PlatformComm ARM 7TDMI Core 20000 0 CPU1 ’ PE HAL MODEL=”

ARM 7TDMI HAL 20000 0 CPU1 ” ’− s i r n o t e PlatformComm ARM 7TDMI Core 20000 0 CPU2 ’ PE HAL MODEL=”

ARM 7TDMI HAL 20000 0 CPU2 ” ’− re−open PlatformComm . s i r

28

Page 34: Design Exploration using Multiple ARM Instruction Set ...

− s i r n o t e PlatformComm ARM 7TDMI Core 20000 0 CPU3 ’ PE HAL MODEL=”ARM 7TDMI HAL 20000 0 CPU3 ” ’

− pe r fo rm code g e n e r a t i o n f o r CPU1 ( f o r ARM 7TDMI OS 20000 0 CPU1 NET )( s t o r e i n f i l e s ”CPU1 / CPU1 . c ” and ”CPU1 / CPU1 . h ” )

− rename g e n e r a t e d model a s ” PlatformCommC1 ”− compi l e and s i m u l a t e− pe r fo rm code g e n e r a t i o n f o r CPU2 ( f o r ARM 7TDMI OS 20000 0 CPU2 NET )

( s t o r e i n f i l e s ”CPU2 / CPU2 . c ” and ”CPU2 / CPU2 . h ” )− rename g e n e r a t e d model a s ” PlatformCommC2 ”− compi l e and s i m u l a t e− pe r fo rm code g e n e r a t i o n f o r CPU3 ( f o r ARM 7TDMI OS 20000 0 CPU3 NET )

( s t o r e i n f i l e s ”CPU3 / CPU3 . c ” and ”CPU3 / CPU3 . h ” )− rename g e n e r a t e d model a s ” PlatformCommC3 ”− compi l e and s i m u l a t e− S i m u l a t i o n s h o u l d be working f i n e . I t does n o t g e t s t u c k d i f f e r e n t from one CPU c a s e .

∗ Cross−c o m p i l a t i o n f o r CPU1Now SCE does n o t s u p p o r t t h e m u l t i p l e ARM ISS . Only one ARM ISS s i m u l a t i o n i s a l l o w e d .

− we w i l l c r o s s−compi l e t h e g e n e r a t e d C code i n t h e CPU1 d i r e c t o r ycd CPU1

− i n s p e c t t h e g e n e r a t e d ANSI−C code i n f i l e s ”CPU1 . c ” and ”CPU1 . h ”I t a p p e a r s t h a t t h e g e n e r a t e d C code c o n v e r t s our back−a n n o t a t e d t i m i n ge s t i m a t e s i n t o TaskDelay s t a t e m e n t s t h a t suspend a t a s k f o r t h e g i v e np e r i o d o f t ime . Thus , t h e s o f t w a r e t a s k s a r e a c t u a l l y p u t t o s l e e p f o rour back−a n n o t a t e d w a i t f o r ( ) s t a t e m e n t s ! !

A f t e r t h e C code g e n e r a t i o n s t e p , t a k e a look a t t h e f i l e CPU1 . h .At t h e t o p of t h e f i l e , you w i l l f i n d t h e f o l l o w i n g d e f i n i t i o n :# d e f i n e WAITFOR(X) TaskDelay ( ( u n s i g n e d long ) ( (X) / 1 0 0 0 ) )P l e a s e change t h i s t o t h e f o l l o w i n g :# d e f i n e WAITFOR(X) / / n o t h i n g !Th i s w i l l make t h e bogus w a i t i n g d i s a p p e a r .− t o compi l e t h e g e n e r a t e d code t o g e t h e r wi th t h e microC−OS−I I

( and some o t h e r f i l e s ) , we ’ l l use a p r e p a r e d M a k e f i l ecp / home / doemer / EECS222C F08 / l e c t u r e 8 / M a k e f i l e .− u p d a t e M a k e f i l e f o r r e f e r e n c e p a t h f o r SWARM ISS and u s e r s o u r c e code l i k e below .

# r e f e r e n c e t o SWARM ISS# ISS DIR = / o p t / pkg / sw / swarm

ISS DIR = / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / swarm

# USER s p e c i f i c s o u r c e f i l eUSR SRC := CPU1 . cUSR HDR := CPU1 . h

− t y p e ” make ” a f t e r m o d i f i c a t i o n o f M a k e f i l e− t h e g e n e r a t e d ARM−e x e c u t a b l e i s found i n f i l e ” userCode ”

∗ Cross−c o m p i l a t i o n f o r CPU2Now SCE does n o t s u p p o r t t h e m u l t i p l e ARM ISS . Only one ARM ISS s i m u l a t i o n i s a l l o w e d .− we w i l l c r o s s−compi l e t h e g e n e r a t e d C code i n t h e CPU2 d i r e c t o r y

cd CPU2− i n s p e c t t h e g e n e r a t e d ANSI−C code i n f i l e s ”CPU1 . c ” and ”CPU1 . h ”

more CPU2 . cmore CPU2 . h

− t o compi l e t h e g e n e r a t e d code t o g e t h e r wi th t h e microC−OS−I I( and some o t h e r f i l e s ) , we ’ l l use a p r e p a r e d M a k e f i l ecp / home / doemer / EECS222C F08 / l e c t u r e 8 / M a k e f i l e .

− u p d a t e M a k e f i l e f o r r e f e r e n c e p a t h f o r SWARM ISS and u s e r s o u r c e code l i k e below .# r e f e r e n c e t o SWARM ISS# ISS DIR = / o p t / pkg / sw / swarm

29

Page 35: Design Exploration using Multiple ARM Instruction Set ...

ISS DIR = / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / swarm

# USER s p e c i f i c s o u r c e f i l eUSR SRC := CPU2 . cUSR HDR := CPU2 . h

− t y p e ”mae” a f t e r m o d i f i c a t i o n o f M a k e f i l e− t h e g e n e r a t e d ARM−e x e c u t a b l e i s found i n f i l e ” userCode ”∗ Cross−c o m p i l a t i o n f o r CPU3Now SCE does n o t s u p p o r t t h e m u l t i p l e ARM ISS . Only one ARM ISS s i m u l a t i o n i s a l l o w e d .

− we w i l l c r o s s−compi l e t h e g e n e r a t e d C code i n t h e CPU3 d i r e c t o r ycd CPU3

− i n s p e c t t h e g e n e r a t e d ANSI−C code i n f i l e s ”CPU3 . c ” and ”CPU3 . h ”more CPU3 . cmore CPU3 . h

− t o compi l e t h e g e n e r a t e d code t o g e t h e r wi th t h e microC−OS−I I( and some o t h e r f i l e s ) , we ’ l l use a p r e p a r e d M a k e f i l e

− cp / home / doemer / EECS222C F08 / l e c t u r e 8 / M a k e f i l e .− u p d a t e M a k e f i l e f o r r e f e r e n c e p a t h f o r SWARM ISS and u s e r s o u r c e code l i k e below .

# r e f e r e n c e t o SWARM ISS# ISS DIR = / o p t / pkg / sw / swarm

ISS DIR = / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / swarm

# USER s p e c i f i c s o u r c e f i l eUSR SRC := CPU3 . cUSR HDR := CPU3 . h

− t y p e ” make ” a f t e r m o d i f i c a t i o n o f M a k e f i l e− t h e g e n e r a t e d ARM−e x e c u t a b l e i s found i n f i l e ” userCode ”

∗ I n s e r t i o n o f ISS model f o r CPU1 , CPU2 and CPU3− s e l e c t Comm. s i r ( i n SCE P r o j e c t window )− Edi t−>I m p o r t D e s i g n

” / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / arm7tdmiwrapper / a r m 7 t d m i i s s 1 . s i r ”( t h i s w i l l show up i n t h e l i s t o f unused b e h a v i o r s i n t h e working windowas b e h a v i o r ”ARM 7TDMI ISS1 ” )

− Edi t−>I m p o r t D e s i g n” / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / arm7tdmiwrapper / a r m 7 t d m i i s s 2 . s i r ”( t h i s w i l l show up i n t h e l i s t o f unused b e h a v i o r s i n t h e working windowas b e h a v i o r ”ARM 7TDMI ISS2 ” )

− Edi t−>I m p o r t D e s i g n” / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / arm7tdmiwrapper / a r m 7 t d m i i s s 3 . s i r ”( t h i s w i l l show up i n t h e l i s t o f unused b e h a v i o r s i n t h e working windowas b e h a v i o r ”ARM 7TDMI ISS3 ” )

− l o c a t e t h e i n s t a n c e ”CPU1” of b e h a v i o r ” ARM 7TDMI Core 20000 0 CPU1 ”i n t h e h i e r a r c h y browse r

− r e p l a c e t h i s a b s t r a c t model w i th t h e ISS model( r i g h t −c l i c k ” ARM 7TDMI Core 20000 0 CPU1 ” , ChangeType t o ”ARM 7TDMI ISS1 ” )− l o c a t e t h e i n s t a n c e ”CPU2” of b e h a v i o r ” ARM 7TDMI Core 20000 0 CPU2 ”

i n t h e h i e r a r c h y browse r− r e p l a c e t h i s a b s t r a c t model w i th t h e ISS model( r i g h t −c l i c k ” ARM 7TDMI Core 20000 0 CPU2 ” , ChangeType t o ”ARM 7TDMI ISS2 ” )− l o c a t e t h e i n s t a n c e ”CPU3” of b e h a v i o r ” ARM 7TDMI Core 20000 0 CPU3 ”

i n t h e h i e r a r c h y browse r− r e p l a c e t h i s a b s t r a c t model w i th t h e ISS model( r i g h t −c l i c k ” ARM 7TDMI Core 20000 0 CPU3 ” , ChangeType t o ”ARM 7TDMI ISS3 ” )

− go t o P r o j e c t menu −> S e t t i n g s− change I m p o r t p a t h t o ” . : / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / arm7tdmiwrapper ”− change L i b r a r y p a t h t o ” / home / l e c s / c h k o u t / p r o j e c t / mu l t i swarm / swarm ”

30

Page 36: Design Exploration using Multiple ARM Instruction Set ...

− s e t V e r b o s i t y l e v e l and Warning l e v e l t o 2 .− save t h i s d e s i g n model ( a s a new model i n t h e p r o j e c t )

F i l e−>SaveAs ” P l a t f o r m I S S . s i r ” ( i n your working d i r e c t o r y ! )− compi l e and s i m u l a t e( S i m u l a t i o n s h o u l d be ended l i k e below . )

0 : S t i m u l u s s e n d s b l o c k 9 .0 : S t i m u l u s s e n d s b l o c k 1 0 .

UARTCTRL: S e r i a l d e v i c e S l a v e p t s [ / dev / p t s / 2 ]Note : Uploaded t h e Program−B i n a r y : . / CPU3 / userCodeUARTCTRL: S e r i a l d e v i c e S l a v e p t s [ / dev / p t s / 3 ]Note : Uploaded t h e Program−B i n a r y : . / CPU2 / userCodeUARTCTRL: S e r i a l d e v i c e S l a v e p t s [ / dev / p t s / 4 ]Note : Uploaded t h e Program−B i n a r y : . / CPU1 / userCode

0 : S t i m u l u s s e n d s b l o c k 1 1 .R e g i s t e r IRQ R e g i s t e r IRQ R e g i s t e r IRQ 000 pFunc pFunc pFunc0 x0x0x5e05e05e0 pArg pArg pArg 0 x0x0x000

402 : S t i m u l u s s e n d s b l o c k 1 2 .821 : S t i m u l u s s e n d s b l o c k 1 3 .

1197 : S t i m u l u s s e n d s b l o c k 1 4 .1597 : S t i m u l u s s e n d s b l o c k 1 5 .1817 : Moni to r r e c e i v e d b l o c k 0 ( wi th 337 b y t e s ) .1817 : Encoding f o r b l o c k 0 took 1817 micro s e c o n d s .....

. . . . . .90817 : Encoding f o r b l o c k 174 took 8157 micro s e c o n d s .91350 : Moni to r r e c e i v e d b l o c k 175 ( wi th 15 b y t e s ) .91350 : Encoding f o r b l o c k 175 took 8189 micro s e c o n d s .91879 : Moni to r r e c e i v e d b l o c k 176 ( wi th 9 b y t e s ) .91879 : Encoding f o r b l o c k 176 took 8213 micro s e c o n d s .92391 : Moni to r r e c e i v e d b l o c k 177 ( wi th 6 b y t e s ) .92391 : Encoding f o r b l o c k 177 took 8196 micro s e c o n d s .92894 : Moni to r r e c e i v e d b l o c k 178 ( wi th 6 b y t e s ) .92894 : Encoding f o r b l o c k 178 took 8175 micro s e c o n d s .93467 : Moni to r r e c e i v e d b l o c k 179 ( wi th 305 b y t

e s ) .93467 : Encoding f o r b l o c k 179 took 8236 micro s e c o n d s .93467 : Moni to r e x i t s s i m u l a t i o n .

31