Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have...

19
Energy-Efficient Communication Processors

Transcript of Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have...

Page 1: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

Energy-Efficient Communication Processors

Page 2: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

Robert Fasthuber • Francky CatthoorPraveen Raghavan • Frederik Naessens

Energy-EfficientCommunication Processors

Design and Implementation forEmerging Wireless Systems

123

Page 3: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

Robert FasthuberIMECLeuvenBelgium

Francky CatthoorIMECHeverleeBelgium

Praveen RaghavanIMECLeuvenBelgium

Frederik NaessensIMECLeuvenBelgium

ISBN 978-1-4614-4991-1 ISBN 978-1-4614-4992-8 (eBook)DOI 10.1007/978-1-4614-4992-8Springer New York Heidelberg Dordrecht London

Library of Congress Control Number: 2013938037

� Springer Science+Business Media New York 2013This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part ofthe material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformation storage and retrieval, electronic adaptation, computer software, or by similar or dissimilarmethodology now known or hereafter developed. Exempted from this legal reservation are briefexcerpts in connection with reviews or scholarly analysis or material supplied specifically for thepurpose of being entered and executed on a computer system, for exclusive use by the purchaser of thework. Duplication of this publication or parts thereof is permitted only under the provisions ofthe Copyright Law of the Publisher’s location, in its current version, and permission for use mustalways be obtained from Springer. Permissions for use may be obtained through RightsLink at theCopyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law.The use of general descriptive names, registered names, trademarks, service marks, etc. in thispublication does not imply, even in the absence of a specific statement, that such names are exemptfrom the relevant protective laws and regulations and therefore free for general use.While the advice and information in this book are believed to be true and accurate at the date ofpublication, neither the authors nor the editors nor the publisher can accept any legal responsibility forany errors or omissions that may be made. The publisher makes no warranty, express or implied, withrespect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Page 4: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

Preface

With great advances in technology over the last decades, the computationalperformance and energy efficiency of mobile devices has been significantlyimproved. This improvement has led to modern mobile devices, such as smart-phones or tablets, which offer a multitude of interesting applications. For this rea-son, they have become an essential part of our lifestyle and can be found everywherearound us. It is expected that in the future even hundreds of mobile devices will be inpersonal use. All of these devices will have to communicate with each other, mostlyby using a huge variety of different wireless communication standards.

To support this huge variety of standards in a cost-effective way, i.e. to avoid theuse of many underutilized hardware resources, (1) high programming flexibility isrequired. Because of the joint demand for (2) high performance and long batterylifetime, these devices will have to be (3) extremely energy efficient. To achievehigh performance and high energy efficiency, future designs need to be able toexploit the full potential offered by the latest process technologies, i.e. they need tobe (4) technology scaling-friendly. Since the Non-Recurring Engineering (NRE)costs for products that leverage on future process technologies will become dra-matically high (5) high reusability of designs and design effort across a large numberof applications will be essential. Considering these joint requirements, it willbecome very challenging to continue the trend towards ever-more computationallypowerful mobile devices. The purpose of this book is to contribute to solving thischallenge.

A review of the state-of-the-art literature clearly shows the drawbacks ofexisting solutions and motivates the need for a new holistic design approach,which can cope with the joint set of requirements in a more optimal way. Thisbook proposes such a design approach and demonstrates the feasibility on threecase studies.

The core of this book is the energy-efficient technology-friendly Domain-SpecificInstruction set Processor (DSIP) architecture template, which enables highreusability. This architecture template targets specifically the baseband functionalityof emerging high-performance wireless communication systems, which is verydifficult to implement under the given design constraints. To achieve high energyefficiency, innovative architecture concepts, such as software Single InstructionMultiple Data (SIMD), Distributed Loop Buffer (DLB) and Very-Wide Register

v

Page 5: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

(VWR), have been combined together with consistent co-design flows. Technologyfriendliness, i.e. the capability of a design to significantly profit from technologyscaling, is ensured by proactively coping with negative side effects of future Deep–Deep Sub-Micron (DDSM) technologies. In this book we focus mainly on theincreasing influence of wires over transistors. We handle this issue by keepingexplicitly the most important wires (most active, critical path) in the architecturetemplate short and by proposing a compatible back-end semi-custom design flowwhich will lead to the desired layout.

This book includes three relevant case studies which demonstrate the appli-cation and feasibility of the proposed design approach. All three DSIP architecturetemplate instances, i.e. the advanced Multiple-Input Multiple-Output (MIMO)detector for future LTE/WLAN standards, the high-speed Finite-Impulse Response(FIR) filter for emerging 60 GHz systems and the high-throughput Fast-FourierTransformation (FFT) for 60 GHz and WLAN systems, have been designed andimplemented in TSMC 40 nm technology. Although sufficiently fulfilling also ofall other requirements, these designs are at least a factor of 2–3 more energy/areaefficient than state-of-the-art programmable solutions. Thus, this result motivatesthe content of this book.

We expect this book to be of interest for academia in both ways, for describingthe overall design approach of efficient architecture implementations, and fordescribing the proposed innovative architecture concepts in more detail. The goalof all projects, which have driven this research, was to obtain results that arerelevant for industry. Since this book is reflecting this, we believe that the contentis also of interest for senior architecture design engineers and for their managers inindustry. This is specifically for those who want to make use of the proposedconcepts in their own research and development or for those who wish to antic-ipate the evolution of commercially available design concepts over the next fewyears.

The material of this book is based on research that has been carried out at IMECin the period of 2005–2012, partly in the context of European and national researchprojects. It has been a pleasure for us to work in this research domain and to co-operate with our project partners and with our colleagues from the analog anddigital SSET-CSI group.

We would like to use this opportunity to thank all the people who have providedcontributions and feedback in the direct focus of this book, both at IMEC and atother locations. In particular we want to mention David Novo, Min Li, Wim VanThillo, Sofie Pollin, Ubaid Ahmad, Prashant Agrawal, Halil Kukner, MatthiasHartmann, André Bourdoux, Claude Desset, Hans Cappelle, Peter Debacker, RafAppeltans, Tom Vander Aa, Veerle Derudder, Liesbet Van der Perre, AntoineDejonghe, Bruno Bougard, Jos Huisken, Wim Dehaene and Paolo Ienne. Fur-thermore, we want to thank all the Master’s students who have helped us in thisresearch, in particular Alaa Medra, Imran Ali, Kostas Samaras and VagelisBebelis.

vi Preface

Page 6: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

Finally, we hope that the reader will find this book useful and enjoyable andthat the proposed concepts and results will contribute to the continued progress inthis field.

Leuven, Belgium Robert FasthuberFrancky Catthoor

Praveen RaghavanFrederik Naessens

Preface vii

Page 7: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

Contents

1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 The Smartphone of the Future. . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Trends and Consequences . . . . . . . . . . . . . . . . . . . . . . 11.1.2 The Solution: A Highly Energy-Efficient SDR Platform . 6

1.2 Research Challenges to Enable Highly Energy-EfficientSDR Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 The Energy-Efficiency Gap . . . . . . . . . . . . . . . . . . . . . 61.2.2 The Architecture Gap for Ultimately

Scaled Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.3 The Productivity Gap . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.4 The Culture Gap for Design Paradigms . . . . . . . . . . . . . 10

1.3 Key Concepts to Tackle Research Challenges and Related Gapsin the State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.1 Employ an Architecture With a Well-Chosen

Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.2 Leverage on an Energy and Cost-Effective

Architecture Template . . . . . . . . . . . . . . . . . . . . . . . . . 121.3.3 Consider DDSM Technology Constraints During the

Design of the Architecture . . . . . . . . . . . . . . . . . . . . . . 131.3.4 Employ Consistent, Predictable and Systematic

Design Flows Around the Architecture Template . . . . . . 131.3.5 Ensure that Algorithm and Architecture

are Well-Matched . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3.6 Effectively Adapt to the Actual Requirements

at Run-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.4 Proposed DSIP Architecture Template Design Approach . . . . . . 161.5 Main Focus and Contributions . . . . . . . . . . . . . . . . . . . . . . . . 17

1.5.1 Main Focus of this Book . . . . . . . . . . . . . . . . . . . . . . . 171.5.2 Overview of the Main and Side Contributions . . . . . . . . 17

1.6 Structure of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

ix

Page 8: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 252.1 Background on Wireless Communication Systems. . . . . . . . . . . 25

2.1.1 The General Digital Wireless Communication System. . . 252.1.2 The Increasing Complexity of Wireless Communication

Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.1.3 Physical Layer Signal Processing in an Advanced

Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2 Background on Architecture Styles for Wireless

Communication Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2.1 ASIC/rASIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.2.2 ASIP (Application Processor) . . . . . . . . . . . . . . . . . . . . 302.2.3 ASIP (Baseband Processor) . . . . . . . . . . . . . . . . . . . . . 312.2.4 DSIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.2.5 DSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.2.6 Other Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3 Background on the Physical Layer System Design . . . . . . . . . . 332.3.1 Functionality Design . . . . . . . . . . . . . . . . . . . . . . . . . . 332.3.2 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.3.3 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.3.4 Architecture Design. . . . . . . . . . . . . . . . . . . . . . . . . . . 352.3.5 Software Mapping and Compilation . . . . . . . . . . . . . . . 362.3.6 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . 36

2.4 Related Work on Functionality/Algorithm Designand Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.4.1 Functionality Design . . . . . . . . . . . . . . . . . . . . . . . . . . 372.4.2 Algorithm Design and Optimization . . . . . . . . . . . . . . . 372.4.3 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.5 Related Work on Architecture Design . . . . . . . . . . . . . . . . . . . 402.5.1 Design of the Platform . . . . . . . . . . . . . . . . . . . . . . . . 402.5.2 Design of an ASIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.5.3 Design of an Overall Processor Platform . . . . . . . . . . . . 412.5.4 Design of the Processing Elements of a Processor . . . . . 432.5.5 Design of the Data Storage Hierarchy of a Processor . . . 442.5.6 Design of the Instruction C./S. Hierarchy

of a Processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.6 Related Work on Software Mapping/Compilation . . . . . . . . . . . 46

2.6.1 Software Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.6.2 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.7 Related Work on Hardware Implementation . . . . . . . . . . . . . . . 472.7.1 Implementation of the Platform . . . . . . . . . . . . . . . . . . 472.7.2 Implementation of a Block. . . . . . . . . . . . . . . . . . . . . . 47

2.8 Related Work on Wireless Architectures and Templates. . . . . . . 512.8.1 ASIC/rASIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.8.2 ASIP (Application Processor) . . . . . . . . . . . . . . . . . . . . 52

x Contents

Page 9: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

2.8.3 ASIP (Baseband Processor) . . . . . . . . . . . . . . . . . . . . . 532.8.4 DSIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.8.5 DSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.9 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 56References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3 The Proposed DSIP Architecture Template for the WirelessCommunication Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.1 An Effective Architecture Template for the Wireless

Communication Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.1.1 Considered Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . 693.1.2 Proposed Design Approach . . . . . . . . . . . . . . . . . . . . . 70

3.2 Applied Design Approach to Define Architecture Template . . . . 733.2.1 Analysis and Definition of System and Algorithm

Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.2.2 Analysis and Definition of Technology Constraints . . . . . 753.2.3 Evaluation and Selection of Architectural Concepts . . . . 783.2.4 Definition of the Architecture Template. . . . . . . . . . . . . 79

3.3 Requirements from Algorithm Perspective . . . . . . . . . . . . . . . . 793.3.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 803.3.2 Arithmetic and Logic Operations . . . . . . . . . . . . . . . . . 803.3.3 Parallelization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823.3.4 Data Transfer Operations . . . . . . . . . . . . . . . . . . . . . . . 833.3.5 Data Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843.3.6 Instruction Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 853.3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.4 Employed Architectural Concepts . . . . . . . . . . . . . . . . . . . . . . 863.4.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 873.4.2 Arithmetic and Logic Operations . . . . . . . . . . . . . . . . . 893.4.3 Parallelization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923.4.4 Data Transfer Operations . . . . . . . . . . . . . . . . . . . . . . . 933.4.5 Data Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943.4.6 Instruction Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

3.5 Proposed Architecture Template . . . . . . . . . . . . . . . . . . . . . . . 973.5.1 Top Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973.5.2 Cluster Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 983.5.3 Engine Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013.5.4 Slice Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

3.6 Scalability of the Architecture Template. . . . . . . . . . . . . . . . . . 1073.6.1 Technology Scalability . . . . . . . . . . . . . . . . . . . . . . . . 1073.6.2 Hardware/Instance Scalability: Design Space

and Architecture Instantiation Design Flow . . . . . . . . . . 1093.6.3 Hardware/Instance Scalability: Model to Define

Flexibility in a Quantitative Manner . . . . . . . . . . . . . . . 111

Contents xi

Page 10: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

3.6.4 Hardware/Instance Scalability: Flexibility Evaluationof the Proposed Architecture Template . . . . . . . . . . . . . 112

3.6.5 Software/Run-Time Scalability . . . . . . . . . . . . . . . . . . . 1213.7 Summary of Combined Innovative Concepts. . . . . . . . . . . . . . . 121

3.7.1 Main Template-Specific Concepts . . . . . . . . . . . . . . . . . 1223.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

4 Case Study 1: DSIP Architecture Instance for MIMO Detection. . . 1374.1 Motivation, Related Work and Contributions . . . . . . . . . . . . . . 137

4.1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . 1374.1.2 Summary of Related Work. . . . . . . . . . . . . . . . . . . . . . 1374.1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

4.2 Background on Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1394.2.1 MIMO Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1394.2.2 Motivation for a Flexible Implementation . . . . . . . . . . . 140

4.3 Algorithm Optimizations and Characteristics . . . . . . . . . . . . . . 1414.3.1 Algorithm Choice and Applied Optimizations . . . . . . . . 1414.3.2 Algorithm Characteristics. . . . . . . . . . . . . . . . . . . . . . . 144

4.4 Proposed DSIP Architecture Instance . . . . . . . . . . . . . . . . . . . . 1454.4.1 Top Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1454.4.2 Cluster Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1464.4.3 Engine Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1494.4.4 Slice Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

4.5 Software Mapping and Hardware Implementation . . . . . . . . . . . 1554.5.1 Software Mapping and Scheduling . . . . . . . . . . . . . . . . 1554.5.2 Hardware Implementation and Results . . . . . . . . . . . . . . 1564.5.3 Instance Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1574.5.4 Run-Time Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

4.6 Comparison and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 1594.6.1 Implemented ASIC References . . . . . . . . . . . . . . . . . . . 1594.6.2 Comparison to ASIC References. . . . . . . . . . . . . . . . . . 1594.6.3 Flexible Implementations from Literature. . . . . . . . . . . . 1634.6.4 Comparison to Flexible Implementations . . . . . . . . . . . . 165

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

5 Case Study 2: DSIP Architecture Instances for FIR Filtering . . . . 1715.1 Motivation, Related Work and Contributions . . . . . . . . . . . . . . 171

5.1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . 1715.1.2 Summary of Related Work. . . . . . . . . . . . . . . . . . . . . . 1725.1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

5.2 Background on Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1735.2.1 Matched Filter for the 60 GHz System . . . . . . . . . . . . . 1745.2.2 Motivation for a Flexible Implementation . . . . . . . . . . . 174

xii Contents

Page 11: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

5.3 Algorithm Optimizations and Characteristics . . . . . . . . . . . . . . 1755.3.1 Algorithm Choice and Applied Optimizations . . . . . . . . 1755.3.2 Algorithm Characteristics. . . . . . . . . . . . . . . . . . . . . . . 179

5.4 Proposed DSIP Architecture Instances . . . . . . . . . . . . . . . . . . . 1805.4.1 Top Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1815.4.2 Cluster Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1835.4.3 Engine Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1845.4.4 Slice Level of the HW-SIMDi . . . . . . . . . . . . . . . . . . . 1855.4.5 Slice Level of the SW-SIMDi . . . . . . . . . . . . . . . . . . . 187

5.5 Software Mapping and Hardware Implementation . . . . . . . . . . . 1945.5.1 Software Mapping and Scheduling . . . . . . . . . . . . . . . . 1945.5.2 Hardware Implementation and Results . . . . . . . . . . . . . . 1975.5.3 Throughput and Scalability . . . . . . . . . . . . . . . . . . . . . 200

5.6 Comparison and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 2025.6.1 Implemented ASIC References . . . . . . . . . . . . . . . . . . . 2025.6.2 Processor References from Literature . . . . . . . . . . . . . . 2035.6.3 Comparison (Normalized, Pessimistic). . . . . . . . . . . . . . 2035.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

6 Case Study 3: DSIP Architecture Instancefor FFT Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2196.1 Motivation, Related Work and Contributions . . . . . . . . . . . . . . 219

6.1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . 2196.1.2 Summary of Related Work. . . . . . . . . . . . . . . . . . . . . . 2206.1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

6.2 Background on Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2206.2.1 (I)FFT for High-Data Rate Standards . . . . . . . . . . . . . . 2216.2.2 Motivation for a Flexible Implementation . . . . . . . . . . . 222

6.3 Algorithm Optimizations and Characteristics . . . . . . . . . . . . . . 2236.3.1 Algorithm Choice and Applied Optimizations . . . . . . . . 2246.3.2 Algorithm Characteristics. . . . . . . . . . . . . . . . . . . . . . . 226

6.4 Proposed DSIP Architecture Instance . . . . . . . . . . . . . . . . . . . . 2286.4.1 Top Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2286.4.2 Cluster Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2306.4.3 Engine Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2316.4.4 Slice Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

6.5 Software Mapping and Hardware Implementation . . . . . . . . . . . 2366.5.1 Utilized Processor Design Tool Suite . . . . . . . . . . . . . . 2366.5.2 Software Mapping and Scheduling . . . . . . . . . . . . . . . . 2376.5.3 Hardware Implementation and Results . . . . . . . . . . . . . . 2386.5.4 Throughput and Scalability . . . . . . . . . . . . . . . . . . . . . 240

Contents xiii

Page 12: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

6.6 Comparison and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 2416.6.1 ASIC References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2426.6.2 Flexible References from Literature . . . . . . . . . . . . . . . 2426.6.3 Comparison (Normalized, Pessimistic). . . . . . . . . . . . . . 2426.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

7 Front-End Design Flow: Bridging theAlgorithm-Architecture Gap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2517.1 Motivation and Issues, Overview of Proposal

and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2517.1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . 2517.1.2 Algorithm-Architecture Co-Design for Traditional

Architecture Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . 2527.1.3 Proposed Measures to Enable an Effective

Algorithm-Architecture Co-Design Flow . . . . . . . . . . . . 2547.1.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

7.2 Proposed Architecture Template Instantiation Design Flow . . . . 2567.2.1 Algorithm Transformations . . . . . . . . . . . . . . . . . . . . . 2587.2.2 Decisions on Data Representation . . . . . . . . . . . . . . . . . 2597.2.3 Support of Arithmetic and Logic Operations . . . . . . . . . 2607.2.4 Decisions on Parallelization . . . . . . . . . . . . . . . . . . . . . 2617.2.5 Support of Data Transfers . . . . . . . . . . . . . . . . . . . . . . 2647.2.6 Data Storage Dimensioning . . . . . . . . . . . . . . . . . . . . . 2647.2.7 Instruction Control Dimensioning . . . . . . . . . . . . . . . . . 265

7.3 Application on Case Study 1: MIMO Detector . . . . . . . . . . . . . 2657.3.1 Algorithm Transformations . . . . . . . . . . . . . . . . . . . . . 2657.3.2 Decisions on Data Representation . . . . . . . . . . . . . . . . . 2667.3.3 Support of Arithmetic and Logic Operations . . . . . . . . . 2667.3.4 Decisions on Parallelization . . . . . . . . . . . . . . . . . . . . . 2667.3.5 Support of Data Transfers . . . . . . . . . . . . . . . . . . . . . . 2697.3.6 Data Storage Dimensioning . . . . . . . . . . . . . . . . . . . . . 2707.3.7 Instruction Control Dimensioning . . . . . . . . . . . . . . . . . 270

7.4 Application on Case Study 2: FIR Filter. . . . . . . . . . . . . . . . . . 2717.4.1 Algorithm Transformations . . . . . . . . . . . . . . . . . . . . . 2717.4.2 Decisions on Data Representation . . . . . . . . . . . . . . . . . 2727.4.3 Support of Arithmetic and Logic Operations . . . . . . . . . 2727.4.4 Decisions on Parallelization . . . . . . . . . . . . . . . . . . . . . 2727.4.5 Support of Data Transfers . . . . . . . . . . . . . . . . . . . . . . 2747.4.6 Data Storage Dimensioning . . . . . . . . . . . . . . . . . . . . . 2757.4.7 Instruction Control Dimensioning . . . . . . . . . . . . . . . . . 275

7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

xiv Contents

Page 13: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

8 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2798.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

8.1.1 A Clear Need for New Design Approachesfor Wireless Application Platforms . . . . . . . . . . . . . . . . 279

8.1.2 Proposed Design Approach for Wireless ApplicationPlatforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

8.1.3 Results are Very Promising . . . . . . . . . . . . . . . . . . . . . 2808.1.4 Main Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2808.1.5 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2818.2.1 Remaining Challenges and Tasks . . . . . . . . . . . . . . . . . 2818.2.2 Planned Continuation of Work . . . . . . . . . . . . . . . . . . . 284

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

Contents xv

Page 14: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

Acronyms

Wireless Communication Standards

2/3/4G Communication standards of the 2nd/3rd/4th generation3GPP 3rd Generation Partnership ProjectATSC Advanced Television Systems CommitteeDMB Digital Multimedia BroadcastingDVB Digital Video BroadcastingECMA European Computer Manufacturers AssociationEV-DO Evolution-Data OptimizedGPRS General Packet Radio ServiceGSM Global System for Mobile communicationsHSDPA High Speed Download Packet AccessHSUPA High Speed Upload Packet AccessIrDA Infrared Data AssociationLTE(-A) Long-Term Evolution (-Advanced)NFC Near-Field CommunicationP2P Peer-to-PeerUMTS Universal Mobile Telecommunications SystemWiMAX Worldwide interoperability for Microwave AccessWLAN Wireless Local Area NetworkWMAN Wireless Metropolitan Area NetworkWPAN Wireless Personal Area NetworkWWAN Wireless Wide Area Network

System/Functionality/Algorithm/Software

AFE Analog Front-EndBAO Basic Arithmetic OperationBER Bit-Error-RateBLCO Boolean Logic/Comparison OperationCDFG Control and Data Flow GraphCDMA Code Division Multiple Access

xvii

Page 15: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

CFO Carrier-Frequency OffsetCORDIC COordinate Rotation DIgital Computer algorithmCP ComParisonCR Cognitive RadioCSD Canonical Signed DigitDCD Dichotomous Coordinate Descent algorithmDFE Digital Front-EndDFG Data Flow GraphDIF Decimation in FrequencyDIT Decimation in TimeDLP Data Level ParallelismDTSE Data Transfer and Storage ExplorationEQ EQualizerFDM Frequency-Division MultiplexingFEC Forward Error CorrectionFFT Fast Fourier TransformationFIR Finite Impulse ResponseHO Hard-OutputILP Instruction Level ParallelismITSE Instruction Transfer and Storage ExplorationLDPC Low Density Parity CheckLLR Log-Likelihood-RatioLORD Layered ORthogonal Lattice DetectionLOS Line Of SightLR Lattice ReductionLRA-MMSE Lattice-Reduction-Aided MMSELSB Least Significant BitMAC Multiply-and-ACcumulate operationMCO Multiplication with a Constant OperatorMCM Multiple Constant MultiplicationMDF Multi-path Delay FeedbackMFCSO Modified Fixed-Complexity Soft-OutputMIMO Multiple Input Multiple OutputML Maximum LikelihoodMMSE Minimum Mean-Squared ErrorMPSoC Multi-Processor SoCMSB Most Significant BitMTT Multi-pass Trellis TraversalMVON Multiplication with a Variable Operator and a multiplier which

can adopt only a Narrow value rangeMVOW Multiplication with a Variable Operator and a multiplier which

can adopt a Wide value rangeNOP No OperationOFDM Orthogonal Frequency-Division MultiplexingOp Operation

xviii Acronyms

Page 16: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

OSI Open Systems Interconnection layersPER Packet Error RatePHY PHYsical layer (lowest OSI layer)QAM Quadrature Amplitude ModulationQPSK Quadrature Phase-Shift KeyingQRD QR orthogonal-triangular DecompositionRD ReaDRF Radio FrequencyRRDML Reconfigurable Reduced Dimension Maximum LikelihoodSBP-LR Scalable Block-based Parallel Lattice ReductionSC Single CarrierSCO Sampling Clock OffsetSDR Software Defined RadioSHF SHift FactorSIC Successive Interference CancellationSIFS Short InterFrame Space latency/timingSNR Signal-to-Noise RatioSO Soft-OutputSoC System on a ChipSRRC Square-Root-Raised-Cosine filterSSFE Selective Spanning with Fast EnumerationSWL Subword Length/SizeTDM Time-Division MultiplexingTLP Task Level ParallelismTSD Tuple Search DetectorTSO Trigonometric/Special OperationWR WRiteZF Zero-Forcing

Architectures/Architecture Styles

ADRES Architecture for Dynamically Reconfigurable Embedded SystemASIC Application Specific Integrated CircuitASIP Application Specific Instruction set ProcessorASIP-AP ASIP-Application ProcessorASIP-BP ASIP-Baseband ProcessorCG(R)A Coarse Grained Reconfigurable Array (processor)DSP Digital Signal ProcessorDSIP Domain Specific Instruction set ProcessorEVP Embedded Vector ProcessorFG(R)A Fine Grained Reconfigurable Array (processor)FPGA Field-Programmable Gate ArrayGPP General Purpose ProcessorGPU Graphics Processing Unit

Acronyms xix

Page 17: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

rASIC reconfigurable ASICRISC Reduced Instruction Set ComputerSDR Software Defined Radio (processor)SIMD Single Instruction Multiple Data (processor)SIMT Single Instruction Multiple Threads (processor)SODA Signal-processing On-Demand ArchitectureSTA Synchronous Transfer ArchitectureTTA Transport Triggered ArchitectureVLIW Very Long Instruction Word (processor)VDSP Vector DSP

Architecture Components/Elements

ADC Analog-to-Digital ConverterALU Arithmetic and Logic UnitAMBA Advanced Microcontroller Bus ArchitectureAOI And-Or-InverterASI Application/Algorithm Specific InstructionBAU Basic Arithmetic UnitBPE Block Processing EngineCLB Centralized Loop BufferCLC Cluster Level ControlRCU Reusable Custom UnitDEMUX DE-MultipleXerDMA Direct Memory AccessDMEM Data MemoryDSH Data Storage HierarchyDSI Domain-Specific InstructionFF Flip-FlopFIFO First-In First-OutFU Functional UnitHA Hardware AcceleratorHardSIMD Hardware SIMDHC Hold CounterICache Instruction CacheICH Instruction Control HierarchyIMEM Instruction MemoryISH Instruction Storage HierarchyLB Loop BufferLC Loop ControllerLU Logic UnitLUT(U) Look-Up Table (Unit)MRAM Magnetoresistive Random Access MemoryMU Multiplier Unit

xx Acronyms

Page 18: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

MUX MultiplexerNoC Network on ChipNVM Non-Volatile MemoryOICU Online Instruction Computation UnitPC Program CounterPE Processing ElementRF Register FileRRAM Resistive Random Access MemorySELC Shared Engine Level ControlSHU ShufflerSIMD Single Instruction Multiple DataSoftSIMD Software SIMDSRAM Static Random Access MemorySSU Subword Shuffler UnitSTLC Shared Top Level ControlSTR STRideSTT-RAM Sing Torque Transfer Random Access MemoryTW TWiddleVWR Very Wide RegisterVWR_SI SI VWR Slice InterfaceWSU Word Shuffler Unit

Design/Tools/Automation/Technology

ADL Architecture Description LanguageBC Best CaseCTS Clock Tree SynthesisDDSM Deep Deep Sub-Micron technologies (65 nm and below)DPG Data-Path GeneratorDRC Design Rule CheckDVFS Dynamic-Voltage and Frequency ScalingEDI Cadence Encounter Digital ImplementationEPS Cadence Encounter Power SystemETS Cadence Encounter Timing SystemGP General PurposeHLE High-Level EstimationHDL Hardware Description Language, e.g. VHDL and VerilogHLS High-Level SynthesisHT High-ThroughputHW HardWareIP Intellectual PropertyLA Low-AreaLP Low PowerNRE Non-Recurring Engineering cost

Acronyms xxi

Page 19: Energy-Efficient Communication Processors978-1-4614-4992-8/1.pdflifetime, these devices will have to be (3) extremely energy efficient. To achieve high performance and high energy

PCB Printed Circuit BoardRC Resistance–CapacitanceRTL Register Transfer LevelSDF Standard Delay FormatSDP Structured DataPathSPEF Standard Parasitic Exchange FormatSW SoftWareTC Typical CaseTCL Tool Command LanguageTSMC Taiwan Semiconductor Manufacturing Company, LimitedVCD Value Change DumpVHDL Very high-speed integrated circuit Hardware Description LanguageWC Worst Case

Institutions/Divisions/Departments

ESAT Department of Electrical Engineering, KULIMEC Interuniversity MicroElectronics CentreKUL University of LeuvenNTNU Norwegian University of Science and TechnologySSET Smart Systems and Energy Technology, Division in IMEC

Metrics

bits (Mega/kilo) bitsbytes (Mega/kilo) bytesbps (Giga/Mega/kilo) bits per secondOPS (Giga/Million) Operations Per SecondGE Gate EquivalentIPS (Giga/Million) Instructions Per SecondJ (pico/femto) JouleW (milli) Watt

xxii Acronyms