Powermeter for HPC Systems - ULisboa€¦ · Powermeter for HPC Systems André Filipe Gonçalves...

Powermeter for HPC Systems

André Filipe Gonçalves Duarte

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisors: Dr. Pedro Filipe Zeferino TomásDr. Nuno Filipe Valentim Roma

Examination Committee

Chairperson: Dr. Nuno Cavaco Gomes HortaSupervisor: Dr. Pedro Filipe Zeferino Tomás

Member of the Committee: Dr. Francisco André Corrêa Alegria

May 2015

Courage and perseverance have a magical talisman, before which difficulties disappear and obstacles

vanish into air.

John Quincy Adams

iii

Acknowledgments

I would like to thank to Professors Pedro Filipe Zeferino Tomas and Nuno Filipe Valentim Roma for

all the support, advices and patience they showed over the past months. Without their guidance and

belief on me, I am sure it would not be possible to end this important stage of my life. I would like to

thank, as well, to Professor Jose Germano for all the useful advices he gave me and also my friends for

always being there even when one thinks they are not. I would like to show my pride and joy for being

part of a group of people who I know will follow me in the rest of my life. Thus, thanks to Joao Pedro

Costa e Castro, Ricardo Filipe Tomas Pires, Goncalo Diogo Gomes Mendes, Goncalo Gouveia Velez

Bidarra Saraiva, Guilherme Costa e Castro and for last but not least to the great Ortonimo (Flavio Jorge

dos Santos Lopes).

I can not end this text without show my highest esteem to my parents and my sister for loving me

so hard and for doing all they could to help me, even when it seemed I did not show appreciation for it.

Thank you, I love you

v

Abstract

The fast pace at which technology has been evolving has led to a significant increase of the amount

of energy that is consumed by nowadays High Performance Computing systems (HPC). Consequently,

it has become highly important to understand how the energy consumption of any given application

changes over time, envisaging the possibility to implement real-time power profiling and resources opti-

mization. The work that was developed in the scope of this thesis describes the design and prototyping

of an acquisition board (and related software API) composed by several Hall sensors and a microcon-

troller. Such board is capable of measuring the amount of power demanded by an HPC system, by

monitoring the current that passes through the several rails of the main Power Supply Unit (PSU) of a

personal computer. For such purpose, a broad set of conditioning modules were studied and imple-

mented, in order to ensure accurate and precise measurements under an ample dynamic range of the

measured signals. In particular, an Automatic Gain Controller (AGC) module was implemented in the

acquisition board, embracing both the analog and digital domains of the measurement procedure. The

results obtained from the experimental evaluation showed that the conceived device is highly suitable

for real-time power profiling of HPC systems under complex workloads, by providing fine-grained mea-

sures of the power consumption over time, hardly attained by other alternative state-of-the-art devices

or systems.

Keywords: High Performance Computing Systems, Energy Consumption, Real-time Power

Profiling, In-situ Measurements, Automatic Gain Control, PIC

vii

Resumo

O ritmo acelerado a que as tecnologias se tem desenvolvido levou a um aumento significativo da

energia consumida pelos sistemas de computacao de alto desempenho (HPC). Consequentemente, e

de extrema importancia perceber como e que o consumo energetico duma aplicacao varia ao longo do

tempo, visando a caracterizacao em tempo real da potencia consumida pelo sistema e a consequente

otimizacao de recursos. O trabalho que foi desenvolvido no ambito desta tese descreve o projeto e

a prototipagem duma placa de aquisicao (e a aplicacao de software associada) composta por varios

sensores de Hall e um microcontrolador. Esta placa e passıvel de medir a potencia requerida por

um sistema HPC, monitorizando a corrente em varias linhas de alimentacao provenientes da fonte de

alimentacao (PSU) de um computador pessoal. Assim, foram estudados e implementados diversos

modulos de acondicionamento, com o intuito de garantir medicoes exatas e precisas sob uma larga

gama dinamica dos sinais medidos. Em particular, foi implementado um modulo de Controlo Automatico

do Ganho (AGC), fazendo a ligacao entre os domınios analogico e digital da placa de aquisicao. Os

resultados experimentais obtidos revelaram que a placa concebida e particularmente adequada para

a caracterizacao em tempo real da potencia consumida por aplicacoes de elevada complexidade em

sistemas HPC, obtendo-se uma precisao de medida ao longo do tempo que dificilmente e alcancado

por outros dispositivos modernos e sistemas do mesmo genero.

Palavras-chave: Sistemas de Computacao de Alto Desempenho, Consumo Energetico,

Caracterizacao da Potencia em Tempo Real, Medicoes In situ, Controlo Automatico do Ganho, PIC

ix

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 State-of-the-Art 9

2.1 Real-time Power Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Software-based Power Measurement . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1.A System Profile-based Power Model . . . . . . . . . . . . . . . . . . . . . 10

2.1.1.B PMC-based Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.2 Hardware-based Power Measurement . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Powermeter - Architecture Definition and Specifications 19

3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Signal Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 DC conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.2 AC conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 AGC - Automatic Gain Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 MatLab Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.2 Analog Domain Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

xi

3.3.2.A Band-pass Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.2.B Summming Amplifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.2.C Programmable Gain Amplifier . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.3 Dynamic Range Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.3.A SNR,THD and SFDR Analysis . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.3.B Combining the PGA with Oversampling . . . . . . . . . . . . . . . . . . . 45

3.3.4 System Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.4.A Absolute Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.4.B Popov Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3.4.C Circle Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.3.4.D Application of the Theorems to the System in Study . . . . . . . . . . . . 50

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Powermeter - Software/Firmware 55

4.1 Communication System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.2 Sampling Frequency Choice of the System . . . . . . . . . . . . . . . . . . . . . . 58

4.1.3 Types of Data Transferred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.1.3.A Synchronous - Clocks Synchronization and Sampling Process Initializa-

tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1.3.B Asynchronous -Time Synchronization Data . . . . . . . . . . . . . . . . . 60

4.1.3.C Asynchronous -Time Stamps and Sampled Data . . . . . . . . . . . . . . 62

4.2 Firmware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2.1 Buffering Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2.2 Oversampling and Maximum Search Algorithm . . . . . . . . . . . . . . . . . . . . 64

4.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.1 Powermeter Application Programming Interface . . . . . . . . . . . . . . . . . . . . 65

4.3.2 Energy Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3.2.A Time Stamp Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Results 69

5.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.1 Sensors and ADC Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Non-Linearity Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Power Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3.1 SPLASH-2 Benchamark Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3.2 NAS Parallel Benchmark Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.4 PC Energy Consumption Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

xii

6 Conclusions 81

6.1 Summary and Overall Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Bibliography 89

A Appendix A 91

A.1 Bandpass Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.2 Analog Implementation - Full Circuit Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.3 Dynamic Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

A.4 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

xiii

List of Tables

2.1 List of available RAPL sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 PowerPack power meter profile API. (Source [1]) . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Component Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Gain at the Central Frequency to the Various Tests . . . . . . . . . . . . . . . . . . . . . . 32

3.3 Parameters Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 AD5113 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Digital Word Vs Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6 ADC Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.7 SNR values for each case test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1 Types of Data Transferred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Synchronous Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Asynchronous Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Sampling Process Initialization Command Example . . . . . . . . . . . . . . . . . . . . . . 60

4.5 Host and PIC’s Clock Synchronization Data Structure . . . . . . . . . . . . . . . . . . . . 60

4.6 Sampling Process Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.7 Main API Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.1 AGC Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Powermter Time Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.3 RAPL Time Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.4 Machine’s Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

List of Figures

1.1 Motherboard and Component’s Connectors Pin-out . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Propagation of Performance Events. (Source [2]) . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 An example of using multimeter to measure the power. [3] . . . . . . . . . . . . . . . . . . 14

2.3 PowerPack (Source [1]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Block Diagram of the Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Current Sensor. Source: Allegro MicroSystems . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 DC Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5 Subtractor Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.6 AGC Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.7 On-off Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.8 Loop Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.9 Reconstruction of the Input signal with and without AGC . . . . . . . . . . . . . . . . . . . 28

3.10 Loop Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.11 Reconstruction of the Input signal with and without AGC . . . . . . . . . . . . . . . . . . . 29

3.12 Full Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.13 Band Pass Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.14 Real Vs Theoretical Bode Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.15 Pspice Simulation Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.16 Voltage Converter Schematic (LMC7660 datasheet - Texas Instruments) . . . . . . . . . . 34

3.17 Summing Amplifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.18 Adder Output - Voffset (purple) = 2.5 V; Vsub (yellow) = 1.175 mV amplitude; Vout (cyan)

= 3.662 V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.19 PGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.20 AD5113’s Pin Configuration and Block Diagram - Analog Devices . . . . . . . . . . . . . . 36

3.21 Averaged Conversion Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.22 Averaging Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.23 PIC18f4550 10-bit ADC’s SNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.24 FFT of Sinusoidal Signal sampled at Fs = 3.3(3) kHz . . . . . . . . . . . . . . . . . . . . . 44

xv

3.25 FFT of Sinusoidal Signal with samples averaging (Fs = 3.3(3) kHz) . . . . . . . . . . . . . 45

3.26 Non-Linearities Analysis [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.27 System for Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.28 Non-Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.29 Nyquist Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.1 Round-Trip Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Communication Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Synchronization Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 Offset Change Along the Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5 System Diagram - Microcontroller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6 Dual-Buffer Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.7 Oversampling with Rolling Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.8 System Diagram - Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.9 Trapezoidal Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.10 Energy Computing (where Bn refers to energy batch of the nth packet) . . . . . . . . . . . 67

4.11 Case Scenario when time[index] = Tn > end time (where Tn = T0 + (n− 1)× TSampling) 68

5.1 ADC and Sensor Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Stability Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3 Power Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.4 Profiling of NPB Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.5 EDP Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.6 AC Power Consumption and DC Power Distribution in the System . . . . . . . . . . . . . 78

5.7 Power Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

A.1 Filter Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.2 R2 Parameter Variation with 5% Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.3 Circuit Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.4 Sinusoid FFT with Oscilloscope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

A.5 THD and SFDR for N=8. Fs = 3.333 kHz . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

A.6 EP CLASS=B Power Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

A.7 BT CLASS=B Power Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

xvi

List of Algorithms

4.1 Maximum Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2 Energy Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3 Time Stamp Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

xvii

Glossary

AC Alternate Current

AGC Automatic Gain Control

API Application Program Interface

API Application Programming Interface

CPU Central Processing Unit

DAC Digital-to-Analog Converter

DC Direct Current

DMA Dynamic Memory Access

DVFS Dynamic Voltage-Frequency Scale

EDP Energy-Delay Product

FPGA Field-Programmable Gate Array

GPU Graphics Processing Unit

HDD Hard Drive Disk

HPC High-Performance Computing

LLC Last Level Cache

MCU Microcontroller Central Unit

MSR Machine-Specific Registers

NTP Network Time Protocol

OS Operating System

PCB Printed-Circuit Board

PCI Peripheral Component Interconnect

PGA Programmable Gain Amplifier

PLL Phase-Locked-Loop

PMC Performance-Monitoring Counters

PSU Power Supply Unit

RAPL Running Average Power Limit

RMS Root-Mean Squared

SFDR Spurious-Free Dynamic Range

SIE Serial Interface Engine

SMD Surface-Mount Device

xix

SNR Signal-to-Noise Ratio

TCP Transmission Control Protocol

THD Total Harmonic Distortion

TLB Translation Look aside Buffer

UDP User Datagram Protocol

USB Universal Serial Bus

p.f. power factor

xx

1Introduction

Contents

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1

Energy consumption is an issue in every consumer electronic equipment, even more since fossil re-

sources are being consumed at a rate far superior than nature’s regenerative rhythm and also because

of all the power consumption constraints in all sort of equipments to reduce environmental impact. How-

ever, this general interest in saving energy has only been a focus of attention by the computing science

community in the last few years. This happens mostly due to the tremendous pace at which technology

has evolved in the last decade, translating into systems with computing capabilities far more powerful

than ever, verifying, until now, Moore’s law projections. This, in turn, leads to more and more power

required to feed those systems and special attention is now being paid to High Performance Computing

(HPC) systems by the science community, qualifying energy consumption as a primary concern in the

design of electronic systems and of computer systems in particular. As a result, several research efforts

have been made to optimize and carefully manage energy consumption at multiple levels - starting from

individual components like wireless radios [5], storage devices [6], and processors [7, 8], to PC’s and

servers [9, 10, 11].

For the computer engineer, it is important to figure out where and why an application consumes

more power and with that information make decisions that can improve the application’s energy effi-

ciency. Thus, it is essential to adequately measure the energy spent in computer systems. One of many

approaches to do it, is to create energy models that allow extrapolating the future energy consumption

of the system. Those models are used to schedule applications and resources [12], to adapt application

behavior to externally specified energy constraints [13], and to account energy usage to the respective

software components [14]. Those energy models, usually, are based on information retrieved from spe-

cific power indicators, which are correlated with the power measurements taken, a priori, when running

of some stressful workload (for instance, CPU utilization is a power indicator of the workload of the unit).

Instead of developing those models, based on statistical information, hardware performance counters,

also known as Performance-Monitoring Counters (PMCs), can be used. PMCs are special registers

associated to on-board energy sensors for measuring the energy consumption of on-core hardware

components. Intel introduced these sensors - calling them ”Running Average Power Limit” (RAPL) -

with their Sandy Bridge microarchitecture [15]. In spite of this is a better approach than the one using

power indicators, since it takes into account direct measurements of the system’s power, it is still based

on system events, which do not reflect accurately power consumption [16]. Although that issue can

be attenuated with fine-grained instrumentation of single processing units [17], a study by McCullough

et al. [18] revealed that affine based models often perform poorly when it comes to model more com-

plex computer systems and workloads, arguing the need for increased direct measurement by physical

instrumentation of power consumption.

In-situ power measurement with meters and special devices, it is an alternative method to determine

power consumption. Digital voltmeters and clamp meters are one of the choices [13, 19], but there are

also special devices to do so, such as PowerPack [1] and PLEB [20]. In general all those systems are

inadequate, since they do not provide high sampling rates [13, 19], do not permit sampling of more

than one sensor at a time or the scalability of the measuring system to other systems is not feasible

[13, 19, 1, 20]. Other systems rely on measurement of the total system power at the wall socket (AC

2

power), but the applicability is limited in the case of any fine grained adaptation. Furthermore, power

measurements at the system level cannot distinguish between the actual power used and the power

wasted due to the inefficiencies in the power supply. For all the above reasons, it is necessary an

accurate, reliable, fast and possible easy to integrate system of energy/power measurement, so it is

appropriate for real-time power characterization of any algorithm.

In this dissertation, an hardware system called Powermeter is presented. The Powermeter is a

prototype developed within INESC-ID SiPS group, which is capable of measuring directly the 12V, 5V

and 3.3V power rails from the motherboard’s power connector and individual components’ power (e.g.,

Central Processing Unit (CPU), Graphics Processing Unit (GPU) and Hard Drive Disk (HDD)). It can,

also, measure the AC power requested by the system, using an Automatic Gain Controller (AGC), which

dynamically amplifies the input signal, improving Analog-to-Digital Converter (ADC) dynamic range and

allowing to distinguish small variations in the system’s power consumption. The design and requirements

of the full system, as well as details about the theory and project of the AGC are revised in later chapters

of this dissertation.

1.1 Motivation

Energy (Joule), which is the physical quantity used for electric bills (usually in kWh), is defined as the

integral of the instantaneous power (Watt) drained over the time:

Energy =

∫ t2

t1

P (t)dt (1.1)

Thus, the average power consumed can be obtained by the ratio of the energy and the period of

integration.

In the desktop computers, the power is supplied by a Power Supply Unit (PSU), which converts AC

power to DC power and divides the last one by several rails (12 V, 5V and 3.3V). Power is drained

from this rails to source some components, such as Random Access Memory (RAM) modules, Network

Interface Chip (NIC), fans and others, and it is important to know if there are specific rails that power

mainly some components. Thus, the ATX/EPS specifications [21] were analysed.

Every motherboard, complying with ATX12V or EPS12V design directives, has a 24 pin main power

connector (see figure 1.1(a)), from where power is drained to supply all the basic functions of the moth-

erboard. However, some elements are powered from specific connectors coming from a standard PSU.

The connectors that will be described are of the Molex type [22]: Molex connector is the term to define

a two-piece pin and socket interconnection and are widely use for connecting power in desktop PCs,

because of the simplicity, reliability, flexibility, and low cost of the Molex design (see figure 1.1).

The connector to power CPU is represented in figure 1.1(b). This is usually named EPS12V connec-

tor and provides the necessary power for the moderns multicore processors package through four 12V

rails. Power to supply GPU or Field-Programmable Gate Array (FPGAs) is drained from a PCI-Express

connector, as illustrated in figure 1.1(c). Older connectors had 6-pin, while the modern ones have 8-pin,

3

Power-Supply-Fundamentals,8-7-312631-13.jpg (Imagem JPEG, 450x338 pixéis)

http://media.bestofmicro.com/Power-Supply-Fundamentals,8-7-312631-13.jpg[21/03/2015 13:07:47]

(a) Motherboard Connector Pin-out (b) EPS12V Molex Pin-out

(c) 8 pin PCI Express Pin-out (d) SATA Cable Pin-out

(e) 4 Pin Peripheral Pin-out

Figure 1.1: Motherboard and Component’s Connectors Pin-out

providing all the power to graphic cards. This connector provides power through three 12V power rails.

The HDD is powered by SATA connectors, which also includes a data cable. This connector, normally,

has four major rails, which are the 12V, 5V, 3.3V and GND connectors (see figure 1.1(d)). It can, also,

include the 12 V and 5 V rails (but not the 3.3 V one, which is an extra rail and most of the time it is not

needed). Other peripherals are powered by the 4-pin peripheral power cable, identified by figure 1.1(e).

This connector provides power through a 12V and 5V rails.

The CPU and PCI Express cables are connected to only one current sensor, which has specific input

connectors for the Molex type. On other hand, SATA cable, connecting the HDD, has 12 V and 5 V rails,

so measurements have to be done through two different current sensors (one for the 12 V rail and the

other for the 5V one). Other rails connecting directly to the motherboard have also their own current

sensor (this includes the 12 V, 5V and 3.3 V rails).

The computing science community has attained fantastic results, regarding algorithm’s time per-

formance, resorting to multi-core processors (using CPUs, GPUs and FPGAs). However, this kind of

parallelism usually implies higher power demands, due to the minimum level of power required by each

active core. In addition, different configurations and systems consume different amounts of power under

stress of the same workload. And, even with the same system and configuration, an algorithm has dif-

4

ferent power demands along time, due to the various tasks running on the processor, which may require

distinct resources.

Thus, it would be convenient to somehow understand how and where power is being consumed

by a scientific computational application, associated with some configuration (i.e., running with one or

more threads in one or more cores). For instance, with that kind of information in hand, the engineer

could choose the best combination of the number of nodes and threads for which the algorithm will

attain good performances of both time and energy consumption. It could, also, make the decision of

exploiting parallelism of specific jobs with GPUs instead of using CPUs or vice-versa, depending on the

gain of performance and energy cost. In addition, can combine the real-time power information with

power-aware strategies, such as Dynamic Voltage-Frequency Scale (DVFS), decreasing CPU’s power

consumption.

This problem can be solved resorting to energy models [23, 24, 13] which rely on specific power

indicators (such as CPU utilization) to deliver power consumption information. Other approaches are

based on direct power measurements, by the usage of voltmeters and ammeters. All those methods

(which will be boarded in greater detail in further chapters) are inaccurate and/or do not provide enough

time resolution to characterize, in detail, the power consumption of a workload.

Therefore, the main motivation, for this thesis, is the project of hardware and software components

that cooperate to reach a on-board sensory processing with real-time computation capabilities, so the

aforementioned problems introduced before can be solved.

1.2 Related Work

Measuring power consumption, in an accurate and easy way, to allow power profiling it is a challenge.

Bircher and John [2] proposed the use of PMCs to find the power of the processor and other devices,

such as memory and disk. Chang et al. [25] proposed energy models to estimate the components’

power consumption, alleging that PMCs are not suitable to build fine-grained energy models and that

the power profiling thread may interfere with other applications using PMCs.

Other researchers ([3]) defended the use of in-situ measurements, since it provides more accu-

rate measurements and can measure total system’s power and individual devices. Data collection can

be done with digital voltmeters and ammeters (Power Scope [13]) and inbuilt sense resistors (Watch-

Dog.com PowerEgg [26], WattsUp? Pro [27]). These approaches have, usually, low time resolutions (2

or 4 Hz), which leads to a major loss of power information, if we bear in mind that a multi-core processor

can issue billions instructions per second.

Some of the aforementioned applications have drawbacks, such as not being fast or accurately

enough. Thus, failing to fulfill the needs of the computer engineer for a device that can successfully

characterize, in real-time, power consumption variations.

5

1.3 Objectives

The main purpose of this thesis, is to develop an electronic device, to be incorporated in a standard

desktop system, which allows an accurate and precise way of profiling power consumption in real-time.

Such accomplishment, would allow to analyze and optimize applications and, also, to compare different

implementations strategies. Therefore, it is necessary a list of requirements for which the device must

comply to, in order to successfully achieve these goals:

• It must be able to sense all (or almost all) the rails that supply power to a desktop PC with low

power losses;

• It must provide accurate and precise measurements;

• It must not introduce much (ideally none) time overhead and have a user-friendly interface;

• It must feature a great time resolution, so fast variations in power demand are distinguishable

(Fs > 1 kHz);

• It must exchange data at high rates (> 64 KB/s);

• It should be powered by the host system and composed by the smallest number of components

possible, in order to reduce costs and guarantee a low form factor.

According to this, it will be necessary to understand how a common desktop PC is powered, analysing

the ATX/EPS12V design guide [21]; project electronic circuits to acquire the different signals and an easy

access Application Program Interface (API). Further, the system stability and performance (in terms of

added noise, sampling frequency attained and latency) have to be evaluated. Finally, tests to the de-

vice have to be made, using several benchmarks and comparing the readings with internal counters

(RAPL). In addition, as proof of concept, this framework it will be used to profile the power and energy

consumption of NAS parallel benchmarks, so it can be demonstrated how useful can be the device to

the computer engineer.

1.4 Main Contributions

This dissertation has merit for the following contributions: it has resulted in the development of an

easy to integrate board, supplied by the measured system itself and with low power losses ; it can

provide accurate readings in both AC and DC rails, while introducing small time overheads (during the

testes performed, introduced, in the worst case, a extra 1.27% of time overhead); the AC readings are

done with the help of an AGC system, which dynamically amplifies the sensed signal, improving ADC’s

dynamic range; it can sample at rates hardly attained by previous devices (Fs = 3.3(3) kHz) and it can

transfer data at high rates, by the usage of the Universal Serial Bus (USB) protocol (more than 64 KB/s).

All these characteristics makes the proposed device a specialized tool for power profiling, at real-

time, of complex scientific workloads, making it suitable to be used with common power-aware strategies,

6

such as DVFS. Besides, the tests conducted, revealed that, with the device, it is possible to distinguish

different power patterns associated with distinct stages of the computing algorithm. It was also used

to classify diverse applications, according to their degree of both time and energy performance and

understand in which configurations (running with 1 or more threads by processor) the algorithm attains

better performances.

1.5 Dissertation Outline

This dissertation is organized in six chapters:

• In Chapter 2 the State of Art is introduced, seeking to present the common solutions to solve the

problem, including software-based and hardware-based measurements;

• In Chapter 3 a global view of the system is given. The full hardware characteristics of Powermeter

device is revealed. The chapter comprehends the devices used for signal conditioning and sam-

pling, structure and analysis of the Automatic Gain Control (AGC), number of acquisition channels

used, sampling frequency utilized and others;

• In Chapter 4 the software and firmware structures are introduced, revealing under which applica-

tion the device communicates between host and microcontroller, what information is exchanged

between the two units and in what cases, and the necessary procedures/algorithms to guarantee

system synchronism and reliability;

• In Chapter 5 an evaluation of certain parameters of the device is made (such as stability of the

system); an analysis to the device performance, testing it using well known benchmarks (NPB [28],

STREAM [29], Bonnie [30], SPLASH2 [31],..) intended to stress specific components ( CPU,GPU,

HDD, Memory I/O...); and a characterization of PC energy consumption is performed. The results

and conclusions of those tests are also included in this chapter;

• In Chapter 6 concluding remarks are presented, based on the results obtained in earlier chapters.

Furthermore, future work is proposed.

7

2State-of-the-Art

Contents

2.1 Real-time Power Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

9

High power consumption is a problem that scientific community has been trying to solve whether

by changing the CPU architecture or monitoring power and properly change kernel scheduler. This

actually arises two distinct problems: accurately measure energy/power consumption and implement an

algorithm to actually save energy while not significantly compromising the running time of the algorithm.

In the following sections, only the energy measurement methods topic will be revised, since that belongs

to the scope of this dissertation. The advantages and drawbacks of each system will, also, be pointed

out along the various sections.

2.1 Real-time Power Measurement

Real-time power measurement in computing systems can be accomplished by hardware-based or

software-based methods. Hardware-based power measurement mainly uses different kinds of instru-

ments to measure the power of a device directly. The result is much more accurate than software

based methods. It is usually used to evaluate the effectiveness of power saving techniques. However,

hardware-based methods are limited to measure component level power profiles. Hence, even though

not as accurate as hardware-based method, software-based power profiling tries to estimate the power

of different levels by designing a group of power models.

2.1.1 Software-based Power Measurement

Some authors defend that hardware-based methods are hard to integrate and expensive [32, 16].

Thus, a lot of research has been made under the area of software measurement and profiling.

Software-based approach usually builds power models to estimate the power dissipation of different

levels: instruction level, program block level, process level, hardware component level, system level and

so forth. These methods first try to find the power indicators that could reflect the power of these software

or hardware units. Then they build the power model with these power indicators and fine tune the

parameters of the power model. It is possible to categorize power models under two categories, based

on the difference of the power indicators: system profile-based method and hardware performance

counter (PMC) based method.

2.1.1.A System Profile-based Power Model

System profile or system events are a set performance statistical information supplied by the oper-

ating system (for instance, Linux saves that statistical information under the ” proc” directory). These

events reflect the current state of the hardware and software, including the operating system.

Li and John [33] estimated the power dissipation of operating system, using those kind of events.

They found that some operating system routines spend constant power, and other operating system

routines, such as process scheduling and I/O operating, have a linear relationship with Instructions Per

Cycle (IPC). They build the power model based on IPC as shown on equation 2.1. In this equation k1

10

and k0 are constants that they get from linear regression step. Finally, they define the routine level based

operating system power model, adding all the individual powers, as shown on equation 2.2.

P = k1 × IPC + k0 (2.1)

EOS =∑i

POS routinw,i × TOS routinw,i (2.2)

Kansal et al. [34] introduced a virtual machine level power model. They first built the power model

and estimate the power for CPU, memory and disk, and then they distributed the power to each virtual

machine based on the utilization of each component. The CPU energy model they proposed is based on

CPU utilization; the memory energy model uses the number of Last Level Cache )LLC) misses; and the

disk energy model relies on the bytes of data that disk reads and writes. These models are formalized

as follows:

Ecpu = αcpuµcpu + γcpu (2.3)

Emem(T ) = αmemNLLCM (T ) + γmem (2.4)

Edisk = αrbbR + αwbbW + γdisk (2.5)

In these three equations, µcpu refers to CPU utilization, NLLCM to the number of LLC miss in time T,

bR and bW , to the amount of bytes read from disk and the amount of bytes write into the disk, respectively.

The parameters α and γ are constants that they get when training the energy model.

Chen et al. [32] also use similar method to compute the power of a process. However, they target

laptop computers, so wireless network card consumption is considered.

2.1.1.B PMC-based Power Model

As mentioned earlier, hardware performance counters or PMCs is a group of registers that are used

to count hardware, software and operating system events. Thus, a large group of proposals use PMCs

to build power models. Bellosa [35] is probably the first person who proposed the usage of PMCs to

estimate power information. After analysis, they found that four PMCs (integer operation, floating-point

operation, second-level address strobe and memory transaction) are tightly related to the power dissi-

pation of a processor. In [36], Contreras et al. construct power models for the Intel PXA255 processor

and memory. For the processor power model, they used the following PMCs: instructions executed,

instruction cache miss and TLB (Translation Look aside Buffer) misses. They build the memory power

model with two counters: Instruction Cache Miss and Data Cache Miss.

The previous works show that performance events can be directly used to build power model for CPU

and memory, however, current performance counters do not directly supply useful events that reflect the

11

activity of other devices. Bircher and John [2] show that processor related performance events are highly

correlated with the power of other devices, such as memory, chipset, I/O and disk. In able to establish

the power model for these subsystems , first they need to understand the correlation or say that how

these events are propagated in the subsystem. Figure 2.1 shows the propagation of these performance

events in all the subsystems they defined.

They used in-built resistors to measure the power consumption of those subsystems, acquiring data

with a separate workstation at rate of ten thousand samples per second. Then, they related the power

information with performance counter samples taken at a much slower rate of one per second. They

select nine performance events based on these measures that were better indicators of the power spent

on those subsystems. These events include cycles, halted cycles, fetched uops, level 3 cache miss, TLB

misses, Dynamic Memory Access (DMA) accesses, processor memory bus transactions, uncacheable

accesses and interrupts. Equation 2.6 shows the CPU power model they proposed. More details about

the models used can be found in [2].

PCPU =

NumCPUs∑i=1

(9.25 + (35.7− 9.25)× PercentActivei + 4.31× FetchedUopsiCycle

(2.6)

In this equation, PercentActivei is the percentage of CPU utilization, FetchedUopsi is the micro-

operations fetched by the processor and Cycle is the core frequency time. The values 35.7 and 9.35

reflect the maximum and minimum power dissipation of one CPU, respectively and 4.31 is a constant

value that gives the relationship of the performance events and the real power.

Figure 2.1: Propagation of Performance Events. (Source [2])

Following, RAPL system will be boarded with some detail, due to its relevance for the results evalu-

ation of the work conducted in this dissertation.

12

RAPL

Intel model-specific (or machine-specific) registers (MSRs) are implemented within the x86 and x64

instruction sets as means for processes to access and modify parameters related to cpu execution [37].

A handful of MSRs are allocated for platform specific power management within the Sandy Bridge and

successor microarchitectures and allow access to energy measurement and enforcement of power lim-

its. In particular, Intel refers to these registers as the Running Average Power Limit interfaces. RAPL

provides sensors that allow measuring the power consumption of the CPU level components listed in

table 2.1. These available counters limit measurements to CPU and memory controller power consump-

tion. It is, however, impossible to measure energy consumption of I/O devices.

RAPL PKG Whole CPU packageRAPL PP0 Processor cores onlyRAPL PP1 ”A specific device in the uncore”

RAPL DRAM Memory controller

Table 2.1: List of available RAPL sensors

The Intel documentation [37] states that client platforms have access to PKG,PP0,PP1 while server

platforms (code name Jaketown) may access PKG,PP0,DRAM. From the above domain definitions,

one expects that energy(PP0) + energy(PP1) = energy(PKG). On client Sandy Bridge platforms

PP1 measures the energy of the on-chip graphics processor, as opposed to the entire uncore. On

these machines energy(PP0) + energy(PP1) ≤ energy(PKG) and energy(PKG) − (energy(PP0) +

energy(PP1)) = energy(uncore).

RAPL has its limitations. Individual cores cannot be measured and PP0 represents the sum of all

core energies. Similarly, DRAM and uncore energy data do not distinguish between various memory

channels or uncore devices. Moreover, since RAPL is based on energy models, it is not as accurate

as direct measurement methods and the minimum acquisition period of the counters, according to intel

[37] it is about 1 ms. For a four core processor, running at 3.5 GHz, which fetches seven instructions per

clock cycle, that period of time translates into a process resolution of 98 million instructions.

2.1.2 Hardware-based Power Measurement

Hardware-based methods use instruments to measure the current or voltage of the hardware de-

vices. Those measurements are used later to compute the power spent by the measured object. The

instruments used to do the measurement includes different types of meters, special hardware devices

that can be embedded into the hardware platforms, and power sensors designed within the hardware.

Normally, these methods can only measure the component level power, because the highly integration

of the hardware circuits makes the lower level functional units become difficult to measure. Some re-

searchers [38, 19] rely on micro-benchmarks which stress one or more special functional units, to isolate

lower level power.

13

Power Measurement with Meters

The usage of meters for direct measurement is a straightforward method to understand the power

dissipation of devices and of the full system. Some authors [39, 38] make use of power meters to

measure the real power and use it to analyse and validate their research work. Other researchers [1]

measure the hardware components’ power and discriminate it into lower levels based on some indicators

that could reflect the activity of these lower level units. The differences of these methods lies in the type

of meters that is used to do the measurements and at which place the measurement is done.

One type of the globally used meters is the digital voltmeter. These meters can sample the mea-

sured object, generally, each second. The result can be collected with the serial port that is connected

to the data collection system. By using this method, we need to disconnect the wires that we want to

measure and connect a small resistance (usually less than 0.5 Ω). Finally, we measure the voltage of

the resistor and compute the power on this wire. Figure 2.2 shows an example of this method. Joseph

et al. [19] use voltmeters to indirectly measure the power while executing different benchmarks and make

power/performance tradeoffs.

A second kind of used meters it is the clamp ampere meter, which can measure the current without

disconnect the wire. Normally, a clamp ammeter has larger measurement range than voltmeter, thus

they can be used to measure the power of systems that have a much higher current.Kamil et al. [40]

use the direct power measurement method with clamp meters to measure the power on a Cray XT4

supercomputer under several HPC (High Performance Computer) workloads.

Figure 2.2: An example of using multimeter to measure the power. [3]

The voltmeter, ammeter and clamp meter allows researchers direct access to the DC motherboard

rails. However, it does not scale well to large clusters, as it requires a lot of customization to each

machine. Further, the low timing resolution (usually 4 Hz) is not adequate to perform dynamic profiling.

In addition, there is also the possibility to use a kind of power meter, such as WattsUp? Pro [27]

or WatchDog.com PowerEgg [26], to measure the AC power. This kind of meters can only measure

the system level power, because only power supplies are powered by AC. These external power meters

14

allow the measurement of power consumption at a maximum frequency on the order of twice per second.

Since DC output of a computer power supply is buffered and filtered for stability, fast variations in the DC

load often do not translate to corresponding variations on the AC side of the supply. For these reasons,

these legacy power meter devices are not suitable to monitor system power usage with sufficient detail

to perform dynamic profiling and make in situ decisions for power / performance optimization during

application execution.

Integrating Sensors into Hardware

While direct measuring with meters is simple, it does not supply methods to control the process of

the measurement process. For example, to synchronize the measured power to the monitoring of the

performance metrics. To circumvent this, there are devices that offer, such possibility, while producing

more accurate power measurements, such as PLEB 2 [20] and PowerPack [1].

(a) PowerPack Architecture (b) PowerPack Overview

Figure 2.3: PowerPack (Source [1])

In the course of the research of their previous work [3], Ge et al. [1] propose a power analysis

program called PowerPack. The architecture of PowerPack is show in Figure 2.3(a). They run the

profiled application, the system status profiler, a thread that controls the meters and a group of power

reader threads on the same platform. PowerPack uses a special method to synchronize the measured

power to process information. First, they implement a set of functions, as shown in table 2.2, that will

be called by the profiled applications before and after some critical code blocks. The execution of these

functions then trigger the system status profiler and the meter control thread to sample the data. After

that, the power analyzer can simultaneously inspect the collected data. Finally, they propose a method

to map the measured power into the application code and analyze the energy efficiency in a multi-core

system.

15

Function Descriptionpmeter init Connect to meter control threadpmeter log Set power profile log file and options

pmeter start Session start a new profile session and label itpmeter end Session stop current profile session

pmeter finalize Disconnect from meter control thread

Table 2.2: PowerPack power meter profile API. (Source [1])

In spite of the accuracy granted by this framework, it still has major drawbacks. As it can be seen

in figure 2.3(b) and according to [1], the framework uses external meters such as WattsUp? Pro [27] to

measure AC power, resistor sensors and an external computer to save the sampled data. This results in

the impossibility to easily escalate this framework to other computing systems and also to use the data

for power-aware strategies.

Other hardware-based power measuring device is PLEB 2 [20]. PLEB 2 is a single-board computer

based on the Intel XScale PXA255. It was custom-designed primarily as a reference to be used in em-

bedded systems research, but secondarily as a platform for applications implementation. The PXA255

was chosen as being representative of high performance CPUs designed for embedded systems. It con-

sists of a 400MHz ARMv5TE compatible core combined with a set of on-chip peripheral units including

memory, interrupt, DMA and LCD controllers. The main processing core consists of the CPU, SRAM

and flash memory. Three switching lithium-ion batteries provide power to core, memory and IO.

The device was designed with power-measurement hardware on-board. Each of the three power

supplies (nominally for the CPU core, memory and IO) are instrumented with current sensors. Each

power supply is well regulated to its designated voltage, therefore the voltage is assumed to be constant

and the current is proportional to the power (P=IV). The microcontroller on-board has an integrated ADC

converter and can read the sensors at up to 15 kHz. Since it can only measure one of the sensors at

a time, this equates to a maximum of 5 kHz on the individual sensors when all are measured at equal

rates. Samples are transferred from the microcontroller to the PXA255 as they are taken.

This device solves many disadvantages of other mentioned platforms (PowerScope [13] and Pow-

erPack [1]), however still has significant limitations. Communication between the microcontroller and

the PXA255 is via I2C, which transfers data slowly (400kbps) than others protocols like SPI or USB;

Instead of measuring the current supplied by each power supply, a current sensor could be installed per

device, allowing the system to measure the current consumed. This would mean each IO device would

be individually monitored allowing users to understand how and why each device consumes power. In

addition, PLEB was a system designed particularly for the an ARM, thus it can not be used for other kind

of systems.

2.2 Summary

In this section, it was presented to the reader different ways that researchers came up with to solve

the problems related with energy consumption in computing systems. Some of those methods include

16

designing new power efficient components and implementing power-aware strategies such as DVFS.

However, it is crucial to understand, for instance, what components spend more power, what parts of a

source code are demanding more power and in what conditions that happen. Thus, measuring energy

consumption in an accurate, easy and efficient way is fundamental. Towards achieving that goal, two

major methods were introduced: Hardware-based and Software-based measurements.

In sum, software methods have the advantage of being easy to integrate in every modern system, to

be able to give consumption at the processes level and to permit the data to be used in running-time for

power-aware policies. Contrastingly, as we saw, those methods lack accuracy when compared to direct

measurements, since they are based on estimations. Hardware methods are more reliable, but can be

very hard to escalate to other systems due to the customization needed and may be an expensive way

of measuring consumption. Moreover, some of them do not allow the use of power-aware strategies

because of their low sampling frequency.

In the next section, details about the framework of the measuring system developed in this disserta-

tion are provided.

17

3Powermeter - Architecture Definition

and Specifications

Contents

3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Signal Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 AGC - Automatic Gain Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

19

In this chapter, the system architecture of the proposed Powermeter device is presented, together

with a detailed description of its components. Specific aspects of the device are boarded, such AC and

DC signal conditioning, including the design of an Automatic Gain Controller , used to improve ADC’s

dynamic range. Thereby, an analysis of the ADC’s SNR, THD and SFDR is attained, as well as, some

design criteria are defined. In the end, the reader will have a clear idea about how the device samples

and treats data, so accurate measurements are attained.

3.1 System Architecture

ACSensor CPU

EPS12VSensor

GPU/PCI-ESensor

HDDSensor

ADC-10bits

PIC18F4550

HOST

USB

Power Cable

Current Sensor Output

ACWall Socket

RailsSensors

Motherboard Molex Connector

Signal AcquisitionBoard

Powermeter

Figure 3.1: System Architecture

Powermeter is capable of measuring all the available rails coming from the personal computer PSU,

including CPU EPS12V 6/8-pin, PCI-E 6/8-pin and SATA 4/5-pin connectors. Also, the 12 V, 5V and 3.3

V rails from the motherboard connector and the input of the PSU connector (the one directly connected

to the AC power socket) are sensed (see figure 3.1). The Powermeter sensor devices are inserted

directly on the power connectors coming from the PSU by plugging supply-side power cables into input

connectors of the Powermeter and connect the output to wherever the initial connector should plug in

(motherboard, CPU, peripherals and others).

Regarding the sensors, precise low-offset, linear Hall-Effect sensors were used, which convert a

magnetic field generated by a current to a proportional voltage. The Hall IC has a copper conduction

path which has an internal resistance of 1.2 mΩ providing low power losses. The device uses USB

protocol to power the device and to suit communications between it and the host, since USB is a fast,

versatile and reliable interface. Powermeter uses Microchip PIC18F4550 Microcontroller Unit (MCU),

20

inserted in a common demonstration board (PICDEM FS USB), to establish communication between

the host system’s USB and the current sensors on the device. The system communication is interrupt-

driven, simplifying the microcontroller code and decreasing the time overhead, since there is no need by

the host to poll the device to find if there is data to be collected. The used demonstration board, has a 20

MHz Oscillator as an input, which is then used to generate (with PLL’s) the 48 MHz MCU clock (the USB

peripheral also runs with a 48 MHz clock - Full-Speed USB Mode). This microcontroller also includes

timer modules (which were used to generate the necessary time stamps and sampling frequency) and

a 10-bit ADC.

3.2 Signal Acquisition

184550

230

Figure 3.2: Block Diagram of the Hardware Architecture

Figure 3.2 exposes a general view of the proposed acquisition system. The system properly acquires

the output of the DC and AC sensors. The AC acquisition introduces a AGC system, which dynamically

amplifies the input signal, allowing to distinguish small variations of the signal. The system comprises

a active band-pass filter, which reduces noise and only passes the 50 Hz component of the spectrum,

eliminating in the process the offset imposed by the current sensor; and a controller that dictates if

the signal gets amplified or attenuated, using a Programmable Gain Amplifier (PGA). Consequently, by

the usage of this novel approach the ADC’s dynamic range gets improved. Low power-loss (1.2 m Ω)

sensors (ACS712/14) were used to sense AC and DC current of several power rails. These sensors

operate based on Hall-effect principle and output a voltage which is proportional to the magnetic field

generated by the sensed current. The sensors have different sensitivities, depending on the range of

current they are able to sense. For instance, for a 20 A range we have a 100 mV/A sensitivity, whilst for

30 A range, the sensor gives a 66 mV/A output. However, this sensors introduce an offset approximately

equal to V cc/2. As a result and also because we want a good precision in ADC conversions (i.e. as

many filled bits as possible), the signal coming from the sensor has to be conditioned. AC and DC signal

conditioning requires different approaches, as it will be described in the next sections.

21

(a) Hall-Effect Sen-sor

(b) Allegro’sCurrent Sensor

Figure 3.3: Current Sensor. Source: Allegro MicroSystems

3.2.1 DC conditioning

Sensing

Amplification

ACS 712/714±5,±20,±30 [A]

185,100,66 [mV/A]

VCC

VIOUT

FILTER

GND

IP+

IP-

IP+

IP-

8

7

6

5

4

3

2

1

Ip

CF0.01μF

CBYP0.1μF

+5V (Vcc)

RF

RG

C11nF

+

-2

36

7

4

R1

R2

ADC

Power source

Load

LoadVCC

GND

Vref

VampViout

Figure 3.4: DC Conditioning

Figure 3.4 illustrates the DC signal acquisition. In sum, the output generated by the ACS712/14 sen-

sors passes through an amplifying stage before gets acquired by the ADC. For the DC power rails, the

conditioning procedure consists in cancelling the DC offset provided by the ACS712/14 sensor (which

can vary between 2,330 V and 2,5 V) and introduce some gain to the remaining signal, so it can be

successfully acquired by the ADC stage. Thus, the proposed conditioning circuit consists in a subtractor

amplifier stage (figure 3.5).

−

++5VR1

R2

C1

R3

VInVOut

R4

+5V

Figure 3.5: Subtractor Circuit

In equation 3.1, we have the equation corresponding to the output signal, obtained after analysing

the circuit presented in figure 3.5 for DC input voltages (implying ω = 0, the capacitor can be ’seen’ as

22

an open circuit).

VOut = −(VIn − V+)× R4

R3+ V+ (3.1)

Hence, for DC voltages the output gain is constant and equal to G = R4R3 . However, for greater

frequencies, the signal is attenuated at a rate of 20 dB per decade, since the capacitor introduces a pole

in the circuit. Thus, the cut-off frequency can be computed with equation 3.2.

ωH =1

R4C1(3.2)

This acts as an Anti-Aliasing filter, which is required when performing samples acquisition. The V+

voltages can be 2741 or 2848 mV and the gains 4.254 or 4.299, yielding always an output voltage lower

than the +5 V, but higher than 3.5 V for input voltages greater than 2216 mV. This is always true for input

currents greater than zero. Those voltages and respective gains were obtained after a study of the best

configurations to allow a better amplification of the measuring signal, without saturate the ADC’s range.

3.2.2 AC conditioning

By definition, instantaneous electric power is obtained by multiplying the voltage and current:

P (t) = V (t)I(t) (3.3)

For sinusoidal waveforms, this translates into:

V (t) = VM sin(ωt)

I(t) = IM sin(ωt+ φ)

P (t) = VMIM2 [cos (φ)− cos (2ωt+ φ)]

(3.4)

Thus, the power waveform comprehends a time varying component and a constant component. The

constant component, which can be obtained by averaging P (t), corresponds to the effective power

delivered to the load, also called Active Power. The active power can also be obtained resorting to

complex amplitudes analysis and using the complex power. By definition the complex power is given by:

S =1

2VMIMe

jφ =VMIM

2cos(φ) + j

VMIM2

sin(φ) (3.5)

where φ is the angle between voltage and current. The real part of equation 3.5 is the Active Power,

whilst the imaginary part is the Reactive Power, which corresponds to the maximum value of the power

component that oscillates between the mains and the load, resulting from the energy stored on capac-

itors or/and inductors. Using RMS (Root-Mean Squared) amplitudes, the active power is obtained with

equation 3.6.

PActive = VRMSIRMS cos(φ) (3.6)

23

The ratio between the active power and the Aparent Power (VRMSIRMS) is known as power factor

(p.f.) and for sinusoidal waves it corresponds to cos(φ). Equation 3.6 is an important result since it states

that if the RMS amplitudes of voltage and current are known and if the p.f. is also known, then we can

compute the Active Power. Fortunately, modern PSUs have power factor correction and the CORSAIR

TX750 PSU’s manufacturers guarantee a steady power factor of 0.99 [41]. Thus, it is only necessary to

find the amplitude of the sine wave at each cycle to compute the active power, since the power factor

and voltage RMS values are known (p.f. = 0.99 and VRMS = 230 V for European electric power). This,

in fact, means that the power of the AC signal is proportional to the current.

3.3 AGC - Automatic Gain Control

The need to measure signals with a wide dynamic range is quite common in the electronics industry,

but current technology often has difficulty meeting actual system requirements. Weigh-scale systems

typically use load-cell bridge sensors with maximum full-scale outputs of 1 mV to 2 mV. Such systems

may require resolutions on the order of 1,000,000 to 1, which, when referred to a 2-mV input, call

for a high-performance, low-noise, high-gain amplifier and a sigma-delta modulator. While the actual

sensor data typically takes up only a small portion of the input signal range, the system must often

be designed to handle fault conditions. This is exactly the problem of the used current sensors, which

outputs a very low-voltage amplitude signal. Thus, a wide dynamic range, high performance with small

inputs, and quick response to fast-changing signals, are key requirements. These requirements call

for a flexible signal-conditioning block, with low-noise inputs, relatively high gains, and the ability to

dynamically change the gain in response to input level changes without affecting performance, while

still maintaining a wide dynamic range. Existing sigma-delta technology can provide the dynamic range

needed for many applications, but only at the expense of an increase of the operation rate.

This section presents an alternative approach that uses a successive-approximation sampling 10-bit

ADC, combined with an autoranging PGA front end, forming a AGC system (figure 3.6). With gain that

changes automatically based on analog input value, it uses oversampling to increase the dynamic range

of the system to more than 80 dB.

1

1

<</>>

10 512

/

A

Figure 3.6: AGC Structure

24

The band-pass filter, introduced earlier, is useful to reduce the input noise of the system in compari-

son to the input signal and to eliminate the DC component of Hall sensor. However, the current sensor

output signal can be, still, a low amplitude signal and small variations of the amplitude get eliminated

during the process of quantization. Therefore, a automatic gain controller is proposed. This stage is sup-

posed to, along the time, distinguish small variations in the input signal and amplify them, guaranteeing

no loss of the input signal and an improvement of the dynamic range. It is important, for the design,

to know the minimum and maximum amplitude that the sensor output signal can reach over time and

after some experiments, it was determined that it varies between 30 mV and 50 mV. Figure 3.6 shows

a diagram with the major blocks that constitutes this system. Part of this scheme (the analog part) was

done using physical elements like operational amplifiers and digital controlled potentiometers, whilst the

digital part was performed by programming the PIC18F4550.

Starting from the left, the input signal (which is actually the output of the AC current sensor) gets

filtered and amplified by the band-pass filter. Following, and ignoring for now the subtraction node, the

signal is dynamically attenuated or amplified, depending on the range where the signal is. Since the goal

is to always get a signal whose amplitude is almost the full-range of the ADC, if, for instance, the signal

has a low amplitude (lets say 1 V), the AGC amplifies it. However, if it has a amplitude very close or

above the ADC’s full-range (e.g., 4.9 V), then the signal gets attenuated, ensuring all the times that the

signal lies between the full-range of the ADC. It is the PGA block that amplifies or attenuates the input

signal, in conjunction with the controller, which decides whether to amplify, attenuate or keep the current

gain. Because the ADC embedded in the PIC18F4550 microcontroller is unipolar (range between 0 and

5 V), the PGA introduces an offset of V cc2 , shifting the input signal to the middle-range of the ADC. This

guarantees that the input signal can be properly amplified without upper or lower saturation. In addition,

due to the non-linear characteristic of the ADC, this also reduces quantization errors. Afterwards, at

the digital domain, the initial input signal gets recovered, subtracting the DC offset and cancelling the

gain by the usage of arithmetic shifts 1. Then, the signal gets averaged with the last 32 samples, thus

resulting in a 32 point moving average filter, which mathematical formulation is in equation 3.7.

y(n) =1

32

31∑k=0

x(n− k) (3.7)

By using this procedure, it is possible to optimally amplify, even more, the small variations in the

current signal, which occurs over time. The 5-bit DAC is used to convert to the analog domain the digital

value obtained after the average computation. This value has a length of 10 bits, thus only the 5 MSBs

are used. The digital value is recorded in a variable, so it can be summed to the acquired signal to

successfully regenerate the original signal.

The controller system could be accomplished in several ways: one of the alternatives would be to

design a PI (Proportional Integral) controller, which increases or decreases the gain according with a

signal error (that signal would be the maximum voltage we want to be at the input of the ADC). However,

1 The C18 compiler of PIC18F4550 implements multiplications and divisions operations, of any length, that are not supportedby the hardware, by calling library functions. Hence, this kind of operations are very time consuming and are not appropriate fora real-time processing project. Consequently, an amplifier that only assumes gains of powers of 2 was projected. This eases theprocess of recovering the initial signal by performing right or left shifts, according to the gain imposed by the PGA.

25

in this particular case, the controller parameters would change in time, since they are dependent on the

amplitude of the input signal, which keeps varying in time. This would result in different time responses

of the loop. The parameters are also dependent on the sampling frequency, thus for every change

in the sampling frequency (during design project), new parameters may have to be computed. Other

issue is that the computations needed are not adequate for the PIC18F4550’s 8-bit architecture and

would consume much time - a multiplication and division of floats can reach 336 and 2712 clock cycles,

respectively, while arithmetic shifts of integers require, approximately, 20 clock cycles.

(a) On-Off Characteristic

10

.

AndLowpass Filter

1

(b) On-off Controller Block

Figure 3.7: On-off Controller

In order to meet the requirements, another solution was engineered: the Bang-bang controller. The

Bang-bang controller (or also denoted as on-off controller), is a simple and effective solution to this

problem. In this project, an ’on-off’ non-linearity with hysteresis and a dead-zone was implemented (see

figure 3.7).

The controller (figure 3.7(b)) was implemented in the digital domain and changes the gain over time

by controlling a PGA (the schematic of the electronic devices, which form the PGA will be addressed

in section 3.3.2). The controller comprehends three states: one where it increases gain, other where it

decreases it and the the dead-zone, where it maintains the current gain unchanged. The controller also

includes a maximum search algorithm and a low-pass filter: the search for the maximum is necessary

not only to test if the signal amplitude is within the allowable range, but, also, to compute the power

of the AC signal - remember that the power is proportional to the amplitude of the current signal; the

low-pass is used to avoid false gains changes, due to instantaneous fluctuations in the measured signal.

The output of the controller feeds the non-linearity, whose margins were computed taking into account

the variations of the current sensor output (the computation of those margins are revealed in the chapter

of the results 5).

3.3.1 MatLab Simulation

MatLab was used to test the controller. The schematic built in simulink, is based on the diagram in

figure 3.1. For simulation purposes, the used lower and upper margins, of the on-off characteristic, were

T0 = 488 mV and T1 = 1.95 V , respectively.

26

There are, fundamentally, at least two interesting simulation cases:

1. The first case, is the behaviour of the circuit for sinusoidal input with amplitude range between 120

mV and 200 mV;

2. The second case, starts with a very low amplitude signal (100 mV) and jumps suddenly to a very

high amplitude signal (5 V);

Those tests will allow to understand the behavior of the loop and observe if it can properly adjust the

amplitude to fit within the margins of the non-linearity, whether in a case where the input signal is a very

low-voltage signal (first case) or when it is a very high-voltage signal (second case), while providing the

highest gain as possible. It is also interesting to compare the results obtained with an ADC with and

without the interaction of the proposed controller.

Sinusoidal Input: 120-200 mV

With the help of a voltmeter and running some workloads at the PC, it was determined that the output

amplitude of the current sensor varies roughly between 30 mV and 50 mV. For the first test in study, the

controller was subject to a sinusoidal wave signal, which amplitude suddenly increases from 120 mV

to 200 mV. This signal simulates the output signal coming from the current sensor after filtered by the

bandpass filter, which introduces a gain of 4;

0 1 2

−0.2

0

0.2

Time (s)

Amplitude(V

)

Input Sinewave

(a) Input Sinewave

0 1 2−5

−4

−3

−2

−1

0

1

2

3

4

5

Time (s)

Amplitude(V

)

Output Sinewave

(b) Output Sinewave

Figure 3.8: Loop Response

Figure 3.8, demonstrates how the loop reacts to the input signal by amplifying it to a value which

guarantees that the limits of the on-off controller are respected. It must be said that the output plot does

not includes the offset component of 2.5V. This being said, even when the amplitude changes from 120

mV to 200 mV, the loop still amplifies the signal, but with a lower gain, since the signal is higher. In

figure 3.9, a comparison of the output signal, in volts, after being discretized by a common non-ideal

10-bit ADC (in red) without any sort of correction, with the result obtained when using the AGC (in blue)

is provided.

27

0 0.5 1 1.5

−0.2

−0.1

0

0.1

0.2

Amplitude(V

)

Time (s)

Output with AGCOutput without AGC

Figure 3.9: Reconstruction of the Input signal with and without AGC

Inspecting the figure, we can point out the differences in amplitude of the resulted signal. Where

in the case of the conversion with no correction the maximum amplitude of the signal reaches a value

lower than it should (e.g, for an input of 200 mV we only get approximately 170 mV and by using AGC

we get 197 mV). Ideally, we should get back 197 mV. However, besides the non linearity of the ADC, we

also have the quantization error of 12LSB (' 2.44 mV/bit). It is also obvious that with no correction the

signal gets clipped due to the unipolarity of the ADC. This test shows the obvious benefit of the usage

of such approach to sample data, since with it we get more precision and no signal loss.

Sinusoidal Input: 50 mV - 5 V

For the second test, the controller was, again, subject of a sinusoidal wave signal, which amplitude

is between 50 mV and 5 V. The goal of this test, is to conclude if the controller is also able to attenuate

the input signal, if needed, so the output signal meets the pre-conceived specifications.

As the reader can observe in figure 3.10, the loop can successfully correct the signal’s amplitude to

a desired value. It is, also, noticeable a transitory period which occurs in the instant of the instantaneous

change in the amplitude of the input signal from a very low voltage (50 mV) to a very high voltage (5 V).

As before, in the graphic of figure 3.11 we compare the output signal, in volts, after being discretized by

a common non-ideal 10-bit ADC without any sort of correction, with the result obtained when using the

AGC method.

Inspecting the figure, the reader can observe that the low voltage signal was lost when performing

the conversion without the AGC, since the input signal presents a very low amplitude. In spite of there

are some errors in the resulting signal, due to the transition period from a very low voltage signal to a

very high one, only with the AGC the input signal is successfully recovered. All other conclusions, that

were pointed out in the previous test, can be also applied to this case as well.

28

0 1 2−5

−4

−3

−2

−1

0

1

2

3

4

5

Time (s)

Amplitude(V

)Input Sinewave

(a) Input Sinewave

0 1 2−5

−4

−3

−2

−1

0

1

2

3

4

5

Time (s)

Amplitude(V

)

Output Sinewave

(b) Output Sinewave

Figure 3.10: Loop Response

0 2−6

−4

−2

0

2

4

6

Time (s)

Amplitude(V

)

Output With AGCOutput Without AGC

1

Figure 3.11: Reconstruction of the Input signal with and without AGC

3.3.2 Analog Domain Implementation

It is necessary to materialize the blocks introduced earlier, such as the band-pass filter, the PGA and

the subtraction node, by projecting electronic devices that comprise their respective functions. Every

circuit that will be analysed in the following sections belong to the full schematic of the proposed system,

which is provided in figure 3.12. The layout of the final circuit can also be found in the appendix of this

document (please refer to figure A.3 if needed).

29

Figure 3.12: Full Circuit

3.3.2.A Band-pass Filter

In Europe, the electric power is provided with a voltage signal which has 230 VRMS of amplitude

and f = 50 Hz of frequency. However, the offset introduced by the Hall sensor must be eliminated,

while preserving the 50 Hz component of the spectrum. For that reason, a band-pass filter was chosen

to full-fill this need, so it not only rejects the DC component, but also attenuates all other frequencies

but the 50 Hz component. Moreover, it also acts as an Anti-Aliasing filter. The filter specifications are

presented bellow:

• Central Frequency f0 = 50 Hz;

• Bandwidth b = 10 Hz;

• Filter Gain at the central frequency G(f0) = 4 ' 12 dB.

The chosen values guarantee a filter with a narrow bandwidth (good selectivity - Q = 5). The value

of the gain at this stage influences the rest of the circuitry, namely, the PGA gain range. The higher the

gain of the filter, the lower the maximum gain of the PGA has to be, in order to guarantee a high dynamic

range. Therefore, initially this filter was designed to have a gain of 10. However, after the implementation

of the PGA (which is provided in section 3.3.2.C), it was realized that the designed gain was too high

and, consequently, it was reduced to 4. A lower gain increases the available bandwidth, resulting in a

device with a faster response in time.

To implement this filter, a multiple feedback structure was used, which allows the implementation of

a simple and reliable 2nd order band-pass filter, for low quality factors.

VOUTVIN

= − Hω0s

s2 + ω0

Q + ω20

(3.8)

30

Figure 3.13: Band Pass Filter

The bandpass transfer function of a second order bandpass filter can be obtained with equation 3.8,

where H is a gain factor, ω0 is the central frequency of the filter and Q is the quality factor. The transfer

function of the above circuit is present in equation 3.9.

VOUTVIN

= −1

R1C4s

s2 + 1R5

(1C3

+ 1C4

)s+ 1

R5C3C4

(1R1

+ 1R2

) (3.9)

To get the components’ values, it is just a matter of solving a system of equations involving the

numerator and the denominator of both transfer functions. The process to get the values is the following:

1. Choose C3 Value;

2. Do k = ω0C3 and C4 = C3;

3. Now, R1 is1

R1C4= Hω0 ⇔ R1 =

1

Hk; (3.10)

4. Resistor R5 is2

C1R5=

1

k(2Q−H)⇔ R5 =

2Q

k; (3.11)

5. And, finally, resistor R2 is

R2 =1

R5C3C4ω20 − 1

R1

=1

k(2Q−H). (3.12)

Before going any further, we must guarantee that the current requested by the input load of the

amplifier is not higher than the one the current sensor can provide. The ACS714 current sensor can

provide a maximum current of 3 mA , so this means the band pass-filter has to have a sufficient high

input load or else the requested current will be too high:

VmaxZinput

≤ 3mA⇔ Zinput ≥3.2V

3mA' 1.07kΩ (3.13)

Where Vmax is the maximum voltage the current sensor will give and Zinput is the load seen by the

current sensor, which is equal to Zinput = R1+R2. In accordance, we need at least an input impedance

of 1.07 kΩ. The Quality Factor is given by Q = f0b = 50

10 = 5. The gain at the central frequency will be

influenced by the H factor, but also by the filter quality factor. Thus, to obtain the desired gain, the H

factor must be H = GQ = 4

5 = 0.8.

31

Thus, choosing H = 0.8, C3 = 0.22 µF (it has to be low to avoid the use of an electrolytic capacitor)

and after performing the remaining computations for the resistances values, the nominal values illus-

trated in table 3.1 were achieved. Note that the resistances with values obtained with summation are

obtained with the series of two resistors.

Component Nominal 5% Tolerance 1% Tolerance

R1 36.172 kΩ 36000 + 180 = 36180 Ω 23.2 + 13 kΩ

R2 738.1955 Ω 680 + 56 = 736 Ω 698 + 40.2 Ω

R5 289.37 kΩ 130 + 160 = 290 kΩ 243 + 46.4 kΩ

C3 = C4 0.22 µF - -

Table 3.1: Component Values

Resorting to Matlab [42] functionalities, an analysis of the filter was developed, by considering these

resistor values. All the components were subject to deviations in respect to their nominal values, consid-

ering 1% and 5% compnonents tolerances. The goal was to realize how the tolerance of the components

can influence the response of the filter. The results are illustrated in figures 3.14 and A.1 (in appenidx

A) and are also resumed in table 3.2.

101

102

103

−30

−20

−10

0

10

20

30

Bode Diagram

Frequency (Hz)

System: REAL Frequency (Hz): 48.4 Magnitude (dB): 12

System: TheoreticalFrequency (Hz): 50Magnitude (dB): 12

−40

Magnitude (dB)

Theoretical

REAL

Figure 3.14: Real Vs Theoretical Bode Diagram

Theorethical Nominal 5% +5% Deviation -5% Deviation Nominal 1% +1% Deviation -1% Deviation

Gain (dB) 12 12.1 9.25 8.96 12 11.9 11.9

Table 3.2: Gain at the Central Frequency to the Various Tests

Table 3.2 reflects the change in gain at the filter central frequency, for the components nominal values

and their, respective, deviations (±1% and ±5%). As it can be seen, the worst results are obtained when

the resistors tolerances are 5%, where the filter achieves, in the worst case 8.96 dB, resulting in a relative

error of

Error = 100× 4− 108.9620

4= 29.9% (3.14)

32

Figure 3.14 displays a comparison between the magnitude response of the theoretical filter and the

real filter, realized with real non-ideal components (in this case, the ones indicated in table 3.1 for 1 %

tolerance). The plot shows that the central frequency gets deviated from 50 Hz to 48.4 Hz. This suggests

the need for an potentiometer to tune the central frequency.

The filter was also tested in PSPICE, where the 5% components were deviated from their nominal

values: the idea was to understand if there are some components that influence more or less the central

frequency or the quality factor. The final circuit diagram is presented in figure 3.15(a). The test consisted

in injecting a sine wave with 50 mV of amplitude and performing an AC Sweep analysis varying the

frequency between 0 and 1 kHz and also performing a parametric analysis, by varying each resistor

individually. The results provided in table 3.3 were obtained. The table reflects a change in the central

frequency when R2, alone, was deviated and the changes of both central frequency and the quality

factor when R5, alone, was deviated from its nominal value. The quality factor did not change in the

deviation of R2.

(a) Circuit Diagram (b) Offset Voltage Analysis

Figure 3.15: Pspice Simulation Circuits

R2 f0 [Hz] R5 f0 [Hz] Q

+5% 49 +5% 48.9 10.73

+5% 51.2 -5% 51.4 9.29

Table 3.3: Parameters Variation

In sum, the results show that the filter specifications do not significantly alter, whether using 5% or

1% tolerance values. However, it is recommended that 1% values are used, since the filter response

practically did not change. It was also noticeable that changing R2 value can be used to tune the central

frequency, as well as R5, but the latter also changes the filter quality factor. As a result, 1% component

values were used, in this project. In addition, it was necessary to tune R2 value with a potentiometer, by

connecting it in parallel with the resistor.

The operational amplifier used to implement the filter was the MCP6022 IC, from Microchip. This IC

has a product gain bandwidth of 10 MHz, has a pole at fc = 10Hz, an open gain loop of 120 dB and

an offset VOS = 500 uV. Observing the circuit, and by resorting to the superposition theorem, we can

calculate the contribution of this offset voltage to the output voltage. Hence, by grounding the input signal

and imposing a voltage source of 500 uV in V+ to simulate the voltage offset (see figure 3.15(b)) and also

doing this analysis in DC mode, we have a voltage follower (i.e., the output voltage will be equal to 500

33

uV). Therefore, our signal will have a 500 uV offset, which is not significant. For this application, where

it was designed a 12 dB gain, the amplifier attains a bandwidth of BW = 10MHz

101220

= 2.512 MHz, which is

more than enough to meet the filter requirements, since we are working with much lower frequencies.

PIC18F4550 only supplies a positive voltage of about 5 V, which comes from the USB. However,

the band-pass filter requires a a negative voltage, to operate correctly. Consequently, it is necessary to

generate a negative voltage from a positive one. The different alternatives to accomplish this, require

additional ICs: it could be used a transformer-coupled-split rail design or an Isolated Fly-buck Converter.

Despite both alternatives would generate a very steady negative voltage, they rely on inductors, which

occupy a great space in a circuit board. Therefore, LMC7660 switched capacitor voltage converter

circuit was used. The LMC7660 is capable of converting an input voltage between +1.5V to +10V to the

corresponding -1.5V to -10V, it requires a low supply current of 200 uA (maximum value), has more than

90% of efficiency and it only needs two external components (Cp - Pump Capacitor and Cr - Reservoir

Capacitor). To calculate the reservoir capacitor value, we have to take into account that the operational

amplifier will need a typical 1 mA current to drive all transistors, but has also to provide enough current

to the output circuit. Thus, the formula (given in LMC7660 datasheet) to calculate this capacitor is:

IL = Crdvdt∼ Cr ×

Vripplep−p4/FOSC

⇒ Cr =4

FOSC× ILVripplep−p

(3.15)

Where IL is the Load Current, the Vripplep−p is the accepted Output Voltage Ripple and FOSC

is the Oscillatory Frequency. Thus, by operating at the nominal frequency (FOSC = 10 kHz), and

choosing IL = 10 mA and Vripplep−p = 40 mV, yields:

Cr = 100µF (3.16)

The value of the pump capacitor is usually the same of the reservoir one, so Cp = 100 µF . Having

a large pump capacitor is also beneficial to increase conversion efficiency. A circuit diagram of this

converter lies in figure 3.16.

Figure 3.16: Voltage Converter Schematic (LMC7660 datasheet - Texas Instruments)

3.3.2.B Summming Amplifier

As it was mentioned earlier, before the signal can be supplied to the ADC, it is convenient to introduce

an offset (equal to V ref+ADC/2 = V ref+ADC2 = 5

2 = 2.5 V). A summing amplifier configuration was used,

34

Figure 3.17: Summing Amplifier

where one of the inputs is the output of the band-pass circuit and the other is the offset itself. This

summation only works, as it should, if all resistors are of the same value. The offset will be generated

by a voltage divider, where both resistances are the same, so the divider is 2. Furthermore, this resistor

divider is connected to a buffer, so it does not charge the next circuit. The values of the resistors are

in the order of kΩ, because we do not want high currents being drained by the amplifier. It was used

1% resistances and their nominal values are indicated in figure 3.17. Figure 3.18 illustrates the result of

the simulation for a case where the output of the band-pass is fed into this adder, yielding a sinusoidal

wave with a DC component of V dd2 :

Figure 3.18: Adder Output - Voffset (purple) = 2.5 V; Vsub (yellow) = 1.175 mV amplitude; Vout (cyan)= 3.662 V

3.3.2.C Programmable Gain Amplifier

The PGA is implemented by using a subtractor amplifier , whose entries are the output of the sum-

ming amplifier and a DAC. The PGA is referenced to the V cc/2 common voltage and it can be shown,

by resorting to the superposition theorem, that its output voltage is given by equation 3.17.

Vout =RFRE

(V2 − V1) +V cc

2(3.17)

35

Figure 3.19: PGA

The equation reveals that for changing the gain of the amplifier one could simply modify the relation

between resistances RF and RE and varying this relation over time. This can be accomplished if RF

resistor is replaced by a digitally controlled potentiometer, as it is illustrated in figure 3.19. The used

potentiometer was the AD5113 from Analog Devices: it contains a fixed resistor with a wiper contact

that taps the fixed resistor value at a point determined by a digitally controlled UP/DOWN counter.

Figure 3.20: AD5113’s Pin Configuration and Block Diagram - Analog Devices

In picture 3.20 it is represented AD5113’s block diagram. This device has a three-wire serial input

interface: CLK - Serial Clock Input (negative edge triggered); CS - Chip Select Input (active low) and

UP/DOWN (U/D) - UP/DOWN direction increment control. Table 3.4 summarizes the operation of the

circuit in respective to all combinations of states of those three signals. Therefore, when CS is taken

active low, the clock begins to increment or decrement the internal UP/DOWN counter, dependent upon

the state of the U/D control pin. The UP/DOWN counter value (D) starts at 0x20 at system power ON.

Each new clock pulse will increment the value of the internal counter by one LSB, until the full scale of

0x40 is reached, as long as U/D pin is logic high.

The resistance between the wiper and either the end point of the fixed resistor provides a con-

stant step size that it is equal to the end-to-end resistance divided by the number of positions (e.g.,

Rstep = 80kΩ/64 = 1.25 kΩ). Just like a common potentiometer, it is possible to use the resistance

between A terminal and the wiper or between the wiper and B terminal of the full resistance. Analog

36

CS CLK U/D Operation

L ↓ H Wiper Increment Toward Terminal AL ↓ L Wiper Decrement Toward Terminal BH X X Wiper Position Fixed

Table 3.4: AD5113 Operation

Devices provides digital potentiometers of this series with 5 kΩ, 10 kΩ and 80 kΩ nominal resistances.

As discussed earlier, one of the requirements is that the resulted gain must be a power of 2, so the

computational effort gets reduced. The resistor between the wiper and B terminal is obtained according

to the next formula:

RWB = D × RWN

64+RW (3.18)

Where RWN is the nominal resistance, D is the digital word recorded in the UP/DOWN counter and

RW the wiper resistance that typically assumes the value of 700 Ω. Hence, the gain can be obtained in

function of the digital word (D), the nominal resistance and the wiper one, combining equations 3.17 and

3.18, yielding:

Gain =Rf

RE=

D64 ×RWN +RW

RE(3.19)

So, if RE = 5 kΩ and RWN = 80 kΩ, it yields:

Gain =D64 × 80 + 0.07

5(3.20)

With those values, the gain is a function of the D variable and assumes values of power of 2 as long

as D equals to 1, 2, 4, 8, 16, 32 and 64. This is, in turn, very easy to attain in the digital domain.

On the other hand, according to the PSU CORSAIR TX750 [41], the AC input can request a maximum

current of 10 A. This means that the current sensor output lies between 2.5 V and 3.2 V, corresponding to

the current inputs of 0 A and 10 A, respectively. Though, after running some workload, it was measured

a minimum current of about 0.45 A and a maximum current of 0.53 A, corresponding to 30 mV and 50

mV at the output of the sensor, respectively. Thus, for the low voltage of 30 mV the combining gain of the

band-pass filter and the PGA ouputs a voltage of 4× 16× 30 mV = 1.920 V which, when summed to the

common voltage of V cc/2, yields 4.420 V. Therefore, the ADC’s input full-range is almost attained with

the projected gains. Higher values than 4.7 V are not recommended, to prevent the loss of information

during the quantization process.

Ideally, there would be no error in the resistors value and neither would exist a wiper resistance.

Unfortunately, that is not the truth and an effort must be made to reduce the introduced error. In respect

to the wiper resistance, little can be done, since it is inherent to the technology used by Analog Devices,

but it is possible to ultimately reduce the error regarding resistors values: first the resistors to be used

as RE ones will be of 1% precision, since it were the ones available. Nevertheless, by doing a parallel

association between two resistors of value 2× RE : in the best case, the error can be reduced from 1%

to 0.01 %; in the worst case the error keeps being 1 %. With the values of the resistors set, we can

37

easily control the gain by changing the value of the digital word. The values the digital word can assume

and respective gain for the worst-case resistances values RE are represented in the following table:

D G (Ideal) G (Re = 5,05) Error G (Re = 4,9995) Error

1 0,25 0,26 4,55 % 0,26 5,61 %

2 0,50 0,51 1,78 % 0,51 2,81 %

4 1,00 1,00 0,40 % 1,01 1,41 %

8 2,00 1,99 0,30 % 2,01 0,71 %

16 4,00 3,97 0,64 % 4,01 0,36 %

32 8,00 7,93 0,82 % 8,01 0,19 %

64 16,00 15,86 0,90 % 16,02 0,10 %

Table 3.5: Digital Word Vs Gain

This table reflects the importance of always operating at the highest gain possible, since small errors

(referred to the ideal gain) are attained for both cases. For every gain setting, the PGA introduces an

offset. Therefore, those offsets were measured and saved in a look-up table in the microcontroller, to

correct such deviation. The results of this test are provided in chapter 5.

Finally, it is important to explain how the value provided by the ADC gets converted back to an

electrical magnitude. As it was showed, an amplifier stage was needed before the signal could be

acquired by the ADC. Hence, in order to obtain the real value of the sampled voltage, it is only a matter

of revert all the steps done until we got the digital value generated by the ADC:

1. Convert the Value in Binary Format to a Voltage

VV alue = Digital V alue× VRef+ − VRef−ADC Resolution

(3.21)

where Digital V alue is the value given by the ADC, VRef+ is the positive ADC reference voltage

(= 5V), VRef− is the negative ADC reference voltage (= 0V) and ADC Resolution is the ADC

resolution (quantization levels = 1024).

2. Calculate the Voltage Value Before Amplification

By analysing the circuit provided in figure 3.5, the input voltage is given by the following expression:

Vin = −(VV alue − V+)× R3

R4+ V+ (3.22)

i.e., it is the ratio between resistors R3 and R4 that dictates the gain. The gain values for each

sensor are stored in an array within the application. Now, it is necessary to subtract the offset

imposed by the sensor (VRef Sensor), so we obtain the real voltage value. The offset values of

each sensor were measured with a voltmeter and are stored in a look-up table at the host.

VReal = Vin − V Ref Sensor (3.23)

38

3. Calculate the Current Value

Finally, the voltage value must be converted back to current. Thus, we have to use the current

sensors sensitivity (mV/A) to convert the voltage value back to current:

I =VReal

Sensitivity(3.24)

4. AC Sensor

The above steps are valid to all sensors, but the AC sensor. For this sensor, the signal conditioning

is different to the one used with DC rails. Therefore, only the first and third steps applies, but still

needs an intermediate step in which the gain introduced by the bandpass filter gets cancelled:

VReal =VV alue

Filter Gain(3.25)

The reader is invited again to observe the full-schematic of the system in figure 3.12 if necessary.

3.3.3 Dynamic Range Analysis

In ADC applications, dynamic range is the ratio of the rms value of the full scale to the rms noise,

which is generally measured with the analog inputs shorted together. Commonly expressed in decibels,

it indicates the range of signal amplitudes that the ADC can resolve. For an N-bit ADC, the ideal DR

(Dynamic Range) or also called SNR (Signal to Noise Ratio)[43] can be calculated as:

DR = 6.021×N + 1.763 (dB) (3.26)

One method to improve this parameter is to perform Oversampling. As the name implies, over-

sampling gathers additional conversion data from an input signal. Standard convention for sampling an

analog signal indicates that the sampling frequency Fs should be at least twice the maximum frequency

(FM ) of the input signal. This is known as the Nyquist Theorem:

Fs = 2× FM (3.27)

Hence, by using an higher sampling frequency (oversampling), and combined with averaging tech-

niques, the Effective Number Of Bits (ENOB) can be improved. In fact, averaging the oversampled

results also averages the quantization noise, thus improving the SNR and consequently the ENOB. The

ENOB is a function of the Signal-to-Noise ratio plus Distortion (SINAD), which is measured with a si-

nusoidal input near full-scale applied to the A/D converter. The SINAD is found by computing the ratio

of the RMS level of the input signal to the RMS value of the root-sum-square (RSS) of all noise and

distortion components in the FFT analysis, except for the DC component. The ENOB is calculated by

substituting the ADC’s computed SINAD for DR in equation 3.26 and solving equation for N.

39

ENOB =SINAD− 1.763

6.021dB (3.28)

For each bit of accuracy improvement, the signal must be oversampled by a factor of four, meaning

that the relationship between the oversampling frequency FOS and the sampling frequency FS is:

FOS = 4Nb × Fs (3.29)

where Nb is the desired improvement on the ENOB (for instance, for two bits of improvement, Nb =

2). Figure 3.21 shows how oversampling improves the accuracy of the conversion result. In this diagram,

the input signal is oversampled by four (sample groups are shown in green and purple) and averaged.

The shown sample points illustrate the difference between the raw, noisy signal and the average; the

noise in this example affecting ± 3 bits of accuracy on an individual sample. Note that the averaged

values (orange dots) are much closer to the ideal value than most of the single samples.

500

505

495

510

490

485

480

Individual Samples

Samples Group

Input waveform

Sample GroupAverage

Conversion Value

t

Figure 3.21: Averaged Conversion Results

As a general rule, doubling the sampling frequency yields, approximately, a 3-dB improvement in

noise performance. Noise improvement can be attained by using post-processing techniques (averag-

ing). When averaging conversion results, there are two approaches that can be taken into account:

normal average or rolling average.

Normal average is simply to acquire N samples, adding them and dividing the result by N. When us-

ing normal averaging in an oversampling scenario, after the technique is applied, the sample data used

in the calculation is discarded. This process is repeated every time the application needs a new con-

version result. When using averaging techniques, there is a slight delay associated with the calculated

conversion result since it corresponds to the average of the last n samples. The delay can be calculated

using the formula shown in Equation 3.30.

tdelay = tsn − ts0 + tprocess (3.30)

40

where ts0 is the time at which the first sample of the average occurs, and tsn is when the last sample

occurs. The tprocess time, required to process the sampled data and calculate the average to supply

to the application, is also factored into the equation. Unfortunately, when using this method the ADC

sampling frequency is reduced by the same oversampling factor (i.e., if we are using an oversampling

factor of N = 8 and we are sampling at a rate of 10 kHz, then the effective sampling frequency will be

10/8 = 1.25 kHz).

The rolling average technique consists in using a sample buffer of the N most recent samples in

the averaging calculation, allowing the ADC to sample at its maximum rate (the ADC sample rate is not

reduced by N as in normal averaging), making it ideally suited for applications requiring oversampling

and higher sample rates. However, we still have the delay resulting from populating the sample buffer

and process the average calculation (this however can be efficiently done by using a number of samples

equal to powers of two). Furthermore, in order to always have useful data being processed by the

application, while the buffer is being filled with data, the samples coming directly from the ADC are used.

Figure 3.29 shows the differences between the two aforementioned techniques for an oversampling ratio

of 4. We can see that using the rolling average permits the operation at the maximum frequency, whilst

the normal average does not since it has to wait for the buffer to be filled with 4 samples before averaging.

t

Averaged Sample Availabletdelay

t1t0

Conversion Value

(a) Normal Average

t0

t

Conversion Value

t5 t10 t15 t20

d0d1d2

Input waveform

(b) Rolling Average

Figure 3.22: Averaging Techniques

3.3.3.A SNR,THD and SFDR Analysis

For the following analysis, it will be important to know, precisely, how the ADC performs when

stressed with a common 50 Hz sinusoid input. Thus, in this section an analysis to the SNR (already

mentioned), the Total Harmonic Distortion (THD) and to the Spurious Free Dynamic Range (THD) will

be done and some choices are made, such as the sampling frequency of the system and the post-

processing technique to be used. Therefore, definitions for THD and SFDR need to be provided.

Total Harmonic Distortion is the ratio of the rms value of the fundamental signal to the mean value

of the root-sum-square of its harmonics (generally, only the first 5 harmonics are significant). THD of an

ADC is also generally specified with the input signal close to full-scale, although it can be specified at

any level.

The Spurious Free Dynamic Range is the ratio of the rms value of the signal to the rms value of the

41

worst spurious signal regardless of where it falls in the frequency spectrum. The worst spur may or may

not be a harmonic of the original signal. SFDR can be specified with respect to full-scale (dBFS) or with

respect to the actual signal amplitude, also called carrier (dBc).

There is no information available about the dynamic characteristics of the embedded ADC in PIC18F4550

microcontroller. However, we can have as basis the information about a similar 10 bit ADC from Mi-

crochip and compare those values with the ones obtained through measurements. In table 3.6 is re-

vealed the main characteristics of the datasheet of Microchip 10-bit ADC MCP3004/3008, for Vref = 5

V.

200 KSPS MaxDNL and INL ± 1LSB

SINAD 61 dBSFDR 78 dBTHD -76 dB

Table 3.6: ADC Characteristics

0 50 100 150 200 250 300 350 400 450

−120

−100

−80

−60

−40

−20

0

Frequency (Hz)

Single-Sided Amplitude Spectrum of V dd2

dB

RMS Quantization Noise Level

10log( 40962 ) = 33 dB

FFT NOISE FLOOR = 94 dB

SNR = 60.887dB

(a) Vdd/2

0 50 100 150 200 250 300 350 400 450 500−120

−100

−80

−60

−40

−20

0F

Single-Sided Amplitude Spectrum of in

dB

Frequency (Hz)

FundamentalNoiseDC and Harmonics (excluded)

10log(40962 ) = 33 dB

(b) Fs = 1kHz; VDC = 2.45 V and VAC = 1.715 Vrms

Figure 3.23: PIC18f4550 10-bit ADC’s SNR

The SNR of MCP3004/3008 can be calculated using the SINAD and the THD values, according to

the next formula [44]:

SNR = −10 log[10−SINAD/10 − 10THD/10] ' 61, 45dB (3.31)

As it should, the SNR is greater than the SINAD value. For the ADC of PIC18F4550, in order to calculate

some of these parameters first a DC signal with magnitude Vdd/2 was fed into ADC and sampled at 1

kHz. Figure 3.23(a) shows the Fast Fourier Transform (FFT) plot of that test, taken with 4096 points. The

scale is referred to the Full-scale input voltage (Vref = +5 V). The plot evidences the noise floor of the

system to lie around 91 dB. The FFT process reduces the noise floor, so the actual RMS quantization

noise level equals to [? 43]:

SNR = FFT Noise Floor− 10log

(4096

2

)' 60.887 dB (3.32)

42

This is, actually, the SNR of the system. The ENOB is calculated via equation 3.28 resulting in:

ENOB ' 9.8 (3.33)

This is a typical value for the ENOB of a 10 bit-ADC. However, it was obtained injecting a constant

value. Accordingly, to have a better representation of the SNR, a sinusoidal signal with f=50 Hz and an

amplitude close to the full scale (DC offset of 2.45 V and a maximum of 1.715 VRMS) and sampled at Fs

= 1kHz was injected. The respective FFT is presented in figure 3.23(b).

As it can be concluded, by observing the plot, the ADC behaves poorly and a reduction of the SNR

to approximately 38 dB occurs. In fact, it was observed that the signal was interfered with noise with the

same amplitude of the harmonics. There can be several reasons to this: extra noise (not quantization)

introduced by the mains; the voltage reference used by the ADC, which comes from USB and powers the

full system is not very stable, thus affecting the conversion; the sampling rate is not high enough. Hence

the combining effect of the aforementioned reasons, results in the quantization process being strongly

interfered during ADC’s conversion. The first reason was, at first, discarded after obtaining the FFT of

the same sinusoid with a digital oscilloscope, where we can see the fundamental and its harmonics,

the noise, but no other signal (please, refer to figure A.4 in the appendix). However, the DSO-X 2024A

digital oscilloscope performs oversampling and samples averaging, thus it is inconclusive the source of

noise, by using this method.

The hypothesis that the voltage of the USB bus is unstable is a fact, since it varies with the load of

the PC. A solution would be to source the board with an external power source, providing a much more

stable voltage reference. Nevertheless, this would require modifications in the PIC’s demonstration

board and it would be hard to come up with an easy to integrate meter board when it demands the use

of an external power source.

Therefore, the sampling frequency has been increased to 3.3(3) kHz and new measurements were

taken. Furthermore, post-processing of the samples has been done: averaging N=8 and N=16 samples,

by using a buffer rolling average technique. The choice of operating at Fs = 3.3(3) kHz is discussed in

chapter 4.

After inspection of the resulting measures in figures 3.25(a) and 3.25(b), it can be observed an

increase in the SNR value (the noise floor gets reduced) and the noise interference previously seen

is now eradicated, even in figure 3.24, where averaging is not performed. Hence, sampling at Fs =

3.3(3) kHz is enough to rise the SNR to approximately 60.4 dB, which is a much closer value to the one

obtained when injecting a constant DC signal.

As it was previously referred, performing an averaging of the samples can improve the SNR value

and it is effectively the case when averaging by N=8 and N=16 samples: figure 3.25(a) shows an im-

provement of, approximately, 9 dB when compared with the SNR obtained with no averaging. This, in

fact, agrees with the conjecture which states that for every doubling of averaged samples, there is a

noise improvement of more or less 3 dB. This, in turn, yields a final SNR of 69.187 dB, thus surpassing

the theoretical limit of 61.973 dB for a 10-bit ADC. Similar conclusions are obtained analysing figure

43

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6−140

−120

−100

−80

−60

−40

−20

0

Frequency (kHz)

dB

F

23

4

5

6

Single-Sided Amplitude Spectrum of Sin



FFT NOISE FLOOR = 93.51dBFS10log(40962 ) = 33 dB

SNR = 60.396 dBFS

Figure 3.24: FFT of Sinusoidal Signal sampled at Fs = 3.3(3) kHz

3.25(b), where, in this case, we state an improvement of 11 dB, instead of the expected 12 dB. This

results in a final SNR of, approximately, 72 dB.

The THD, SFDR and SINAD were also computed for every case. The ENOB value was calculated

with the values obtained for THD and SNR and by using, once again, equation 3.28. Since the analysis

is similar for every case, it was decided to just include the plots for THD and SFDR for one of the cases

(please refer to figure A.5 that lies in the appendix of this dissertation). Table 3.7 summarizes the results.

Signal V dd2 Fs=1kHz Fs=3.3(3) kHz Fs=3.3(3) kHz — N = 8 Fs = 3.3(3) kHz — N = 16

SNR 61 dB 38 dB 60.4 dB 69.187 dB 72 dB

THD N/A 54.22 dB 57.72 dB 61.291 dB 62.64 dB

SFDR N/A 52 dB 59.82 dB 61.842 dB 63.33 dB

ENOB N/A 6 8.982 9.778 10.03

Table 3.7: SNR values for each case test

For this project, it was chosen to operate with FS = 3.3(3) kHz, averaging N=8 samples with a buffer

rolling technique. Sampling with an higher number of samples would introduce a major overhead in

the main loop of the application, due to the buffer and accumulator manipulation, which would force the

usage of a lower sample frequency. Anyhow, this option introduces enough noise improvement to outgo

the SNR theoretical limit for a 10-bit ADC. According to the results, it is secure to state that the mains

introduces some undesired spurs affecting significantly the ADC’s conversion. However, it is possible to

bypass this issue by sampling at higher frequencies and performing samples averaging. It is, now, also

possible to compare these results with the ones indicated in table 3.6 referring to a typical 10-bit ADC.

Thus, PIC18F4550’s ADC has a lower performance than MCP3002 ADC, although is important to bear

in mind that MCP3002ADC has a resolution of 200 KSPS, whilst PIC18F4550’s ADC has only 58 KSPS,

44

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6−140

−120

−100

−80

−60

−40

−20

0

Frequency (kHz)

dB

F

2

3

4

5

6




FFT NOISE FLOOR = 102.3 dBFS

SNR=69.2dBFS

(a) N=8

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6−140

−120

−100

−80

−60

−40

−20

0

Frequency (kHz)

dB

F

2

3

4

56



FFT NOISE FLOOR = 104.5 dB

SNR=72dB


(b) N =16

Figure 3.25: FFT of Sinusoidal Signal with samples averaging (Fs = 3.3(3) kHz)

which affects dynamic performance.

3.3.3.B Combining the PGA with Oversampling

As referred earlier, to achieve a maximum dynamic range of the ADC, a front-end PGA stage can be

added to increase the effective SNR for very small signal inputs.

Consider a system dynamic range requirement of > 80 dB:

1. First, we need to calculate the minimum rms noise to achieve this dynamic range, considering a

Vp−p = 5 V (Unipolar ADC). The maximum allowable system noise is calculated as:

DR = 20 log

(VFSRMS

VRMSNoise

)(3.34a)

Where V FSRMS and VRMSNoise are the Full-Scale RMS Voltage allowed by the ADC and the RMS

noise, respectively. Substituting the values, yields:

80dB = 20log

(5

2√2

VRMSNoise

)⇔ VRMSNoise = 176.776 µV RMS (3.34b)

2. The 10-bit 58 KSPS ADC of PIC18F4550 works at approximately 3 KSPS. The total rms noise is

simply:

176.776 µVRMS = ND ×√BWmax (3.35a)

Where ND refers to the noise density and BWmax to the Nyquist band. Thus for BWmax = 1.5

kHz:

ND =176.776 µV√

1500' 4564.35 nV/

√Hz (3.35b)

This is the amount of noise density referred to the input (RTI), that can be tolerated by the system.

3. It is now possible to understand if the amplifier of the AGC chosen, a priori, is appropriate to

45

provide sufficient Analog Front-End gain to achieve the required 80 dB. If it is not suitable, then

another device with better performance must be chosen or do more conservative specifications

regarding the DR.

The input 50 Hz signal is acquired with a frequency of 3.3(3) kHz and it is subsequently averaged

with N=8 samples, thus reducing the amount of noise in the system and providing a total of 69.2

dB of SNR. Hence, to achieve 80 dB of DR, it will require at least 10.8 dB improvement, which can

come from the gain provided by the PGA stage. Therefore, this block has to provide a gain of at

least 3.5 without exceeding the ND limit (4564.35 nV/√Hz).

The AD8031 amplifier combined with variable resistors, and mounted in a differential gain amplifier

with RE = 5 kΩ (figure 3.19) it is capable of providing a maximum gain of 16, as we have seen

before, which is higher than the required. This amplifier has 15 nV/√Hz at f = 1 KHz for unity gain

(G = 1); for higher gains, the noise is usually lower and is higher for operations at low frequencies

(< 1KHz). So, we can roughly use the 15 nV/√Hz value for the following calculations.

NDPGA ' 15 nV/√Hz (3.36)

Another AD8301 amplifier should lie between the band-pass filter and the the PGA, providing the

necessary common-mode voltage, through a summing configuration (figure 3.17):

NDSum = 15 nV/√Hz (3.37)

The MCP6002 is positioned before the summing amplifier and is used to implement the band-pass

filter (figure 3.13):

NDBpass = 15 nV/√Hz (3.38)

4. Since the complete system’s noise budget is 4564 nV/√Hz (RTI), it is useful to calculate the

dominant noise sources to ensure that the limit is not exceeded. The noise densities are referred

to the input, which is the band-pass filter of the full circuit (the reader may recall the full circuit

diagram in figure 3.12).

The AD8031 summing amplifier and the PGA have both an input-referred noise of 15 nV/√Hz.

And when referred back to the input of the band-pass filter (which can provide a gain of 4), yields

3.75 nV/√Hz of noise. The band-pass has an input-referred noise of 15 nV/

√Hz

The ADC has a SNR of 60.4 dB using 5V as reference, yielding:

N =

52√2

1060.4/20' 1688.2 µVRMS (3.39)

Considering the Nyquist BW (1.5 kHz):

NDADC '1688.2µVRMS√

1500' 43.6 µV/

√Hz (3.40)

46

And when referred back to the input, we have NDADC RTI =681.25 nV/√Hz.

The total RTI noise of the full system is given by the root-sum-square (RSS) of the those noise

densities.

Noise Total =√

ND2PGA RTI + ND2

Sum RTI + ND2Bpass RTI + ND2

ADC RTI (3.41a)

Substituting values, yields:

Noise Total =

√2× (3.75 nV/

√Hz)2 + (15 nV/

√Hz)2 + (681.25 nV/

√Hz)2 (3.41b)

⇒ Noise Total ' 681.44 nV/√Hz (3.41c)

Which is lower than the maximum allowable (4564 nV/√Hz). The total noise contribution suggests

that the initial consideration (80 dB) was too conservative and, in fact, the DR can be greater.

Considering this value and by using equation 3.34a, the system may achieve a DR = 96.5 dB.

3.3.4 System Stability Analysis

Since the proposed AGC is non-linear, the theory commonly used for feedback linear systems is not

appropriate. However, under some assumptions, it is possible to analyse the system in a way similar to

the one used with linear systems in the frequency, domain. This section will provide some background

about non-linear system problems and some analysis techniques based in non-linearities of sector,

important for this type of cases and then apply this theory to the practical case.

3.3.4.A Absolute Stability

A grand part of non-linear systems can be represented as conventional linear, time invariant system

with a non-linear block in the feedback line (see figure 3.26(a)). Actually, this type of systems appear

with some frequency in practical engineering cases and the problem of stability analysis of this kind of

systems is often identified as Lur’e Problem or Absolute Stability Problem [45, 4, 46].

(a) Lur’e Problem (b) Definition of a Sector

Figure 3.26: Non-Linearities Analysis [4]

The process for representing a system in this form depends on the particular system that is involved.

47

For instance, in the case where the only non-linearity is in the form of a relay or actuator/sensor non-

linearity, there is no difficulty in representing the system in this feedback form. In other cases, the

representation may be less obvious.

It is assumed that the external input is r = 0 and a study is made for the behavior of the unforced

system. What is unique about this section, is the use of the frequency response of the linear system,

which builds on classical control tools like Nyquist plot and Nyquist criterion. Being the dynamic system

in figure 3.26(a) represented in the form of state space model:

x = Ax+Bu

y = Cx+Du

u = −φ(y)

(3.42)

where x ∈ <n, u and y ∈ <p, the pair (A,B) is controllable, (C,D) is observable. The matrix G(s) is

the open loop transfer function of the system and it can be obtained from the state space model:

G(s) = C(sI− A)−1B + D (3.43)

It is useful to define a sector and when does a function belongs to a sector. A continuous function φ

belongs to the sector [K1,K2] if there is two non negative numbers k1 and k2, such as:

y 6= 0⇒ K1 ≤φ(y)

y≤ K2 (3.44)

Geometrically, this condition means the graph of φ(y) lies between two lines with slope K1 and K2

as shown in figure 3.26(b). The definition in equation 3.44 implies that φ(0) = 0 and φ(y)y ≥ 0 (i.e., the

graph of φ(y) is located in the 1st and 3rd quadrants, since K1 and K2 are non-negative numbers). An

important question arises: supposing the non-linearity φ(y) belongs to sector [k1, k2], if the open loop

transfer function G(s) is stable (or Hurwitz), what conditions should be imposed to φ(y) in order ensure

that the closed loop system remains stable? the following theorems answers this question.

3.3.4.B Popov Criterion

The monovariable case of the Popov criterion is addressed by several authors ([45, 46, 47, 4]) and

establishes a sufficient condition to prove stability of non-linear systems in closed loop. Considering the

system presented in figure 3.26(a), if the following conditions are satisfied:

1. Matrix A has all the eigenvalues in the left-half of the complex plan (i.e. A is Hurwitz) and D > 0;

2. [A,B] is controllable;

3. The non linearity φ(y) belongs to sector [0,K].

48

Then, the system is globally asymptotically stable where φ(y) ∈ [0,K] iff ∃ q ≥ 0 such that:

<(1 + jωq)G(jw)+1

K> 0 ∀ ω ∈ < (3.45)

This formulation permits an interesting graphical interpretation:

If the Nyquist plot of the system H(s) = (1 + sq)G(s) lies in in the right side of the vertically line that

passes through the point − 1K , then the closed loop system is stable [46].

Loop Transformation

As we have seen in the previous section, if the non-linearity φ(y) belongs to section [0,K], then it

is possible to apply Popov’s criterion. Nevertheless, even if the non-linearity belongs to a more general

sector in the form of [K1,K2], it is still possible to apply the criterion if we perform a loop transformation.

The idea is to transform the loop such that all the conditions for using the criterion are still satisfied. If

we apply this transformation, φ ∈ [K1,K2] is transformed to φ ∈ [0,K2 −K1] and we can now use the

theorem, considering the new linear system G(s).

G(s) =G(s)

1 +K1G(s)(3.46)

3.3.4.C Circle Criterion

This criterion is a stronger and general criterion for the absolute system problems and the Popov

criterion can be seen as a particular case of this general criterion (when q = 0), allowing the non-linearity

to be time-variant [4]. Thus, considering, again, the system in figure 3.26(a), if the following conditions

are satisfied:

• Matrix A has no eigenvalues in the jω axe and has p eigenvalues in the right-half of the complex

plan;

• The non-linearity φ belongs to sector [K1,K2] and can be time-variant;

• If it verifies one of the following conditions:

1. 0 < K1 ≤ K2 the Nyquist plot of G(jω) does not enter the disk D(K1,K2) and encircles it p

times in the counter-clockwise direction, where p is the number of poles of G(s) with positives

real parts.

2. 0 = K1 < K2 , G(s) is Hurwitz and the Nyquist plot of G(jω) lies to the right of a vertical line

defined by <[s] = − 1K2

. (Popov Criterion for q = 0)

3. K1 < 0 < K2, G(s) is Hurwitz and the Nyquist plot of G(jω) lies in the interior of the disk

D(K1,K2).

Then, the closed loop system is absolutely stable in the finite domain.

49

3.3.4.D Application of the Theorems to the System in Study

First, in order to apply one of the above theorems, we have to present the system in the state space

model fashion way. Thus, we have to model some real time functions in a transfer function, so we can

get the open loop transfer function. In sum, the system is composed by a PGA (whose gain is changed

through digital potentiometers), the input sinusoidal signal, a rectifier, a first order Low Pass filter and the

non-linear control block. Besides the non-linearity, the only blocks that introduce dynamic to the overall

system are the programmable gain amplifier and the filter. The sinusoid can influence the stability of the

system through its amplitude that we call Vin.

The variable gain amplifier, can, in this case, be roughly modelled by a first order transfer function of

type 0, because the response is associated to the frequency response of the AD8031 amplifier:

G1(s) =1/τ

s+ 1/τ(3.47)

Where τ refers to the constant time of a first order system. To determine the value of this constant,

we normally could observe the step response of this block and get τ , which would be the time the

output takes going from 0 value to 63.2% of its final value. But, according to the datasheet of the digital

rheostats, we can determine this constant. As stated before, the digital word D assumes values between

1 and 64, thus we can get the time constant if we consider the worst case scenario (i.e., change the

digital word of the device incrementing one by one from 1 to 64). The AD5113’s datasheet provides the

following times: t2 = 10 ns - CLK low time; t3 = 10 ns - CLK high time and t4 = 15 ns - U/D setup time.

At the first time, incrementing the variable will take t = t2 + t3 + t4, whilst in the following increments

will just take t = t2 + t3, so the full time can be obtained:

time = N × (t2 + t3) + t4 (3.48)

Where N is the total of increments we need. In the worst case we need 63 increments, so the full

time and consequently τ is:

τ = 63× 20 ns+ 15 ns = 1275 ns (3.49)

This, in turn, yields the following first order transfer function:

G1(s) =785000

s+ 785000(3.50)

This means the dynamic of the system will be governed by the low pass filter, since the pole of its

transfer function is much closer to the imaginary axis:

G2(s) =10

s+ 10(3.51)

Hence, for the stability analysis, the system can be presented by the following MatLab simulink [42]

blocks:

50

10

| |

.

100010001

110102

5

1

Figure 3.27: System for Stability Analysis

The diagram is organised in the way discussed earlier, that is, a linear system with a non-linear

feedback block (Lur’e Problem). The gain (and input sinusoid amplitude) are fixed at the maximum

values those variables can assume (16 and 5V of amplitude, respectively). One of the goals is to know

if the system will be stable when the maximum possible gain is reached since, in such case, it will be

stable for lower gains.

(a) Non-Linearity Plot (b) Centered Non-Linearity Plot

Figure 3.28: Non-Linearity

The non-linearity is represented in figure 3.28(a). However, in order, to apply the theorems, this

linearity must be contained in the 1st and 3rd quadrants, must be time-invariant (in the case of Popov

Criterion) and memoryless. A slightly change can be made to the system to shift the graphic and center

it around zero. Thus, subtracting -200 to the input signal of the non-linearity, we can shift the graph of

figure 3.28(a) to the left by 200 and consequently satisfy the criteria with no loss of generality, whatsoever

or even change in the dynamic of the overall system2. Hence, the final plot of the non-linearity function

is presented in figure 3.28(b), which is an ’on-off’ non-linearity with hysteresis and a dead zone.

By the definition of equation 3.44, the above function belongs to the sector:

φ ∈ [0, 1] (3.52)

2The margins used for this analysis may not be the ones used physically. However, it does not matter which margins arechosen, since it is always possible to shift the characteristic to comprise the sector requirements

51

Resorting to MatLab, the space-state model of the system in figure 3.27 is:

x = Ax +Bu

y = Cx

u = −φ(y)

(3.53)

Where,

A =

−10 1.286× 1010

0 −7.85× 105

, B =

0

1

, C =[10 0

]and D =

[0]

(3.54)

Taking into account equation 3.43,this yields the following open-loop transfer-function:

G(s) =1.286× 1011

(s+ 785000)(s+ 10)(3.55)

The non-linearity is time-invariant, so Popov Criterion can be applied. Therefore, according to the

criterion in equation 3.45, by doing K = 1, q = 1 and if the next inequality is satisfied, then the stability is

guaranteed:

<(1 + jω)G(jw) > − 1

K= −1 ∀ ω ∈ < (3.56)

Figure 3.29(a) represents the Nyquist Diagram for equation 3.57.

H(s) = (1 + s)G(s) (3.57)

−2 0 2 4 6 10 12 14 16 18x 10

4

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1x 105 Nyqust Diagram

Real Axis

Imaginary

Axis

NyquistR = -1

8

(a) Nyquist Diagram of H(s) for Positive Frequencies

−1 0 5 15 20−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1x 104 Nyquist Diagram

RealAxis

Imaginary

Axis

NyquistR = -1

10

(b) Zoom in Nyquist Diagram for H(s)

Figure 3.29: Nyquist Diagram

Taking a closer look in the neighbourhood of the vertical line at R = -1 (see figure 3.29(b)), can be

observed that the diagram lies at the right of that line, so the Popov criterion states that this system is

absolutely stable for any non-linearity contained in sector φ ∈ [0, 1]. Note that the particular choice of the

sector, resulted in a line that passes in the point R=-1, thus coincidently resulting in the classic Nyquist’s

52

condition, which guarantees linear system’s stability. If other sector had been chosen, such as φ ∈ [0, 2]

(actually, the non linearity belongs to any sector where K1 ≥ 1), the system would also be stable. The

Popov criterion would not guarantee stability only for the sector [0,+∞[, since this would constrain the

Nyquist diagram to lie at the right of the imaginary axis and, as we witnessed, that it is not the case, at

least for q = 1.

3.4 Summary

This chapter addressed fundamental design aspects for the successful development of Powermeter

device. It started by introducing the architecture of the overall system, revealing to the reader how ev-

erything connects together (sensors, microcontroller, host, power rails), so power measurements can be

done seamlessly. The rails composing a common PSU were also studied in order to ensure that Power-

meter is suitable to sense any rail and component of a computing system. Furthermore, it was analysed

how the different signals that a PSU works with (AC and DC) could be measured and conditioned, so a

precise and accurate measurement could be attained. Under this topic, it was developed a convenient

approach to sense AC signals, the AGC. This method comprehends several electronic blocks, which

filter and dynamically amplify/attenuate the sensed AC signal. With this methodology, it is possible to

increase the ADC dynamic range, in order to provide very accurate readings and prevent signal loss of

very low voltage signals.

A thorough analysis to the system was developed. MatLab was used as the supporting tool to design

the controller and validate the system’s performance, under some relevant case examples. Since the

dynamic range is tightly influenced by the noise, a study to the system’s overall noise was accomplished,

calculating the ADC’s SNR, THD and SFDR, with and without oversampling. Finally, because the system

relies on a non-linearity, a theoretical analysis to the system’s stability was done, introducing some

theorems and frequency domain tools to deal with these kind of problems, entitled in the literature as

Lur’ problems. The analysis showed that the system is absolutely stable for the chosen sector.

The next section will present details about the software API developed under the scope of this thesis,

which implements the algorithms necessary for the energy/power computation.

53

4Powermeter - Software/Firmware

Contents

4.1 Communication System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Firmware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

55

This chapter focuses on explaining how the microcontroller communicates with the host’s system,

how the acquired data is treated and interpreted, and the algorithms used to allow synchronization be-

tween both systems. First, it is provided an introduction of the USB communication, then it is developed

a study about the latency of the communication system. The sampling frequency, previously chosen,

is validated taking into account all the constraints that limit it. In the last section of this chapter, the

algorithm used to compute the energy of a running application is analysed. Finally, it is also revealed

how the final API can be used to read power samples.

4.1 Communication System

The communication between the microcontroller and the host is established through the Universal

Serial Bus protocol. The PIC18FX455/X550 device family contains a full-speed and a low-speed com-

patible USB Serial Interface Engine (SIE) that allows fast communication between any USB host and the

PIC microcontroller. It was used the full-speed mode, which grants at the most 12 Mbps. The protocol

has a hierarchy of descriptors, which permits, for instance, the configuration of a HID (Human Interface

Device) or CDC (Communication Device Class) class device.

For this project, the USB transactions are granted using the HID class specification. The class report

descriptor allows the user to specify the type of data to be transferred (bulk, interrupt, isochronous). This

class was used since it is the choice for custom devices and is more or less simple to configure, while,

at the same time, it is able to provide a high throughput.

It was used the Bulk transfer type because it can transfer large amounts of data, guaranteeing data

integrity. The communication is interrupt-driven, meaning that the host does not keeps pooling the

device waiting for data, which leads to less communication overhead.The host communicates through

USB, by using the libusb library API [48], which is an open-source library. The application offers two

main interfaces for device I/O: Synchronous and Asynchronous Interfaces.

The Synchronous interface allows the user to perform a USB transfer with a single function call.

When the function call returns, the transfer has completed and the user can parse the results. The user

can call specific functions to transfer data, such as libusb bulk transfer() and libusb interrupt transfer(),

which transfer data using a bulk and an interrupt endpoint, respectively. The main advantage of this

model is simplicity: we can do everything with a single function call. However, the application will sleep

inside a transfer until the transaction has completed, consequently the entire thread will be useless for

that duration.

This limitation can be a problem to a real-time demanding application. Fortunately, there exists an

interface that seeks to solve that problem: the Asynchronous interface. Asynchronous I/O is a more

complex interface, but instead of providing functions that block until the I/O has complete, the interface

presents non-blocking functions, which begin a transfer and then return immediately. Due to the de-

mands of this project for developing a device that should not introduce a significant overhead on the

normal user application, the used interface for exchanging large amounts of time sensitive data was

the Asynchronous one. Nevertheless, for specific cases, such as to initiate synchronization between

56

clocks or to start and stop the sampling process, it was used the Synchronous interface since these kind

of requests do not have time restrictions.

The synchronization between clocks it is a requirement and a very important procedure of the system

since it allows to associate every acquired sample with the time it was sampled, based on the clock of

the host. This synchronization permits, for instance, to get an insight of when the power consumption

on a server is more or less intensive. Thus, it is a main requirement if one wants to do real-time power

characterization of an application. This question will be addressed in the following sections.

4.1.1 Latency

The latency of the system in transferring data is a major concern since it limits the sampling frequency

that the system can attain. Hence, it was performed some tests to the latency of the system when

exchanging packets of bytes between the microcontroller and the host systems. The HID can send 64

bytes per packet. However, if the user requires more bytes to be transferred, the protocol divides the

amount of information among various packets of 64 Bytes (i.e., for 64 bytes it sends 1 packet, whilst for

65 bytes is forced to send 2 packets).

Before exploring the test that was conducted and its results, it must be said that, for this work, it was

decided to transfer 256 bytes of data. The 256 bytes comprises the header and the time tag of every

chunk of samples (5 bytes) and the payload (or acquired samples, which occupy 251 bytes). The 10-bit

ADC outputs data through two registers of 1 byte (High-byte and Low-byte resisters). Thus, with 256

bytes, one can transfer 2502 = 125 samples. Since the energy computation is partially executed in the

MCU, more samples would require a larger accumulator, translating into a higher data processing time.

Host µC

Message A OUT

Message A IN

Figure 4.1: Round-Trip Time

That being said, the test consisted in exchanging 1000 messages between the both systems, by

performing a so called Round-Time Trip (RTT) - see figure 4.1 - and by varying the amount of bytes

transferred. Figure 4.2(a) reflects the results of that test, where the abscissa axis is the number of

packets transferred. As it was expected, increasing the number of packets increases the latency and

the relationship between them is linear. The results showed that for every extra packet, the latency

increases more or less 100 µs.

In figure 4.2(b) it is also provided an histogram for the transfer of exchanging 4 packets of data (256

bytes). This figure allows to understand if all the messages take the same amount of time or if there are

messages that take more or less time. Nevertheless, the test allows to identify the worst-time case and

compute the minimum sampling period that the system can attain. Therefore, by examining the figure it

can be observed that the majority of messages take 556 µs. However, the worst-time case is 636 µs.

57

1 2 3 4 50

200

400

600

Number of Packets

Late

ncy

(us)

(a) USB Latency

436

456

476

496

516

536

550

556

576

596

616

636

656

676

696

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

Time (us)

Freq

uenc

y(b) Histogram for IN Transfers of 256 B

Figure 4.2: Communication Tests

Thus, to calculate the minimum allowed sampling time, it must be considered the worst-time case

(636 µs):

TSminimum × 125 ≥ 636 µs⇔ TSminimum ≥ 5.09 µs (4.1)

4.1.2 Sampling Frequency Choice of the System

The reader may recall that the frequency of operation used for all the results in chapter 3 was FS =

3.3(3) kHz. That frequency was obtained by performing the following analysis:

The sampling frequency is limited by, at least, three aspects:

1. Minimum time that data takes to be transferred between the processors (T1 = 5.09 µs);

2. Time resolution allowed by Timer01 module (T2 = 16 µs);

3. Time necessary to compute data within PIC;

The time necessary to process data is one of the constraints that limit the sampling frequency. This

time comprehends, among others, the time to manage all the required buffers to save, send and average

data, the time required for the energy computing and the time used by the controller stage to dynamically

change the gain of the PGA.

PIC18F4550 works at 48 MHz, but every instruction takes (typically) 4 clock cycles. Therefore, the

instruction rate is 12 millions instructions per second. It was measured the time spent by the microcon-

troller’s firmware to process all the software routines by using an oscilloscope and a flag within the main

loop, whose value alternates between the logic values ’0’ and ’1’. The obtained time was approximately

Tspent1 = 900 µs (1.1 kHz). Consequently, it was analysed the assembly code of the project and it was

done an effort to reduce this execution time Tspent1. The most used variables (including flags and buffers)

were moved to the microcontroller’s access bank, which is two times faster than the normal access mode1The project uses a timer module (Timer0) embedded in the microcontroller, which acts as the clock of the microcontroller

58

and loop unrolling (4x) was applied to the most critical loops. However, the bank has limited space of 95

Bytes and it is also used by the compiler to store temporary data. Therefore, not all data could be placed

there, namely the buffer used to average the 32 samples data. In sum, this has resulted in a reduction

to 273 µs, but to avoid working at the limit, it was used 300 µs instead (FS = 3.3(3) kHz).

This is, after all, the bottleneck of the project as regards the sampling frequency of the system,

when compared with the other limits referred at the beginning. For an Intel Core i7 3770K, with 4

cores and 3.5 GHz, which issues 4 IPC, this is a very acceptable frequency, since it corresponds to

an inspection window of 16.8 millions instructions (the lower the better). This is much better than the

resolution achieved by the PowerEgg [26], the WattsUp [27] or the approach discussed in [3], whose

sampling frequency is 4 Hz, achieving a resolution of 14 billion instructions.

4.1.3 Types of Data Transferred

There are different kinds of data being transferred between host and device: Asynchronous data

and Synchronous data (see table 4.1. As it was mentioned before, the synchronous interface is used

just to initiate some routines like the time synchronization between clocks and the sampling process.

The asynchronous interface is called to exchange large amounts of data along the time between the

microcontroller and the host), including data regarding clocks synchronization and the acquired samples

by the microcontroller.

Asynchronous Transfers Synchronous Transfers

Time Stamps and Sampled Data Time Synchronization Data Initialize Clocks Sync and Sampling Process

Table 4.1: Types of Data Transferred

Hence, it is necessary to distinguish the data sent/received by both systems (microcontroller and

host). Therefore, the messages are divided into an Header and a Payload. The Header gives the

possibility to differentiate the type of data being transferred, while the Payload is the data itself that the

device/host wants to transmit. When in asynchronous mode, the buffer size is at the most 256 Bytes,

while when in synchronous mode it is just 5 Bytes. Tables 4.2 and 4.6 shows how data is divided in the

synchronous and asynchronous modes, respectively.0 7 8 15 16 23 24 31 32 39

Header Channel 0 Channel 1 Channel 2 Channel 3

Table 4.2: Synchronous Data Structure

In the synchronous structure, the byte named Header can be NTP INIT, START or STOP commands

(used to start clock synchronization, and start and stop the sampling process, respectively). The payload

comprehends the bytes referring to Channel 0, 1, 2 and 3, which can be any of the available channels

to sample (V230, CPU, HDD and GPU), chosen by the user.

In the case of the asynchronous data structure, the Header byte can be NTP HEADER or DATA HEADER,

referring to the different kinds of asynchronous data - Time Synchronization Data and Time Stamps and

59

0 1 2 3 · · · 58 59 60 61 62 63

Header (1Byte) Payload00h

Payload40h

Payload80h

PayloadC0h

Table 4.3: Asynchronous Data Structure

Sampled Data, respectively. The payload is the data exchanged between both processors, which can

be data about clocks synchronization or samples acquired during the sampling process.

4.1.3.A Synchronous - Clocks Synchronization and Sampling Process Initialization

The clocks synchronization and the sampling process initialization processes are done by using the

asynchronous structure. In table 4.4 it is presented an example of the structure used to request the start

of the sampling activity for CPU and GPU channels. Not requested channels are filled with NOP values.

All the commands and headers are specific identifiers common to both MCU’s and host’s application and

are referred with defines saved in a C header file. CPU CHANNEL and GPU CHANNEL are macros,

which refer to one of the thirteen input analog channels (AN0-AN12) of the used microcontroller, where

the sensors output are connected to.

0 39

START CPU CHANNEL GPU CHANNEL NOP NOP

Table 4.4: Sampling Process Initialization Command Example

4.1.3.B Asynchronous -Time Synchronization Data

0 71

NTP HEADER(1Byte) Time (8 Bytes)00h

Table 4.5: Host and PIC’s Clock Synchronization Data Structure

Asynchronous communication uses a buffer of 256 bytes, divided in packets of 64 bytes. However,

during the clocks synchronization process only needs 9 Bytes of information (1 Byte for the Header and

the remaining slots for Time data) - see table 4.5.

The clocks synchronization is achieved by using a protocol that resembles the one used in Network

Time Protocol (NTP) algorithm, for a client-server method. The algorithm has four distinct time stamps:

Originate Time Stamp - Time Request Sent by Client (T1); Receive Time Stamp - Time Received by

Server (T2); Transmit Time Stamp - Time Reply Sent by Server(T3) and Destination Time Stamp - Time

Reply Received by Client (T4).

60

ClientServer

T1

T2

T3

T4

ACK + T2 + T3

NTPREQ

UEST

Figure 4.3: Synchronization Protocol

In this situation the Client is the microcontroller and the Server is the host. In figure 4.3, is provided

the normal steps for synchronization. The Client requests a synchronization process, saving time T1;

then the Server acknowledges and sends times T2 and T3 to the Client. With those time stamps, one

can determine the delay (time taken since the message was sent and until it arrives to destiny) inherent

in a Client - Server communication, by resorting to the next formula:

Delay = (T4 − T1)− (T3 − T2) (4.2)

To compute the time difference between the two clocks (i.e., offset), we start by assuming that it

exists no asymmetry in the communication, that is, Client → Server time and Server → Client time are

the same:

Offset = T2 − [T1 +Delay

2] =

(T2 − T1) + (T3 − T4)

2(4.3)

If the offset is positive, implies that the server’s clock is ahead in respect to the client, so we must

sum the Offset value to clients’ time-stamp. On the other hand, if the offset is negative, then the server’s

clock is lagging in respect to the client and we must subtract the offset value to the client time-stamp.

Hence, with the knowledge of the Offset value, we can adjust the clock used to generate the time

stamp within the microcontroller and synchronize it with host’s clock. The process of synchronization

occurs long enough, to guarantee an offset time in the order of microsecond. Despite this process

allows a very accurate synchronization, along the time, the two clocks will inevitably lose synchronism.

This is mainly due to the asymmetric routes and the network congestion during host’s clock update with

the internet time (NTP). This asynchronism is practically linear with time as we can see in figure 4.4,

where we witness the increase of the offset value with time.

61

1 10 20 30 40 50 60 70

0

5

10

20

30

40

45

50

645.78 µs

Minutes Passed

Offs

et(m

s)

Offset TimesLinear Regression

Figure 4.4: Offset Change Along the Time

The linear regression of the data gives the rate at which the offset gets modified. According to it, the

offset increases approximately 645.78 µs per minute. Thus, we can adjust the microcontroller’s clock to

counter act this deviation by adding 645.78 µs every minute. With the expense of increasing the time

overhead, it was decided to synchronize the clocks every time a packet is received, allowing a better

synchronism, which is an important requirement to allow power profiling at real-time.

4.1.3.C Asynchronous -Time Stamps and Sampled Data

0 63

DATA HEADER (1Byte) Time Stamp00h

Time Stamp Samples40h

Samples80h

SamplesC0h

Table 4.6: Sampling Process Data Structure

The data that is exchanged using the asynchronous interface is the type of data that the application

will handle most of the time. Every time the host receives this type of data, it will receive a packet

containing not only the specific header, but also the payload containing a time stamp - representing

the time the first power sample was taken - and N power samples, which can be related to data coming

from one or more sensors. As it was mentioned, the buffer has a capacity for 256 Bytes of data (see table

4.6). Thus, excluding the Header and Time Stamp, which occupies 5 Bytes together, we can transfer up

to 125 data samples within a 256 Bytes size buffer (because, each sample occupies 2 Bytes of data).

For instance, for N = 100, if the CPU sensor and GPU sensor are being sampled, then the received

packet will have in total N2 = 50 samples from the CPU sensor and 50 samples from the GPU sensor.

This, however, does not translates into a reduction of the sampling frequency, but instead means that

the data will be exchanged more frequently between the host and the microcontroller.

The time stamp which gets associated with every chunk of data (i.e., N samples), is obtained by the

62

usage of another timer module (Timer1), with a period of 5 us. This variable is a 32 bit integer and it

is synchronized with the lower 32 bits of host’s real time clock (in microseconds). Since the time stamp

resolution is in microseconds and because it is a 32 unsigned int variable, the stamp will roll over at

some point in time. Despite that, it does not compromises the ability to correctly tag the data, since the

32 significant bits of host’s clock are also used to stamp data. In fact, every time the host receives a

packet, saves the time of arrival of that packet in a variable. This allows to compare that time with the

incoming time stamp and find out if there was a roll over and correct it, if that it was the case.

4.2 Firmware

Acquisition Board µC Host

Figure 4.5: System Diagram - Microcontroller

The firmware comprehends the set of instructions that are preprogrammed in a embedded system

(the microcontroller). Those set of instructions allows the communication between the hardware system

(acquisition board) and the host, establishing a bridge between both systems (figure 4.5). Following, it

is introduced some of the strategies and procedures to process data with the PIC microcontroller.

4.2.1 Buffering Strategy

_ B_

Figure 4.6: Dual-Buffer Strategy

A dual buffer strategy (figure 4.6) was used to save and send sampled data: every sample is saved

in a buffer (A buffer) of size 256 bytes. When this buffer is filled with data, we must send it to the host

and still allow the sampling process to continue seamlessly. Consequently, another buffer (B buffer)

was necessary to save those samples, whilst A buffer is being utilized only to transfer data. When

B buffer is filled, it is used to send data while A buffer will be responsible to store the sampled data.

Thus, the buffer to store or to send data alternates between A buffer and B buffer along time.

63

4.2.2 Oversampling and Maximum Search Algorithm

Figure 4.7: Oversampling with Rolling Buffer

As it was referred before, in chapter 3, it was performed oversampling of the acquired samples at f =

3.3(3) kHz, in order to improve the DR. Hence, it was used a rolling average buffer of 8 samples (figure

4.7) : this buffer has a pointer, which returns to the start of the vector as soon as the buffer is filled

with data. While the buffer is not completely filled with samples, the program sends the samples directly

without averaging them. This allows a constant sending of useful data even at the start of the program.

When a new sample arrives it is added to the accumulator after the older sample gets subtracted from

it. Finally, the output of the accumulation is averaged and the oversample process is over.

In the case of the sensing of the AC sensor, after the oversampling technique, it is also necessary a

search for the maximum of the acquired data. This is necessary, since in order to compute the power

demanded by the power supply, we have to know the amplitude of the current signal so we can multiply

it by the 230 VRMS and by the power factor (=0.99).

The pseudocode for the maximum search is depicted in algorithm 4.1, which it just illustrates the

main part of the algorithm. This algorithm returns the peak of the sine wave acquired during AC current

sensing. The algorithm is pretty simple, fast and most importantly, effective. It follows the acquired

signal and tests if the current sample is higher than the one before and if it is, declares it as temporary

maximum. Then, tests if the current sample is lower than the temporary maximum, taking into account

the delta parameter, which was set to 10 quantization levels, after some experiments. If this comes to be

true, then the current sample is an absolute maximum and a search for the minimum value is addressed

from this point on. These two routines are inextricably linked and the maximum can be found because

the search for the minimum updates the temporary maximum value.

64

Algorithm 4.1 Maximum Search1: if lookformax then2: if Current Sample <temporary max −delta then3: New absolute maximum found4: lookformax = 05: end if6: else7: if Current Sample >temporary max +delta then8: New absolute minimum found9: lookformax = 1

10: end if11: end if

4.3 Software

Acquisition Board µC Host

Figure 4.8: System Diagram - Host

The software application comprehends the set of instructions that are preprogrammed in the host,

necessary to establish the communication with the microcontroller and process and output relevant

data to the user (figure 4.8). In this section, it is presented the interfaces, functions and procedures

used in the development of the Powermeter application that the end-user has access to. First, it is

presented which functions of the API are used to output the energy and after, it is explored how the

energy computation (and other routines) work together to successfully return the energy and power

samples of an application.

4.3.1 Powermeter Application Programming Interface

The API returns the energy that was spent between two points of the code of the user application.

For this, the user has to call three major functions, which use the synchronous structure introduced in

section 4.1.3:

• powermeter api init(char chanel1, char chanel2, char chanel3, char chanel4);

• powermeter api start(void);

• powermeter api stop(void).

To use the first function, the user must choose which channel(s) to sample (CPU, GPU0, HDD or

V230). If there are channels that the user does not want to sample, then it must use NOP as argument.

For example, if only CPU channel must be sensed, the call must be: powermeter api init(CPU, NOP,

NOP, NOP. This function executes every necessary initializations, including libusb library initialization,

vectors and files initialization, and synchronization between clocks. This function

65

The powermeter api start(void) function gives order to start the sampling process, whilst the power-

meter api stop(void) function, besides stopping the sampling process, also closes opened files (used to

store, in different files, all the samples and time stamps for post-analysis), frees allocated memory and

comprise the energy calculation. The function indirectly calls energy calc(long long end time) func-

tion, whose goal is to return the energy spent between the start and stop calls. Table 4.7 summarizes

the main functions and their respective features.

Function Arguments Features

powermeter api init channels to sample General Initializations (Libusb, variables, clock sync)

powermeter api start Void Starts Data Sampling

powermeter api stop Void Stops Sampling, Frees Allocated Memory, Calculates Energy

Table 4.7: Main API Functions

4.3.2 Energy Calculation

The energy is calculated from the power curve by integrating it along time. Since the operation occurs

in the discrete time domain, the energy spent between start and stop commands, must be computed

by using numerical integration methods. There are various methods available to do one-dimensional

integration, that are based on interpolation functions: Rectangle Rule - Order 0 Polynomial, Trapezoidal

Rule - Order 1 Polynomial and Simpson’s Rule - Order 2 Polynomial.

Pwrn

−1

Pwrn

PwrN

−1

PwrN

y = f(x)

x

y

Energy =

N∑n=1

TSampling ×Pwrn−1 + Pwrn

2(4.4)

Figure 4.9: Trapezoidal Rule

For this work, it was used the trapezoidal integration rule (see figure 4.9) since it is a method which

leads to a small integration error, compared to the rectangle rule and also because it is simple. Simpson’s

rule would be a better choice, to reduce the error. However, it would require more sums, multiplications

and divisions by numbers which are not a power of 2 and that constitutes a problem for PIC18F4550’s

architecture.

The algorithm for the energy computing is presented in 4.2 as pseudocode. By analysing it, the

algorithm first starts by adding the energy of some batches of samples, previously calculated in the mi-

crocontroler before sending the entire packet to the host: with this procedure, the CPU is not overloaded

with extra computation, yielding less power and time overhead when running the Powermeter API.

66

Algorithm 4.2 Energy Calculation1: batch number = index/N samples2: for Energy batches below batch number do3: Total Energy += Energy Batch4: end for5: Calculate Remaining Energy using Trapezoidal Rule6: return Total Energy

Pwr0 Pwr1 PwrN

B0 B1 Bn-1 Bn

batch number

BN

Figure 4.10: Energy Computing (where Bn refers to energy batch of the nth packet)

The energy batches received from the microcontroller are stored orderly in a vector. Each batch

corresponds to the energy of a full packet of N samples received. So, there exists a correlation between

the index of a time stamp and the batch it belongs to. For instance, if the host receives 100 samples per

packet, if at the end it has received 200 samples of data (resulting in 2 energy batches) and if the time

at which the sampling process ceased refers to the sample number 110, then doing an integer division

by the number of samples per packet (110/100 = 1) we know that we can add directly the energy of the

first batch (observe figure 4.10). From that point on, the energy must be computed, by resorting to a

numerical integration method. It only stops when the power sample associated to the end time stamp

(the time at which the sampling process has stopped) is reached. Finally, the total energy is returned.

4.3.2.A Time Stamp Search

Besides the matter about how, specifically, the energy computation is done, there is also the issue of

finding the corresponding time stamp to the time at which the sampling process has terminated. Thus,

it is necessary a search algorithm to find the closest time stamp to that time. The pseudocode in 4.3

presents the algorithm used to solve the problem 2. To complement, in figure 4.11 is illustrated in a block

diagrams the search process.

Essentially, the algorithm starts by estimating the index of the vector where the end time is likely to

be: that is done with the pseudocode in line 1. The reasoning behind it, is that each time stamp is equally

spaced by the sampling period (T sampling), thus subtracting the end time by the first time stamp and

dividing by T sampling, it shall give a good estimate of the wanted index. Nevertheless, it is likely that

the obtained index is not the best choice, thus a linear search is conducted over the vector either in an

increasing or decreasing direction as illustrated in figure 4.11. When a possible time stamp is found by

the algorithm, a final comparison is performed between the end time and the two closest time stamps

to discover the one which produces the least absolute error.

2The pseudocode only presents half of the original source code, because the other half is similar, but instead sweeping vectorentries above the index, it searches for the time stamp on entries below it. The time stamps are saved in a structure containing avector with the data and a variable providing the actual size of that vector.

67

Algorithm 4.3 Time Stamp Search1: index← (end time− start time)/T sampling2: if time[index] > end time then3: for i← index− 1 to 0 do4: if time[i] <= end time then5: if time[i+ 1] is closer to end time than time[i] then6: index← i+ 17: break8: else9: index← i

10: break11: end if12: end if13: end for14: end if

T0 T1 Tn-1

i

Tn

index

TN

Figure 4.11: Case Scenario when time[index] = Tn > end time (where Tn = T0 + (n− 1)× TSampling)

4.4 Summary

In this section, the reader was introduced to subjects about the MCU and the host communication

and the development of the MCU’s firmware and host’s API. Firstly, a briefing about USB communication

was presented, explaining, roughly, how it works how it connects to the rest of the work. For instance,

the latency of USB transactions limit the maximum sampling frequency, that can be attained.

Afterwards, some aspects regarding the the microcontroller’s firmware were detailed, such as what

was the buffering strategy used and how the sampling frequency can be configured. It was also revealed

the several types of data interchanged between the host and the microcontroller and how they ”inter-

pret” each type, by verifying the header of each received message. The synchronization process was

explained, which consists of set of messages traded between the two systems, in order to synchronize

their clocks.

In the final section of this chapter, the general structure of Powermeter API was discussed. Within

this subject, the reader had a grasp of the main functions used to operate with the tool, understanding

their purposes and in what contexts they must be used. It was also specified how the energy computing

is done, resorting to a numerical integration method. Finally, it was provided a description of some

important algorithms (Time Stamp Search and Energy Computation).

68

5Results

Contents

5.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Non-Linearity Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Power Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4 PC Energy Consumption Characterization . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

69

In this chapter, several results are provided. First, the results of the sensors calibration and of the

loop stability tests are analysed, specifying the calibration process and the tests conditions. Then, the

device is used to profile specific benchmarks (FFT, LU Matrix Decomposition and RADIX) on a Personal

Computer and the results are compared to the ones obtained with RAPL in terms of time and power

consumed by the workload. It is also given temporal charts with the instantaneous power measured by

both applications, showing the differences between them. Finally, the device is used to characterize the

power consumed by a common machine, under stress of specific workloads. The results reveal which

rails are correlated with the power demanded by the modules inherent in a computer system (CPU, HDD,

RAM, NIC and others).

5.1 Calibration

5.1.1 Sensors and ADC Calibration

The ADC was calibrated by varying the input voltage with a voltage source, starting from 0 V and

increasing the voltage 200 mV step by step until the VRef = 5.03 V was reached. A linear regression

model was adjusted to the data. Figure 5.1(a) shows the results of the test.

0 1 2 3 4 50

200

400

512600

800

1,023

Voltage

Dig

italW

ord

Real ValuesIdeal Values

(a) ADC Calibration

0.5 1 1.50

0.5

1

1.5

2

2.5

Input Current

Out

putC

urre

nt

Real ValuesIdeal Values

(b) AC Sensor Calibration

Figure 5.1: ADC and Sensor Calibration

Equation 5.1 represents the linear regression of the data and it is used to convert the digital values

back to voltage. The D variable refers to the digital word outputted by the ADC and the VReal is the

voltage obtained after calibration.

VReal =5033.425088

1024× D + 7.46460873 mV (5.1)

IReal = 1.0172× IAcq + 0.0848 A (5.2)

A calibration of the conditioning board was also conducted by varying the sensor input current, start-

70

ing from 0 A to 2 A, with steps of 100 mA, by using a current source (see figure 5.1(b)). After the data

had been acquired by the ADC, it was converted back to current values and compared to the ideal ones.

For the AC current sensor, the values of current are corrected using equation 5.2, where IAcq is the

measured current and IReal is the value of current after being corrected. The AGC introduces an offset

for every gain set, thus a zero calibration had to be done. The inputs of the AGC were shorted to ground

and a conversion to digital values for every gain setting was realized. Those values are stored in a

look-up table for easy access after every ADC conversion.

Gain 1 2 4 8 16

Offset (LSB) 0 1 2 5 11

Table 5.1: AGC Calibration

5.2 Non-Linearity Thresholds

In chapter 3, the stability of the proposed AGC system was studied based on general thresholds

values. In this section, it will be explained how to come up with the best thresholds values, so chattering

(i.e., rapidly repeated gain-switching around the threshold value) does not occur and at the same time

attain always the highest gain, so less noise and measurement errors are introduced. These thresholds

values were used in the several results obtained in latter sections.

Two main thresholds were calculated based on the amplitude reached by the AC current, after run-

ning some benchmarks (between 0.45 A and 0.53 A, corresponding to 30 mV and 50 mV, respectively).

They are the lower limit T1 = 715 and the upper limit T2 = 959: these thresholds guarantee that every

signal will remain in between this limits, which correspond to 3.5 V and 4.7 V, respectively. Moreover, a

third limit was calculated: the maximum gain limit T0 = 540. For every signal amplitude lower than the

limit T0, the gain jumps from G=1 to G=16 directly, allowing a faster loop response and an amplitude

close to the ADC’s full-range. This limit (in voltage) corresponds to 2.64 V, but without the DC offset

it corresponds, actually, to 137 mV. As the reader may recall from section 3.3, the output range of the

current sensors after being amplified by the bandpass filter was found to be [120mV, 200mV ]. Therefore,

the T0 limit shall always guarantee the maximum amplification gain for that range, without the saturation

of the output signal. Nevertheless, if it happens that the signal after amplification gets above the upper

limit T2, the controller reduces the gain continuously until the amplitude lies within the allowed margins.

In the worst case, it means jumping from G=16 to G= 14 , which can take 1.275 ms.

Hence, The thresholds were chosen carefully, creating enough hysteresis to prevent chattering and

to always guarantee the maximum amplitude signal at the ADC’s terminals, thus improving the DR. Tests

have been conducted to proof stability of the loop by injecting a sine wave with 50 Hz of frequency and

varying its amplitude.

Figure 5.2(a) shows the case where a sine wave of amplitude 330 mV is amplified with a gain of 4.

In figure 5.2(b) the input sine wave of amplitude 3.9 V is attenuated with a gain of 2.

71

(a) Sine Wave Amplification (b) Sine Wave Attenuation

Figure 5.2: Stability Proof

The tests are evidence that the loop achieves stability (i.e., does not enter in chattering) and at the

same time increases the system DR, by imposing to the input signal the best gain at all times.

5.3 Power Profiling

The main goal of this section is to understand how the measuring methods (internal and external

measurements) influence algorithm execution (time overhead) and also compare the readings given by

Powermeter and RAPL. The measures obtained with RAPL were performed with a time resolution of 2

ms. In fact, although Intel [37] reports a maximum update rate of 1 kHz, after some tests with this time

period, it was revealed that some readings were zero. As a result it was decided to use an update rate

corresponding to 500 Hz.

5.3.1 SPLASH-2 Benchamark Tests

SPLASH-2 benchmark suite, included in PARSEC suite [31], was used for this evaluation. This

benchmark offers a variety of workloads: for benchmarking purposes, FFT, LU Matrix Decomposition

and RADIX sorter algorithm workloads were utilized. Those workloads and their respective Makefile

were properly modified to include the calls to Powermeter API or to RAPL and each algorithm was

executed in a loop for 1000 times in a row. The results were averaged to get statistical significance.

As said before, the main focus of these tests were: evaluate the time execution with/without the power

measure systems and evaluate the energy consumption reported by Powermeter and RAPL.

1. FFT: For this workload, the tests were made using a data set of 220 complex numbers and the

algorithm was distributed in 4 threads (1 by Physical Core);

2. LU: A 1024x1024 matrix of doubles was used as input and the algorithm was distributed in 4

threads (1 by Physical Core);

3. RADIX: For RADIX workload, a data set of 4292608 32-bit random integers was sorted and the

algorithm was distributed in 4 threads (1 by Physical Core);

72

Workload Powermeter

Total Time (µs) Total Time (µs) Time Overhead(%) Energy (J)

FFT 76639298 77474667 1.09 2757.21

LU 98014713 99266712 1.27 4742.26

RADIX 96320861 97226277 0.94 4023.43

Table 5.2: Powermter Time Overhead

RAPL

Total Time (µs) Time Overhead(%) Energy (J)

FFT 76826729 0.24 2277.01

LU 98648873 0.55 4257.96

RADIX 96438625 0.12 3698.20

Table 5.3: RAPL Time Overhead

Observing tables 5.2 and 5.3, it can be concluded that the overhead due to Powermeter or RAPL

metering system is not significant, since in the worst case it takes 0.25 % and 1.22 % more time respec-

tively, than the original source code. However, possible due to the higher amount of data exchanged and

computed (Powermeter works at f=3.3(3) kHz, whilst RAPL works at f = 500 Hz), Powermeter introduces

more time overhead than RAPL. Regarding the energy measurement of both devices, it is notorious a

difference between Powermeter and RAPL, where the former indicates higher energy values than RAPL.

This is not strange, since it runs for a longer period of time, has an higher resolution than RAPL and it is

not based on PMCs as RAPL, but mainly this is due to the fact that RAPL does not report all the energy

consumed by the uncore of the CPU.

5.3.2 NAS Parallel Benchmark Tests

Conversely, to evaluate the power profiling and the trade-off between energy and time consump-

tion, the NASA Parallel Benchmarks (NPB’s) with OpenMP [28] were used. This benchmark includes

workloads for scientific applications like FT (3-D fast Fourier Transform), BT (Block Tri-diagonal solver

of Navier-Stokes equation) and CG (Conjugate Gradient method to compute an approximation to the

smallest eigenvalue of a large, sparse, unstructured matrix). Version NPB 3.3.1 was compiled and it

offers different benchmark classes: S, W, A, B, C, D, E and F, sorted from the smaller one to the largest

problem size. In the conducted tests, only class B was used, which offers a standard problem size and

a reasonable high number of iterations (depending on the used workload), thus allowing the profile of an

application for a satisfactory period of time.

Figures 5.3(a) and 5.3(b) represent power profiling graphs of NPB FT benchmark, compiled with -O3

and with CLASS=B and ran with 4 threads, one by processor. In this case, class B translates into a

problem size of a 512x256x256 grid and 20 iterations. Both RAPL and Powermeter were used for power

profiling this workload. In figure 5.4 it is shown the different power patterns of CPU, RAM and HDD over

time for the EP and BT benchmarks.

73

0 1 2 3 4 5 6 7 80

10

20

30

40

50

60

70

Time (s)

Pow

er (W

)

PowerMeterRAPL

Startup Iteration 7 Initialize

(a) RAPL and Powermeter

7.7 7.8 7.9 8 8.1 8.2 8.3 8.4 8.5 8.630

35

40

45

50

55

60

65

70

Time (s)

Pow

er (W

)

PowerMeterRAPL

(b) In-depth View

Figure 5.3: Power Profiling

In figure 5.3(a), it is clear the fluctuations of the power drained by the CPU over the time. It was

found that the number of valleys corresponds exactly to the process iterations executed by the bench-

mark (N=20). The application initiates with a warm up phase and an initialization phase, followed by

N iterations (for CLASS B N = 20). Both RAPL and Powermeter follow the power variations along the

time and it is possible to distinguish the computational stage (when power is at its peek) and commu-

nication stage (when power comes down). Even so, RAPL counters indicates less power consumption

than Powermeter and a huge amount of spikes (some of them reaching values above the maximum

power consumption specified by Intel [49]). An in-depth view to one of the process iterations reveals

more details, as it can be observed in figure 5.3(b). In this figure, it is clear the difference, in terms of

resolution, of RAPL and Powermeter: while the former outputs an average of the power measured, the

later indicates the real power at some instant in time and, thereby, much more fluctuations in power be-

havior are visible; we can, also, differentiate three small valleys in each iteration, which must be related

to inner subroutines of the FT benchmark. Thus, while with the Powermeter it is possible to distinguish

several different patterns of the power demanded by an application, with RAPL such level of detail is

lost, making it inapt for the power characterization in real-time of high-complexity workloads.

Figures 5.4(a) and 5.4(b) shows the power profiling results of the EP and BT benchmarks, illustrating

the power consumption of CPU, HDD and RAM (for better visualization of these figures, please refer to

figures A.6 and A.7 in the appendix). For each benchmark, the conducted tests results are displayed

for 1, 2, 4 and 8 threads, running in different processors. The corresponding source code was compiled

with gfortran, -O3 flag and linked to Powermeter API. Each subfigure focuses on the power usage for

the first few seconds of the test, in order to clearly show the resulting power behavior patterns.

The embarrassingly parallel benchmark (EP) is essentially computation intensive and communica-

tion free. The benchmark consumes a consistent amount of power during its entire execution, since it

is perfectly balanced, and each thread executes a CPU-intensive job. Contrastingly, the BT benchmark

(represented in figure 5.4(b)) is more memory-intensive. As a consequence, the power pattern is not

consistent over the test run. It was found that the power consumed by the CPU and the memory is

74

0 1 20

10

20

30

40

50

60

Pow

er(W

)

Time (s)

1 Thread

0 1 2

Time (s)

2 Threads

0 1 2

Time (s)

4 Threads

0 1 2

Time (s)

8 Threads

CPUHDDRAM (5V)

(a) EP CLASS=B Power Profiling

0 1 20

10

20

30

40

50

60

701 Thread

Time (s)

Pow

er(W

)

0 1 2

2 Threads

Time (s)0 1 2

4 Threads

Time (s)0 1 2

8 Threads

Time (s)

CPUHDD

RAM (5V)

(b) BT CLASS=B Power Profiling

Figure 5.4: Profiling of NPB Benchmark

interrelated in a way that when memory power goes up, CPU power goes down and vice-versa. In ad-

dition, each valley corresponds exactly to the number of iterations of the workload (in this case N=200).

Both tests are not HDD-intensive, thus there is little disk accesses and therefore, the disk consumes a

constant amount of power over the time.

There is a trade-off between energy consumption and time performance that should be considered

to determine the best configurations in number of cores based on the user’s needs. For performance-

constrained systems, the best operating points will be those that minimize time execution. For power-

constrained systems, the best operating points will be those that minimize power or energy consumption.

For systems where energy efficiency is optimized or power performance must be balanced, the choice

of appropriate metric is whether the performance gain was worth the additional power requirement.

Thus, a metric used in [50] can be used to understand the trade-off between energy consumption and

time performance. The metric is called Energy-Delay Product (EDP) and it is defined as the product

of the time necessary to execute the code and its respective consumed energy. Thereby, the smaller

EDP the configuration achieves for an application, the better the efficiency of that configuration for that

application. Figure 5.5 reveals the results by using this metric for the EP, MG and FT benchmarks, for

1, 2, 4 and 8 threads. The results are normalized in respect to the obtained value for 1 thread (i.e.,

EDPN = EN×DN

E1×D1, where N refers to the number of threads).

For better performance of the LU benchmark, the better configuration is to divide the code in 8

threads (D = 81.7 s and E = 4547 J), but for the lower energy cost it should be divided into 4 threads

(D=82.5 s and E=4233). According to the results from figure 5.5, the better configuration should be to

divide the algorithm in 4 threads, evenly by the 4 physical cores. Therefore, the metric gave priority to

the energy cost in dispense to the increase in time performance.

In the case of the MG benchmark, for the lower energy cost and the better time performance, the

workload must be divided in 4 threads (D = 6.5 s and E = 228 J). The EDP metric agrees with this and

states that the best configuration is to divide it into 4 threads as well.

75

1 2 4 80

0.2

0.4

0.6

0.8

1

1.2

Number of Threads

Nor

mal

ized

Valu

es

EDP

LUMG

Figure 5.5: EDP Metric

5.4 PC Energy Consumption Characterization

Powermeter was tested in Liliana (machine hosted by SiPS Group at Inesc-ID Lisbon). The machine

features the following characteristics:

Module Component

MB ASUS P8Z77-V LX

CPU 3.5 GHz Intel i7 3770K (Ivy Bridge Architecture - TDP = 77 W)

RAM G.Skill Sniper DDR3 2x8 GB - 1.866 GHzHDD Seagate Barracuda - 2 TBPSU CORSAIR TX750

Table 5.4: Machine’s Characteristics

In this section the goal is to successfully characterize power consumption of the aforementioned

machine. A few tests were realized, not only to discover what connectors/rails from the PSU are di-

rectly correlated with the power drawn from the CPU or RAM, for instance, but also to understand how

significant is that power consumption when compared to the total power consumed by the machine.

The first test consists, simply, in running the API while the machine is at its idle state (i.e., no intensive

workload is running). The power spent while a system is in idle state accounts for a very large ratio of

the total power dissipation, however this power is not considered as used for computing. Active power

corresponds to the extra power dissipated when the systems is no longer in idle mode, but in active

mode.

1. CPU: the CPU is powered, directly, by the four 12V cables connecting to EPS12V. This was con-

firmed by experiments and by the ATX12V power supply design guide. Thus, the sensors are

connected directly to this cables. For this test, the LU matrix decomposition provided by SPLASH2

benchmark [31] was used. The test consisted in the decomposition of a 4096x4096 matrix of dou-

bles (128 MB of data) and it was ran in a loop for 25 times, so most of the necessary data will be

present in CPU cache, decreasing memory accesses. The number of processors used during the

76

test was also changed, thus the benchmark was ran for P = 1, P = 2, P = 4, P = 6 and P = 8

processors.

2. HDD I/O: The HDD is powered directly by two independent cables (+12V and +5V rais), hence

by directly measuring this rails, the disk power consumption can be profiled. In this scenario, a

series of tests were performed, consisting in several write operations in the hard disk. Those tests

require that the data being written has to surpass largely total RAM size - at least use a dataset

two times larger than the total available memory RAM (at Liliana we have approximately 15778

MB). It was used the Bonnie benchmark utility [30]. This benchmark performs a serial of writings

and rewritings to a file. At first, the benchmark starts by writing to a file calling putc() sdtio macro

- the loop that does the writing should be small enough to fit into any reasonable I-cache. Next,

writes the data efficiently, writing blocks of data to the disk, calling write(2). To finish the test, each

chunk of the file is read with read(2), changed, and rewritten with write(2), requiring an lseek(2).

3. Memory RAM Accesses: Memory modules must drain power from one of the power rails con-

nected to the motherboard. To profile power spent in memory accesses STREAM benchmark [29]

was used, which measures an effective memory bandwidth on one or more cores. Therefore, a

total of three tests were conducted: 1,2 and 4 threads. With this, it is expected to grasp which

power rails are correlated with intensive memory accesses. About the dataset, the benchmark

requires that the size of the array to be used, must be four times larger than CPU L3 cache (8

MB).

4. TCP/UDP Data Packets: For this case, the aim is to isolate the power consumed by the NIC, so it

was used the iPerf benchmark [51], which can generate TCP and UDP data packets and measure

throughput between server and client. In order to have a significant set of results, the benchmark

was ran for t = 100 seconds. During the test, Liliana machine had been connected as a client to

Diana machine (another system belonging to SiPS), which acts as a server.

Figures 5.6(a) and 5.6(b) shows the power consumption measured during five different workload

tests on the machine. In the case of figure 5.6(a) each bar corresponds to system power draw for each

of the workloads. In figure 5.6(b), it is provided a stacked bar chart, where for each workload (idle, LU

1, LU 2,...) it is presented the power drawn by a component (CPU or HDD) or a rail (12V, 5V and 3.3 V).

By observing figure 5.6(b) for the idle bar, the reader can observe how the power is distributed in a

desktop system, consuming about 25 W of DC power. Although the CPU only consumes about 8 W, it is

still the major portion of power consumption, and together with disk’s power, represent more than 50 %

of the system power. The other rails (which are related to RAM, fans, peripherals and others) consumes

together the rest of the system’s power (about 12 W).

After a thorough analysis of each bar of figure 5.6(b), we can come up with several conclusions.

For the case of the workload for the LU decomposition of a matrix, it is visible that each component/rail

suffers an increase of power consumption. But it is the CPU that presents the higher increase in power

usage: while in idle it consumes more or less 8 W, but in stress it achieves, on average, 30 W, 45 W,

77

Idle

LU1

LU2

LU4

LU6

LU8

Bonnie

STREAM1

STREAM2

STREAM4

iperf

0

20

40

60

80

100

120

140

160Po

wer

(W)

AC

(a) AC Power

0

20

40

60

80

100

Pow

er(W

)

idle

LU1LU2LU4LU6LU8

Bonnie

iperf

CPU

12V

5V

3.3V

HDD

STREA

M 1

STREA

M 2

STREA

M 4

(b) DC Power Distribution

Figure 5.6: AC Power Consumption and DC Power Distribution in the System

61 W, 63 W and 65 W, for P = 1, P = 2, P = 4, P = 6 and P = 8 processors, respectively. The

largest momentaneous power measured was about 68 W. This complies with Intel specification [49],

which states a maximum of 77 W of power draw by the CPU package, with no overclocking. Notice that

there is not a significant change in power for P = 4, P = 6 and P = 8. This occurs because for P = 6

and P = 8 processors, Hyperthreading technology emulates at the most eight cores, by having each of

the four physical cores running simultaneously two threads.

According to the results, the tendency is that for each doubling in the number of processors used, an

extra 15 W of power is required, excluding the cases where the hyperthreading technology is being used.

The growth of power in every rail was also measured during this test. Since the EATX12V connector

supplies power to the CPU core, the power of the uncore (L3, mem. controller) must come from the

motherboard, explaining the growth in power of the 12V rail. Even so, CPU’s fan and system’s fan are

also powered by the 12V and 5V. Hence the surge in power on those rails as well.

For the disk intensive benchmark, the reader can observe that there are insignificant changes in all

the rails. The only interesting changes, occur for the CPU and, as it should, for the HDD. The increase

in CPU power is expected, since the workload runs several writes and rewrites in a loop. During the

test, the HDD consumed more power in the writing with putc() command, achieving about 10.5 W of

power, whilst when writing in chunks the power was lower. This was expected, since when writing a

char at a time, more accesses to the disk are required than when writing a bunch of characters in once.

The rewriting consumes more power than when just writing, because it also calls the lseek() system

function. Although there was in increase in the power usage, the maximum specified power drained by

this component was not achieved (12.75 W): the reasons may be because of the use of a not sufficient

large dataset, but most probably is due to the existence of a buffer, which allows the disk to process

reads/writes in batches, minimizing disk accesses.

78

Focusing, now, on the STREAM benchmark, it is visible a general rise of power draw in every rail

and component. Nevertheless, if the reader takes a closer look to the 3.3V, 5V and 12V rails, it can

perceive that it is only during this workload test that these rails evidence the higher power consumption

values. For other tests, the power drained by the 5V rail has changed a little during LU workloads, but the

consumption is not very high when compared with the idle power. This is an unambiguous proof that the

5V rail is tightly correlated with the power drained by RAM modules. Despite this benchmark is not CPU

intensive (as the reader can observe in CPU bar), the 12V and 3.3V experiences a significant augment

in power along this set of tests. Though, since those rails present a very high power consumption

during LU workload as well, this is a confirmation that the 12V must power the CPU uncore and fans.

Nevertheless, when in intensive work, RAM modules must request extra power from the 12V and 3.3V

rails, explaining the peek values observed under the STREAM benchmark test.

Regarding the last workload, intended to stress the NIC, it is clear that there was insignificant or even

no difference between the power drained when running the test and the idle state. This happened for all

rails and components measured, thus it is a clear proof that the NIC uses all the power it needs whether

in active or in idle state.

3 6 9 120

10

20

30

40

50

60

70

Load (%)

Pow

erE

ffici

ency

(%)

Figure 5.7: Power Efficiency

The PSU’s efficiency was determined by doing the ratio between the DC power drained by each

workload and the respective AC power consumed. Figure 5.7 illustrates the results regarding PSU

efficiency, where the abscissa axis expresses the load associated to each workload. At idle state or

during NIC intensive test, the PSU performance is below 40%, which is normal for a load near 3%. As

the load increases, so does efficiency, achieving a maximum value of 62.62% with a load of 12.15 % (LU

8 Cores). According to the PSU’s manufacturer, this unit should provide at least 80 % of efficiency for

loads higher than 20%.However, those values are not achieved for lower loads as it was demonstrated.

Unfortunately, due to the lack of more power demanding components in Liliana machine, it was not

possible to achieve loads higher than 13 %, so the efficiency values guaranteed by the manufacturer

were not attained. Even so, it is predictable that such efficiency is feasible for loads above 20% since for

only 12% of load an efficiency of 62.6% was obtained and because the efficiency curve is not linear.

79

5.5 Summary

In this chapter, a diversified set of results were provided. At the beginning, results about the calibra-

tions conducted to the sensors as well as the non-linearity thresholds were given. The following sections

were intended to validate the power readings of Powermeter API by comparing them with internal coun-

ters of RAPL system. It was shown that although RAPL is good enough for power profiling a running

workload, which presents a variable power pattern over time, it fails to follow in detail the behavior of the

power curve. In fact, RAPL outputs an average of the energy readings over the time, thereby a lot of

information is lost in contrast with Powermeter, which gives a total new view of the power pattern at a

level previously impractical, due to its accuracy and time resolution. Thus, Powermeter is better suited

for real-time profiling than RAPL.

The chapter addressed, as well, the power profiling and the energy efficiency of several parallel

benchmarks designed by NASA (NAS parallel benchmarks). The results regarding power profiling of

CPU, HDD and RAM (5V) were insightful, in a way that showed how different is the power demand

by distinct applications. For the EP benchmark, the reader had the opportunity to observe how well

distributed is the application by all the 4 physical cores, thus not demonstrating large fluctuations on the

power profile of any of the modules. On the other hand, the profiling of the BT benchmark revealed a

totally different pattern, by showing large fluctuations in the CPU and RAM power usage. By using the

Powermeter, it was concluded that the fluctuations are related to the number of iterations executed by

the workload and with accesses to the RAM. It was also determined how efficient is a workload when

taking advantage of parallel execution and, it was ascertained that hyperthreading can improve both

time performance and energy efficiency.

Finally, the device was used to characterize the power consumption of a desktop, equipped with state-

of-the art components, by measuring power usage of all the rails coming from the PSU. The tests indi-

cated how power was distributed in the system and how it changes in all rails when micro-benchmarks,

engineered to stress specific modules, were used. It was shown that hyperthreading induces a insignifi-

cant amount of power overhead and that the RAM is mainly correlated to the 5V rail, while the remaining

rails are mostly used to source fans, NIC and other motherboard components. In addition, it was verified

that the NIC consumes a constant amount of power either in active or at rest. In the end, the efficiency

curve of the PSU was obtained, which revealed that power supplies are very inefficiently with low loads.

Thereby, it is important to design a system which has a sufficient amount of load, so the PSU efficiency

reaches more than 80%, reducing the losses.

80

6Conclusions

Contents

6.1 Summary and Overall Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

81

6.1 Summary and Overall Conclusions

The main objective of this thesis was the design and implementation of a measurement device for

real-time monitoring of the power consumption of the main components of a computing system. In

chapter 2 some state of the art applications and devices that seek to solve the problem [23, 26, 27, 13]

were identified. However, those devices are not adequate for the characterization of complex applica-

tions, which have an high level of computational requirements. Therefore, a novel device was introduced

in the scope of this thesis, that promises to achieve the aforementioned goal: the Powermeter. The

conceived device comprehends several electronic components, including precision Hall-effect current

sensors and an AGC structure for handling the AC sensor output signal, by dynamically scaling the sig-

nal’s amplitude, improving ADC’s DR. The system samples at a rate of f = 3.33kHz, it does not add

any significant time overhead and transmits the acquired data at high speed (more than 64 KB/s), while

providing accurate and precise power measurements.

To make this possible, it was fundamental to analyse all the initial requirements and project electronic

structures that were fitted for the conditioning of the various types of signals coming from am off-the-

shelf desktop PSU. Furthermore, an AGC block comprising a band-pass filter and a PGA was proposed.

The system was analysed in the spectral domain, computing the expected induced noise density and it

was evaluated in terms of stability. The results proved that the system achieves an high DR (more than

90 dB) and that it was absolutely stable.

Then, the design of the software part of the conceived architecture was presented in chapter 4. In

this chapter, the latency of the system was studied; the used sampling frequancy was validated, based

on several system constrains; it was revealed the essential procedures for computing the energy with low

time overhead, by scheduling part of that computation to the MCU; and it was presented the three main

functions of the software API that are used to initiate and stop the power readings: pormeter api init(),

powermeter api start() and powermeter api stop().

The results regarding the ADC and the sensors calibration were presented in chapter 5. For valida-

tion of the system usefulness and reliability, several tests were performed by comparing the readings of

computational intensive workloads with both powermeter and internal counters (RAPL). It was also pro-

filed a few parallel benchmarks with different kinds of computational requirements. The results served

as a proof of concept of the proposed device and several conclusions were realized: it was demon-

strated that RAPL is less detailed and reliable for real time power characterization than Powermeter

by not providing enough time resolution and all the uncore power consumption of the CPU; and it was

concluded that various applications request distinct resources, evidencing different power patterns in

CPU and RAM. Furthermore, it was highlighted that the applications which routines are highly paral-

lelizable, attain better energy and time performances and it was possible to understand which are the

configurations that achieve the better trade-off between energy cost and time performance, by using the

EDP metric. Then, it was identified which rails supply specific components (5V - RAM, 12V, 5V, 3.3 V -

fans, NIC and others) and it was realized how the power is distributed in a desktop computing system.

Finally, it was revealed that for loads below 20% a standard PSU performs poorly, achieving a very low

82

conversion efficiency.

6.2 Future Work

For the future work it would be important to implement the conceived device in a Printed Circuit Board

(PCB) with SMD components, in order to reduce its form factor and enhance noise immunity. It would

be also interesting to integrate the developed software in a more powerful microcontroller with a higher

clock frequency and RAM size, so a higher sampling frequency could be attained.

Nevertheless, the prototype device already accomplishes all the necessary requirements for real-time

power profiling. Consequently, it would be interesting to combine the device with power-aware strategies,

such as DVFS or PWM-based strategies and use it for energy-aware scheduling in homogeneous and

heterogeneous clusters.

83

Bibliography

[1] Rong Ge, Xizhou Feng, Shuaiwen Song, Hung-Ching Chang, Dong Li, and Kirk W. Cameron.

PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications. IEEE

Trans. Parallel Distrib. Syst., 21(5):658–671, 2010. URL http://dblp.uni-trier.de/db/

journals/tpds/tpds21.html#GeFSCLC10.

[2] W.L. Bircher and L.K. John. Complete system power estimation: A trickle-down approach based

on performance events. In Performance Analysis of Systems Software, 2007. ISPASS 2007. IEEE

International Symposium on, pages 158–168, April 2007. doi: 10.1109/ISPASS.2007.363746.

[3] Xizhou Feng, Rong Ge, and Kirk W. Cameron. Power and energy profiling of scientific applications

on distributed systems. In Proceedings of the 19th IEEE International Parallel and Distributed

Processing Symposium (IPDPS’05), 01:34, 2005.

[4] Gustavo Vitorino Monteiro da Silva. Controlo Nao Linear. Technical report, Escola Superior de

Tecnologia de Setubal, 2006.

[5] Trevor Pering, Yuvraj Agarwal, Rajesh Gupta, and Roy Want. Coolspots: Reducing the power

consumption of wireless mobile devices with multiple radio interfaces. In Proceedings of the 4th

International Conference on Mobile Systems, Applications and Services, MobiSys ’06, pages 220–

232, New York, NY, USA, 2006. ACM. ISBN 1-59593-195-3. doi: 10.1145/1134680.1134704. URL

http://doi.acm.org/10.1145/1134680.1134704.

[6] Kester Li, Roger Kumpf, Paul Horton, and Thomas Anderson. A Quantitative Analysis of Disk Drive

Power Management in Portable Computers. In Proceedings of the USENIX Winter 1994 Technical

Conference on USENIX Winter 1994 Technical Conference, WTEC’94, pages 22–22, Berkeley, CA,

USA, 1994. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1267074.

1267096.

[7] Krisztian Flautner, Steve Reinhardt, and Trevor Mudge. Automatic performance setting for dynamic

voltage scaling. In MobiCom ’01: Proceedings of the 7th annual international conference on Mobile

computing and networking, pages 260–271, New York, NY, USA, 2001. ACM. ISBN 1-58113-422-3.

[8] Trevor Pering, Tom Burd, and Robert Brodersen. Dynamic voltage scaling and the design of a

Low-Power microprocessor system. In In Power Driven Microarchitecture Workshop, attached to

85

http://dblp.uni-trier.de/db/journals/tpds/tpds21.html#GeFSCLC10

http://dblp.uni-trier.de/db/journals/tpds/tpds21.html#GeFSCLC10

http://doi.acm.org/10.1145/1134680.1134704

http://dl.acm.org/citation.cfm?id=1267074.1267096


ISCA98, 1998. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.

53.7554.

[9] Yuvraj Agarwal, Stefan Savage, and Rajesh Gupta. SleepServer: A Software-only Approach for

Reducing the Energy Consumption of PCs Within Enterprise Environments. In Proceedings of

the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’10, pages

22–22, Berkeley, CA, USA, 2010. USENIX Association. URL http://dl.acm.org/citation.

cfm?id=1855840.1855862.

[10] Luiz Andre Barroso and Urs Holzle. The Case for Energy-Proportional Computing. Computer, 40

(12):33–37, December 2007. ISSN 0018-9162. doi: 10.1109/MC.2007.443. URL http://dx.

doi.org/10.1109/MC.2007.443.

[11] Ripal Nathuji and Karsten Schwan. VirtualPower: Coordinated Power Management in Virtual-

ized Enterprise Systems. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating

Systems Principles, SOSP ’07, pages 265–278, New York, NY, USA, 2007. ACM. ISBN

978-1-59593-591-5. doi: 10.1145/1294261.1294287. URL http://doi.acm.org/10.1145/

1294261.1294287.

[12] Andreas Merkel and Frank Bellosa. Balancing power consumption in multiprocessor systems. In

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006,

EuroSys ’06, pages 403–414, New York, NY, USA, 2006. ACM. ISBN 1-59593-322-0. doi: 10.1145/

1217935.1217974. URL http://doi.acm.org/10.1145/1217935.1217974.

[13] Jason Flinn and M. Satyanarayanan. PowerScope: A Tool for Profiling the Energy Usage of Mobile

Applications. In Proceedings of the Second IEEE Workshop on Mobile Computer Systems and

Applications, WMCSA ’99, pages 2–, Washington, DC, USA, 1999. IEEE Computer Society. ISBN

0-7695-0025-0. URL http://dl.acm.org/citation.cfm?id=520551.837522.

[14] Abhinav Pathak, Y. Charlie Hu, and Ming Zhang. Where is the Energy Spent Inside My App?: Fine

Grained Energy Accounting on Smartphones with Eprof. In Proceedings of the 7th ACM European

Conference on Computer Systems, EuroSys ’12, pages 29–42, New York, NY, USA, 2012. ACM.

ISBN 978-1-4503-1223-3. doi: 10.1145/2168836.2168841. URL http://doi.acm.org/10.

1145/2168836.2168841.

[15] Intel Corp. Intel Xeon processor. http://www.intel.com/xeon, 2012.

[16] David C. Snowdon, Stefan M. Petters, and Gernot Heiser. Accurate On-line Prediction of Processor

and Memoryenergy Usage Under Voltage Scaling. In Proceedings of the 7th ACM &Amp; IEEE

International Conference on Embedded Software, EMSOFT ’07, pages 84–93, New York, NY, USA,

2007. ACM. ISBN 978-1-59593-825-1. doi: 10.1145/1289927.1289945. URL http://doi.acm.

org/10.1145/1289927.1289945.

86

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.7554

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.7554



http://dx.doi.org/10.1109/MC.2007.443

http://dx.doi.org/10.1109/MC.2007.443

http://doi.acm.org/10.1145/1294261.1294287

http://doi.acm.org/10.1145/1294261.1294287

http://doi.acm.org/10.1145/1217935.1217974


http://doi.acm.org/10.1145/2168836.2168841

http://doi.acm.org/10.1145/2168836.2168841

http://doi.acm.org/10.1145/1289927.1289945

http://doi.acm.org/10.1145/1289927.1289945

[17] Aaron Carroll and Gernot Heiser. An Analysis of Power Consumption in a Smartphone. In

Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIX-

ATC’10, pages 21–21, Berkeley, CA, USA, 2010. USENIX Association. URL http://dl.acm.

org/citation.cfm?id=1855840.1855861.

[18] John C. McCullough, Yuvraj Agarwal, Jaideep Chandrashekar, Sathyanarayan Kuppuswamy,

Alex C. Snoeren, and Rajesh K. Gupta. Evaluating the Effectiveness of Model-based Power

Characterization. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical

Conference, USENIXATC’11, pages 12–12, Berkeley, CA, USA, 2011. USENIX Association. URL

http://dl.acm.org/citation.cfm?id=2002181.2002193.

[19] Russ Joseph, David Brooks, and Margaret Martonosi. Live, runtime power measurements as a

foundation for evaluating power/performance tradeoffs. Workshop on Complexity Effectice Design

WCED, held in conjunction with ISCA, 28, 2001.

[20] David C. Snowdon, Stefan M. Petters, and Gernot Heiser. Power measurement as the basis for

power management. In Proceedings of the 1st Workshop on Operating System Platforms for

Embedded Real-Time Applications (OSPERT), Palma, Mallorca, Spain, jul 2005.

[21] ATX specification, version 2.2. Technical report, 2005. URL http://www.formfactors.org.

[22] Molex Connectors, Accessed 2015. URL http://www.molex.com/molex/index.jsp.

[23] Marcus Hahnel, Bjorn Dobel, Marcus Volp, and Hermann Hartig. Measuring Energy Consump-

tion for Short Code Paths Using RAPL. SIGMETRICS Perform. Eval. Rev., 40(3):13–17, Jan-

uary 2012. ISSN 0163-5999. doi: 10.1145/2425248.2425252. URL http://doi.acm.org/10.

1145/2425248.2425252.

[24] Thanh Do, Suhib Rawshdeh, and Weisong Shi. ptop: A process-level power profiling tool.

In Proceedings of the 2nd Workshop on Power Aware Computing and Systems (HotPower’09), oct

2009.

[25] Fay Chang, Keith I. Farkas, and Parthasarathy Ranganathan. Energy-Driven Statistical Sampling:

Detecting Software Hotspots. In Babak Falsafi and T. N. Vijaykumar, editors, PACS, volume 2325 of

Lecture Notes in Computer Science, pages 110–129. Springer, 2002. ISBN 3-540-01028-9. URL

http://dblp.uni-trier.de/db/conf/pacs/pacs2002.html#ChangFR02.

[26] ”poweregg”. http://www.itwatchdogs.com.

[27] ”watts up”. http://www.wattsupmeters.com.

[28] M.Frumkin H. Jin and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and

Its Performance . Technical report, 1999. URL https://www.nas.nasa.gov/assets/pdf/

techreports/1999/nas-99-011.pdf.

87




http://www.formfactors.org.

http://www.molex.com/molex/index.jsp

http://doi.acm.org/10.1145/2425248.2425252

http://doi.acm.org/10.1145/2425248.2425252

http://dblp.uni-trier.de/db/conf/pacs/pacs2002.html#ChangFR02

https://www.nas.nasa.gov/assets/pdf/techreports/1999/nas-99-011.pdf

https://www.nas.nasa.gov/assets/pdf/techreports/1999/nas-99-011.pdf

[29] John D. McCalpin. Stream: Sustainable memory bandwidth in high performance com-

puters. Technical report, University of Virginia, Charlottesville, Virginia, 1991-2007.

URL http://www.cs.virginia.edu/stream/. A continually updated technical report.

http://www.cs.virginia.edu/stream/.

[30] Bonnie++ Benchmark Suite, Accessed 2015. URL http://www.coker.com.au/bonnie++/.

[31] Christian Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January

2011.

[32] Hui Chen, Youhuizi Li, and Weisong Shi. Fine-grained power management using process-level

profiling. Sustainable Computing: Informatics and Systems, 2(1):33 – 42, 2012. ISSN 2210-5379.

doi: http://dx.doi.org/10.1016/j.suscom.2012.01.002. URL http://www.sciencedirect.com/

science/article/pii/S2210537912000030.

[33] Tao Li and Lizy Kurian John. Run-time modeling and estimation of operating system power con-

sumption. SIGMETRICS Perform. Eval. Rev., 31(1):160–171, June 2003. ISSN 0163-5999. doi:

10.1145/885651.781048. URL http://doi.acm.org/10.1145/885651.781048.

[34] Aman Kansal, Feng Zhao, Jie Liu, Nupur Kothari, and Arka A. Bhattacharya. Virtual machine

power metering and provisioning. In Proceedings of the 1st ACM Symposium on Cloud Computing,

SoCC ’10, pages 39–50, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0036-0. doi: 10.1145/

1807128.1807136. URL http://doi.acm.org/10.1145/1807128.1807136.

[35] Frank Bellosa. The benefits of event: Driven energy accounting in power-sensitive systems. In

Proceedings of the 9th Workshop on ACM SIGOPS European Workshop: Beyond the PC: New

Challenges for the Operating System, EW 9, pages 37–42, New York, NY, USA, 2000. ACM. doi:

10.1145/566726.566736. URL http://doi.acm.org/10.1145/566726.566736.

[36] Gilberto Contreras. Power prediction for intel xscale processors using performance monitoring unit

events. In In Proceedings of the International symposium on Low power electronics and design

(ISLPED, pages 221–226. ACM Press, 2005.

[37] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual, 2011.

[38] Canturk Isci and Margaret Martonosi. Runtime power monitoring in high-end processors: Method-

ology and empirical data. In Proceedings of the 36th Annual IEEE/ACM International Symposium

on Microarchitecture, MICRO 36, pages 93–, Washington, DC, USA, 2003. IEEE Computer Society.

ISBN 0-7695-2043-X. URL http://dl.acm.org/citation.cfm?id=956417.956567.

[39] Pat Bohrer, Elmootazbellah N. Elnozahy, Tom Keller, Michael Kistler, Charles Lefurgy, Chandler

McDowell, and Ram Rajamony. Power aware computing. chapter The Case for Power Management

in Web Servers, pages 261–289. Kluwer Academic Publishers, Norwell, MA, USA, 2002. ISBN 0-

306-46786-0. URL http://dl.acm.org/citation.cfm?id=783060.783075.

88

http://www.cs.virginia.edu/stream/

http://www.coker.com.au/bonnie++/

http://www.sciencedirect.com/science/article/pii/S2210537912000030

http://www.sciencedirect.com/science/article/pii/S2210537912000030

http://doi.acm.org/10.1145/885651.781048

http://doi.acm.org/10.1145/1807128.1807136

http://doi.acm.org/10.1145/566726.566736



[40] S. Kamil, J. Shalf, and E. Strohmaier. Power efficiency in high performance computing. In Parallel

and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1–8,

April 2008. doi: 10.1109/IPDPS.2008.4536223.

[41] CORSAIR TX750, Accessed 2015. URL http://www.corsair.com/pt-pt/

enthusiast-series-tx750-v2-80-plus-bronze-certified-750-watt-high-performance-power-supply.

[42] MATLAB. version 8.3.0.532 (R2014a). The MathWorks Inc., Natick, Massachusetts, 2014.

[43] Steve Bowling. Understanding A/D Converter Performance Specifications. Technical report, Mi-

crochip Technology Inc., 2000.

[44] Walt Kester. Understand SINAD, ENOB, SNR, THD, THD + N, and SFDR so You Don’t Get Lost in

the Noise Floor. Technical report, Analog Devices, 2009.

[45] ZHISHENG DUAN, JIN-ZHI WANG, and LIN HUANG. Frequency domain method for the dichotomy

of modified chaos equations. International Journal of Bifurcation and Chaos, 15(08):2485–2505,

2005. doi: 10.1142/S0218127405013435. URL http://www.worldscientific.com/doi/

abs/10.1142/S0218127405013435.

[46] Pedro Bulach Gapski. Analise Convexa do Problema da Estabilidade Absoluta de Sistemas tipo

Lur’. Master’s thesis, FEE/UNICAMP, June 1994.

[47] M. Vidyasagar. Nonlinear Systems Analysis. Prentice Hall, 2nd edition, 1993. ISBN:0-13-623463-1.

[48] libusb API library, Accessed 2015. URL http://libusb.info/.

[49] Intel Corporation. Intel i7 3770k 3.9 GHz Specifications. Technical report, 2011.

[50] Kristof Du Bois, Tim Schaeps, Stijn Polfliet, Frederick Ryckbosch, and Lieven Eeckhout. SWEEP:

Evaluating Computer System Energy Efficiency Using Synthetic Workloads. In Manolis Katevenis,

Margaret Martonosi, Christos Kozyrakis, and Olivier Temam, editors, HiPEAC, pages 159–166.

ACM, 2011. ISBN 978-1-4503-0241-8. URL http://dblp.uni-trier.de/db/conf/hipeac/

hipeac2011.html#BoisSPRE11.

[51] NLANR/DAST : Iperf - the TCP/UDP bandwidth measurement tool.

http://dast.nlanr.net/Projects/Iperf/, Accessed 2015. URL http://dast.nlanr.net/

Projects/Iperf/.

89

http://www.corsair.com/pt-pt/enthusiast-series-tx750-v2-80-plus-bronze-certified-750-watt-high-performance-power-supply

http://www.corsair.com/pt-pt/enthusiast-series-tx750-v2-80-plus-bronze-certified-750-watt-high-performance-power-supply

http://www.worldscientific.com/doi/abs/10.1142/S0218127405013435

http://www.worldscientific.com/doi/abs/10.1142/S0218127405013435

http://libusb.info/

http://dblp.uni-trier.de/db/conf/hipeac/hipeac2011.html#BoisSPRE11

http://dblp.uni-trier.de/db/conf/hipeac/hipeac2011.html#BoisSPRE11

http://dast.nlanr.net/Projects/Iperf/

http://dast.nlanr.net/Projects/Iperf/

AAppendix A

91

A.1 Bandpass Filter

101

102

103

−40

−30

−20

−10

0

10

20

30

Frequency (Hz)

System: +5%Frequency (Hz): 50Magnitude (dB): 9.25

System: −5%Frequency (Hz): 50Magnitude (dB): 8.96

Mag

nitude (dB)

TheoreticalNominal5+5%−5%

(a) Filter Response with 5% Resistors Tolerance

101

102

103

−40

−30

−20

−10

0

10

20

30Bode Diagram

Frequency (Hz)

Mag

nitude (dB)

System: +1%Frequency (Hz): 50Magnitude (dB): 11.9

System: −1%Frequency (Hz): 50Magnitude (dB): 11.9

TheoreticalNominal1+1%−1%

(b) Filter Response with 1% Resistors Tolerance

Figure A.1: Filter Response

Figure A.2: R2 Parameter Variation with 5% Tolerance

92

A.2 Analog Implementation - Full Circuit Diagram

Figure A.3: Circuit Layout

93

A.3 Dynamic Range

Figure A.4: Sinusoid FFT with Oscilloscope

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6−120

−100

−80

−60

−40

−20

0

Frequency (kHz)

dB

F

2

3

4

5

6

THD: -61.29 dBFundamentalHarmonicsDC and Noise (excluded)

(a) THD

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6−140

−120

−100

−80

−60

−40

−20

0

Frequency (kHz)

dB

F

S

SFDR: 61.842 dBFS

SFDRFundamentalSpursDC (excluded)

(b) SFDR

Figure A.5: THD and SFDR for N=8. Fs = 3.333 kHz

94

A.4 Benchmark Results

0 1 20

10

20

30

40

50

60Pow

er(W

)

Time (s)

1 Thread

0 1 2

Time (s)

2 Threads

0 1 2

Time (s)

4 Threads

0 1 2

Time (s)

8 Threads

CPUHDDRAM (5V)

Figure A.6: EP CLASS=B Power Profiling

0 1 20

10

20

30

40

50

60

701 Thread

Time (s)

Pow

er(W

)

0 1 2

2 Threads

Time (s)0 1 2

4 Threads

Time (s)0 1 2

8 Threads

Time (s)

CPUHDD

RAM (5V)

Figure A.7: BT CLASS=B Power Profiling

95

Powermeter for HPC Systems - ULisboa€¦ · Powermeter for HPC Systems André Filipe Gonçalves...

Documents

Transcript of Powermeter for HPC Systems - ULisboa€¦ · Powermeter for HPC Systems André Filipe Gonçalves...