Powermeter for HPC Systems - ULisboa€¦ · Powermeter for HPC Systems André Filipe Gonçalves...
Transcript of Powermeter for HPC Systems - ULisboa€¦ · Powermeter for HPC Systems André Filipe Gonçalves...
Powermeter for HPC Systems
André Filipe Gonçalves Duarte
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisors: Dr. Pedro Filipe Zeferino TomásDr. Nuno Filipe Valentim Roma
Examination Committee
Chairperson: Dr. Nuno Cavaco Gomes HortaSupervisor: Dr. Pedro Filipe Zeferino Tomás
Member of the Committee: Dr. Francisco André Corrêa Alegria
May 2015
Courage and perseverance have a magical talisman, before which difficulties disappear and obstacles
vanish into air.
John Quincy Adams
iii
Acknowledgments
I would like to thank to Professors Pedro Filipe Zeferino Tomas and Nuno Filipe Valentim Roma for
all the support, advices and patience they showed over the past months. Without their guidance and
belief on me, I am sure it would not be possible to end this important stage of my life. I would like to
thank, as well, to Professor Jose Germano for all the useful advices he gave me and also my friends for
always being there even when one thinks they are not. I would like to show my pride and joy for being
part of a group of people who I know will follow me in the rest of my life. Thus, thanks to Joao Pedro
Costa e Castro, Ricardo Filipe Tomas Pires, Goncalo Diogo Gomes Mendes, Goncalo Gouveia Velez
Bidarra Saraiva, Guilherme Costa e Castro and for last but not least to the great Ortonimo (Flavio Jorge
dos Santos Lopes).
I can not end this text without show my highest esteem to my parents and my sister for loving me
so hard and for doing all they could to help me, even when it seemed I did not show appreciation for it.
Thank you, I love you
v
Abstract
The fast pace at which technology has been evolving has led to a significant increase of the amount
of energy that is consumed by nowadays High Performance Computing systems (HPC). Consequently,
it has become highly important to understand how the energy consumption of any given application
changes over time, envisaging the possibility to implement real-time power profiling and resources opti-
mization. The work that was developed in the scope of this thesis describes the design and prototyping
of an acquisition board (and related software API) composed by several Hall sensors and a microcon-
troller. Such board is capable of measuring the amount of power demanded by an HPC system, by
monitoring the current that passes through the several rails of the main Power Supply Unit (PSU) of a
personal computer. For such purpose, a broad set of conditioning modules were studied and imple-
mented, in order to ensure accurate and precise measurements under an ample dynamic range of the
measured signals. In particular, an Automatic Gain Controller (AGC) module was implemented in the
acquisition board, embracing both the analog and digital domains of the measurement procedure. The
results obtained from the experimental evaluation showed that the conceived device is highly suitable
for real-time power profiling of HPC systems under complex workloads, by providing fine-grained mea-
sures of the power consumption over time, hardly attained by other alternative state-of-the-art devices
or systems.
Keywords: High Performance Computing Systems, Energy Consumption, Real-time Power
Profiling, In-situ Measurements, Automatic Gain Control, PIC
vii
Resumo
O ritmo acelerado a que as tecnologias se tem desenvolvido levou a um aumento significativo da
energia consumida pelos sistemas de computacao de alto desempenho (HPC). Consequentemente, e
de extrema importancia perceber como e que o consumo energetico duma aplicacao varia ao longo do
tempo, visando a caracterizacao em tempo real da potencia consumida pelo sistema e a consequente
otimizacao de recursos. O trabalho que foi desenvolvido no ambito desta tese descreve o projeto e
a prototipagem duma placa de aquisicao (e a aplicacao de software associada) composta por varios
sensores de Hall e um microcontrolador. Esta placa e passıvel de medir a potencia requerida por
um sistema HPC, monitorizando a corrente em varias linhas de alimentacao provenientes da fonte de
alimentacao (PSU) de um computador pessoal. Assim, foram estudados e implementados diversos
modulos de acondicionamento, com o intuito de garantir medicoes exatas e precisas sob uma larga
gama dinamica dos sinais medidos. Em particular, foi implementado um modulo de Controlo Automatico
do Ganho (AGC), fazendo a ligacao entre os domınios analogico e digital da placa de aquisicao. Os
resultados experimentais obtidos revelaram que a placa concebida e particularmente adequada para
a caracterizacao em tempo real da potencia consumida por aplicacoes de elevada complexidade em
sistemas HPC, obtendo-se uma precisao de medida ao longo do tempo que dificilmente e alcancado
por outros dispositivos modernos e sistemas do mesmo genero.
Palavras-chave: Sistemas de Computacao de Alto Desempenho, Consumo Energetico,
Caracterizacao da Potencia em Tempo Real, Medicoes In situ, Controlo Automatico do Ganho, PIC
ix
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 State-of-the-Art 9
2.1 Real-time Power Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Software-based Power Measurement . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1.A System Profile-based Power Model . . . . . . . . . . . . . . . . . . . . . 10
2.1.1.B PMC-based Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Hardware-based Power Measurement . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Powermeter - Architecture Definition and Specifications 19
3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Signal Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 DC conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 AC conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 AGC - Automatic Gain Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 MatLab Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Analog Domain Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
xi
3.3.2.A Band-pass Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2.B Summming Amplifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2.C Programmable Gain Amplifier . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.3 Dynamic Range Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.3.A SNR,THD and SFDR Analysis . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3.B Combining the PGA with Oversampling . . . . . . . . . . . . . . . . . . . 45
3.3.4 System Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.4.A Absolute Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.4.B Popov Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.4.C Circle Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.4.D Application of the Theorems to the System in Study . . . . . . . . . . . . 50
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Powermeter - Software/Firmware 55
4.1 Communication System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.2 Sampling Frequency Choice of the System . . . . . . . . . . . . . . . . . . . . . . 58
4.1.3 Types of Data Transferred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.3.A Synchronous - Clocks Synchronization and Sampling Process Initializa-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.3.B Asynchronous -Time Synchronization Data . . . . . . . . . . . . . . . . . 60
4.1.3.C Asynchronous -Time Stamps and Sampled Data . . . . . . . . . . . . . . 62
4.2 Firmware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Buffering Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Oversampling and Maximum Search Algorithm . . . . . . . . . . . . . . . . . . . . 64
4.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Powermeter Application Programming Interface . . . . . . . . . . . . . . . . . . . . 65
4.3.2 Energy Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.2.A Time Stamp Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Results 69
5.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.1 Sensors and ADC Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Non-Linearity Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Power Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.1 SPLASH-2 Benchamark Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.2 NAS Parallel Benchmark Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 PC Energy Consumption Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
xii
6 Conclusions 81
6.1 Summary and Overall Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Bibliography 89
A Appendix A 91
A.1 Bandpass Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.2 Analog Implementation - Full Circuit Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.3 Dynamic Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.4 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
xiii
List of Tables
2.1 List of available RAPL sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 PowerPack power meter profile API. (Source [1]) . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Component Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Gain at the Central Frequency to the Various Tests . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Parameters Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 AD5113 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Digital Word Vs Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 ADC Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7 SNR values for each case test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1 Types of Data Transferred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Synchronous Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Asynchronous Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Sampling Process Initialization Command Example . . . . . . . . . . . . . . . . . . . . . . 60
4.5 Host and PIC’s Clock Synchronization Data Structure . . . . . . . . . . . . . . . . . . . . 60
4.6 Sampling Process Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.7 Main API Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1 AGC Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Powermter Time Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 RAPL Time Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Machine’s Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
List of Figures
1.1 Motherboard and Component’s Connectors Pin-out . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Propagation of Performance Events. (Source [2]) . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 An example of using multimeter to measure the power. [3] . . . . . . . . . . . . . . . . . . 14
2.3 PowerPack (Source [1]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Block Diagram of the Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Current Sensor. Source: Allegro MicroSystems . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 DC Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Subtractor Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6 AGC Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.7 On-off Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8 Loop Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.9 Reconstruction of the Input signal with and without AGC . . . . . . . . . . . . . . . . . . . 28
3.10 Loop Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.11 Reconstruction of the Input signal with and without AGC . . . . . . . . . . . . . . . . . . . 29
3.12 Full Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.13 Band Pass Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.14 Real Vs Theoretical Bode Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.15 Pspice Simulation Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.16 Voltage Converter Schematic (LMC7660 datasheet - Texas Instruments) . . . . . . . . . . 34
3.17 Summing Amplifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.18 Adder Output - Voffset (purple) = 2.5 V; Vsub (yellow) = 1.175 mV amplitude; Vout (cyan)
= 3.662 V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.19 PGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.20 AD5113’s Pin Configuration and Block Diagram - Analog Devices . . . . . . . . . . . . . . 36
3.21 Averaged Conversion Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.22 Averaging Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.23 PIC18f4550 10-bit ADC’s SNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.24 FFT of Sinusoidal Signal sampled at Fs = 3.3(3) kHz . . . . . . . . . . . . . . . . . . . . . 44
xv
3.25 FFT of Sinusoidal Signal with samples averaging (Fs = 3.3(3) kHz) . . . . . . . . . . . . . 45
3.26 Non-Linearities Analysis [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.27 System for Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.28 Non-Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.29 Nyquist Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1 Round-Trip Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Communication Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Synchronization Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Offset Change Along the Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 System Diagram - Microcontroller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6 Dual-Buffer Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7 Oversampling with Rolling Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.8 System Diagram - Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.9 Trapezoidal Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.10 Energy Computing (where Bn refers to energy batch of the nth packet) . . . . . . . . . . . 67
4.11 Case Scenario when time[index] = Tn > end time (where Tn = T0 + (n− 1)× TSampling) 68
5.1 ADC and Sensor Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Stability Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Power Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4 Profiling of NPB Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5 EDP Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6 AC Power Consumption and DC Power Distribution in the System . . . . . . . . . . . . . 78
5.7 Power Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A.1 Filter Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.2 R2 Parameter Variation with 5% Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.3 Circuit Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.4 Sinusoid FFT with Oscilloscope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.5 THD and SFDR for N=8. Fs = 3.333 kHz . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.6 EP CLASS=B Power Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.7 BT CLASS=B Power Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
xvi
List of Algorithms
4.1 Maximum Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Energy Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Time Stamp Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
xvii
xviii
Glossary
AC Alternate Current
AGC Automatic Gain Control
API Application Program Interface
API Application Programming Interface
CPU Central Processing Unit
DAC Digital-to-Analog Converter
DC Direct Current
DMA Dynamic Memory Access
DVFS Dynamic Voltage-Frequency Scale
EDP Energy-Delay Product
FPGA Field-Programmable Gate Array
GPU Graphics Processing Unit
HDD Hard Drive Disk
HPC High-Performance Computing
LLC Last Level Cache
MCU Microcontroller Central Unit
MSR Machine-Specific Registers
NTP Network Time Protocol
OS Operating System
PCB Printed-Circuit Board
PCI Peripheral Component Interconnect
PGA Programmable Gain Amplifier
PLL Phase-Locked-Loop
PMC Performance-Monitoring Counters
PSU Power Supply Unit
RAPL Running Average Power Limit
RMS Root-Mean Squared
SFDR Spurious-Free Dynamic Range
SIE Serial Interface Engine
SMD Surface-Mount Device
xix
SNR Signal-to-Noise Ratio
TCP Transmission Control Protocol
THD Total Harmonic Distortion
TLB Translation Look aside Buffer
UDP User Datagram Protocol
USB Universal Serial Bus
p.f. power factor
xx
1Introduction
Contents
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1
Energy consumption is an issue in every consumer electronic equipment, even more since fossil re-
sources are being consumed at a rate far superior than nature’s regenerative rhythm and also because
of all the power consumption constraints in all sort of equipments to reduce environmental impact. How-
ever, this general interest in saving energy has only been a focus of attention by the computing science
community in the last few years. This happens mostly due to the tremendous pace at which technology
has evolved in the last decade, translating into systems with computing capabilities far more powerful
than ever, verifying, until now, Moore’s law projections. This, in turn, leads to more and more power
required to feed those systems and special attention is now being paid to High Performance Computing
(HPC) systems by the science community, qualifying energy consumption as a primary concern in the
design of electronic systems and of computer systems in particular. As a result, several research efforts
have been made to optimize and carefully manage energy consumption at multiple levels - starting from
individual components like wireless radios [5], storage devices [6], and processors [7, 8], to PC’s and
servers [9, 10, 11].
For the computer engineer, it is important to figure out where and why an application consumes
more power and with that information make decisions that can improve the application’s energy effi-
ciency. Thus, it is essential to adequately measure the energy spent in computer systems. One of many
approaches to do it, is to create energy models that allow extrapolating the future energy consumption
of the system. Those models are used to schedule applications and resources [12], to adapt application
behavior to externally specified energy constraints [13], and to account energy usage to the respective
software components [14]. Those energy models, usually, are based on information retrieved from spe-
cific power indicators, which are correlated with the power measurements taken, a priori, when running
of some stressful workload (for instance, CPU utilization is a power indicator of the workload of the unit).
Instead of developing those models, based on statistical information, hardware performance counters,
also known as Performance-Monitoring Counters (PMCs), can be used. PMCs are special registers
associated to on-board energy sensors for measuring the energy consumption of on-core hardware
components. Intel introduced these sensors - calling them ”Running Average Power Limit” (RAPL) -
with their Sandy Bridge microarchitecture [15]. In spite of this is a better approach than the one using
power indicators, since it takes into account direct measurements of the system’s power, it is still based
on system events, which do not reflect accurately power consumption [16]. Although that issue can
be attenuated with fine-grained instrumentation of single processing units [17], a study by McCullough
et al. [18] revealed that affine based models often perform poorly when it comes to model more com-
plex computer systems and workloads, arguing the need for increased direct measurement by physical
instrumentation of power consumption.
In-situ power measurement with meters and special devices, it is an alternative method to determine
power consumption. Digital voltmeters and clamp meters are one of the choices [13, 19], but there are
also special devices to do so, such as PowerPack [1] and PLEB [20]. In general all those systems are
inadequate, since they do not provide high sampling rates [13, 19], do not permit sampling of more
than one sensor at a time or the scalability of the measuring system to other systems is not feasible
[13, 19, 1, 20]. Other systems rely on measurement of the total system power at the wall socket (AC
2
power), but the applicability is limited in the case of any fine grained adaptation. Furthermore, power
measurements at the system level cannot distinguish between the actual power used and the power
wasted due to the inefficiencies in the power supply. For all the above reasons, it is necessary an
accurate, reliable, fast and possible easy to integrate system of energy/power measurement, so it is
appropriate for real-time power characterization of any algorithm.
In this dissertation, an hardware system called Powermeter is presented. The Powermeter is a
prototype developed within INESC-ID SiPS group, which is capable of measuring directly the 12V, 5V
and 3.3V power rails from the motherboard’s power connector and individual components’ power (e.g.,
Central Processing Unit (CPU), Graphics Processing Unit (GPU) and Hard Drive Disk (HDD)). It can,
also, measure the AC power requested by the system, using an Automatic Gain Controller (AGC), which
dynamically amplifies the input signal, improving Analog-to-Digital Converter (ADC) dynamic range and
allowing to distinguish small variations in the system’s power consumption. The design and requirements
of the full system, as well as details about the theory and project of the AGC are revised in later chapters
of this dissertation.
1.1 Motivation
Energy (Joule), which is the physical quantity used for electric bills (usually in kWh), is defined as the
integral of the instantaneous power (Watt) drained over the time:
Energy =
∫ t2
t1
P (t)dt (1.1)
Thus, the average power consumed can be obtained by the ratio of the energy and the period of
integration.
In the desktop computers, the power is supplied by a Power Supply Unit (PSU), which converts AC
power to DC power and divides the last one by several rails (12 V, 5V and 3.3V). Power is drained
from this rails to source some components, such as Random Access Memory (RAM) modules, Network
Interface Chip (NIC), fans and others, and it is important to know if there are specific rails that power
mainly some components. Thus, the ATX/EPS specifications [21] were analysed.
Every motherboard, complying with ATX12V or EPS12V design directives, has a 24 pin main power
connector (see figure 1.1(a)), from where power is drained to supply all the basic functions of the moth-
erboard. However, some elements are powered from specific connectors coming from a standard PSU.
The connectors that will be described are of the Molex type [22]: Molex connector is the term to define
a two-piece pin and socket interconnection and are widely use for connecting power in desktop PCs,
because of the simplicity, reliability, flexibility, and low cost of the Molex design (see figure 1.1).
The connector to power CPU is represented in figure 1.1(b). This is usually named EPS12V connec-
tor and provides the necessary power for the moderns multicore processors package through four 12V
rails. Power to supply GPU or Field-Programmable Gate Array (FPGAs) is drained from a PCI-Express
connector, as illustrated in figure 1.1(c). Older connectors had 6-pin, while the modern ones have 8-pin,
3
Power-Supply-Fundamentals,8-7-312631-13.jpg (Imagem JPEG, 450x338 pixéis)
http://media.bestofmicro.com/Power-Supply-Fundamentals,8-7-312631-13.jpg[21/03/2015 13:07:47]
(a) Motherboard Connector Pin-out (b) EPS12V Molex Pin-out
(c) 8 pin PCI Express Pin-out (d) SATA Cable Pin-out
(e) 4 Pin Peripheral Pin-out
Figure 1.1: Motherboard and Component’s Connectors Pin-out
providing all the power to graphic cards. This connector provides power through three 12V power rails.
The HDD is powered by SATA connectors, which also includes a data cable. This connector, normally,
has four major rails, which are the 12V, 5V, 3.3V and GND connectors (see figure 1.1(d)). It can, also,
include the 12 V and 5 V rails (but not the 3.3 V one, which is an extra rail and most of the time it is not
needed). Other peripherals are powered by the 4-pin peripheral power cable, identified by figure 1.1(e).
This connector provides power through a 12V and 5V rails.
The CPU and PCI Express cables are connected to only one current sensor, which has specific input
connectors for the Molex type. On other hand, SATA cable, connecting the HDD, has 12 V and 5 V rails,
so measurements have to be done through two different current sensors (one for the 12 V rail and the
other for the 5V one). Other rails connecting directly to the motherboard have also their own current
sensor (this includes the 12 V, 5V and 3.3 V rails).
The computing science community has attained fantastic results, regarding algorithm’s time per-
formance, resorting to multi-core processors (using CPUs, GPUs and FPGAs). However, this kind of
parallelism usually implies higher power demands, due to the minimum level of power required by each
active core. In addition, different configurations and systems consume different amounts of power under
stress of the same workload. And, even with the same system and configuration, an algorithm has dif-
4
ferent power demands along time, due to the various tasks running on the processor, which may require
distinct resources.
Thus, it would be convenient to somehow understand how and where power is being consumed
by a scientific computational application, associated with some configuration (i.e., running with one or
more threads in one or more cores). For instance, with that kind of information in hand, the engineer
could choose the best combination of the number of nodes and threads for which the algorithm will
attain good performances of both time and energy consumption. It could, also, make the decision of
exploiting parallelism of specific jobs with GPUs instead of using CPUs or vice-versa, depending on the
gain of performance and energy cost. In addition, can combine the real-time power information with
power-aware strategies, such as Dynamic Voltage-Frequency Scale (DVFS), decreasing CPU’s power
consumption.
This problem can be solved resorting to energy models [23, 24, 13] which rely on specific power
indicators (such as CPU utilization) to deliver power consumption information. Other approaches are
based on direct power measurements, by the usage of voltmeters and ammeters. All those methods
(which will be boarded in greater detail in further chapters) are inaccurate and/or do not provide enough
time resolution to characterize, in detail, the power consumption of a workload.
Therefore, the main motivation, for this thesis, is the project of hardware and software components
that cooperate to reach a on-board sensory processing with real-time computation capabilities, so the
aforementioned problems introduced before can be solved.
1.2 Related Work
Measuring power consumption, in an accurate and easy way, to allow power profiling it is a challenge.
Bircher and John [2] proposed the use of PMCs to find the power of the processor and other devices,
such as memory and disk. Chang et al. [25] proposed energy models to estimate the components’
power consumption, alleging that PMCs are not suitable to build fine-grained energy models and that
the power profiling thread may interfere with other applications using PMCs.
Other researchers ([3]) defended the use of in-situ measurements, since it provides more accu-
rate measurements and can measure total system’s power and individual devices. Data collection can
be done with digital voltmeters and ammeters (Power Scope [13]) and inbuilt sense resistors (Watch-
Dog.com PowerEgg [26], WattsUp? Pro [27]). These approaches have, usually, low time resolutions (2
or 4 Hz), which leads to a major loss of power information, if we bear in mind that a multi-core processor
can issue billions instructions per second.
Some of the aforementioned applications have drawbacks, such as not being fast or accurately
enough. Thus, failing to fulfill the needs of the computer engineer for a device that can successfully
characterize, in real-time, power consumption variations.
5
1.3 Objectives
The main purpose of this thesis, is to develop an electronic device, to be incorporated in a standard
desktop system, which allows an accurate and precise way of profiling power consumption in real-time.
Such accomplishment, would allow to analyze and optimize applications and, also, to compare different
implementations strategies. Therefore, it is necessary a list of requirements for which the device must
comply to, in order to successfully achieve these goals:
• It must be able to sense all (or almost all) the rails that supply power to a desktop PC with low
power losses;
• It must provide accurate and precise measurements;
• It must not introduce much (ideally none) time overhead and have a user-friendly interface;
• It must feature a great time resolution, so fast variations in power demand are distinguishable
(Fs > 1 kHz);
• It must exchange data at high rates (> 64 KB/s);
• It should be powered by the host system and composed by the smallest number of components
possible, in order to reduce costs and guarantee a low form factor.
According to this, it will be necessary to understand how a common desktop PC is powered, analysing
the ATX/EPS12V design guide [21]; project electronic circuits to acquire the different signals and an easy
access Application Program Interface (API). Further, the system stability and performance (in terms of
added noise, sampling frequency attained and latency) have to be evaluated. Finally, tests to the de-
vice have to be made, using several benchmarks and comparing the readings with internal counters
(RAPL). In addition, as proof of concept, this framework it will be used to profile the power and energy
consumption of NAS parallel benchmarks, so it can be demonstrated how useful can be the device to
the computer engineer.
1.4 Main Contributions
This dissertation has merit for the following contributions: it has resulted in the development of an
easy to integrate board, supplied by the measured system itself and with low power losses ; it can
provide accurate readings in both AC and DC rails, while introducing small time overheads (during the
testes performed, introduced, in the worst case, a extra 1.27% of time overhead); the AC readings are
done with the help of an AGC system, which dynamically amplifies the sensed signal, improving ADC’s
dynamic range; it can sample at rates hardly attained by previous devices (Fs = 3.3(3) kHz) and it can
transfer data at high rates, by the usage of the Universal Serial Bus (USB) protocol (more than 64 KB/s).
All these characteristics makes the proposed device a specialized tool for power profiling, at real-
time, of complex scientific workloads, making it suitable to be used with common power-aware strategies,
6
such as DVFS. Besides, the tests conducted, revealed that, with the device, it is possible to distinguish
different power patterns associated with distinct stages of the computing algorithm. It was also used
to classify diverse applications, according to their degree of both time and energy performance and
understand in which configurations (running with 1 or more threads by processor) the algorithm attains
better performances.
1.5 Dissertation Outline
This dissertation is organized in six chapters:
• In Chapter 2 the State of Art is introduced, seeking to present the common solutions to solve the
problem, including software-based and hardware-based measurements;
• In Chapter 3 a global view of the system is given. The full hardware characteristics of Powermeter
device is revealed. The chapter comprehends the devices used for signal conditioning and sam-
pling, structure and analysis of the Automatic Gain Control (AGC), number of acquisition channels
used, sampling frequency utilized and others;
• In Chapter 4 the software and firmware structures are introduced, revealing under which applica-
tion the device communicates between host and microcontroller, what information is exchanged
between the two units and in what cases, and the necessary procedures/algorithms to guarantee
system synchronism and reliability;
• In Chapter 5 an evaluation of certain parameters of the device is made (such as stability of the
system); an analysis to the device performance, testing it using well known benchmarks (NPB [28],
STREAM [29], Bonnie [30], SPLASH2 [31],..) intended to stress specific components ( CPU,GPU,
HDD, Memory I/O...); and a characterization of PC energy consumption is performed. The results
and conclusions of those tests are also included in this chapter;
• In Chapter 6 concluding remarks are presented, based on the results obtained in earlier chapters.
Furthermore, future work is proposed.
7
8
2State-of-the-Art
Contents
2.1 Real-time Power Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
9
High power consumption is a problem that scientific community has been trying to solve whether
by changing the CPU architecture or monitoring power and properly change kernel scheduler. This
actually arises two distinct problems: accurately measure energy/power consumption and implement an
algorithm to actually save energy while not significantly compromising the running time of the algorithm.
In the following sections, only the energy measurement methods topic will be revised, since that belongs
to the scope of this dissertation. The advantages and drawbacks of each system will, also, be pointed
out along the various sections.
2.1 Real-time Power Measurement
Real-time power measurement in computing systems can be accomplished by hardware-based or
software-based methods. Hardware-based power measurement mainly uses different kinds of instru-
ments to measure the power of a device directly. The result is much more accurate than software
based methods. It is usually used to evaluate the effectiveness of power saving techniques. However,
hardware-based methods are limited to measure component level power profiles. Hence, even though
not as accurate as hardware-based method, software-based power profiling tries to estimate the power
of different levels by designing a group of power models.
2.1.1 Software-based Power Measurement
Some authors defend that hardware-based methods are hard to integrate and expensive [32, 16].
Thus, a lot of research has been made under the area of software measurement and profiling.
Software-based approach usually builds power models to estimate the power dissipation of different
levels: instruction level, program block level, process level, hardware component level, system level and
so forth. These methods first try to find the power indicators that could reflect the power of these software
or hardware units. Then they build the power model with these power indicators and fine tune the
parameters of the power model. It is possible to categorize power models under two categories, based
on the difference of the power indicators: system profile-based method and hardware performance
counter (PMC) based method.
2.1.1.A System Profile-based Power Model
System profile or system events are a set performance statistical information supplied by the oper-
ating system (for instance, Linux saves that statistical information under the ” proc” directory). These
events reflect the current state of the hardware and software, including the operating system.
Li and John [33] estimated the power dissipation of operating system, using those kind of events.
They found that some operating system routines spend constant power, and other operating system
routines, such as process scheduling and I/O operating, have a linear relationship with Instructions Per
Cycle (IPC). They build the power model based on IPC as shown on equation 2.1. In this equation k1
10
and k0 are constants that they get from linear regression step. Finally, they define the routine level based
operating system power model, adding all the individual powers, as shown on equation 2.2.
P = k1 × IPC + k0 (2.1)
EOS =∑i
POS routinw,i × TOS routinw,i (2.2)
Kansal et al. [34] introduced a virtual machine level power model. They first built the power model
and estimate the power for CPU, memory and disk, and then they distributed the power to each virtual
machine based on the utilization of each component. The CPU energy model they proposed is based on
CPU utilization; the memory energy model uses the number of Last Level Cache )LLC) misses; and the
disk energy model relies on the bytes of data that disk reads and writes. These models are formalized
as follows:
Ecpu = αcpuµcpu + γcpu (2.3)
Emem(T ) = αmemNLLCM (T ) + γmem (2.4)
Edisk = αrbbR + αwbbW + γdisk (2.5)
In these three equations, µcpu refers to CPU utilization, NLLCM to the number of LLC miss in time T,
bR and bW , to the amount of bytes read from disk and the amount of bytes write into the disk, respectively.
The parameters α and γ are constants that they get when training the energy model.
Chen et al. [32] also use similar method to compute the power of a process. However, they target
laptop computers, so wireless network card consumption is considered.
2.1.1.B PMC-based Power Model
As mentioned earlier, hardware performance counters or PMCs is a group of registers that are used
to count hardware, software and operating system events. Thus, a large group of proposals use PMCs
to build power models. Bellosa [35] is probably the first person who proposed the usage of PMCs to
estimate power information. After analysis, they found that four PMCs (integer operation, floating-point
operation, second-level address strobe and memory transaction) are tightly related to the power dissi-
pation of a processor. In [36], Contreras et al. construct power models for the Intel PXA255 processor
and memory. For the processor power model, they used the following PMCs: instructions executed,
instruction cache miss and TLB (Translation Look aside Buffer) misses. They build the memory power
model with two counters: Instruction Cache Miss and Data Cache Miss.
The previous works show that performance events can be directly used to build power model for CPU
and memory, however, current performance counters do not directly supply useful events that reflect the
11
activity of other devices. Bircher and John [2] show that processor related performance events are highly
correlated with the power of other devices, such as memory, chipset, I/O and disk. In able to establish
the power model for these subsystems , first they need to understand the correlation or say that how
these events are propagated in the subsystem. Figure 2.1 shows the propagation of these performance
events in all the subsystems they defined.
They used in-built resistors to measure the power consumption of those subsystems, acquiring data
with a separate workstation at rate of ten thousand samples per second. Then, they related the power
information with performance counter samples taken at a much slower rate of one per second. They
select nine performance events based on these measures that were better indicators of the power spent
on those subsystems. These events include cycles, halted cycles, fetched uops, level 3 cache miss, TLB
misses, Dynamic Memory Access (DMA) accesses, processor memory bus transactions, uncacheable
accesses and interrupts. Equation 2.6 shows the CPU power model they proposed. More details about
the models used can be found in [2].
PCPU =
NumCPUs∑i=1
(9.25 + (35.7− 9.25)× PercentActivei + 4.31× FetchedUopsiCycle
(2.6)
In this equation, PercentActivei is the percentage of CPU utilization, FetchedUopsi is the micro-
operations fetched by the processor and Cycle is the core frequency time. The values 35.7 and 9.35
reflect the maximum and minimum power dissipation of one CPU, respectively and 4.31 is a constant
value that gives the relationship of the performance events and the real power.
Figure 2.1: Propagation of Performance Events. (Source [2])
Following, RAPL system will be boarded with some detail, due to its relevance for the results evalu-
ation of the work conducted in this dissertation.
12
RAPL
Intel model-specific (or machine-specific) registers (MSRs) are implemented within the x86 and x64
instruction sets as means for processes to access and modify parameters related to cpu execution [37].
A handful of MSRs are allocated for platform specific power management within the Sandy Bridge and
successor microarchitectures and allow access to energy measurement and enforcement of power lim-
its. In particular, Intel refers to these registers as the Running Average Power Limit interfaces. RAPL
provides sensors that allow measuring the power consumption of the CPU level components listed in
table 2.1. These available counters limit measurements to CPU and memory controller power consump-
tion. It is, however, impossible to measure energy consumption of I/O devices.
RAPL PKG Whole CPU packageRAPL PP0 Processor cores onlyRAPL PP1 ”A specific device in the uncore”
RAPL DRAM Memory controller
Table 2.1: List of available RAPL sensors
The Intel documentation [37] states that client platforms have access to PKG,PP0,PP1 while server
platforms (code name Jaketown) may access PKG,PP0,DRAM. From the above domain definitions,
one expects that energy(PP0) + energy(PP1) = energy(PKG). On client Sandy Bridge platforms
PP1 measures the energy of the on-chip graphics processor, as opposed to the entire uncore. On
these machines energy(PP0) + energy(PP1) ≤ energy(PKG) and energy(PKG) − (energy(PP0) +
energy(PP1)) = energy(uncore).
RAPL has its limitations. Individual cores cannot be measured and PP0 represents the sum of all
core energies. Similarly, DRAM and uncore energy data do not distinguish between various memory
channels or uncore devices. Moreover, since RAPL is based on energy models, it is not as accurate
as direct measurement methods and the minimum acquisition period of the counters, according to intel
[37] it is about 1 ms. For a four core processor, running at 3.5 GHz, which fetches seven instructions per
clock cycle, that period of time translates into a process resolution of 98 million instructions.
2.1.2 Hardware-based Power Measurement
Hardware-based methods use instruments to measure the current or voltage of the hardware de-
vices. Those measurements are used later to compute the power spent by the measured object. The
instruments used to do the measurement includes different types of meters, special hardware devices
that can be embedded into the hardware platforms, and power sensors designed within the hardware.
Normally, these methods can only measure the component level power, because the highly integration
of the hardware circuits makes the lower level functional units become difficult to measure. Some re-
searchers [38, 19] rely on micro-benchmarks which stress one or more special functional units, to isolate
lower level power.
13
Power Measurement with Meters
The usage of meters for direct measurement is a straightforward method to understand the power
dissipation of devices and of the full system. Some authors [39, 38] make use of power meters to
measure the real power and use it to analyse and validate their research work. Other researchers [1]
measure the hardware components’ power and discriminate it into lower levels based on some indicators
that could reflect the activity of these lower level units. The differences of these methods lies in the type
of meters that is used to do the measurements and at which place the measurement is done.
One type of the globally used meters is the digital voltmeter. These meters can sample the mea-
sured object, generally, each second. The result can be collected with the serial port that is connected
to the data collection system. By using this method, we need to disconnect the wires that we want to
measure and connect a small resistance (usually less than 0.5 Ω). Finally, we measure the voltage of
the resistor and compute the power on this wire. Figure 2.2 shows an example of this method. Joseph
et al. [19] use voltmeters to indirectly measure the power while executing different benchmarks and make
power/performance tradeoffs.
A second kind of used meters it is the clamp ampere meter, which can measure the current without
disconnect the wire. Normally, a clamp ammeter has larger measurement range than voltmeter, thus
they can be used to measure the power of systems that have a much higher current.Kamil et al. [40]
use the direct power measurement method with clamp meters to measure the power on a Cray XT4
supercomputer under several HPC (High Performance Computer) workloads.
Figure 2.2: An example of using multimeter to measure the power. [3]
The voltmeter, ammeter and clamp meter allows researchers direct access to the DC motherboard
rails. However, it does not scale well to large clusters, as it requires a lot of customization to each
machine. Further, the low timing resolution (usually 4 Hz) is not adequate to perform dynamic profiling.
In addition, there is also the possibility to use a kind of power meter, such as WattsUp? Pro [27]
or WatchDog.com PowerEgg [26], to measure the AC power. This kind of meters can only measure
the system level power, because only power supplies are powered by AC. These external power meters
14
allow the measurement of power consumption at a maximum frequency on the order of twice per second.
Since DC output of a computer power supply is buffered and filtered for stability, fast variations in the DC
load often do not translate to corresponding variations on the AC side of the supply. For these reasons,
these legacy power meter devices are not suitable to monitor system power usage with sufficient detail
to perform dynamic profiling and make in situ decisions for power / performance optimization during
application execution.
Integrating Sensors into Hardware
While direct measuring with meters is simple, it does not supply methods to control the process of
the measurement process. For example, to synchronize the measured power to the monitoring of the
performance metrics. To circumvent this, there are devices that offer, such possibility, while producing
more accurate power measurements, such as PLEB 2 [20] and PowerPack [1].
(a) PowerPack Architecture (b) PowerPack Overview
Figure 2.3: PowerPack (Source [1])
In the course of the research of their previous work [3], Ge et al. [1] propose a power analysis
program called PowerPack. The architecture of PowerPack is show in Figure 2.3(a). They run the
profiled application, the system status profiler, a thread that controls the meters and a group of power
reader threads on the same platform. PowerPack uses a special method to synchronize the measured
power to process information. First, they implement a set of functions, as shown in table 2.2, that will
be called by the profiled applications before and after some critical code blocks. The execution of these
functions then trigger the system status profiler and the meter control thread to sample the data. After
that, the power analyzer can simultaneously inspect the collected data. Finally, they propose a method
to map the measured power into the application code and analyze the energy efficiency in a multi-core
system.
15
Function Descriptionpmeter init Connect to meter control threadpmeter log Set power profile log file and options
pmeter start Session start a new profile session and label itpmeter end Session stop current profile session
pmeter finalize Disconnect from meter control thread
Table 2.2: PowerPack power meter profile API. (Source [1])
In spite of the accuracy granted by this framework, it still has major drawbacks. As it can be seen
in figure 2.3(b) and according to [1], the framework uses external meters such as WattsUp? Pro [27] to
measure AC power, resistor sensors and an external computer to save the sampled data. This results in
the impossibility to easily escalate this framework to other computing systems and also to use the data
for power-aware strategies.
Other hardware-based power measuring device is PLEB 2 [20]. PLEB 2 is a single-board computer
based on the Intel XScale PXA255. It was custom-designed primarily as a reference to be used in em-
bedded systems research, but secondarily as a platform for applications implementation. The PXA255
was chosen as being representative of high performance CPUs designed for embedded systems. It con-
sists of a 400MHz ARMv5TE compatible core combined with a set of on-chip peripheral units including
memory, interrupt, DMA and LCD controllers. The main processing core consists of the CPU, SRAM
and flash memory. Three switching lithium-ion batteries provide power to core, memory and IO.
The device was designed with power-measurement hardware on-board. Each of the three power
supplies (nominally for the CPU core, memory and IO) are instrumented with current sensors. Each
power supply is well regulated to its designated voltage, therefore the voltage is assumed to be constant
and the current is proportional to the power (P=IV). The microcontroller on-board has an integrated ADC
converter and can read the sensors at up to 15 kHz. Since it can only measure one of the sensors at
a time, this equates to a maximum of 5 kHz on the individual sensors when all are measured at equal
rates. Samples are transferred from the microcontroller to the PXA255 as they are taken.
This device solves many disadvantages of other mentioned platforms (PowerScope [13] and Pow-
erPack [1]), however still has significant limitations. Communication between the microcontroller and
the PXA255 is via I2C, which transfers data slowly (400kbps) than others protocols like SPI or USB;
Instead of measuring the current supplied by each power supply, a current sensor could be installed per
device, allowing the system to measure the current consumed. This would mean each IO device would
be individually monitored allowing users to understand how and why each device consumes power. In
addition, PLEB was a system designed particularly for the an ARM, thus it can not be used for other kind
of systems.
2.2 Summary
In this section, it was presented to the reader different ways that researchers came up with to solve
the problems related with energy consumption in computing systems. Some of those methods include
16
designing new power efficient components and implementing power-aware strategies such as DVFS.
However, it is crucial to understand, for instance, what components spend more power, what parts of a
source code are demanding more power and in what conditions that happen. Thus, measuring energy
consumption in an accurate, easy and efficient way is fundamental. Towards achieving that goal, two
major methods were introduced: Hardware-based and Software-based measurements.
In sum, software methods have the advantage of being easy to integrate in every modern system, to
be able to give consumption at the processes level and to permit the data to be used in running-time for
power-aware policies. Contrastingly, as we saw, those methods lack accuracy when compared to direct
measurements, since they are based on estimations. Hardware methods are more reliable, but can be
very hard to escalate to other systems due to the customization needed and may be an expensive way
of measuring consumption. Moreover, some of them do not allow the use of power-aware strategies
because of their low sampling frequency.
In the next section, details about the framework of the measuring system developed in this disserta-
tion are provided.
17
18
3Powermeter - Architecture Definition
and Specifications
Contents
3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Signal Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 AGC - Automatic Gain Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
19
In this chapter, the system architecture of the proposed Powermeter device is presented, together
with a detailed description of its components. Specific aspects of the device are boarded, such AC and
DC signal conditioning, including the design of an Automatic Gain Controller , used to improve ADC’s
dynamic range. Thereby, an analysis of the ADC’s SNR, THD and SFDR is attained, as well as, some
design criteria are defined. In the end, the reader will have a clear idea about how the device samples
and treats data, so accurate measurements are attained.
3.1 System Architecture
ACSensor CPU
EPS12VSensor
GPU/PCI-ESensor
HDDSensor
ADC-10bits
PIC18F4550
HOST
USB
Power Cable
Current Sensor Output
ACWall Socket
RailsSensors
Motherboard Molex Connector
Signal AcquisitionBoard
Powermeter
Figure 3.1: System Architecture
Powermeter is capable of measuring all the available rails coming from the personal computer PSU,
including CPU EPS12V 6/8-pin, PCI-E 6/8-pin and SATA 4/5-pin connectors. Also, the 12 V, 5V and 3.3
V rails from the motherboard connector and the input of the PSU connector (the one directly connected
to the AC power socket) are sensed (see figure 3.1). The Powermeter sensor devices are inserted
directly on the power connectors coming from the PSU by plugging supply-side power cables into input
connectors of the Powermeter and connect the output to wherever the initial connector should plug in
(motherboard, CPU, peripherals and others).
Regarding the sensors, precise low-offset, linear Hall-Effect sensors were used, which convert a
magnetic field generated by a current to a proportional voltage. The Hall IC has a copper conduction
path which has an internal resistance of 1.2 mΩ providing low power losses. The device uses USB
protocol to power the device and to suit communications between it and the host, since USB is a fast,
versatile and reliable interface. Powermeter uses Microchip PIC18F4550 Microcontroller Unit (MCU),
20
inserted in a common demonstration board (PICDEM FS USB), to establish communication between
the host system’s USB and the current sensors on the device. The system communication is interrupt-
driven, simplifying the microcontroller code and decreasing the time overhead, since there is no need by
the host to poll the device to find if there is data to be collected. The used demonstration board, has a 20
MHz Oscillator as an input, which is then used to generate (with PLL’s) the 48 MHz MCU clock (the USB
peripheral also runs with a 48 MHz clock - Full-Speed USB Mode). This microcontroller also includes
timer modules (which were used to generate the necessary time stamps and sampling frequency) and
a 10-bit ADC.
3.2 Signal Acquisition
184550
230
Figure 3.2: Block Diagram of the Hardware Architecture
Figure 3.2 exposes a general view of the proposed acquisition system. The system properly acquires
the output of the DC and AC sensors. The AC acquisition introduces a AGC system, which dynamically
amplifies the input signal, allowing to distinguish small variations of the signal. The system comprises
a active band-pass filter, which reduces noise and only passes the 50 Hz component of the spectrum,
eliminating in the process the offset imposed by the current sensor; and a controller that dictates if
the signal gets amplified or attenuated, using a Programmable Gain Amplifier (PGA). Consequently, by
the usage of this novel approach the ADC’s dynamic range gets improved. Low power-loss (1.2 m Ω)
sensors (ACS712/14) were used to sense AC and DC current of several power rails. These sensors
operate based on Hall-effect principle and output a voltage which is proportional to the magnetic field
generated by the sensed current. The sensors have different sensitivities, depending on the range of
current they are able to sense. For instance, for a 20 A range we have a 100 mV/A sensitivity, whilst for
30 A range, the sensor gives a 66 mV/A output. However, this sensors introduce an offset approximately
equal to V cc/2. As a result and also because we want a good precision in ADC conversions (i.e. as
many filled bits as possible), the signal coming from the sensor has to be conditioned. AC and DC signal
conditioning requires different approaches, as it will be described in the next sections.
21
(a) Hall-Effect Sen-sor
(b) Allegro’sCurrent Sensor
Figure 3.3: Current Sensor. Source: Allegro MicroSystems
3.2.1 DC conditioning
Sensing
Amplification
ACS 712/714±5,±20,±30 [A]
185,100,66 [mV/A]
VCC
VIOUT
FILTER
GND
IP+
IP-
IP+
IP-
8
7
6
5
4
3
2
1
Ip
CF0.01μF
CBYP0.1μF
+5V (Vcc)
RF
RG
C11nF
+
-2
36
7
4
R1
R2
ADC
Power source
Load
LoadVCC
GND
Vref
VampViout
Figure 3.4: DC Conditioning
Figure 3.4 illustrates the DC signal acquisition. In sum, the output generated by the ACS712/14 sen-
sors passes through an amplifying stage before gets acquired by the ADC. For the DC power rails, the
conditioning procedure consists in cancelling the DC offset provided by the ACS712/14 sensor (which
can vary between 2,330 V and 2,5 V) and introduce some gain to the remaining signal, so it can be
successfully acquired by the ADC stage. Thus, the proposed conditioning circuit consists in a subtractor
amplifier stage (figure 3.5).
−
++5VR1
R2
C1
R3
VInVOut
R4
+5V
Figure 3.5: Subtractor Circuit
In equation 3.1, we have the equation corresponding to the output signal, obtained after analysing
the circuit presented in figure 3.5 for DC input voltages (implying ω = 0, the capacitor can be ’seen’ as
22
an open circuit).
VOut = −(VIn − V+)× R4
R3+ V+ (3.1)
Hence, for DC voltages the output gain is constant and equal to G = R4R3 . However, for greater
frequencies, the signal is attenuated at a rate of 20 dB per decade, since the capacitor introduces a pole
in the circuit. Thus, the cut-off frequency can be computed with equation 3.2.
ωH =1
R4C1(3.2)
This acts as an Anti-Aliasing filter, which is required when performing samples acquisition. The V+
voltages can be 2741 or 2848 mV and the gains 4.254 or 4.299, yielding always an output voltage lower
than the +5 V, but higher than 3.5 V for input voltages greater than 2216 mV. This is always true for input
currents greater than zero. Those voltages and respective gains were obtained after a study of the best
configurations to allow a better amplification of the measuring signal, without saturate the ADC’s range.
3.2.2 AC conditioning
By definition, instantaneous electric power is obtained by multiplying the voltage and current:
P (t) = V (t)I(t) (3.3)
For sinusoidal waveforms, this translates into:
V (t) = VM sin(ωt)
I(t) = IM sin(ωt+ φ)
P (t) = VMIM2 [cos (φ)− cos (2ωt+ φ)]
(3.4)
Thus, the power waveform comprehends a time varying component and a constant component. The
constant component, which can be obtained by averaging P (t), corresponds to the effective power
delivered to the load, also called Active Power. The active power can also be obtained resorting to
complex amplitudes analysis and using the complex power. By definition the complex power is given by:
S =1
2VMIMe
jφ =VMIM
2cos(φ) + j
VMIM2
sin(φ) (3.5)
where φ is the angle between voltage and current. The real part of equation 3.5 is the Active Power,
whilst the imaginary part is the Reactive Power, which corresponds to the maximum value of the power
component that oscillates between the mains and the load, resulting from the energy stored on capac-
itors or/and inductors. Using RMS (Root-Mean Squared) amplitudes, the active power is obtained with
equation 3.6.
PActive = VRMSIRMS cos(φ) (3.6)
23
The ratio between the active power and the Aparent Power (VRMSIRMS) is known as power factor
(p.f.) and for sinusoidal waves it corresponds to cos(φ). Equation 3.6 is an important result since it states
that if the RMS amplitudes of voltage and current are known and if the p.f. is also known, then we can
compute the Active Power. Fortunately, modern PSUs have power factor correction and the CORSAIR
TX750 PSU’s manufacturers guarantee a steady power factor of 0.99 [41]. Thus, it is only necessary to
find the amplitude of the sine wave at each cycle to compute the active power, since the power factor
and voltage RMS values are known (p.f. = 0.99 and VRMS = 230 V for European electric power). This,
in fact, means that the power of the AC signal is proportional to the current.
3.3 AGC - Automatic Gain Control
The need to measure signals with a wide dynamic range is quite common in the electronics industry,
but current technology often has difficulty meeting actual system requirements. Weigh-scale systems
typically use load-cell bridge sensors with maximum full-scale outputs of 1 mV to 2 mV. Such systems
may require resolutions on the order of 1,000,000 to 1, which, when referred to a 2-mV input, call
for a high-performance, low-noise, high-gain amplifier and a sigma-delta modulator. While the actual
sensor data typically takes up only a small portion of the input signal range, the system must often
be designed to handle fault conditions. This is exactly the problem of the used current sensors, which
outputs a very low-voltage amplitude signal. Thus, a wide dynamic range, high performance with small
inputs, and quick response to fast-changing signals, are key requirements. These requirements call
for a flexible signal-conditioning block, with low-noise inputs, relatively high gains, and the ability to
dynamically change the gain in response to input level changes without affecting performance, while
still maintaining a wide dynamic range. Existing sigma-delta technology can provide the dynamic range
needed for many applications, but only at the expense of an increase of the operation rate.
This section presents an alternative approach that uses a successive-approximation sampling 10-bit
ADC, combined with an autoranging PGA front end, forming a AGC system (figure 3.6). With gain that
changes automatically based on analog input value, it uses oversampling to increase the dynamic range
of the system to more than 80 dB.
1
1
<</>>
10 512
/
A
Figure 3.6: AGC Structure
24
The band-pass filter, introduced earlier, is useful to reduce the input noise of the system in compari-
son to the input signal and to eliminate the DC component of Hall sensor. However, the current sensor
output signal can be, still, a low amplitude signal and small variations of the amplitude get eliminated
during the process of quantization. Therefore, a automatic gain controller is proposed. This stage is sup-
posed to, along the time, distinguish small variations in the input signal and amplify them, guaranteeing
no loss of the input signal and an improvement of the dynamic range. It is important, for the design,
to know the minimum and maximum amplitude that the sensor output signal can reach over time and
after some experiments, it was determined that it varies between 30 mV and 50 mV. Figure 3.6 shows
a diagram with the major blocks that constitutes this system. Part of this scheme (the analog part) was
done using physical elements like operational amplifiers and digital controlled potentiometers, whilst the
digital part was performed by programming the PIC18F4550.
Starting from the left, the input signal (which is actually the output of the AC current sensor) gets
filtered and amplified by the band-pass filter. Following, and ignoring for now the subtraction node, the
signal is dynamically attenuated or amplified, depending on the range where the signal is. Since the goal
is to always get a signal whose amplitude is almost the full-range of the ADC, if, for instance, the signal
has a low amplitude (lets say 1 V), the AGC amplifies it. However, if it has a amplitude very close or
above the ADC’s full-range (e.g., 4.9 V), then the signal gets attenuated, ensuring all the times that the
signal lies between the full-range of the ADC. It is the PGA block that amplifies or attenuates the input
signal, in conjunction with the controller, which decides whether to amplify, attenuate or keep the current
gain. Because the ADC embedded in the PIC18F4550 microcontroller is unipolar (range between 0 and
5 V), the PGA introduces an offset of V cc2 , shifting the input signal to the middle-range of the ADC. This
guarantees that the input signal can be properly amplified without upper or lower saturation. In addition,
due to the non-linear characteristic of the ADC, this also reduces quantization errors. Afterwards, at
the digital domain, the initial input signal gets recovered, subtracting the DC offset and cancelling the
gain by the usage of arithmetic shifts 1. Then, the signal gets averaged with the last 32 samples, thus
resulting in a 32 point moving average filter, which mathematical formulation is in equation 3.7.
y(n) =1
32
31∑k=0
x(n− k) (3.7)
By using this procedure, it is possible to optimally amplify, even more, the small variations in the
current signal, which occurs over time. The 5-bit DAC is used to convert to the analog domain the digital
value obtained after the average computation. This value has a length of 10 bits, thus only the 5 MSBs
are used. The digital value is recorded in a variable, so it can be summed to the acquired signal to
successfully regenerate the original signal.
The controller system could be accomplished in several ways: one of the alternatives would be to
design a PI (Proportional Integral) controller, which increases or decreases the gain according with a
signal error (that signal would be the maximum voltage we want to be at the input of the ADC). However,
1 The C18 compiler of PIC18F4550 implements multiplications and divisions operations, of any length, that are not supportedby the hardware, by calling library functions. Hence, this kind of operations are very time consuming and are not appropriate fora real-time processing project. Consequently, an amplifier that only assumes gains of powers of 2 was projected. This eases theprocess of recovering the initial signal by performing right or left shifts, according to the gain imposed by the PGA.
25
in this particular case, the controller parameters would change in time, since they are dependent on the
amplitude of the input signal, which keeps varying in time. This would result in different time responses
of the loop. The parameters are also dependent on the sampling frequency, thus for every change
in the sampling frequency (during design project), new parameters may have to be computed. Other
issue is that the computations needed are not adequate for the PIC18F4550’s 8-bit architecture and
would consume much time - a multiplication and division of floats can reach 336 and 2712 clock cycles,
respectively, while arithmetic shifts of integers require, approximately, 20 clock cycles.
(a) On-Off Characteristic
10
.
AndLowpass Filter
1
(b) On-off Controller Block
Figure 3.7: On-off Controller
In order to meet the requirements, another solution was engineered: the Bang-bang controller. The
Bang-bang controller (or also denoted as on-off controller), is a simple and effective solution to this
problem. In this project, an ’on-off’ non-linearity with hysteresis and a dead-zone was implemented (see
figure 3.7).
The controller (figure 3.7(b)) was implemented in the digital domain and changes the gain over time
by controlling a PGA (the schematic of the electronic devices, which form the PGA will be addressed
in section 3.3.2). The controller comprehends three states: one where it increases gain, other where it
decreases it and the the dead-zone, where it maintains the current gain unchanged. The controller also
includes a maximum search algorithm and a low-pass filter: the search for the maximum is necessary
not only to test if the signal amplitude is within the allowable range, but, also, to compute the power
of the AC signal - remember that the power is proportional to the amplitude of the current signal; the
low-pass is used to avoid false gains changes, due to instantaneous fluctuations in the measured signal.
The output of the controller feeds the non-linearity, whose margins were computed taking into account
the variations of the current sensor output (the computation of those margins are revealed in the chapter
of the results 5).
3.3.1 MatLab Simulation
MatLab was used to test the controller. The schematic built in simulink, is based on the diagram in
figure 3.1. For simulation purposes, the used lower and upper margins, of the on-off characteristic, were
T0 = 488 mV and T1 = 1.95 V , respectively.
26
There are, fundamentally, at least two interesting simulation cases:
1. The first case, is the behaviour of the circuit for sinusoidal input with amplitude range between 120
mV and 200 mV;
2. The second case, starts with a very low amplitude signal (100 mV) and jumps suddenly to a very
high amplitude signal (5 V);
Those tests will allow to understand the behavior of the loop and observe if it can properly adjust the
amplitude to fit within the margins of the non-linearity, whether in a case where the input signal is a very
low-voltage signal (first case) or when it is a very high-voltage signal (second case), while providing the
highest gain as possible. It is also interesting to compare the results obtained with an ADC with and
without the interaction of the proposed controller.
Sinusoidal Input: 120-200 mV
With the help of a voltmeter and running some workloads at the PC, it was determined that the output
amplitude of the current sensor varies roughly between 30 mV and 50 mV. For the first test in study, the
controller was subject to a sinusoidal wave signal, which amplitude suddenly increases from 120 mV
to 200 mV. This signal simulates the output signal coming from the current sensor after filtered by the
bandpass filter, which introduces a gain of 4;
0 1 2
−0.2
0
0.2
Time (s)
Amplitude(V
)
Input Sinewave
(a) Input Sinewave
0 1 2−5
−4
−3
−2
−1
0
1
2
3
4
5
Time (s)
Amplitude(V
)
Output Sinewave
(b) Output Sinewave
Figure 3.8: Loop Response
Figure 3.8, demonstrates how the loop reacts to the input signal by amplifying it to a value which
guarantees that the limits of the on-off controller are respected. It must be said that the output plot does
not includes the offset component of 2.5V. This being said, even when the amplitude changes from 120
mV to 200 mV, the loop still amplifies the signal, but with a lower gain, since the signal is higher. In
figure 3.9, a comparison of the output signal, in volts, after being discretized by a common non-ideal
10-bit ADC (in red) without any sort of correction, with the result obtained when using the AGC (in blue)
is provided.
27
0 0.5 1 1.5
−0.2
−0.1
0
0.1
0.2
Amplitude(V
)
Time (s)
Output with AGCOutput without AGC
Figure 3.9: Reconstruction of the Input signal with and without AGC
Inspecting the figure, we can point out the differences in amplitude of the resulted signal. Where
in the case of the conversion with no correction the maximum amplitude of the signal reaches a value
lower than it should (e.g, for an input of 200 mV we only get approximately 170 mV and by using AGC
we get 197 mV). Ideally, we should get back 197 mV. However, besides the non linearity of the ADC, we
also have the quantization error of 12LSB (' 2.44 mV/bit). It is also obvious that with no correction the
signal gets clipped due to the unipolarity of the ADC. This test shows the obvious benefit of the usage
of such approach to sample data, since with it we get more precision and no signal loss.
Sinusoidal Input: 50 mV - 5 V
For the second test, the controller was, again, subject of a sinusoidal wave signal, which amplitude
is between 50 mV and 5 V. The goal of this test, is to conclude if the controller is also able to attenuate
the input signal, if needed, so the output signal meets the pre-conceived specifications.
As the reader can observe in figure 3.10, the loop can successfully correct the signal’s amplitude to
a desired value. It is, also, noticeable a transitory period which occurs in the instant of the instantaneous
change in the amplitude of the input signal from a very low voltage (50 mV) to a very high voltage (5 V).
As before, in the graphic of figure 3.11 we compare the output signal, in volts, after being discretized by
a common non-ideal 10-bit ADC without any sort of correction, with the result obtained when using the
AGC method.
Inspecting the figure, the reader can observe that the low voltage signal was lost when performing
the conversion without the AGC, since the input signal presents a very low amplitude. In spite of there
are some errors in the resulting signal, due to the transition period from a very low voltage signal to a
very high one, only with the AGC the input signal is successfully recovered. All other conclusions, that
were pointed out in the previous test, can be also applied to this case as well.
28
0 1 2−5
−4
−3
−2
−1
0
1
2
3
4
5
Time (s)
Amplitude(V
)Input Sinewave
(a) Input Sinewave
0 1 2−5
−4
−3
−2
−1
0
1
2
3
4
5
Time (s)
Amplitude(V
)
Output Sinewave
(b) Output Sinewave
Figure 3.10: Loop Response
0 2−6
−4
−2
0
2
4
6
Time (s)
Amplitude(V
)
Output With AGCOutput Without AGC
1
Figure 3.11: Reconstruction of the Input signal with and without AGC
3.3.2 Analog Domain Implementation
It is necessary to materialize the blocks introduced earlier, such as the band-pass filter, the PGA and
the subtraction node, by projecting electronic devices that comprise their respective functions. Every
circuit that will be analysed in the following sections belong to the full schematic of the proposed system,
which is provided in figure 3.12. The layout of the final circuit can also be found in the appendix of this
document (please refer to figure A.3 if needed).
29
Figure 3.12: Full Circuit
3.3.2.A Band-pass Filter
In Europe, the electric power is provided with a voltage signal which has 230 VRMS of amplitude
and f = 50 Hz of frequency. However, the offset introduced by the Hall sensor must be eliminated,
while preserving the 50 Hz component of the spectrum. For that reason, a band-pass filter was chosen
to full-fill this need, so it not only rejects the DC component, but also attenuates all other frequencies
but the 50 Hz component. Moreover, it also acts as an Anti-Aliasing filter. The filter specifications are
presented bellow:
• Central Frequency f0 = 50 Hz;
• Bandwidth b = 10 Hz;
• Filter Gain at the central frequency G(f0) = 4 ' 12 dB.
The chosen values guarantee a filter with a narrow bandwidth (good selectivity - Q = 5). The value
of the gain at this stage influences the rest of the circuitry, namely, the PGA gain range. The higher the
gain of the filter, the lower the maximum gain of the PGA has to be, in order to guarantee a high dynamic
range. Therefore, initially this filter was designed to have a gain of 10. However, after the implementation
of the PGA (which is provided in section 3.3.2.C), it was realized that the designed gain was too high
and, consequently, it was reduced to 4. A lower gain increases the available bandwidth, resulting in a
device with a faster response in time.
To implement this filter, a multiple feedback structure was used, which allows the implementation of
a simple and reliable 2nd order band-pass filter, for low quality factors.
VOUTVIN
= − Hω0s
s2 + ω0
Q + ω20
(3.8)
30
Figure 3.13: Band Pass Filter
The bandpass transfer function of a second order bandpass filter can be obtained with equation 3.8,
where H is a gain factor, ω0 is the central frequency of the filter and Q is the quality factor. The transfer
function of the above circuit is present in equation 3.9.
VOUTVIN
= −1
R1C4s
s2 + 1R5
(1C3
+ 1C4
)s+ 1
R5C3C4
(1R1
+ 1R2
) (3.9)
To get the components’ values, it is just a matter of solving a system of equations involving the
numerator and the denominator of both transfer functions. The process to get the values is the following:
1. Choose C3 Value;
2. Do k = ω0C3 and C4 = C3;
3. Now, R1 is1
R1C4= Hω0 ⇔ R1 =
1
Hk; (3.10)
4. Resistor R5 is2
C1R5=
1
k(2Q−H)⇔ R5 =
2Q
k; (3.11)
5. And, finally, resistor R2 is
R2 =1
R5C3C4ω20 − 1
R1
=1
k(2Q−H). (3.12)
Before going any further, we must guarantee that the current requested by the input load of the
amplifier is not higher than the one the current sensor can provide. The ACS714 current sensor can
provide a maximum current of 3 mA , so this means the band pass-filter has to have a sufficient high
input load or else the requested current will be too high:
VmaxZinput
≤ 3mA⇔ Zinput ≥3.2V
3mA' 1.07kΩ (3.13)
Where Vmax is the maximum voltage the current sensor will give and Zinput is the load seen by the
current sensor, which is equal to Zinput = R1+R2. In accordance, we need at least an input impedance
of 1.07 kΩ. The Quality Factor is given by Q = f0b = 50
10 = 5. The gain at the central frequency will be
influenced by the H factor, but also by the filter quality factor. Thus, to obtain the desired gain, the H
factor must be H = GQ = 4
5 = 0.8.
31
Thus, choosing H = 0.8, C3 = 0.22 µF (it has to be low to avoid the use of an electrolytic capacitor)
and after performing the remaining computations for the resistances values, the nominal values illus-
trated in table 3.1 were achieved. Note that the resistances with values obtained with summation are
obtained with the series of two resistors.
Component Nominal 5% Tolerance 1% Tolerance
R1 36.172 kΩ 36000 + 180 = 36180 Ω 23.2 + 13 kΩ
R2 738.1955 Ω 680 + 56 = 736 Ω 698 + 40.2 Ω
R5 289.37 kΩ 130 + 160 = 290 kΩ 243 + 46.4 kΩ
C3 = C4 0.22 µF - -
Table 3.1: Component Values
Resorting to Matlab [42] functionalities, an analysis of the filter was developed, by considering these
resistor values. All the components were subject to deviations in respect to their nominal values, consid-
ering 1% and 5% compnonents tolerances. The goal was to realize how the tolerance of the components
can influence the response of the filter. The results are illustrated in figures 3.14 and A.1 (in appenidx
A) and are also resumed in table 3.2.
101
102
103
−30
−20
−10
0
10
20
30
Bode Diagram
Frequency (Hz)
System: REAL Frequency (Hz): 48.4 Magnitude (dB): 12
System: TheoreticalFrequency (Hz): 50Magnitude (dB): 12
−40
Magnitude (dB)
Theoretical
REAL
Figure 3.14: Real Vs Theoretical Bode Diagram
Theorethical Nominal 5% +5% Deviation -5% Deviation Nominal 1% +1% Deviation -1% Deviation
Gain (dB) 12 12.1 9.25 8.96 12 11.9 11.9
Table 3.2: Gain at the Central Frequency to the Various Tests
Table 3.2 reflects the change in gain at the filter central frequency, for the components nominal values
and their, respective, deviations (±1% and ±5%). As it can be seen, the worst results are obtained when
the resistors tolerances are 5%, where the filter achieves, in the worst case 8.96 dB, resulting in a relative
error of
Error = 100× 4− 108.9620
4= 29.9% (3.14)
32
Figure 3.14 displays a comparison between the magnitude response of the theoretical filter and the
real filter, realized with real non-ideal components (in this case, the ones indicated in table 3.1 for 1 %
tolerance). The plot shows that the central frequency gets deviated from 50 Hz to 48.4 Hz. This suggests
the need for an potentiometer to tune the central frequency.
The filter was also tested in PSPICE, where the 5% components were deviated from their nominal
values: the idea was to understand if there are some components that influence more or less the central
frequency or the quality factor. The final circuit diagram is presented in figure 3.15(a). The test consisted
in injecting a sine wave with 50 mV of amplitude and performing an AC Sweep analysis varying the
frequency between 0 and 1 kHz and also performing a parametric analysis, by varying each resistor
individually. The results provided in table 3.3 were obtained. The table reflects a change in the central
frequency when R2, alone, was deviated and the changes of both central frequency and the quality
factor when R5, alone, was deviated from its nominal value. The quality factor did not change in the
deviation of R2.
(a) Circuit Diagram (b) Offset Voltage Analysis
Figure 3.15: Pspice Simulation Circuits
R2 f0 [Hz] R5 f0 [Hz] Q
+5% 49 +5% 48.9 10.73
+5% 51.2 -5% 51.4 9.29
Table 3.3: Parameters Variation
In sum, the results show that the filter specifications do not significantly alter, whether using 5% or
1% tolerance values. However, it is recommended that 1% values are used, since the filter response
practically did not change. It was also noticeable that changing R2 value can be used to tune the central
frequency, as well as R5, but the latter also changes the filter quality factor. As a result, 1% component
values were used, in this project. In addition, it was necessary to tune R2 value with a potentiometer, by
connecting it in parallel with the resistor.
The operational amplifier used to implement the filter was the MCP6022 IC, from Microchip. This IC
has a product gain bandwidth of 10 MHz, has a pole at fc = 10Hz, an open gain loop of 120 dB and
an offset VOS = 500 uV. Observing the circuit, and by resorting to the superposition theorem, we can
calculate the contribution of this offset voltage to the output voltage. Hence, by grounding the input signal
and imposing a voltage source of 500 uV in V+ to simulate the voltage offset (see figure 3.15(b)) and also
doing this analysis in DC mode, we have a voltage follower (i.e., the output voltage will be equal to 500
33
uV). Therefore, our signal will have a 500 uV offset, which is not significant. For this application, where
it was designed a 12 dB gain, the amplifier attains a bandwidth of BW = 10MHz
101220
= 2.512 MHz, which is
more than enough to meet the filter requirements, since we are working with much lower frequencies.
PIC18F4550 only supplies a positive voltage of about 5 V, which comes from the USB. However,
the band-pass filter requires a a negative voltage, to operate correctly. Consequently, it is necessary to
generate a negative voltage from a positive one. The different alternatives to accomplish this, require
additional ICs: it could be used a transformer-coupled-split rail design or an Isolated Fly-buck Converter.
Despite both alternatives would generate a very steady negative voltage, they rely on inductors, which
occupy a great space in a circuit board. Therefore, LMC7660 switched capacitor voltage converter
circuit was used. The LMC7660 is capable of converting an input voltage between +1.5V to +10V to the
corresponding -1.5V to -10V, it requires a low supply current of 200 uA (maximum value), has more than
90% of efficiency and it only needs two external components (Cp - Pump Capacitor and Cr - Reservoir
Capacitor). To calculate the reservoir capacitor value, we have to take into account that the operational
amplifier will need a typical 1 mA current to drive all transistors, but has also to provide enough current
to the output circuit. Thus, the formula (given in LMC7660 datasheet) to calculate this capacitor is:
IL = Crdvdt∼ Cr ×
Vripplep−p4/FOSC
⇒ Cr =4
FOSC× ILVripplep−p
(3.15)
Where IL is the Load Current, the Vripplep−p is the accepted Output Voltage Ripple and FOSC
is the Oscillatory Frequency. Thus, by operating at the nominal frequency (FOSC = 10 kHz), and
choosing IL = 10 mA and Vripplep−p = 40 mV, yields:
Cr = 100µF (3.16)
The value of the pump capacitor is usually the same of the reservoir one, so Cp = 100 µF . Having
a large pump capacitor is also beneficial to increase conversion efficiency. A circuit diagram of this
converter lies in figure 3.16.
Figure 3.16: Voltage Converter Schematic (LMC7660 datasheet - Texas Instruments)
3.3.2.B Summming Amplifier
As it was mentioned earlier, before the signal can be supplied to the ADC, it is convenient to introduce
an offset (equal to V ref+ADC/2 = V ref+ADC2 = 5
2 = 2.5 V). A summing amplifier configuration was used,
34
Figure 3.17: Summing Amplifier
where one of the inputs is the output of the band-pass circuit and the other is the offset itself. This
summation only works, as it should, if all resistors are of the same value. The offset will be generated
by a voltage divider, where both resistances are the same, so the divider is 2. Furthermore, this resistor
divider is connected to a buffer, so it does not charge the next circuit. The values of the resistors are
in the order of kΩ, because we do not want high currents being drained by the amplifier. It was used
1% resistances and their nominal values are indicated in figure 3.17. Figure 3.18 illustrates the result of
the simulation for a case where the output of the band-pass is fed into this adder, yielding a sinusoidal
wave with a DC component of V dd2 :
Figure 3.18: Adder Output - Voffset (purple) = 2.5 V; Vsub (yellow) = 1.175 mV amplitude; Vout (cyan)= 3.662 V
3.3.2.C Programmable Gain Amplifier
The PGA is implemented by using a subtractor amplifier , whose entries are the output of the sum-
ming amplifier and a DAC. The PGA is referenced to the V cc/2 common voltage and it can be shown,
by resorting to the superposition theorem, that its output voltage is given by equation 3.17.
Vout =RFRE
(V2 − V1) +V cc
2(3.17)
35
Figure 3.19: PGA
The equation reveals that for changing the gain of the amplifier one could simply modify the relation
between resistances RF and RE and varying this relation over time. This can be accomplished if RF
resistor is replaced by a digitally controlled potentiometer, as it is illustrated in figure 3.19. The used
potentiometer was the AD5113 from Analog Devices: it contains a fixed resistor with a wiper contact
that taps the fixed resistor value at a point determined by a digitally controlled UP/DOWN counter.
Figure 3.20: AD5113’s Pin Configuration and Block Diagram - Analog Devices
In picture 3.20 it is represented AD5113’s block diagram. This device has a three-wire serial input
interface: CLK - Serial Clock Input (negative edge triggered); CS - Chip Select Input (active low) and
UP/DOWN (U/D) - UP/DOWN direction increment control. Table 3.4 summarizes the operation of the
circuit in respective to all combinations of states of those three signals. Therefore, when CS is taken
active low, the clock begins to increment or decrement the internal UP/DOWN counter, dependent upon
the state of the U/D control pin. The UP/DOWN counter value (D) starts at 0x20 at system power ON.
Each new clock pulse will increment the value of the internal counter by one LSB, until the full scale of
0x40 is reached, as long as U/D pin is logic high.
The resistance between the wiper and either the end point of the fixed resistor provides a con-
stant step size that it is equal to the end-to-end resistance divided by the number of positions (e.g.,
Rstep = 80kΩ/64 = 1.25 kΩ). Just like a common potentiometer, it is possible to use the resistance
between A terminal and the wiper or between the wiper and B terminal of the full resistance. Analog
36
CS CLK U/D Operation
L ↓ H Wiper Increment Toward Terminal AL ↓ L Wiper Decrement Toward Terminal BH X X Wiper Position Fixed
Table 3.4: AD5113 Operation
Devices provides digital potentiometers of this series with 5 kΩ, 10 kΩ and 80 kΩ nominal resistances.
As discussed earlier, one of the requirements is that the resulted gain must be a power of 2, so the
computational effort gets reduced. The resistor between the wiper and B terminal is obtained according
to the next formula:
RWB = D × RWN
64+RW (3.18)
Where RWN is the nominal resistance, D is the digital word recorded in the UP/DOWN counter and
RW the wiper resistance that typically assumes the value of 700 Ω. Hence, the gain can be obtained in
function of the digital word (D), the nominal resistance and the wiper one, combining equations 3.17 and
3.18, yielding:
Gain =Rf
RE=
D64 ×RWN +RW
RE(3.19)
So, if RE = 5 kΩ and RWN = 80 kΩ, it yields:
Gain =D64 × 80 + 0.07
5(3.20)
With those values, the gain is a function of the D variable and assumes values of power of 2 as long
as D equals to 1, 2, 4, 8, 16, 32 and 64. This is, in turn, very easy to attain in the digital domain.
On the other hand, according to the PSU CORSAIR TX750 [41], the AC input can request a maximum
current of 10 A. This means that the current sensor output lies between 2.5 V and 3.2 V, corresponding to
the current inputs of 0 A and 10 A, respectively. Though, after running some workload, it was measured
a minimum current of about 0.45 A and a maximum current of 0.53 A, corresponding to 30 mV and 50
mV at the output of the sensor, respectively. Thus, for the low voltage of 30 mV the combining gain of the
band-pass filter and the PGA ouputs a voltage of 4× 16× 30 mV = 1.920 V which, when summed to the
common voltage of V cc/2, yields 4.420 V. Therefore, the ADC’s input full-range is almost attained with
the projected gains. Higher values than 4.7 V are not recommended, to prevent the loss of information
during the quantization process.
Ideally, there would be no error in the resistors value and neither would exist a wiper resistance.
Unfortunately, that is not the truth and an effort must be made to reduce the introduced error. In respect
to the wiper resistance, little can be done, since it is inherent to the technology used by Analog Devices,
but it is possible to ultimately reduce the error regarding resistors values: first the resistors to be used
as RE ones will be of 1% precision, since it were the ones available. Nevertheless, by doing a parallel
association between two resistors of value 2× RE : in the best case, the error can be reduced from 1%
to 0.01 %; in the worst case the error keeps being 1 %. With the values of the resistors set, we can
37
easily control the gain by changing the value of the digital word. The values the digital word can assume
and respective gain for the worst-case resistances values RE are represented in the following table:
D G (Ideal) G (Re = 5,05) Error G (Re = 4,9995) Error
1 0,25 0,26 4,55 % 0,26 5,61 %
2 0,50 0,51 1,78 % 0,51 2,81 %
4 1,00 1,00 0,40 % 1,01 1,41 %
8 2,00 1,99 0,30 % 2,01 0,71 %
16 4,00 3,97 0,64 % 4,01 0,36 %
32 8,00 7,93 0,82 % 8,01 0,19 %
64 16,00 15,86 0,90 % 16,02 0,10 %
Table 3.5: Digital Word Vs Gain
This table reflects the importance of always operating at the highest gain possible, since small errors
(referred to the ideal gain) are attained for both cases. For every gain setting, the PGA introduces an
offset. Therefore, those offsets were measured and saved in a look-up table in the microcontroller, to
correct such deviation. The results of this test are provided in chapter 5.
Finally, it is important to explain how the value provided by the ADC gets converted back to an
electrical magnitude. As it was showed, an amplifier stage was needed before the signal could be
acquired by the ADC. Hence, in order to obtain the real value of the sampled voltage, it is only a matter
of revert all the steps done until we got the digital value generated by the ADC:
1. Convert the Value in Binary Format to a Voltage
VV alue = Digital V alue× VRef+ − VRef−ADC Resolution
(3.21)
where Digital V alue is the value given by the ADC, VRef+ is the positive ADC reference voltage
(= 5V), VRef− is the negative ADC reference voltage (= 0V) and ADC Resolution is the ADC
resolution (quantization levels = 1024).
2. Calculate the Voltage Value Before Amplification
By analysing the circuit provided in figure 3.5, the input voltage is given by the following expression:
Vin = −(VV alue − V+)× R3
R4+ V+ (3.22)
i.e., it is the ratio between resistors R3 and R4 that dictates the gain. The gain values for each
sensor are stored in an array within the application. Now, it is necessary to subtract the offset
imposed by the sensor (VRef Sensor), so we obtain the real voltage value. The offset values of
each sensor were measured with a voltmeter and are stored in a look-up table at the host.
VReal = Vin − V Ref Sensor (3.23)
38
3. Calculate the Current Value
Finally, the voltage value must be converted back to current. Thus, we have to use the current
sensors sensitivity (mV/A) to convert the voltage value back to current:
I =VReal
Sensitivity(3.24)
4. AC Sensor
The above steps are valid to all sensors, but the AC sensor. For this sensor, the signal conditioning
is different to the one used with DC rails. Therefore, only the first and third steps applies, but still
needs an intermediate step in which the gain introduced by the bandpass filter gets cancelled:
VReal =VV alue
Filter Gain(3.25)
The reader is invited again to observe the full-schematic of the system in figure 3.12 if necessary.
3.3.3 Dynamic Range Analysis
In ADC applications, dynamic range is the ratio of the rms value of the full scale to the rms noise,
which is generally measured with the analog inputs shorted together. Commonly expressed in decibels,
it indicates the range of signal amplitudes that the ADC can resolve. For an N-bit ADC, the ideal DR
(Dynamic Range) or also called SNR (Signal to Noise Ratio)[43] can be calculated as:
DR = 6.021×N + 1.763 (dB) (3.26)
One method to improve this parameter is to perform Oversampling. As the name implies, over-
sampling gathers additional conversion data from an input signal. Standard convention for sampling an
analog signal indicates that the sampling frequency Fs should be at least twice the maximum frequency
(FM ) of the input signal. This is known as the Nyquist Theorem:
Fs = 2× FM (3.27)
Hence, by using an higher sampling frequency (oversampling), and combined with averaging tech-
niques, the Effective Number Of Bits (ENOB) can be improved. In fact, averaging the oversampled
results also averages the quantization noise, thus improving the SNR and consequently the ENOB. The
ENOB is a function of the Signal-to-Noise ratio plus Distortion (SINAD), which is measured with a si-
nusoidal input near full-scale applied to the A/D converter. The SINAD is found by computing the ratio
of the RMS level of the input signal to the RMS value of the root-sum-square (RSS) of all noise and
distortion components in the FFT analysis, except for the DC component. The ENOB is calculated by
substituting the ADC’s computed SINAD for DR in equation 3.26 and solving equation for N.
39
ENOB =SINAD− 1.763
6.021dB (3.28)
For each bit of accuracy improvement, the signal must be oversampled by a factor of four, meaning
that the relationship between the oversampling frequency FOS and the sampling frequency FS is:
FOS = 4Nb × Fs (3.29)
where Nb is the desired improvement on the ENOB (for instance, for two bits of improvement, Nb =
2). Figure 3.21 shows how oversampling improves the accuracy of the conversion result. In this diagram,
the input signal is oversampled by four (sample groups are shown in green and purple) and averaged.
The shown sample points illustrate the difference between the raw, noisy signal and the average; the
noise in this example affecting ± 3 bits of accuracy on an individual sample. Note that the averaged
values (orange dots) are much closer to the ideal value than most of the single samples.
500
505
495
510
490
485
480
Individual Samples
Samples Group
Input waveform
Sample GroupAverage
Conversion Value
t
Figure 3.21: Averaged Conversion Results
As a general rule, doubling the sampling frequency yields, approximately, a 3-dB improvement in
noise performance. Noise improvement can be attained by using post-processing techniques (averag-
ing). When averaging conversion results, there are two approaches that can be taken into account:
normal average or rolling average.
Normal average is simply to acquire N samples, adding them and dividing the result by N. When us-
ing normal averaging in an oversampling scenario, after the technique is applied, the sample data used
in the calculation is discarded. This process is repeated every time the application needs a new con-
version result. When using averaging techniques, there is a slight delay associated with the calculated
conversion result since it corresponds to the average of the last n samples. The delay can be calculated
using the formula shown in Equation 3.30.
tdelay = tsn − ts0 + tprocess (3.30)
40
where ts0 is the time at which the first sample of the average occurs, and tsn is when the last sample
occurs. The tprocess time, required to process the sampled data and calculate the average to supply
to the application, is also factored into the equation. Unfortunately, when using this method the ADC
sampling frequency is reduced by the same oversampling factor (i.e., if we are using an oversampling
factor of N = 8 and we are sampling at a rate of 10 kHz, then the effective sampling frequency will be
10/8 = 1.25 kHz).
The rolling average technique consists in using a sample buffer of the N most recent samples in
the averaging calculation, allowing the ADC to sample at its maximum rate (the ADC sample rate is not
reduced by N as in normal averaging), making it ideally suited for applications requiring oversampling
and higher sample rates. However, we still have the delay resulting from populating the sample buffer
and process the average calculation (this however can be efficiently done by using a number of samples
equal to powers of two). Furthermore, in order to always have useful data being processed by the
application, while the buffer is being filled with data, the samples coming directly from the ADC are used.
Figure 3.29 shows the differences between the two aforementioned techniques for an oversampling ratio
of 4. We can see that using the rolling average permits the operation at the maximum frequency, whilst
the normal average does not since it has to wait for the buffer to be filled with 4 samples before averaging.
t
Averaged Sample Availabletdelay
t1t0
Conversion Value
(a) Normal Average
t0
t
Conversion Value
t5 t10 t15 t20
d0d1d2
Input waveform
(b) Rolling Average
Figure 3.22: Averaging Techniques
3.3.3.A SNR,THD and SFDR Analysis
For the following analysis, it will be important to know, precisely, how the ADC performs when
stressed with a common 50 Hz sinusoid input. Thus, in this section an analysis to the SNR (already
mentioned), the Total Harmonic Distortion (THD) and to the Spurious Free Dynamic Range (THD) will
be done and some choices are made, such as the sampling frequency of the system and the post-
processing technique to be used. Therefore, definitions for THD and SFDR need to be provided.
Total Harmonic Distortion is the ratio of the rms value of the fundamental signal to the mean value
of the root-sum-square of its harmonics (generally, only the first 5 harmonics are significant). THD of an
ADC is also generally specified with the input signal close to full-scale, although it can be specified at
any level.
The Spurious Free Dynamic Range is the ratio of the rms value of the signal to the rms value of the
41
worst spurious signal regardless of where it falls in the frequency spectrum. The worst spur may or may
not be a harmonic of the original signal. SFDR can be specified with respect to full-scale (dBFS) or with
respect to the actual signal amplitude, also called carrier (dBc).
There is no information available about the dynamic characteristics of the embedded ADC in PIC18F4550
microcontroller. However, we can have as basis the information about a similar 10 bit ADC from Mi-
crochip and compare those values with the ones obtained through measurements. In table 3.6 is re-
vealed the main characteristics of the datasheet of Microchip 10-bit ADC MCP3004/3008, for Vref = 5
V.
200 KSPS MaxDNL and INL ± 1LSB
SINAD 61 dBSFDR 78 dBTHD -76 dB
Table 3.6: ADC Characteristics
0 50 100 150 200 250 300 350 400 450
−120
−100
−80
−60
−40
−20
0
Frequency (Hz)
Single-Sided Amplitude Spectrum of V dd2
dB
RMS Quantization Noise Level
10log( 40962 ) = 33 dB
FFT NOISE FLOOR = 94 dB
SNR = 60.887dB
(a) Vdd/2
0 50 100 150 200 250 300 350 400 450 500−120
−100
−80
−60
−40
−20
0F
Single-Sided Amplitude Spectrum of in
dB
Frequency (Hz)
FundamentalNoiseDC and Harmonics (excluded)
10log(40962 ) = 33 dB
(b) Fs = 1kHz; VDC = 2.45 V and VAC = 1.715 Vrms
Figure 3.23: PIC18f4550 10-bit ADC’s SNR
The SNR of MCP3004/3008 can be calculated using the SINAD and the THD values, according to
the next formula [44]:
SNR = −10 log[10−SINAD/10 − 10THD/10] ' 61, 45dB (3.31)
As it should, the SNR is greater than the SINAD value. For the ADC of PIC18F4550, in order to calculate
some of these parameters first a DC signal with magnitude Vdd/2 was fed into ADC and sampled at 1
kHz. Figure 3.23(a) shows the Fast Fourier Transform (FFT) plot of that test, taken with 4096 points. The
scale is referred to the Full-scale input voltage (Vref = +5 V). The plot evidences the noise floor of the
system to lie around 91 dB. The FFT process reduces the noise floor, so the actual RMS quantization
noise level equals to [? 43]:
SNR = FFT Noise Floor− 10log
(4096
2
)' 60.887 dB (3.32)
42
This is, actually, the SNR of the system. The ENOB is calculated via equation 3.28 resulting in:
ENOB ' 9.8 (3.33)
This is a typical value for the ENOB of a 10 bit-ADC. However, it was obtained injecting a constant
value. Accordingly, to have a better representation of the SNR, a sinusoidal signal with f=50 Hz and an
amplitude close to the full scale (DC offset of 2.45 V and a maximum of 1.715 VRMS) and sampled at Fs
= 1kHz was injected. The respective FFT is presented in figure 3.23(b).
As it can be concluded, by observing the plot, the ADC behaves poorly and a reduction of the SNR
to approximately 38 dB occurs. In fact, it was observed that the signal was interfered with noise with the
same amplitude of the harmonics. There can be several reasons to this: extra noise (not quantization)
introduced by the mains; the voltage reference used by the ADC, which comes from USB and powers the
full system is not very stable, thus affecting the conversion; the sampling rate is not high enough. Hence
the combining effect of the aforementioned reasons, results in the quantization process being strongly
interfered during ADC’s conversion. The first reason was, at first, discarded after obtaining the FFT of
the same sinusoid with a digital oscilloscope, where we can see the fundamental and its harmonics,
the noise, but no other signal (please, refer to figure A.4 in the appendix). However, the DSO-X 2024A
digital oscilloscope performs oversampling and samples averaging, thus it is inconclusive the source of
noise, by using this method.
The hypothesis that the voltage of the USB bus is unstable is a fact, since it varies with the load of
the PC. A solution would be to source the board with an external power source, providing a much more
stable voltage reference. Nevertheless, this would require modifications in the PIC’s demonstration
board and it would be hard to come up with an easy to integrate meter board when it demands the use
of an external power source.
Therefore, the sampling frequency has been increased to 3.3(3) kHz and new measurements were
taken. Furthermore, post-processing of the samples has been done: averaging N=8 and N=16 samples,
by using a buffer rolling average technique. The choice of operating at Fs = 3.3(3) kHz is discussed in
chapter 4.
After inspection of the resulting measures in figures 3.25(a) and 3.25(b), it can be observed an
increase in the SNR value (the noise floor gets reduced) and the noise interference previously seen
is now eradicated, even in figure 3.24, where averaging is not performed. Hence, sampling at Fs =
3.3(3) kHz is enough to rise the SNR to approximately 60.4 dB, which is a much closer value to the one
obtained when injecting a constant DC signal.
As it was previously referred, performing an averaging of the samples can improve the SNR value
and it is effectively the case when averaging by N=8 and N=16 samples: figure 3.25(a) shows an im-
provement of, approximately, 9 dB when compared with the SNR obtained with no averaging. This, in
fact, agrees with the conjecture which states that for every doubling of averaged samples, there is a
noise improvement of more or less 3 dB. This, in turn, yields a final SNR of 69.187 dB, thus surpassing
the theoretical limit of 61.973 dB for a 10-bit ADC. Similar conclusions are obtained analysing figure
43
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6−140
−120
−100
−80
−60
−40
−20
0
Frequency (kHz)
dB
F
23
4
5
6
Single-Sided Amplitude Spectrum of Sin
FundamentalNoiseDC and Harmonics (excluded)
RMS Quantization Noise Level
FFT NOISE FLOOR = 93.51dBFS10log(40962 ) = 33 dB
SNR = 60.396 dBFS
Figure 3.24: FFT of Sinusoidal Signal sampled at Fs = 3.3(3) kHz
3.25(b), where, in this case, we state an improvement of 11 dB, instead of the expected 12 dB. This
results in a final SNR of, approximately, 72 dB.
The THD, SFDR and SINAD were also computed for every case. The ENOB value was calculated
with the values obtained for THD and SNR and by using, once again, equation 3.28. Since the analysis
is similar for every case, it was decided to just include the plots for THD and SFDR for one of the cases
(please refer to figure A.5 that lies in the appendix of this dissertation). Table 3.7 summarizes the results.
Signal V dd2 Fs=1kHz Fs=3.3(3) kHz Fs=3.3(3) kHz — N = 8 Fs = 3.3(3) kHz — N = 16
SNR 61 dB 38 dB 60.4 dB 69.187 dB 72 dB
THD N/A 54.22 dB 57.72 dB 61.291 dB 62.64 dB
SFDR N/A 52 dB 59.82 dB 61.842 dB 63.33 dB
ENOB N/A 6 8.982 9.778 10.03
Table 3.7: SNR values for each case test
For this project, it was chosen to operate with FS = 3.3(3) kHz, averaging N=8 samples with a buffer
rolling technique. Sampling with an higher number of samples would introduce a major overhead in
the main loop of the application, due to the buffer and accumulator manipulation, which would force the
usage of a lower sample frequency. Anyhow, this option introduces enough noise improvement to outgo
the SNR theoretical limit for a 10-bit ADC. According to the results, it is secure to state that the mains
introduces some undesired spurs affecting significantly the ADC’s conversion. However, it is possible to
bypass this issue by sampling at higher frequencies and performing samples averaging. It is, now, also
possible to compare these results with the ones indicated in table 3.6 referring to a typical 10-bit ADC.
Thus, PIC18F4550’s ADC has a lower performance than MCP3002 ADC, although is important to bear
in mind that MCP3002ADC has a resolution of 200 KSPS, whilst PIC18F4550’s ADC has only 58 KSPS,
44
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6−140
−120
−100
−80
−60
−40
−20
0
Frequency (kHz)
dB
F
2
3
4
5
6
Single-Sided Amplitude Spectrum of Sin
FundamentalNoiseDC and Harmonics (excluded)
RMS Quantization Noise Level
FFT NOISE FLOOR = 102.3 dBFS
SNR=69.2dBFS
(a) N=8
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6−140
−120
−100
−80
−60
−40
−20
0
Frequency (kHz)
dB
F
2
3
4
56
Single-Sided Amplitude Spectrum of Sin
FundamentalNoiseDC and Harmonics (excluded)
FFT NOISE FLOOR = 104.5 dB
SNR=72dB
RMS Quantization Noise Level
(b) N =16
Figure 3.25: FFT of Sinusoidal Signal with samples averaging (Fs = 3.3(3) kHz)
which affects dynamic performance.
3.3.3.B Combining the PGA with Oversampling
As referred earlier, to achieve a maximum dynamic range of the ADC, a front-end PGA stage can be
added to increase the effective SNR for very small signal inputs.
Consider a system dynamic range requirement of > 80 dB:
1. First, we need to calculate the minimum rms noise to achieve this dynamic range, considering a
Vp−p = 5 V (Unipolar ADC). The maximum allowable system noise is calculated as:
DR = 20 log
(VFSRMS
VRMSNoise
)(3.34a)
Where V FSRMS and VRMSNoise are the Full-Scale RMS Voltage allowed by the ADC and the RMS
noise, respectively. Substituting the values, yields:
80dB = 20log
(5
2√2
VRMSNoise
)⇔ VRMSNoise = 176.776 µV RMS (3.34b)
2. The 10-bit 58 KSPS ADC of PIC18F4550 works at approximately 3 KSPS. The total rms noise is
simply:
176.776 µVRMS = ND ×√BWmax (3.35a)
Where ND refers to the noise density and BWmax to the Nyquist band. Thus for BWmax = 1.5
kHz:
ND =176.776 µV√
1500' 4564.35 nV/
√Hz (3.35b)
This is the amount of noise density referred to the input (RTI), that can be tolerated by the system.
3. It is now possible to understand if the amplifier of the AGC chosen, a priori, is appropriate to
45
provide sufficient Analog Front-End gain to achieve the required 80 dB. If it is not suitable, then
another device with better performance must be chosen or do more conservative specifications
regarding the DR.
The input 50 Hz signal is acquired with a frequency of 3.3(3) kHz and it is subsequently averaged
with N=8 samples, thus reducing the amount of noise in the system and providing a total of 69.2
dB of SNR. Hence, to achieve 80 dB of DR, it will require at least 10.8 dB improvement, which can
come from the gain provided by the PGA stage. Therefore, this block has to provide a gain of at
least 3.5 without exceeding the ND limit (4564.35 nV/√Hz).
The AD8031 amplifier combined with variable resistors, and mounted in a differential gain amplifier
with RE = 5 kΩ (figure 3.19) it is capable of providing a maximum gain of 16, as we have seen
before, which is higher than the required. This amplifier has 15 nV/√Hz at f = 1 KHz for unity gain
(G = 1); for higher gains, the noise is usually lower and is higher for operations at low frequencies
(< 1KHz). So, we can roughly use the 15 nV/√Hz value for the following calculations.
NDPGA ' 15 nV/√Hz (3.36)
Another AD8301 amplifier should lie between the band-pass filter and the the PGA, providing the
necessary common-mode voltage, through a summing configuration (figure 3.17):
NDSum = 15 nV/√Hz (3.37)
The MCP6002 is positioned before the summing amplifier and is used to implement the band-pass
filter (figure 3.13):
NDBpass = 15 nV/√Hz (3.38)
4. Since the complete system’s noise budget is 4564 nV/√Hz (RTI), it is useful to calculate the
dominant noise sources to ensure that the limit is not exceeded. The noise densities are referred
to the input, which is the band-pass filter of the full circuit (the reader may recall the full circuit
diagram in figure 3.12).
The AD8031 summing amplifier and the PGA have both an input-referred noise of 15 nV/√Hz.
And when referred back to the input of the band-pass filter (which can provide a gain of 4), yields
3.75 nV/√Hz of noise. The band-pass has an input-referred noise of 15 nV/
√Hz
The ADC has a SNR of 60.4 dB using 5V as reference, yielding:
N =
52√2
1060.4/20' 1688.2 µVRMS (3.39)
Considering the Nyquist BW (1.5 kHz):
NDADC '1688.2µVRMS√
1500' 43.6 µV/
√Hz (3.40)
46
And when referred back to the input, we have NDADC RTI =681.25 nV/√Hz.
The total RTI noise of the full system is given by the root-sum-square (RSS) of the those noise
densities.
Noise Total =√
ND2PGA RTI + ND2
Sum RTI + ND2Bpass RTI + ND2
ADC RTI (3.41a)
Substituting values, yields:
Noise Total =
√2× (3.75 nV/
√Hz)2 + (15 nV/
√Hz)2 + (681.25 nV/
√Hz)2 (3.41b)
⇒ Noise Total ' 681.44 nV/√Hz (3.41c)
Which is lower than the maximum allowable (4564 nV/√Hz). The total noise contribution suggests
that the initial consideration (80 dB) was too conservative and, in fact, the DR can be greater.
Considering this value and by using equation 3.34a, the system may achieve a DR = 96.5 dB.
3.3.4 System Stability Analysis
Since the proposed AGC is non-linear, the theory commonly used for feedback linear systems is not
appropriate. However, under some assumptions, it is possible to analyse the system in a way similar to
the one used with linear systems in the frequency, domain. This section will provide some background
about non-linear system problems and some analysis techniques based in non-linearities of sector,
important for this type of cases and then apply this theory to the practical case.
3.3.4.A Absolute Stability
A grand part of non-linear systems can be represented as conventional linear, time invariant system
with a non-linear block in the feedback line (see figure 3.26(a)). Actually, this type of systems appear
with some frequency in practical engineering cases and the problem of stability analysis of this kind of
systems is often identified as Lur’e Problem or Absolute Stability Problem [45, 4, 46].
(a) Lur’e Problem (b) Definition of a Sector
Figure 3.26: Non-Linearities Analysis [4]
The process for representing a system in this form depends on the particular system that is involved.
47
For instance, in the case where the only non-linearity is in the form of a relay or actuator/sensor non-
linearity, there is no difficulty in representing the system in this feedback form. In other cases, the
representation may be less obvious.
It is assumed that the external input is r = 0 and a study is made for the behavior of the unforced
system. What is unique about this section, is the use of the frequency response of the linear system,
which builds on classical control tools like Nyquist plot and Nyquist criterion. Being the dynamic system
in figure 3.26(a) represented in the form of state space model:
x = Ax+Bu
y = Cx+Du
u = −φ(y)
(3.42)
where x ∈ <n, u and y ∈ <p, the pair (A,B) is controllable, (C,D) is observable. The matrix G(s) is
the open loop transfer function of the system and it can be obtained from the state space model:
G(s) = C(sI− A)−1B + D (3.43)
It is useful to define a sector and when does a function belongs to a sector. A continuous function φ
belongs to the sector [K1,K2] if there is two non negative numbers k1 and k2, such as:
y 6= 0⇒ K1 ≤φ(y)
y≤ K2 (3.44)
Geometrically, this condition means the graph of φ(y) lies between two lines with slope K1 and K2
as shown in figure 3.26(b). The definition in equation 3.44 implies that φ(0) = 0 and φ(y)y ≥ 0 (i.e., the
graph of φ(y) is located in the 1st and 3rd quadrants, since K1 and K2 are non-negative numbers). An
important question arises: supposing the non-linearity φ(y) belongs to sector [k1, k2], if the open loop
transfer function G(s) is stable (or Hurwitz), what conditions should be imposed to φ(y) in order ensure
that the closed loop system remains stable? the following theorems answers this question.
3.3.4.B Popov Criterion
The monovariable case of the Popov criterion is addressed by several authors ([45, 46, 47, 4]) and
establishes a sufficient condition to prove stability of non-linear systems in closed loop. Considering the
system presented in figure 3.26(a), if the following conditions are satisfied:
1. Matrix A has all the eigenvalues in the left-half of the complex plan (i.e. A is Hurwitz) and D > 0;
2. [A,B] is controllable;
3. The non linearity φ(y) belongs to sector [0,K].
48
Then, the system is globally asymptotically stable where φ(y) ∈ [0,K] iff ∃ q ≥ 0 such that:
<(1 + jωq)G(jw)+1
K> 0 ∀ ω ∈ < (3.45)
This formulation permits an interesting graphical interpretation:
If the Nyquist plot of the system H(s) = (1 + sq)G(s) lies in in the right side of the vertically line that
passes through the point − 1K , then the closed loop system is stable [46].
Loop Transformation
As we have seen in the previous section, if the non-linearity φ(y) belongs to section [0,K], then it
is possible to apply Popov’s criterion. Nevertheless, even if the non-linearity belongs to a more general
sector in the form of [K1,K2], it is still possible to apply the criterion if we perform a loop transformation.
The idea is to transform the loop such that all the conditions for using the criterion are still satisfied. If
we apply this transformation, φ ∈ [K1,K2] is transformed to φ ∈ [0,K2 −K1] and we can now use the
theorem, considering the new linear system G(s).
G(s) =G(s)
1 +K1G(s)(3.46)
3.3.4.C Circle Criterion
This criterion is a stronger and general criterion for the absolute system problems and the Popov
criterion can be seen as a particular case of this general criterion (when q = 0), allowing the non-linearity
to be time-variant [4]. Thus, considering, again, the system in figure 3.26(a), if the following conditions
are satisfied:
• Matrix A has no eigenvalues in the jω axe and has p eigenvalues in the right-half of the complex
plan;
• The non-linearity φ belongs to sector [K1,K2] and can be time-variant;
• If it verifies one of the following conditions:
1. 0 < K1 ≤ K2 the Nyquist plot of G(jω) does not enter the disk D(K1,K2) and encircles it p
times in the counter-clockwise direction, where p is the number of poles of G(s) with positives
real parts.
2. 0 = K1 < K2 , G(s) is Hurwitz and the Nyquist plot of G(jω) lies to the right of a vertical line
defined by <[s] = − 1K2
. (Popov Criterion for q = 0)
3. K1 < 0 < K2, G(s) is Hurwitz and the Nyquist plot of G(jω) lies in the interior of the disk
D(K1,K2).
Then, the closed loop system is absolutely stable in the finite domain.
49
3.3.4.D Application of the Theorems to the System in Study
First, in order to apply one of the above theorems, we have to present the system in the state space
model fashion way. Thus, we have to model some real time functions in a transfer function, so we can
get the open loop transfer function. In sum, the system is composed by a PGA (whose gain is changed
through digital potentiometers), the input sinusoidal signal, a rectifier, a first order Low Pass filter and the
non-linear control block. Besides the non-linearity, the only blocks that introduce dynamic to the overall
system are the programmable gain amplifier and the filter. The sinusoid can influence the stability of the
system through its amplitude that we call Vin.
The variable gain amplifier, can, in this case, be roughly modelled by a first order transfer function of
type 0, because the response is associated to the frequency response of the AD8031 amplifier:
G1(s) =1/τ
s+ 1/τ(3.47)
Where τ refers to the constant time of a first order system. To determine the value of this constant,
we normally could observe the step response of this block and get τ , which would be the time the
output takes going from 0 value to 63.2% of its final value. But, according to the datasheet of the digital
rheostats, we can determine this constant. As stated before, the digital word D assumes values between
1 and 64, thus we can get the time constant if we consider the worst case scenario (i.e., change the
digital word of the device incrementing one by one from 1 to 64). The AD5113’s datasheet provides the
following times: t2 = 10 ns - CLK low time; t3 = 10 ns - CLK high time and t4 = 15 ns - U/D setup time.
At the first time, incrementing the variable will take t = t2 + t3 + t4, whilst in the following increments
will just take t = t2 + t3, so the full time can be obtained:
time = N × (t2 + t3) + t4 (3.48)
Where N is the total of increments we need. In the worst case we need 63 increments, so the full
time and consequently τ is:
τ = 63× 20 ns+ 15 ns = 1275 ns (3.49)
This, in turn, yields the following first order transfer function:
G1(s) =785000
s+ 785000(3.50)
This means the dynamic of the system will be governed by the low pass filter, since the pole of its
transfer function is much closer to the imaginary axis:
G2(s) =10
s+ 10(3.51)
Hence, for the stability analysis, the system can be presented by the following MatLab simulink [42]
blocks:
50
10
| |
.
100010001
110102
5
1
Figure 3.27: System for Stability Analysis
The diagram is organised in the way discussed earlier, that is, a linear system with a non-linear
feedback block (Lur’e Problem). The gain (and input sinusoid amplitude) are fixed at the maximum
values those variables can assume (16 and 5V of amplitude, respectively). One of the goals is to know
if the system will be stable when the maximum possible gain is reached since, in such case, it will be
stable for lower gains.
(a) Non-Linearity Plot (b) Centered Non-Linearity Plot
Figure 3.28: Non-Linearity
The non-linearity is represented in figure 3.28(a). However, in order, to apply the theorems, this
linearity must be contained in the 1st and 3rd quadrants, must be time-invariant (in the case of Popov
Criterion) and memoryless. A slightly change can be made to the system to shift the graphic and center
it around zero. Thus, subtracting -200 to the input signal of the non-linearity, we can shift the graph of
figure 3.28(a) to the left by 200 and consequently satisfy the criteria with no loss of generality, whatsoever
or even change in the dynamic of the overall system2. Hence, the final plot of the non-linearity function
is presented in figure 3.28(b), which is an ’on-off’ non-linearity with hysteresis and a dead zone.
By the definition of equation 3.44, the above function belongs to the sector:
φ ∈ [0, 1] (3.52)
2The margins used for this analysis may not be the ones used physically. However, it does not matter which margins arechosen, since it is always possible to shift the characteristic to comprise the sector requirements
51
Resorting to MatLab, the space-state model of the system in figure 3.27 is:
x = Ax +Bu
y = Cx
u = −φ(y)
(3.53)
Where,
A =
−10 1.286× 1010
0 −7.85× 105
, B =
0
1
, C =[10 0
]and D =
[0]
(3.54)
Taking into account equation 3.43,this yields the following open-loop transfer-function:
G(s) =1.286× 1011
(s+ 785000)(s+ 10)(3.55)
The non-linearity is time-invariant, so Popov Criterion can be applied. Therefore, according to the
criterion in equation 3.45, by doing K = 1, q = 1 and if the next inequality is satisfied, then the stability is
guaranteed:
<(1 + jω)G(jw) > − 1
K= −1 ∀ ω ∈ < (3.56)
Figure 3.29(a) represents the Nyquist Diagram for equation 3.57.
H(s) = (1 + s)G(s) (3.57)
−2 0 2 4 6 10 12 14 16 18x 10
4
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1x 105 Nyqust Diagram
Real Axis
Imaginary
Axis
NyquistR = -1
8
(a) Nyquist Diagram of H(s) for Positive Frequencies
−1 0 5 15 20−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1x 104 Nyquist Diagram
RealAxis
Imaginary
Axis
NyquistR = -1
10
(b) Zoom in Nyquist Diagram for H(s)
Figure 3.29: Nyquist Diagram
Taking a closer look in the neighbourhood of the vertical line at R = -1 (see figure 3.29(b)), can be
observed that the diagram lies at the right of that line, so the Popov criterion states that this system is
absolutely stable for any non-linearity contained in sector φ ∈ [0, 1]. Note that the particular choice of the
sector, resulted in a line that passes in the point R=-1, thus coincidently resulting in the classic Nyquist’s
52
condition, which guarantees linear system’s stability. If other sector had been chosen, such as φ ∈ [0, 2]
(actually, the non linearity belongs to any sector where K1 ≥ 1), the system would also be stable. The
Popov criterion would not guarantee stability only for the sector [0,+∞[, since this would constrain the
Nyquist diagram to lie at the right of the imaginary axis and, as we witnessed, that it is not the case, at
least for q = 1.
3.4 Summary
This chapter addressed fundamental design aspects for the successful development of Powermeter
device. It started by introducing the architecture of the overall system, revealing to the reader how ev-
erything connects together (sensors, microcontroller, host, power rails), so power measurements can be
done seamlessly. The rails composing a common PSU were also studied in order to ensure that Power-
meter is suitable to sense any rail and component of a computing system. Furthermore, it was analysed
how the different signals that a PSU works with (AC and DC) could be measured and conditioned, so a
precise and accurate measurement could be attained. Under this topic, it was developed a convenient
approach to sense AC signals, the AGC. This method comprehends several electronic blocks, which
filter and dynamically amplify/attenuate the sensed AC signal. With this methodology, it is possible to
increase the ADC dynamic range, in order to provide very accurate readings and prevent signal loss of
very low voltage signals.
A thorough analysis to the system was developed. MatLab was used as the supporting tool to design
the controller and validate the system’s performance, under some relevant case examples. Since the
dynamic range is tightly influenced by the noise, a study to the system’s overall noise was accomplished,
calculating the ADC’s SNR, THD and SFDR, with and without oversampling. Finally, because the system
relies on a non-linearity, a theoretical analysis to the system’s stability was done, introducing some
theorems and frequency domain tools to deal with these kind of problems, entitled in the literature as
Lur’ problems. The analysis showed that the system is absolutely stable for the chosen sector.
The next section will present details about the software API developed under the scope of this thesis,
which implements the algorithms necessary for the energy/power computation.
53
54
4Powermeter - Software/Firmware
Contents
4.1 Communication System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Firmware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
55
This chapter focuses on explaining how the microcontroller communicates with the host’s system,
how the acquired data is treated and interpreted, and the algorithms used to allow synchronization be-
tween both systems. First, it is provided an introduction of the USB communication, then it is developed
a study about the latency of the communication system. The sampling frequency, previously chosen,
is validated taking into account all the constraints that limit it. In the last section of this chapter, the
algorithm used to compute the energy of a running application is analysed. Finally, it is also revealed
how the final API can be used to read power samples.
4.1 Communication System
The communication between the microcontroller and the host is established through the Universal
Serial Bus protocol. The PIC18FX455/X550 device family contains a full-speed and a low-speed com-
patible USB Serial Interface Engine (SIE) that allows fast communication between any USB host and the
PIC microcontroller. It was used the full-speed mode, which grants at the most 12 Mbps. The protocol
has a hierarchy of descriptors, which permits, for instance, the configuration of a HID (Human Interface
Device) or CDC (Communication Device Class) class device.
For this project, the USB transactions are granted using the HID class specification. The class report
descriptor allows the user to specify the type of data to be transferred (bulk, interrupt, isochronous). This
class was used since it is the choice for custom devices and is more or less simple to configure, while,
at the same time, it is able to provide a high throughput.
It was used the Bulk transfer type because it can transfer large amounts of data, guaranteeing data
integrity. The communication is interrupt-driven, meaning that the host does not keeps pooling the
device waiting for data, which leads to less communication overhead.The host communicates through
USB, by using the libusb library API [48], which is an open-source library. The application offers two
main interfaces for device I/O: Synchronous and Asynchronous Interfaces.
The Synchronous interface allows the user to perform a USB transfer with a single function call.
When the function call returns, the transfer has completed and the user can parse the results. The user
can call specific functions to transfer data, such as libusb bulk transfer() and libusb interrupt transfer(),
which transfer data using a bulk and an interrupt endpoint, respectively. The main advantage of this
model is simplicity: we can do everything with a single function call. However, the application will sleep
inside a transfer until the transaction has completed, consequently the entire thread will be useless for
that duration.
This limitation can be a problem to a real-time demanding application. Fortunately, there exists an
interface that seeks to solve that problem: the Asynchronous interface. Asynchronous I/O is a more
complex interface, but instead of providing functions that block until the I/O has complete, the interface
presents non-blocking functions, which begin a transfer and then return immediately. Due to the de-
mands of this project for developing a device that should not introduce a significant overhead on the
normal user application, the used interface for exchanging large amounts of time sensitive data was
the Asynchronous one. Nevertheless, for specific cases, such as to initiate synchronization between
56
clocks or to start and stop the sampling process, it was used the Synchronous interface since these kind
of requests do not have time restrictions.
The synchronization between clocks it is a requirement and a very important procedure of the system
since it allows to associate every acquired sample with the time it was sampled, based on the clock of
the host. This synchronization permits, for instance, to get an insight of when the power consumption
on a server is more or less intensive. Thus, it is a main requirement if one wants to do real-time power
characterization of an application. This question will be addressed in the following sections.
4.1.1 Latency
The latency of the system in transferring data is a major concern since it limits the sampling frequency
that the system can attain. Hence, it was performed some tests to the latency of the system when
exchanging packets of bytes between the microcontroller and the host systems. The HID can send 64
bytes per packet. However, if the user requires more bytes to be transferred, the protocol divides the
amount of information among various packets of 64 Bytes (i.e., for 64 bytes it sends 1 packet, whilst for
65 bytes is forced to send 2 packets).
Before exploring the test that was conducted and its results, it must be said that, for this work, it was
decided to transfer 256 bytes of data. The 256 bytes comprises the header and the time tag of every
chunk of samples (5 bytes) and the payload (or acquired samples, which occupy 251 bytes). The 10-bit
ADC outputs data through two registers of 1 byte (High-byte and Low-byte resisters). Thus, with 256
bytes, one can transfer 2502 = 125 samples. Since the energy computation is partially executed in the
MCU, more samples would require a larger accumulator, translating into a higher data processing time.
Host µC
Message A OUT
Message A IN
Figure 4.1: Round-Trip Time
That being said, the test consisted in exchanging 1000 messages between the both systems, by
performing a so called Round-Time Trip (RTT) - see figure 4.1 - and by varying the amount of bytes
transferred. Figure 4.2(a) reflects the results of that test, where the abscissa axis is the number of
packets transferred. As it was expected, increasing the number of packets increases the latency and
the relationship between them is linear. The results showed that for every extra packet, the latency
increases more or less 100 µs.
In figure 4.2(b) it is also provided an histogram for the transfer of exchanging 4 packets of data (256
bytes). This figure allows to understand if all the messages take the same amount of time or if there are
messages that take more or less time. Nevertheless, the test allows to identify the worst-time case and
compute the minimum sampling period that the system can attain. Therefore, by examining the figure it
can be observed that the majority of messages take 556 µs. However, the worst-time case is 636 µs.
57
1 2 3 4 50
200
400
600
Number of Packets
Late
ncy
(us)
(a) USB Latency
436
456
476
496
516
536
550
556
576
596
616
636
656
676
696
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
Time (us)
Freq
uenc
y(b) Histogram for IN Transfers of 256 B
Figure 4.2: Communication Tests
Thus, to calculate the minimum allowed sampling time, it must be considered the worst-time case
(636 µs):
TSminimum × 125 ≥ 636 µs⇔ TSminimum ≥ 5.09 µs (4.1)
4.1.2 Sampling Frequency Choice of the System
The reader may recall that the frequency of operation used for all the results in chapter 3 was FS =
3.3(3) kHz. That frequency was obtained by performing the following analysis:
The sampling frequency is limited by, at least, three aspects:
1. Minimum time that data takes to be transferred between the processors (T1 = 5.09 µs);
2. Time resolution allowed by Timer01 module (T2 = 16 µs);
3. Time necessary to compute data within PIC;
The time necessary to process data is one of the constraints that limit the sampling frequency. This
time comprehends, among others, the time to manage all the required buffers to save, send and average
data, the time required for the energy computing and the time used by the controller stage to dynamically
change the gain of the PGA.
PIC18F4550 works at 48 MHz, but every instruction takes (typically) 4 clock cycles. Therefore, the
instruction rate is 12 millions instructions per second. It was measured the time spent by the microcon-
troller’s firmware to process all the software routines by using an oscilloscope and a flag within the main
loop, whose value alternates between the logic values ’0’ and ’1’. The obtained time was approximately
Tspent1 = 900 µs (1.1 kHz). Consequently, it was analysed the assembly code of the project and it was
done an effort to reduce this execution time Tspent1. The most used variables (including flags and buffers)
were moved to the microcontroller’s access bank, which is two times faster than the normal access mode1The project uses a timer module (Timer0) embedded in the microcontroller, which acts as the clock of the microcontroller
58
and loop unrolling (4x) was applied to the most critical loops. However, the bank has limited space of 95
Bytes and it is also used by the compiler to store temporary data. Therefore, not all data could be placed
there, namely the buffer used to average the 32 samples data. In sum, this has resulted in a reduction
to 273 µs, but to avoid working at the limit, it was used 300 µs instead (FS = 3.3(3) kHz).
This is, after all, the bottleneck of the project as regards the sampling frequency of the system,
when compared with the other limits referred at the beginning. For an Intel Core i7 3770K, with 4
cores and 3.5 GHz, which issues 4 IPC, this is a very acceptable frequency, since it corresponds to
an inspection window of 16.8 millions instructions (the lower the better). This is much better than the
resolution achieved by the PowerEgg [26], the WattsUp [27] or the approach discussed in [3], whose
sampling frequency is 4 Hz, achieving a resolution of 14 billion instructions.
4.1.3 Types of Data Transferred
There are different kinds of data being transferred between host and device: Asynchronous data
and Synchronous data (see table 4.1. As it was mentioned before, the synchronous interface is used
just to initiate some routines like the time synchronization between clocks and the sampling process.
The asynchronous interface is called to exchange large amounts of data along the time between the
microcontroller and the host), including data regarding clocks synchronization and the acquired samples
by the microcontroller.
Asynchronous Transfers Synchronous Transfers
Time Stamps and Sampled Data Time Synchronization Data Initialize Clocks Sync and Sampling Process
Table 4.1: Types of Data Transferred
Hence, it is necessary to distinguish the data sent/received by both systems (microcontroller and
host). Therefore, the messages are divided into an Header and a Payload. The Header gives the
possibility to differentiate the type of data being transferred, while the Payload is the data itself that the
device/host wants to transmit. When in asynchronous mode, the buffer size is at the most 256 Bytes,
while when in synchronous mode it is just 5 Bytes. Tables 4.2 and 4.6 shows how data is divided in the
synchronous and asynchronous modes, respectively.0 7 8 15 16 23 24 31 32 39
Header Channel 0 Channel 1 Channel 2 Channel 3
Table 4.2: Synchronous Data Structure
In the synchronous structure, the byte named Header can be NTP INIT, START or STOP commands
(used to start clock synchronization, and start and stop the sampling process, respectively). The payload
comprehends the bytes referring to Channel 0, 1, 2 and 3, which can be any of the available channels
to sample (V230, CPU, HDD and GPU), chosen by the user.
In the case of the asynchronous data structure, the Header byte can be NTP HEADER or DATA HEADER,
referring to the different kinds of asynchronous data - Time Synchronization Data and Time Stamps and
59
0 1 2 3 · · · 58 59 60 61 62 63
Header (1Byte) Payload00h
Payload40h
Payload80h
PayloadC0h
Table 4.3: Asynchronous Data Structure
Sampled Data, respectively. The payload is the data exchanged between both processors, which can
be data about clocks synchronization or samples acquired during the sampling process.
4.1.3.A Synchronous - Clocks Synchronization and Sampling Process Initialization
The clocks synchronization and the sampling process initialization processes are done by using the
asynchronous structure. In table 4.4 it is presented an example of the structure used to request the start
of the sampling activity for CPU and GPU channels. Not requested channels are filled with NOP values.
All the commands and headers are specific identifiers common to both MCU’s and host’s application and
are referred with defines saved in a C header file. CPU CHANNEL and GPU CHANNEL are macros,
which refer to one of the thirteen input analog channels (AN0-AN12) of the used microcontroller, where
the sensors output are connected to.
0 39
START CPU CHANNEL GPU CHANNEL NOP NOP
Table 4.4: Sampling Process Initialization Command Example
4.1.3.B Asynchronous -Time Synchronization Data
0 71
NTP HEADER(1Byte) Time (8 Bytes)00h
Table 4.5: Host and PIC’s Clock Synchronization Data Structure
Asynchronous communication uses a buffer of 256 bytes, divided in packets of 64 bytes. However,
during the clocks synchronization process only needs 9 Bytes of information (1 Byte for the Header and
the remaining slots for Time data) - see table 4.5.
The clocks synchronization is achieved by using a protocol that resembles the one used in Network
Time Protocol (NTP) algorithm, for a client-server method. The algorithm has four distinct time stamps:
Originate Time Stamp - Time Request Sent by Client (T1); Receive Time Stamp - Time Received by
Server (T2); Transmit Time Stamp - Time Reply Sent by Server(T3) and Destination Time Stamp - Time
Reply Received by Client (T4).
60
ClientServer
T1
T2
T3
T4
ACK + T2 + T3
NTPREQ
UEST
Figure 4.3: Synchronization Protocol
In this situation the Client is the microcontroller and the Server is the host. In figure 4.3, is provided
the normal steps for synchronization. The Client requests a synchronization process, saving time T1;
then the Server acknowledges and sends times T2 and T3 to the Client. With those time stamps, one
can determine the delay (time taken since the message was sent and until it arrives to destiny) inherent
in a Client - Server communication, by resorting to the next formula:
Delay = (T4 − T1)− (T3 − T2) (4.2)
To compute the time difference between the two clocks (i.e., offset), we start by assuming that it
exists no asymmetry in the communication, that is, Client → Server time and Server → Client time are
the same:
Offset = T2 − [T1 +Delay
2] =
(T2 − T1) + (T3 − T4)
2(4.3)
If the offset is positive, implies that the server’s clock is ahead in respect to the client, so we must
sum the Offset value to clients’ time-stamp. On the other hand, if the offset is negative, then the server’s
clock is lagging in respect to the client and we must subtract the offset value to the client time-stamp.
Hence, with the knowledge of the Offset value, we can adjust the clock used to generate the time
stamp within the microcontroller and synchronize it with host’s clock. The process of synchronization
occurs long enough, to guarantee an offset time in the order of microsecond. Despite this process
allows a very accurate synchronization, along the time, the two clocks will inevitably lose synchronism.
This is mainly due to the asymmetric routes and the network congestion during host’s clock update with
the internet time (NTP). This asynchronism is practically linear with time as we can see in figure 4.4,
where we witness the increase of the offset value with time.
61
1 10 20 30 40 50 60 70
0
5
10
20
30
40
45
50
645.78 µs
Minutes Passed
Offs
et(m
s)
Offset TimesLinear Regression
Figure 4.4: Offset Change Along the Time
The linear regression of the data gives the rate at which the offset gets modified. According to it, the
offset increases approximately 645.78 µs per minute. Thus, we can adjust the microcontroller’s clock to
counter act this deviation by adding 645.78 µs every minute. With the expense of increasing the time
overhead, it was decided to synchronize the clocks every time a packet is received, allowing a better
synchronism, which is an important requirement to allow power profiling at real-time.
4.1.3.C Asynchronous -Time Stamps and Sampled Data
0 63
DATA HEADER (1Byte) Time Stamp00h
Time Stamp Samples40h
Samples80h
SamplesC0h
Table 4.6: Sampling Process Data Structure
The data that is exchanged using the asynchronous interface is the type of data that the application
will handle most of the time. Every time the host receives this type of data, it will receive a packet
containing not only the specific header, but also the payload containing a time stamp - representing
the time the first power sample was taken - and N power samples, which can be related to data coming
from one or more sensors. As it was mentioned, the buffer has a capacity for 256 Bytes of data (see table
4.6). Thus, excluding the Header and Time Stamp, which occupies 5 Bytes together, we can transfer up
to 125 data samples within a 256 Bytes size buffer (because, each sample occupies 2 Bytes of data).
For instance, for N = 100, if the CPU sensor and GPU sensor are being sampled, then the received
packet will have in total N2 = 50 samples from the CPU sensor and 50 samples from the GPU sensor.
This, however, does not translates into a reduction of the sampling frequency, but instead means that
the data will be exchanged more frequently between the host and the microcontroller.
The time stamp which gets associated with every chunk of data (i.e., N samples), is obtained by the
62
usage of another timer module (Timer1), with a period of 5 us. This variable is a 32 bit integer and it
is synchronized with the lower 32 bits of host’s real time clock (in microseconds). Since the time stamp
resolution is in microseconds and because it is a 32 unsigned int variable, the stamp will roll over at
some point in time. Despite that, it does not compromises the ability to correctly tag the data, since the
32 significant bits of host’s clock are also used to stamp data. In fact, every time the host receives a
packet, saves the time of arrival of that packet in a variable. This allows to compare that time with the
incoming time stamp and find out if there was a roll over and correct it, if that it was the case.
4.2 Firmware
Acquisition Board µC Host
Figure 4.5: System Diagram - Microcontroller
The firmware comprehends the set of instructions that are preprogrammed in a embedded system
(the microcontroller). Those set of instructions allows the communication between the hardware system
(acquisition board) and the host, establishing a bridge between both systems (figure 4.5). Following, it
is introduced some of the strategies and procedures to process data with the PIC microcontroller.
4.2.1 Buffering Strategy
_ B_
Figure 4.6: Dual-Buffer Strategy
A dual buffer strategy (figure 4.6) was used to save and send sampled data: every sample is saved
in a buffer (A buffer) of size 256 bytes. When this buffer is filled with data, we must send it to the host
and still allow the sampling process to continue seamlessly. Consequently, another buffer (B buffer)
was necessary to save those samples, whilst A buffer is being utilized only to transfer data. When
B buffer is filled, it is used to send data while A buffer will be responsible to store the sampled data.
Thus, the buffer to store or to send data alternates between A buffer and B buffer along time.
63
4.2.2 Oversampling and Maximum Search Algorithm
Figure 4.7: Oversampling with Rolling Buffer
As it was referred before, in chapter 3, it was performed oversampling of the acquired samples at f =
3.3(3) kHz, in order to improve the DR. Hence, it was used a rolling average buffer of 8 samples (figure
4.7) : this buffer has a pointer, which returns to the start of the vector as soon as the buffer is filled
with data. While the buffer is not completely filled with samples, the program sends the samples directly
without averaging them. This allows a constant sending of useful data even at the start of the program.
When a new sample arrives it is added to the accumulator after the older sample gets subtracted from
it. Finally, the output of the accumulation is averaged and the oversample process is over.
In the case of the sensing of the AC sensor, after the oversampling technique, it is also necessary a
search for the maximum of the acquired data. This is necessary, since in order to compute the power
demanded by the power supply, we have to know the amplitude of the current signal so we can multiply
it by the 230 VRMS and by the power factor (=0.99).
The pseudocode for the maximum search is depicted in algorithm 4.1, which it just illustrates the
main part of the algorithm. This algorithm returns the peak of the sine wave acquired during AC current
sensing. The algorithm is pretty simple, fast and most importantly, effective. It follows the acquired
signal and tests if the current sample is higher than the one before and if it is, declares it as temporary
maximum. Then, tests if the current sample is lower than the temporary maximum, taking into account
the delta parameter, which was set to 10 quantization levels, after some experiments. If this comes to be
true, then the current sample is an absolute maximum and a search for the minimum value is addressed
from this point on. These two routines are inextricably linked and the maximum can be found because
the search for the minimum updates the temporary maximum value.
64
Algorithm 4.1 Maximum Search1: if lookformax then2: if Current Sample <temporary max −delta then3: New absolute maximum found4: lookformax = 05: end if6: else7: if Current Sample >temporary max +delta then8: New absolute minimum found9: lookformax = 1
10: end if11: end if
4.3 Software
Acquisition Board µC Host
Figure 4.8: System Diagram - Host
The software application comprehends the set of instructions that are preprogrammed in the host,
necessary to establish the communication with the microcontroller and process and output relevant
data to the user (figure 4.8). In this section, it is presented the interfaces, functions and procedures
used in the development of the Powermeter application that the end-user has access to. First, it is
presented which functions of the API are used to output the energy and after, it is explored how the
energy computation (and other routines) work together to successfully return the energy and power
samples of an application.
4.3.1 Powermeter Application Programming Interface
The API returns the energy that was spent between two points of the code of the user application.
For this, the user has to call three major functions, which use the synchronous structure introduced in
section 4.1.3:
• powermeter api init(char chanel1, char chanel2, char chanel3, char chanel4);
• powermeter api start(void);
• powermeter api stop(void).
To use the first function, the user must choose which channel(s) to sample (CPU, GPU0, HDD or
V230). If there are channels that the user does not want to sample, then it must use NOP as argument.
For example, if only CPU channel must be sensed, the call must be: powermeter api init(CPU, NOP,
NOP, NOP. This function executes every necessary initializations, including libusb library initialization,
vectors and files initialization, and synchronization between clocks. This function
65
The powermeter api start(void) function gives order to start the sampling process, whilst the power-
meter api stop(void) function, besides stopping the sampling process, also closes opened files (used to
store, in different files, all the samples and time stamps for post-analysis), frees allocated memory and
comprise the energy calculation. The function indirectly calls energy calc(long long end time) func-
tion, whose goal is to return the energy spent between the start and stop calls. Table 4.7 summarizes
the main functions and their respective features.
Function Arguments Features
powermeter api init channels to sample General Initializations (Libusb, variables, clock sync)
powermeter api start Void Starts Data Sampling
powermeter api stop Void Stops Sampling, Frees Allocated Memory, Calculates Energy
Table 4.7: Main API Functions
4.3.2 Energy Calculation
The energy is calculated from the power curve by integrating it along time. Since the operation occurs
in the discrete time domain, the energy spent between start and stop commands, must be computed
by using numerical integration methods. There are various methods available to do one-dimensional
integration, that are based on interpolation functions: Rectangle Rule - Order 0 Polynomial, Trapezoidal
Rule - Order 1 Polynomial and Simpson’s Rule - Order 2 Polynomial.
Pwrn
−1
Pwrn
PwrN
−1
PwrN
y = f(x)
x
y
Energy =
N∑n=1
TSampling ×Pwrn−1 + Pwrn
2(4.4)
Figure 4.9: Trapezoidal Rule
For this work, it was used the trapezoidal integration rule (see figure 4.9) since it is a method which
leads to a small integration error, compared to the rectangle rule and also because it is simple. Simpson’s
rule would be a better choice, to reduce the error. However, it would require more sums, multiplications
and divisions by numbers which are not a power of 2 and that constitutes a problem for PIC18F4550’s
architecture.
The algorithm for the energy computing is presented in 4.2 as pseudocode. By analysing it, the
algorithm first starts by adding the energy of some batches of samples, previously calculated in the mi-
crocontroler before sending the entire packet to the host: with this procedure, the CPU is not overloaded
with extra computation, yielding less power and time overhead when running the Powermeter API.
66
Algorithm 4.2 Energy Calculation1: batch number = index/N samples2: for Energy batches below batch number do3: Total Energy += Energy Batch4: end for5: Calculate Remaining Energy using Trapezoidal Rule6: return Total Energy
Pwr0 Pwr1 PwrN
B0 B1 Bn-1 Bn
batch number
BN
Figure 4.10: Energy Computing (where Bn refers to energy batch of the nth packet)
The energy batches received from the microcontroller are stored orderly in a vector. Each batch
corresponds to the energy of a full packet of N samples received. So, there exists a correlation between
the index of a time stamp and the batch it belongs to. For instance, if the host receives 100 samples per
packet, if at the end it has received 200 samples of data (resulting in 2 energy batches) and if the time
at which the sampling process ceased refers to the sample number 110, then doing an integer division
by the number of samples per packet (110/100 = 1) we know that we can add directly the energy of the
first batch (observe figure 4.10). From that point on, the energy must be computed, by resorting to a
numerical integration method. It only stops when the power sample associated to the end time stamp
(the time at which the sampling process has stopped) is reached. Finally, the total energy is returned.
4.3.2.A Time Stamp Search
Besides the matter about how, specifically, the energy computation is done, there is also the issue of
finding the corresponding time stamp to the time at which the sampling process has terminated. Thus,
it is necessary a search algorithm to find the closest time stamp to that time. The pseudocode in 4.3
presents the algorithm used to solve the problem 2. To complement, in figure 4.11 is illustrated in a block
diagrams the search process.
Essentially, the algorithm starts by estimating the index of the vector where the end time is likely to
be: that is done with the pseudocode in line 1. The reasoning behind it, is that each time stamp is equally
spaced by the sampling period (T sampling), thus subtracting the end time by the first time stamp and
dividing by T sampling, it shall give a good estimate of the wanted index. Nevertheless, it is likely that
the obtained index is not the best choice, thus a linear search is conducted over the vector either in an
increasing or decreasing direction as illustrated in figure 4.11. When a possible time stamp is found by
the algorithm, a final comparison is performed between the end time and the two closest time stamps
to discover the one which produces the least absolute error.
2The pseudocode only presents half of the original source code, because the other half is similar, but instead sweeping vectorentries above the index, it searches for the time stamp on entries below it. The time stamps are saved in a structure containing avector with the data and a variable providing the actual size of that vector.
67
Algorithm 4.3 Time Stamp Search1: index← (end time− start time)/T sampling2: if time[index] > end time then3: for i← index− 1 to 0 do4: if time[i] <= end time then5: if time[i+ 1] is closer to end time than time[i] then6: index← i+ 17: break8: else9: index← i
10: break11: end if12: end if13: end for14: end if
T0 T1 Tn-1
i
Tn
index
TN
Figure 4.11: Case Scenario when time[index] = Tn > end time (where Tn = T0 + (n− 1)× TSampling)
4.4 Summary
In this section, the reader was introduced to subjects about the MCU and the host communication
and the development of the MCU’s firmware and host’s API. Firstly, a briefing about USB communication
was presented, explaining, roughly, how it works how it connects to the rest of the work. For instance,
the latency of USB transactions limit the maximum sampling frequency, that can be attained.
Afterwards, some aspects regarding the the microcontroller’s firmware were detailed, such as what
was the buffering strategy used and how the sampling frequency can be configured. It was also revealed
the several types of data interchanged between the host and the microcontroller and how they ”inter-
pret” each type, by verifying the header of each received message. The synchronization process was
explained, which consists of set of messages traded between the two systems, in order to synchronize
their clocks.
In the final section of this chapter, the general structure of Powermeter API was discussed. Within
this subject, the reader had a grasp of the main functions used to operate with the tool, understanding
their purposes and in what contexts they must be used. It was also specified how the energy computing
is done, resorting to a numerical integration method. Finally, it was provided a description of some
important algorithms (Time Stamp Search and Energy Computation).
68
5Results
Contents
5.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Non-Linearity Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Power Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4 PC Energy Consumption Characterization . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
69
In this chapter, several results are provided. First, the results of the sensors calibration and of the
loop stability tests are analysed, specifying the calibration process and the tests conditions. Then, the
device is used to profile specific benchmarks (FFT, LU Matrix Decomposition and RADIX) on a Personal
Computer and the results are compared to the ones obtained with RAPL in terms of time and power
consumed by the workload. It is also given temporal charts with the instantaneous power measured by
both applications, showing the differences between them. Finally, the device is used to characterize the
power consumed by a common machine, under stress of specific workloads. The results reveal which
rails are correlated with the power demanded by the modules inherent in a computer system (CPU, HDD,
RAM, NIC and others).
5.1 Calibration
5.1.1 Sensors and ADC Calibration
The ADC was calibrated by varying the input voltage with a voltage source, starting from 0 V and
increasing the voltage 200 mV step by step until the VRef = 5.03 V was reached. A linear regression
model was adjusted to the data. Figure 5.1(a) shows the results of the test.
0 1 2 3 4 50
200
400
512600
800
1,023
Voltage
Dig
italW
ord
Real ValuesIdeal Values
(a) ADC Calibration
0.5 1 1.50
0.5
1
1.5
2
2.5
Input Current
Out
putC
urre
nt
Real ValuesIdeal Values
(b) AC Sensor Calibration
Figure 5.1: ADC and Sensor Calibration
Equation 5.1 represents the linear regression of the data and it is used to convert the digital values
back to voltage. The D variable refers to the digital word outputted by the ADC and the VReal is the
voltage obtained after calibration.
VReal =5033.425088
1024× D + 7.46460873 mV (5.1)
IReal = 1.0172× IAcq + 0.0848 A (5.2)
A calibration of the conditioning board was also conducted by varying the sensor input current, start-
70
ing from 0 A to 2 A, with steps of 100 mA, by using a current source (see figure 5.1(b)). After the data
had been acquired by the ADC, it was converted back to current values and compared to the ideal ones.
For the AC current sensor, the values of current are corrected using equation 5.2, where IAcq is the
measured current and IReal is the value of current after being corrected. The AGC introduces an offset
for every gain set, thus a zero calibration had to be done. The inputs of the AGC were shorted to ground
and a conversion to digital values for every gain setting was realized. Those values are stored in a
look-up table for easy access after every ADC conversion.
Gain 1 2 4 8 16
Offset (LSB) 0 1 2 5 11
Table 5.1: AGC Calibration
5.2 Non-Linearity Thresholds
In chapter 3, the stability of the proposed AGC system was studied based on general thresholds
values. In this section, it will be explained how to come up with the best thresholds values, so chattering
(i.e., rapidly repeated gain-switching around the threshold value) does not occur and at the same time
attain always the highest gain, so less noise and measurement errors are introduced. These thresholds
values were used in the several results obtained in latter sections.
Two main thresholds were calculated based on the amplitude reached by the AC current, after run-
ning some benchmarks (between 0.45 A and 0.53 A, corresponding to 30 mV and 50 mV, respectively).
They are the lower limit T1 = 715 and the upper limit T2 = 959: these thresholds guarantee that every
signal will remain in between this limits, which correspond to 3.5 V and 4.7 V, respectively. Moreover, a
third limit was calculated: the maximum gain limit T0 = 540. For every signal amplitude lower than the
limit T0, the gain jumps from G=1 to G=16 directly, allowing a faster loop response and an amplitude
close to the ADC’s full-range. This limit (in voltage) corresponds to 2.64 V, but without the DC offset
it corresponds, actually, to 137 mV. As the reader may recall from section 3.3, the output range of the
current sensors after being amplified by the bandpass filter was found to be [120mV, 200mV ]. Therefore,
the T0 limit shall always guarantee the maximum amplification gain for that range, without the saturation
of the output signal. Nevertheless, if it happens that the signal after amplification gets above the upper
limit T2, the controller reduces the gain continuously until the amplitude lies within the allowed margins.
In the worst case, it means jumping from G=16 to G= 14 , which can take 1.275 ms.
Hence, The thresholds were chosen carefully, creating enough hysteresis to prevent chattering and
to always guarantee the maximum amplitude signal at the ADC’s terminals, thus improving the DR. Tests
have been conducted to proof stability of the loop by injecting a sine wave with 50 Hz of frequency and
varying its amplitude.
Figure 5.2(a) shows the case where a sine wave of amplitude 330 mV is amplified with a gain of 4.
In figure 5.2(b) the input sine wave of amplitude 3.9 V is attenuated with a gain of 2.
71
(a) Sine Wave Amplification (b) Sine Wave Attenuation
Figure 5.2: Stability Proof
The tests are evidence that the loop achieves stability (i.e., does not enter in chattering) and at the
same time increases the system DR, by imposing to the input signal the best gain at all times.
5.3 Power Profiling
The main goal of this section is to understand how the measuring methods (internal and external
measurements) influence algorithm execution (time overhead) and also compare the readings given by
Powermeter and RAPL. The measures obtained with RAPL were performed with a time resolution of 2
ms. In fact, although Intel [37] reports a maximum update rate of 1 kHz, after some tests with this time
period, it was revealed that some readings were zero. As a result it was decided to use an update rate
corresponding to 500 Hz.
5.3.1 SPLASH-2 Benchamark Tests
SPLASH-2 benchmark suite, included in PARSEC suite [31], was used for this evaluation. This
benchmark offers a variety of workloads: for benchmarking purposes, FFT, LU Matrix Decomposition
and RADIX sorter algorithm workloads were utilized. Those workloads and their respective Makefile
were properly modified to include the calls to Powermeter API or to RAPL and each algorithm was
executed in a loop for 1000 times in a row. The results were averaged to get statistical significance.
As said before, the main focus of these tests were: evaluate the time execution with/without the power
measure systems and evaluate the energy consumption reported by Powermeter and RAPL.
1. FFT: For this workload, the tests were made using a data set of 220 complex numbers and the
algorithm was distributed in 4 threads (1 by Physical Core);
2. LU: A 1024x1024 matrix of doubles was used as input and the algorithm was distributed in 4
threads (1 by Physical Core);
3. RADIX: For RADIX workload, a data set of 4292608 32-bit random integers was sorted and the
algorithm was distributed in 4 threads (1 by Physical Core);
72
Workload Powermeter
Total Time (µs) Total Time (µs) Time Overhead(%) Energy (J)
FFT 76639298 77474667 1.09 2757.21
LU 98014713 99266712 1.27 4742.26
RADIX 96320861 97226277 0.94 4023.43
Table 5.2: Powermter Time Overhead
RAPL
Total Time (µs) Time Overhead(%) Energy (J)
FFT 76826729 0.24 2277.01
LU 98648873 0.55 4257.96
RADIX 96438625 0.12 3698.20
Table 5.3: RAPL Time Overhead
Observing tables 5.2 and 5.3, it can be concluded that the overhead due to Powermeter or RAPL
metering system is not significant, since in the worst case it takes 0.25 % and 1.22 % more time respec-
tively, than the original source code. However, possible due to the higher amount of data exchanged and
computed (Powermeter works at f=3.3(3) kHz, whilst RAPL works at f = 500 Hz), Powermeter introduces
more time overhead than RAPL. Regarding the energy measurement of both devices, it is notorious a
difference between Powermeter and RAPL, where the former indicates higher energy values than RAPL.
This is not strange, since it runs for a longer period of time, has an higher resolution than RAPL and it is
not based on PMCs as RAPL, but mainly this is due to the fact that RAPL does not report all the energy
consumed by the uncore of the CPU.
5.3.2 NAS Parallel Benchmark Tests
Conversely, to evaluate the power profiling and the trade-off between energy and time consump-
tion, the NASA Parallel Benchmarks (NPB’s) with OpenMP [28] were used. This benchmark includes
workloads for scientific applications like FT (3-D fast Fourier Transform), BT (Block Tri-diagonal solver
of Navier-Stokes equation) and CG (Conjugate Gradient method to compute an approximation to the
smallest eigenvalue of a large, sparse, unstructured matrix). Version NPB 3.3.1 was compiled and it
offers different benchmark classes: S, W, A, B, C, D, E and F, sorted from the smaller one to the largest
problem size. In the conducted tests, only class B was used, which offers a standard problem size and
a reasonable high number of iterations (depending on the used workload), thus allowing the profile of an
application for a satisfactory period of time.
Figures 5.3(a) and 5.3(b) represent power profiling graphs of NPB FT benchmark, compiled with -O3
and with CLASS=B and ran with 4 threads, one by processor. In this case, class B translates into a
problem size of a 512x256x256 grid and 20 iterations. Both RAPL and Powermeter were used for power
profiling this workload. In figure 5.4 it is shown the different power patterns of CPU, RAM and HDD over
time for the EP and BT benchmarks.
73
0 1 2 3 4 5 6 7 80
10
20
30
40
50
60
70
Time (s)
Pow
er (W
)
PowerMeterRAPL
Startup Iteration 7 Initialize
(a) RAPL and Powermeter
7.7 7.8 7.9 8 8.1 8.2 8.3 8.4 8.5 8.630
35
40
45
50
55
60
65
70
Time (s)
Pow
er (W
)
PowerMeterRAPL
(b) In-depth View
Figure 5.3: Power Profiling
In figure 5.3(a), it is clear the fluctuations of the power drained by the CPU over the time. It was
found that the number of valleys corresponds exactly to the process iterations executed by the bench-
mark (N=20). The application initiates with a warm up phase and an initialization phase, followed by
N iterations (for CLASS B N = 20). Both RAPL and Powermeter follow the power variations along the
time and it is possible to distinguish the computational stage (when power is at its peek) and commu-
nication stage (when power comes down). Even so, RAPL counters indicates less power consumption
than Powermeter and a huge amount of spikes (some of them reaching values above the maximum
power consumption specified by Intel [49]). An in-depth view to one of the process iterations reveals
more details, as it can be observed in figure 5.3(b). In this figure, it is clear the difference, in terms of
resolution, of RAPL and Powermeter: while the former outputs an average of the power measured, the
later indicates the real power at some instant in time and, thereby, much more fluctuations in power be-
havior are visible; we can, also, differentiate three small valleys in each iteration, which must be related
to inner subroutines of the FT benchmark. Thus, while with the Powermeter it is possible to distinguish
several different patterns of the power demanded by an application, with RAPL such level of detail is
lost, making it inapt for the power characterization in real-time of high-complexity workloads.
Figures 5.4(a) and 5.4(b) shows the power profiling results of the EP and BT benchmarks, illustrating
the power consumption of CPU, HDD and RAM (for better visualization of these figures, please refer to
figures A.6 and A.7 in the appendix). For each benchmark, the conducted tests results are displayed
for 1, 2, 4 and 8 threads, running in different processors. The corresponding source code was compiled
with gfortran, -O3 flag and linked to Powermeter API. Each subfigure focuses on the power usage for
the first few seconds of the test, in order to clearly show the resulting power behavior patterns.
The embarrassingly parallel benchmark (EP) is essentially computation intensive and communica-
tion free. The benchmark consumes a consistent amount of power during its entire execution, since it
is perfectly balanced, and each thread executes a CPU-intensive job. Contrastingly, the BT benchmark
(represented in figure 5.4(b)) is more memory-intensive. As a consequence, the power pattern is not
consistent over the test run. It was found that the power consumed by the CPU and the memory is
74
0 1 20
10
20
30
40
50
60
Pow
er(W
)
Time (s)
1 Thread
0 1 2
Time (s)
2 Threads
0 1 2
Time (s)
4 Threads
0 1 2
Time (s)
8 Threads
CPUHDDRAM (5V)
(a) EP CLASS=B Power Profiling
0 1 20
10
20
30
40
50
60
701 Thread
Time (s)
Pow
er(W
)
0 1 2
2 Threads
Time (s)0 1 2
4 Threads
Time (s)0 1 2
8 Threads
Time (s)
CPUHDD
RAM (5V)
(b) BT CLASS=B Power Profiling
Figure 5.4: Profiling of NPB Benchmark
interrelated in a way that when memory power goes up, CPU power goes down and vice-versa. In ad-
dition, each valley corresponds exactly to the number of iterations of the workload (in this case N=200).
Both tests are not HDD-intensive, thus there is little disk accesses and therefore, the disk consumes a
constant amount of power over the time.
There is a trade-off between energy consumption and time performance that should be considered
to determine the best configurations in number of cores based on the user’s needs. For performance-
constrained systems, the best operating points will be those that minimize time execution. For power-
constrained systems, the best operating points will be those that minimize power or energy consumption.
For systems where energy efficiency is optimized or power performance must be balanced, the choice
of appropriate metric is whether the performance gain was worth the additional power requirement.
Thus, a metric used in [50] can be used to understand the trade-off between energy consumption and
time performance. The metric is called Energy-Delay Product (EDP) and it is defined as the product
of the time necessary to execute the code and its respective consumed energy. Thereby, the smaller
EDP the configuration achieves for an application, the better the efficiency of that configuration for that
application. Figure 5.5 reveals the results by using this metric for the EP, MG and FT benchmarks, for
1, 2, 4 and 8 threads. The results are normalized in respect to the obtained value for 1 thread (i.e.,
EDPN = EN×DN
E1×D1, where N refers to the number of threads).
For better performance of the LU benchmark, the better configuration is to divide the code in 8
threads (D = 81.7 s and E = 4547 J), but for the lower energy cost it should be divided into 4 threads
(D=82.5 s and E=4233). According to the results from figure 5.5, the better configuration should be to
divide the algorithm in 4 threads, evenly by the 4 physical cores. Therefore, the metric gave priority to
the energy cost in dispense to the increase in time performance.
In the case of the MG benchmark, for the lower energy cost and the better time performance, the
workload must be divided in 4 threads (D = 6.5 s and E = 228 J). The EDP metric agrees with this and
states that the best configuration is to divide it into 4 threads as well.
75
1 2 4 80
0.2
0.4
0.6
0.8
1
1.2
Number of Threads
Nor
mal
ized
Valu
es
EDP
LUMG
Figure 5.5: EDP Metric
5.4 PC Energy Consumption Characterization
Powermeter was tested in Liliana (machine hosted by SiPS Group at Inesc-ID Lisbon). The machine
features the following characteristics:
Module Component
MB ASUS P8Z77-V LX
CPU 3.5 GHz Intel i7 3770K (Ivy Bridge Architecture - TDP = 77 W)
RAM G.Skill Sniper DDR3 2x8 GB - 1.866 GHzHDD Seagate Barracuda - 2 TBPSU CORSAIR TX750
Table 5.4: Machine’s Characteristics
In this section the goal is to successfully characterize power consumption of the aforementioned
machine. A few tests were realized, not only to discover what connectors/rails from the PSU are di-
rectly correlated with the power drawn from the CPU or RAM, for instance, but also to understand how
significant is that power consumption when compared to the total power consumed by the machine.
The first test consists, simply, in running the API while the machine is at its idle state (i.e., no intensive
workload is running). The power spent while a system is in idle state accounts for a very large ratio of
the total power dissipation, however this power is not considered as used for computing. Active power
corresponds to the extra power dissipated when the systems is no longer in idle mode, but in active
mode.
1. CPU: the CPU is powered, directly, by the four 12V cables connecting to EPS12V. This was con-
firmed by experiments and by the ATX12V power supply design guide. Thus, the sensors are
connected directly to this cables. For this test, the LU matrix decomposition provided by SPLASH2
benchmark [31] was used. The test consisted in the decomposition of a 4096x4096 matrix of dou-
bles (128 MB of data) and it was ran in a loop for 25 times, so most of the necessary data will be
present in CPU cache, decreasing memory accesses. The number of processors used during the
76
test was also changed, thus the benchmark was ran for P = 1, P = 2, P = 4, P = 6 and P = 8
processors.
2. HDD I/O: The HDD is powered directly by two independent cables (+12V and +5V rais), hence
by directly measuring this rails, the disk power consumption can be profiled. In this scenario, a
series of tests were performed, consisting in several write operations in the hard disk. Those tests
require that the data being written has to surpass largely total RAM size - at least use a dataset
two times larger than the total available memory RAM (at Liliana we have approximately 15778
MB). It was used the Bonnie benchmark utility [30]. This benchmark performs a serial of writings
and rewritings to a file. At first, the benchmark starts by writing to a file calling putc() sdtio macro
- the loop that does the writing should be small enough to fit into any reasonable I-cache. Next,
writes the data efficiently, writing blocks of data to the disk, calling write(2). To finish the test, each
chunk of the file is read with read(2), changed, and rewritten with write(2), requiring an lseek(2).
3. Memory RAM Accesses: Memory modules must drain power from one of the power rails con-
nected to the motherboard. To profile power spent in memory accesses STREAM benchmark [29]
was used, which measures an effective memory bandwidth on one or more cores. Therefore, a
total of three tests were conducted: 1,2 and 4 threads. With this, it is expected to grasp which
power rails are correlated with intensive memory accesses. About the dataset, the benchmark
requires that the size of the array to be used, must be four times larger than CPU L3 cache (8
MB).
4. TCP/UDP Data Packets: For this case, the aim is to isolate the power consumed by the NIC, so it
was used the iPerf benchmark [51], which can generate TCP and UDP data packets and measure
throughput between server and client. In order to have a significant set of results, the benchmark
was ran for t = 100 seconds. During the test, Liliana machine had been connected as a client to
Diana machine (another system belonging to SiPS), which acts as a server.
Figures 5.6(a) and 5.6(b) shows the power consumption measured during five different workload
tests on the machine. In the case of figure 5.6(a) each bar corresponds to system power draw for each
of the workloads. In figure 5.6(b), it is provided a stacked bar chart, where for each workload (idle, LU
1, LU 2,...) it is presented the power drawn by a component (CPU or HDD) or a rail (12V, 5V and 3.3 V).
By observing figure 5.6(b) for the idle bar, the reader can observe how the power is distributed in a
desktop system, consuming about 25 W of DC power. Although the CPU only consumes about 8 W, it is
still the major portion of power consumption, and together with disk’s power, represent more than 50 %
of the system power. The other rails (which are related to RAM, fans, peripherals and others) consumes
together the rest of the system’s power (about 12 W).
After a thorough analysis of each bar of figure 5.6(b), we can come up with several conclusions.
For the case of the workload for the LU decomposition of a matrix, it is visible that each component/rail
suffers an increase of power consumption. But it is the CPU that presents the higher increase in power
usage: while in idle it consumes more or less 8 W, but in stress it achieves, on average, 30 W, 45 W,
77
Idle
LU1
LU2
LU4
LU6
LU8
Bonnie
STREAM1
STREAM2
STREAM4
iperf
0
20
40
60
80
100
120
140
160Po
wer
(W)
AC
(a) AC Power
0
20
40
60
80
100
Pow
er(W
)
idle
LU1LU2LU4LU6LU8
Bonnie
iperf
CPU
12V
5V
3.3V
HDD
STREA
M 1
STREA
M 2
STREA
M 4
(b) DC Power Distribution
Figure 5.6: AC Power Consumption and DC Power Distribution in the System
61 W, 63 W and 65 W, for P = 1, P = 2, P = 4, P = 6 and P = 8 processors, respectively. The
largest momentaneous power measured was about 68 W. This complies with Intel specification [49],
which states a maximum of 77 W of power draw by the CPU package, with no overclocking. Notice that
there is not a significant change in power for P = 4, P = 6 and P = 8. This occurs because for P = 6
and P = 8 processors, Hyperthreading technology emulates at the most eight cores, by having each of
the four physical cores running simultaneously two threads.
According to the results, the tendency is that for each doubling in the number of processors used, an
extra 15 W of power is required, excluding the cases where the hyperthreading technology is being used.
The growth of power in every rail was also measured during this test. Since the EATX12V connector
supplies power to the CPU core, the power of the uncore (L3, mem. controller) must come from the
motherboard, explaining the growth in power of the 12V rail. Even so, CPU’s fan and system’s fan are
also powered by the 12V and 5V. Hence the surge in power on those rails as well.
For the disk intensive benchmark, the reader can observe that there are insignificant changes in all
the rails. The only interesting changes, occur for the CPU and, as it should, for the HDD. The increase
in CPU power is expected, since the workload runs several writes and rewrites in a loop. During the
test, the HDD consumed more power in the writing with putc() command, achieving about 10.5 W of
power, whilst when writing in chunks the power was lower. This was expected, since when writing a
char at a time, more accesses to the disk are required than when writing a bunch of characters in once.
The rewriting consumes more power than when just writing, because it also calls the lseek() system
function. Although there was in increase in the power usage, the maximum specified power drained by
this component was not achieved (12.75 W): the reasons may be because of the use of a not sufficient
large dataset, but most probably is due to the existence of a buffer, which allows the disk to process
reads/writes in batches, minimizing disk accesses.
78
Focusing, now, on the STREAM benchmark, it is visible a general rise of power draw in every rail
and component. Nevertheless, if the reader takes a closer look to the 3.3V, 5V and 12V rails, it can
perceive that it is only during this workload test that these rails evidence the higher power consumption
values. For other tests, the power drained by the 5V rail has changed a little during LU workloads, but the
consumption is not very high when compared with the idle power. This is an unambiguous proof that the
5V rail is tightly correlated with the power drained by RAM modules. Despite this benchmark is not CPU
intensive (as the reader can observe in CPU bar), the 12V and 3.3V experiences a significant augment
in power along this set of tests. Though, since those rails present a very high power consumption
during LU workload as well, this is a confirmation that the 12V must power the CPU uncore and fans.
Nevertheless, when in intensive work, RAM modules must request extra power from the 12V and 3.3V
rails, explaining the peek values observed under the STREAM benchmark test.
Regarding the last workload, intended to stress the NIC, it is clear that there was insignificant or even
no difference between the power drained when running the test and the idle state. This happened for all
rails and components measured, thus it is a clear proof that the NIC uses all the power it needs whether
in active or in idle state.
3 6 9 120
10
20
30
40
50
60
70
Load (%)
Pow
erE
ffici
ency
(%)
Figure 5.7: Power Efficiency
The PSU’s efficiency was determined by doing the ratio between the DC power drained by each
workload and the respective AC power consumed. Figure 5.7 illustrates the results regarding PSU
efficiency, where the abscissa axis expresses the load associated to each workload. At idle state or
during NIC intensive test, the PSU performance is below 40%, which is normal for a load near 3%. As
the load increases, so does efficiency, achieving a maximum value of 62.62% with a load of 12.15 % (LU
8 Cores). According to the PSU’s manufacturer, this unit should provide at least 80 % of efficiency for
loads higher than 20%.However, those values are not achieved for lower loads as it was demonstrated.
Unfortunately, due to the lack of more power demanding components in Liliana machine, it was not
possible to achieve loads higher than 13 %, so the efficiency values guaranteed by the manufacturer
were not attained. Even so, it is predictable that such efficiency is feasible for loads above 20% since for
only 12% of load an efficiency of 62.6% was obtained and because the efficiency curve is not linear.
79
5.5 Summary
In this chapter, a diversified set of results were provided. At the beginning, results about the calibra-
tions conducted to the sensors as well as the non-linearity thresholds were given. The following sections
were intended to validate the power readings of Powermeter API by comparing them with internal coun-
ters of RAPL system. It was shown that although RAPL is good enough for power profiling a running
workload, which presents a variable power pattern over time, it fails to follow in detail the behavior of the
power curve. In fact, RAPL outputs an average of the energy readings over the time, thereby a lot of
information is lost in contrast with Powermeter, which gives a total new view of the power pattern at a
level previously impractical, due to its accuracy and time resolution. Thus, Powermeter is better suited
for real-time profiling than RAPL.
The chapter addressed, as well, the power profiling and the energy efficiency of several parallel
benchmarks designed by NASA (NAS parallel benchmarks). The results regarding power profiling of
CPU, HDD and RAM (5V) were insightful, in a way that showed how different is the power demand
by distinct applications. For the EP benchmark, the reader had the opportunity to observe how well
distributed is the application by all the 4 physical cores, thus not demonstrating large fluctuations on the
power profile of any of the modules. On the other hand, the profiling of the BT benchmark revealed a
totally different pattern, by showing large fluctuations in the CPU and RAM power usage. By using the
Powermeter, it was concluded that the fluctuations are related to the number of iterations executed by
the workload and with accesses to the RAM. It was also determined how efficient is a workload when
taking advantage of parallel execution and, it was ascertained that hyperthreading can improve both
time performance and energy efficiency.
Finally, the device was used to characterize the power consumption of a desktop, equipped with state-
of-the art components, by measuring power usage of all the rails coming from the PSU. The tests indi-
cated how power was distributed in the system and how it changes in all rails when micro-benchmarks,
engineered to stress specific modules, were used. It was shown that hyperthreading induces a insignifi-
cant amount of power overhead and that the RAM is mainly correlated to the 5V rail, while the remaining
rails are mostly used to source fans, NIC and other motherboard components. In addition, it was verified
that the NIC consumes a constant amount of power either in active or at rest. In the end, the efficiency
curve of the PSU was obtained, which revealed that power supplies are very inefficiently with low loads.
Thereby, it is important to design a system which has a sufficient amount of load, so the PSU efficiency
reaches more than 80%, reducing the losses.
80
6Conclusions
Contents
6.1 Summary and Overall Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
81
6.1 Summary and Overall Conclusions
The main objective of this thesis was the design and implementation of a measurement device for
real-time monitoring of the power consumption of the main components of a computing system. In
chapter 2 some state of the art applications and devices that seek to solve the problem [23, 26, 27, 13]
were identified. However, those devices are not adequate for the characterization of complex applica-
tions, which have an high level of computational requirements. Therefore, a novel device was introduced
in the scope of this thesis, that promises to achieve the aforementioned goal: the Powermeter. The
conceived device comprehends several electronic components, including precision Hall-effect current
sensors and an AGC structure for handling the AC sensor output signal, by dynamically scaling the sig-
nal’s amplitude, improving ADC’s DR. The system samples at a rate of f = 3.33kHz, it does not add
any significant time overhead and transmits the acquired data at high speed (more than 64 KB/s), while
providing accurate and precise power measurements.
To make this possible, it was fundamental to analyse all the initial requirements and project electronic
structures that were fitted for the conditioning of the various types of signals coming from am off-the-
shelf desktop PSU. Furthermore, an AGC block comprising a band-pass filter and a PGA was proposed.
The system was analysed in the spectral domain, computing the expected induced noise density and it
was evaluated in terms of stability. The results proved that the system achieves an high DR (more than
90 dB) and that it was absolutely stable.
Then, the design of the software part of the conceived architecture was presented in chapter 4. In
this chapter, the latency of the system was studied; the used sampling frequancy was validated, based
on several system constrains; it was revealed the essential procedures for computing the energy with low
time overhead, by scheduling part of that computation to the MCU; and it was presented the three main
functions of the software API that are used to initiate and stop the power readings: pormeter api init(),
powermeter api start() and powermeter api stop().
The results regarding the ADC and the sensors calibration were presented in chapter 5. For valida-
tion of the system usefulness and reliability, several tests were performed by comparing the readings of
computational intensive workloads with both powermeter and internal counters (RAPL). It was also pro-
filed a few parallel benchmarks with different kinds of computational requirements. The results served
as a proof of concept of the proposed device and several conclusions were realized: it was demon-
strated that RAPL is less detailed and reliable for real time power characterization than Powermeter
by not providing enough time resolution and all the uncore power consumption of the CPU; and it was
concluded that various applications request distinct resources, evidencing different power patterns in
CPU and RAM. Furthermore, it was highlighted that the applications which routines are highly paral-
lelizable, attain better energy and time performances and it was possible to understand which are the
configurations that achieve the better trade-off between energy cost and time performance, by using the
EDP metric. Then, it was identified which rails supply specific components (5V - RAM, 12V, 5V, 3.3 V -
fans, NIC and others) and it was realized how the power is distributed in a desktop computing system.
Finally, it was revealed that for loads below 20% a standard PSU performs poorly, achieving a very low
82
conversion efficiency.
6.2 Future Work
For the future work it would be important to implement the conceived device in a Printed Circuit Board
(PCB) with SMD components, in order to reduce its form factor and enhance noise immunity. It would
be also interesting to integrate the developed software in a more powerful microcontroller with a higher
clock frequency and RAM size, so a higher sampling frequency could be attained.
Nevertheless, the prototype device already accomplishes all the necessary requirements for real-time
power profiling. Consequently, it would be interesting to combine the device with power-aware strategies,
such as DVFS or PWM-based strategies and use it for energy-aware scheduling in homogeneous and
heterogeneous clusters.
83
84
Bibliography
[1] Rong Ge, Xizhou Feng, Shuaiwen Song, Hung-Ching Chang, Dong Li, and Kirk W. Cameron.
PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications. IEEE
Trans. Parallel Distrib. Syst., 21(5):658–671, 2010. URL http://dblp.uni-trier.de/db/
journals/tpds/tpds21.html#GeFSCLC10.
[2] W.L. Bircher and L.K. John. Complete system power estimation: A trickle-down approach based
on performance events. In Performance Analysis of Systems Software, 2007. ISPASS 2007. IEEE
International Symposium on, pages 158–168, April 2007. doi: 10.1109/ISPASS.2007.363746.
[3] Xizhou Feng, Rong Ge, and Kirk W. Cameron. Power and energy profiling of scientific applications
on distributed systems. In Proceedings of the 19th IEEE International Parallel and Distributed
Processing Symposium (IPDPS’05), 01:34, 2005.
[4] Gustavo Vitorino Monteiro da Silva. Controlo Nao Linear. Technical report, Escola Superior de
Tecnologia de Setubal, 2006.
[5] Trevor Pering, Yuvraj Agarwal, Rajesh Gupta, and Roy Want. Coolspots: Reducing the power
consumption of wireless mobile devices with multiple radio interfaces. In Proceedings of the 4th
International Conference on Mobile Systems, Applications and Services, MobiSys ’06, pages 220–
232, New York, NY, USA, 2006. ACM. ISBN 1-59593-195-3. doi: 10.1145/1134680.1134704. URL
http://doi.acm.org/10.1145/1134680.1134704.
[6] Kester Li, Roger Kumpf, Paul Horton, and Thomas Anderson. A Quantitative Analysis of Disk Drive
Power Management in Portable Computers. In Proceedings of the USENIX Winter 1994 Technical
Conference on USENIX Winter 1994 Technical Conference, WTEC’94, pages 22–22, Berkeley, CA,
USA, 1994. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1267074.
1267096.
[7] Krisztian Flautner, Steve Reinhardt, and Trevor Mudge. Automatic performance setting for dynamic
voltage scaling. In MobiCom ’01: Proceedings of the 7th annual international conference on Mobile
computing and networking, pages 260–271, New York, NY, USA, 2001. ACM. ISBN 1-58113-422-3.
[8] Trevor Pering, Tom Burd, and Robert Brodersen. Dynamic voltage scaling and the design of a
Low-Power microprocessor system. In In Power Driven Microarchitecture Workshop, attached to
85
ISCA98, 1998. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.
53.7554.
[9] Yuvraj Agarwal, Stefan Savage, and Rajesh Gupta. SleepServer: A Software-only Approach for
Reducing the Energy Consumption of PCs Within Enterprise Environments. In Proceedings of
the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’10, pages
22–22, Berkeley, CA, USA, 2010. USENIX Association. URL http://dl.acm.org/citation.
cfm?id=1855840.1855862.
[10] Luiz Andre Barroso and Urs Holzle. The Case for Energy-Proportional Computing. Computer, 40
(12):33–37, December 2007. ISSN 0018-9162. doi: 10.1109/MC.2007.443. URL http://dx.
doi.org/10.1109/MC.2007.443.
[11] Ripal Nathuji and Karsten Schwan. VirtualPower: Coordinated Power Management in Virtual-
ized Enterprise Systems. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating
Systems Principles, SOSP ’07, pages 265–278, New York, NY, USA, 2007. ACM. ISBN
978-1-59593-591-5. doi: 10.1145/1294261.1294287. URL http://doi.acm.org/10.1145/
1294261.1294287.
[12] Andreas Merkel and Frank Bellosa. Balancing power consumption in multiprocessor systems. In
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006,
EuroSys ’06, pages 403–414, New York, NY, USA, 2006. ACM. ISBN 1-59593-322-0. doi: 10.1145/
1217935.1217974. URL http://doi.acm.org/10.1145/1217935.1217974.
[13] Jason Flinn and M. Satyanarayanan. PowerScope: A Tool for Profiling the Energy Usage of Mobile
Applications. In Proceedings of the Second IEEE Workshop on Mobile Computer Systems and
Applications, WMCSA ’99, pages 2–, Washington, DC, USA, 1999. IEEE Computer Society. ISBN
0-7695-0025-0. URL http://dl.acm.org/citation.cfm?id=520551.837522.
[14] Abhinav Pathak, Y. Charlie Hu, and Ming Zhang. Where is the Energy Spent Inside My App?: Fine
Grained Energy Accounting on Smartphones with Eprof. In Proceedings of the 7th ACM European
Conference on Computer Systems, EuroSys ’12, pages 29–42, New York, NY, USA, 2012. ACM.
ISBN 978-1-4503-1223-3. doi: 10.1145/2168836.2168841. URL http://doi.acm.org/10.
1145/2168836.2168841.
[15] Intel Corp. Intel Xeon processor. http://www.intel.com/xeon, 2012.
[16] David C. Snowdon, Stefan M. Petters, and Gernot Heiser. Accurate On-line Prediction of Processor
and Memoryenergy Usage Under Voltage Scaling. In Proceedings of the 7th ACM &Amp; IEEE
International Conference on Embedded Software, EMSOFT ’07, pages 84–93, New York, NY, USA,
2007. ACM. ISBN 978-1-59593-825-1. doi: 10.1145/1289927.1289945. URL http://doi.acm.
org/10.1145/1289927.1289945.
86
[17] Aaron Carroll and Gernot Heiser. An Analysis of Power Consumption in a Smartphone. In
Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIX-
ATC’10, pages 21–21, Berkeley, CA, USA, 2010. USENIX Association. URL http://dl.acm.
org/citation.cfm?id=1855840.1855861.
[18] John C. McCullough, Yuvraj Agarwal, Jaideep Chandrashekar, Sathyanarayan Kuppuswamy,
Alex C. Snoeren, and Rajesh K. Gupta. Evaluating the Effectiveness of Model-based Power
Characterization. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical
Conference, USENIXATC’11, pages 12–12, Berkeley, CA, USA, 2011. USENIX Association. URL
http://dl.acm.org/citation.cfm?id=2002181.2002193.
[19] Russ Joseph, David Brooks, and Margaret Martonosi. Live, runtime power measurements as a
foundation for evaluating power/performance tradeoffs. Workshop on Complexity Effectice Design
WCED, held in conjunction with ISCA, 28, 2001.
[20] David C. Snowdon, Stefan M. Petters, and Gernot Heiser. Power measurement as the basis for
power management. In Proceedings of the 1st Workshop on Operating System Platforms for
Embedded Real-Time Applications (OSPERT), Palma, Mallorca, Spain, jul 2005.
[21] ATX specification, version 2.2. Technical report, 2005. URL http://www.formfactors.org.
[22] Molex Connectors, Accessed 2015. URL http://www.molex.com/molex/index.jsp.
[23] Marcus Hahnel, Bjorn Dobel, Marcus Volp, and Hermann Hartig. Measuring Energy Consump-
tion for Short Code Paths Using RAPL. SIGMETRICS Perform. Eval. Rev., 40(3):13–17, Jan-
uary 2012. ISSN 0163-5999. doi: 10.1145/2425248.2425252. URL http://doi.acm.org/10.
1145/2425248.2425252.
[24] Thanh Do, Suhib Rawshdeh, and Weisong Shi. ptop: A process-level power profiling tool.
In Proceedings of the 2nd Workshop on Power Aware Computing and Systems (HotPower’09), oct
2009.
[25] Fay Chang, Keith I. Farkas, and Parthasarathy Ranganathan. Energy-Driven Statistical Sampling:
Detecting Software Hotspots. In Babak Falsafi and T. N. Vijaykumar, editors, PACS, volume 2325 of
Lecture Notes in Computer Science, pages 110–129. Springer, 2002. ISBN 3-540-01028-9. URL
http://dblp.uni-trier.de/db/conf/pacs/pacs2002.html#ChangFR02.
[26] ”poweregg”. http://www.itwatchdogs.com.
[27] ”watts up”. http://www.wattsupmeters.com.
[28] M.Frumkin H. Jin and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and
Its Performance . Technical report, 1999. URL https://www.nas.nasa.gov/assets/pdf/
techreports/1999/nas-99-011.pdf.
87
[29] John D. McCalpin. Stream: Sustainable memory bandwidth in high performance com-
puters. Technical report, University of Virginia, Charlottesville, Virginia, 1991-2007.
URL http://www.cs.virginia.edu/stream/. A continually updated technical report.
http://www.cs.virginia.edu/stream/.
[30] Bonnie++ Benchmark Suite, Accessed 2015. URL http://www.coker.com.au/bonnie++/.
[31] Christian Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January
2011.
[32] Hui Chen, Youhuizi Li, and Weisong Shi. Fine-grained power management using process-level
profiling. Sustainable Computing: Informatics and Systems, 2(1):33 – 42, 2012. ISSN 2210-5379.
doi: http://dx.doi.org/10.1016/j.suscom.2012.01.002. URL http://www.sciencedirect.com/
science/article/pii/S2210537912000030.
[33] Tao Li and Lizy Kurian John. Run-time modeling and estimation of operating system power con-
sumption. SIGMETRICS Perform. Eval. Rev., 31(1):160–171, June 2003. ISSN 0163-5999. doi:
10.1145/885651.781048. URL http://doi.acm.org/10.1145/885651.781048.
[34] Aman Kansal, Feng Zhao, Jie Liu, Nupur Kothari, and Arka A. Bhattacharya. Virtual machine
power metering and provisioning. In Proceedings of the 1st ACM Symposium on Cloud Computing,
SoCC ’10, pages 39–50, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0036-0. doi: 10.1145/
1807128.1807136. URL http://doi.acm.org/10.1145/1807128.1807136.
[35] Frank Bellosa. The benefits of event: Driven energy accounting in power-sensitive systems. In
Proceedings of the 9th Workshop on ACM SIGOPS European Workshop: Beyond the PC: New
Challenges for the Operating System, EW 9, pages 37–42, New York, NY, USA, 2000. ACM. doi:
10.1145/566726.566736. URL http://doi.acm.org/10.1145/566726.566736.
[36] Gilberto Contreras. Power prediction for intel xscale processors using performance monitoring unit
events. In In Proceedings of the International symposium on Low power electronics and design
(ISLPED, pages 221–226. ACM Press, 2005.
[37] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual, 2011.
[38] Canturk Isci and Margaret Martonosi. Runtime power monitoring in high-end processors: Method-
ology and empirical data. In Proceedings of the 36th Annual IEEE/ACM International Symposium
on Microarchitecture, MICRO 36, pages 93–, Washington, DC, USA, 2003. IEEE Computer Society.
ISBN 0-7695-2043-X. URL http://dl.acm.org/citation.cfm?id=956417.956567.
[39] Pat Bohrer, Elmootazbellah N. Elnozahy, Tom Keller, Michael Kistler, Charles Lefurgy, Chandler
McDowell, and Ram Rajamony. Power aware computing. chapter The Case for Power Management
in Web Servers, pages 261–289. Kluwer Academic Publishers, Norwell, MA, USA, 2002. ISBN 0-
306-46786-0. URL http://dl.acm.org/citation.cfm?id=783060.783075.
88
[40] S. Kamil, J. Shalf, and E. Strohmaier. Power efficiency in high performance computing. In Parallel
and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1–8,
April 2008. doi: 10.1109/IPDPS.2008.4536223.
[41] CORSAIR TX750, Accessed 2015. URL http://www.corsair.com/pt-pt/
enthusiast-series-tx750-v2-80-plus-bronze-certified-750-watt-high-performance-power-supply.
[42] MATLAB. version 8.3.0.532 (R2014a). The MathWorks Inc., Natick, Massachusetts, 2014.
[43] Steve Bowling. Understanding A/D Converter Performance Specifications. Technical report, Mi-
crochip Technology Inc., 2000.
[44] Walt Kester. Understand SINAD, ENOB, SNR, THD, THD + N, and SFDR so You Don’t Get Lost in
the Noise Floor. Technical report, Analog Devices, 2009.
[45] ZHISHENG DUAN, JIN-ZHI WANG, and LIN HUANG. Frequency domain method for the dichotomy
of modified chaos equations. International Journal of Bifurcation and Chaos, 15(08):2485–2505,
2005. doi: 10.1142/S0218127405013435. URL http://www.worldscientific.com/doi/
abs/10.1142/S0218127405013435.
[46] Pedro Bulach Gapski. Analise Convexa do Problema da Estabilidade Absoluta de Sistemas tipo
Lur’. Master’s thesis, FEE/UNICAMP, June 1994.
[47] M. Vidyasagar. Nonlinear Systems Analysis. Prentice Hall, 2nd edition, 1993. ISBN:0-13-623463-1.
[48] libusb API library, Accessed 2015. URL http://libusb.info/.
[49] Intel Corporation. Intel i7 3770k 3.9 GHz Specifications. Technical report, 2011.
[50] Kristof Du Bois, Tim Schaeps, Stijn Polfliet, Frederick Ryckbosch, and Lieven Eeckhout. SWEEP:
Evaluating Computer System Energy Efficiency Using Synthetic Workloads. In Manolis Katevenis,
Margaret Martonosi, Christos Kozyrakis, and Olivier Temam, editors, HiPEAC, pages 159–166.
ACM, 2011. ISBN 978-1-4503-0241-8. URL http://dblp.uni-trier.de/db/conf/hipeac/
hipeac2011.html#BoisSPRE11.
[51] NLANR/DAST : Iperf - the TCP/UDP bandwidth measurement tool.
http://dast.nlanr.net/Projects/Iperf/, Accessed 2015. URL http://dast.nlanr.net/
Projects/Iperf/.
89
90
AAppendix A
91
A.1 Bandpass Filter
101
102
103
−40
−30
−20
−10
0
10
20
30
Frequency (Hz)
System: +5%Frequency (Hz): 50Magnitude (dB): 9.25
System: −5%Frequency (Hz): 50Magnitude (dB): 8.96
Mag
nitude (dB)
TheoreticalNominal5+5%−5%
(a) Filter Response with 5% Resistors Tolerance
101
102
103
−40
−30
−20
−10
0
10
20
30Bode Diagram
Frequency (Hz)
Mag
nitude (dB)
System: +1%Frequency (Hz): 50Magnitude (dB): 11.9
System: −1%Frequency (Hz): 50Magnitude (dB): 11.9
TheoreticalNominal1+1%−1%
(b) Filter Response with 1% Resistors Tolerance
Figure A.1: Filter Response
Figure A.2: R2 Parameter Variation with 5% Tolerance
92
A.2 Analog Implementation - Full Circuit Diagram
Figure A.3: Circuit Layout
93
A.3 Dynamic Range
Figure A.4: Sinusoid FFT with Oscilloscope
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6−120
−100
−80
−60
−40
−20
0
Frequency (kHz)
dB
F
2
3
4
5
6
THD: -61.29 dBFundamentalHarmonicsDC and Noise (excluded)
(a) THD
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6−140
−120
−100
−80
−60
−40
−20
0
Frequency (kHz)
dB
F
S
SFDR: 61.842 dBFS
SFDRFundamentalSpursDC (excluded)
(b) SFDR
Figure A.5: THD and SFDR for N=8. Fs = 3.333 kHz
94
A.4 Benchmark Results
0 1 20
10
20
30
40
50
60Pow
er(W
)
Time (s)
1 Thread
0 1 2
Time (s)
2 Threads
0 1 2
Time (s)
4 Threads
0 1 2
Time (s)
8 Threads
CPUHDDRAM (5V)
Figure A.6: EP CLASS=B Power Profiling
0 1 20
10
20
30
40
50
60
701 Thread
Time (s)
Pow
er(W
)
0 1 2
2 Threads
Time (s)0 1 2
4 Threads
Time (s)0 1 2
8 Threads
Time (s)
CPUHDD
RAM (5V)
Figure A.7: BT CLASS=B Power Profiling
95
96
97