Kirk W. Cameron SCAPE Laboratory Virginia Tech
description
Transcript of Kirk W. Cameron SCAPE Laboratory Virginia Tech
1SCAPE Laboratory Confidential
Kirk W. CameronSCAPE Laboratory
Virginia Tech
The past, present, and future of
Green Computing
Enough About Me• Associate Professor Virginia Tech• Co-founder Green500• Co-founder MiserWare• Founding Member SpecPower• Consultant for EPA Energy Star for Servers• IEEE Computer “Green IT” Columnist• Over $4M Federally funded “Green”
research• SystemG Supercomputer
2
3SCAPE Laboratory Confidential
What is SCAPE?
• Scalable Performance Laboratory– Founded 2001 by Cameron
• Vision– Improve efficiency of high-end systems
• Approach– Exploit/create technologies for high-end systems– Conduct quality research to solve important
problems– When appropriate, commercialize technologies– Educate and train next generation HPC CS
researchers
4
The Big Picture (Today)
• Past: Challenges– Need to measure and correlate power data – Save energy while maintaining performance
• Present– Software/hardware infrastructure for power measurement– Intelligent Power Management (CPU Miser, Memory Miser)– Integration with other toolkits (PAPI, Prophesy)
• Future: Research + Commercialization– Management Infra-Structure for Energy Reduction– MiserWare, Inc.– Holistic Power Management
5
1882 - 2001
6
Prehistory
• Embedded systems• General Purpose Microarchitecture
– Circa 1999 power becomes disruptive technology
– Moore’s Law + Clock Frequency Arms Race– Simulators emerge (e.g. Princeton’s Wattch)– Related work continues today (CMPs, SMT, etc)
1882 - 2001
7
2002
8
Server Power
• IBM Austin– Energy-aware commercial servers [Keller et al]
• LANL– Green Destiny [Feng et al]
• Observations– IBM targets commercial apps– Feng et al achieve power savings in exchange
for performance loss
2002
9
HPC Power
• My observations– Power will become disruptive to HPC– Laptops outselling PC’s– Commercial power-aware not appropriate for
HPC
2002
TM CM-5 .005
Megawatts
Residential A/C.015
MegawattsIntel ASCI Red
.850 Megawatts
High-speed train
10 MegawattsEarth Simulator12 Megawatts
$4,000/yr $12,000/yr $680,000/yr $8 million/yr $9.6 million/yr
$800,000 per yearper megawatt!
Conventional Power Plant300 Megawatts
10
HPPAC Emerges
• SCAPE Project– High-performance,
power-aware computing
– Two initial goals• Measurement tools• Power/energy savings
– Big Goals…no funding (risk all startup funds)
2002
11
2003 - 2004
12
Cluster Power
• IBM Austin– On evaluating request-distribution schemes for saving
energy in server clusters, ISPASS ‘03 [Lefurgy et al]– Improving Server Performance on Trans Processing
Workloads by Enhanced Data Placement. SBAC-PAD ’04 [Rubio et al]
• Rutgers– Energy conservation techniques for disk array-based
servers. ICS ’04 [Bianchini et al]
• SCAPE– High-performance, power-aware computing, SC04– Power measurement + power/energy savings
2003 - 2004
13
Hardware power/energy profiling
Software power/energy control
Data collection
High-performancePower-aware Cluster
Multi-meter BaytechPower Strip
Singlenode
AC
DCData Log
Data Analysis
Data Repository
Power/Energy Profiling Data
Multi-metercontrol
DC Power from power supply
Multi-meter Multi-meter Multi-meter
AC Power from outlet
Baytech Powerstrip
BaytechManagement
unit
DVScontrol
MM Thread MM Thread MM Thread
Multi-meter Control Thread
Applications
PowerPack libraries (profile/control)
Microbenchmarks
DVS Thread DVS Thread DVS Thread
DVS Control Thread
PowerPack MeasurementScalable, synchronized, and accurate.
2003 - 2004
14
After frying multiple
components…
15
PowerPack Framework(DC Power Profiling)
Multi-meters + 32-node Beowulf
If node .eq. root thencall pmeter_init (xmhost,xmport)call pmeter_log (pmlog,NEW_LOG)
endif
<CODE SEGMENT>
If node .eq. root thencall pmeter_start_session(pm_label)
endif
<CODE SEGMENT>
If node .eq. root thencall pmeter_pause()call pmeter_log(pmlog,CLOSE_LOG)call pmeter_finalize()
endif
16
Power Profiles – Single Node
• CPU is largest consumer of power typically (under load)
CPU14%
Memory10%
Disk11%
NIC1%
Other Chipset8%
Fans23%
Power Supply33%
Power consumption distribution for system idleSystem Power: 39 Watt
CPU35%
Memory16%
Disk7%
NIC1%
Other Chipset5%
Fans15%
Power Supply21%
Power consumption distribution formemory performance bound (171.swim)
System Power: 59 Watt
17
Power Profiles – Single Node
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
idle 171.swim 164.gz ip cp scp
CPU M emory Disk NIC
Power Consumption Distribution for Different Workloads
CPU-bound memory-bound
disk-bound
network-bound
Note : on ly power c ons umed by CPU, memory, disk and NIC are considered here
CPU-boundmemory-bound
Power Consumption for Various Workloads
disk-bound
network-bound
18
NAS PB FT – Performance Profiling
computereduce(comm)
computeall-to-all(comm)
About 50% time spent in communications.
19
Power Profile of FT Benchmark (class B, NP=4)
0
5
0
5
0
5
10
0
5
10
15
20
25
30
0 20 40 60 80 100 120 140 160 180 200
Time (second)
Po
we
r (w
att
) CPU power
memory power
disk power
NIC power
startup initialize iteration 1 iteration 2 iteration 3
Power profiles reflect performance profiles.
20SCAPE Laboratory Confidential
One FFT Iteration
one iteration
evolve fft
cffts1 cffts1 cffts2transpose_x_yz
transpose_local
mpi_all-to-all
transpose_finish
send-recv send-recv wait send-recv
CPU Power
Memory Power
0
5
10
15
20
25
30
110 115 120 125 130 135 140 145 150
Time (Seconds)
Po
wer
(W
atts
)
21
2005 - present
22
Intuition confirmed 2005 - Present
23
HPPAC Tool Progress
• PowerPack– Modularized PowerPack and SysteMISER– Extended analytics for applicability– Extended to support thermals
• SysteMISER– Improved analytics to weigh tradeoffs at
runtime– Automated cluster-wide, DVS scheduling– Support for automated power-aware memory
2005 - Present
24
Predicting CPU Power
0
5
10
15
20
25
30
0 10 20 30 40 50 60 70 80 90 100
Time (Seconds)
Po
wer
(W
atts
)
Estimated CPU Power
Measured CPU Power
2005 - Present
25
Predicting Memory Power
0
2
4
6
8
10
12
0 10 20 30 40 50 60 70 80 90 100
Time (Seconds)
Po
we
r (W
att
s)
Estimated Memory Power
Measured Memory Power
2005 - Present
26
Correlating Thermals BT 2005 - Present
27SCAPE Laboratory Confidential
Correlating Thermals MG 2005 - Present
28
Tempest Results FT 2005 - Present
29
SysteMISER
• Our software approach to reduce energy– Management Infrastructure for Energy
Reduction
• Power/performance– measurement– prediction– control
The Heat Miser.
2005 - Present
30
Power-aware DVS scheduling strategies
CPUSPEED Daemon[example]$ start_cpuspeed[example]$ mpirun –np 16 ft.B.16
Internal schedulingMPI_Init();<CODE SEGMENT>setspeed(600);<CODE SEGMENT> setspeed(1400);<CODE SEGMENT>MPI_Finalize();
External Scheduling[example]$ psetcpuspeed 600[example]$ mpirun –np 16 ft.B.16
NEMO & PowerPack Framework for saving energy
2005 - Present
31
CPU MISER Scheduling (FT)
36% energy savings, less than 1% performance loss
See SC2004, SC2005 publications.
Normalized Energy and Delay with CPU MISER for FT.C.8
0.00
0.20
0.40
0.60
0.80
1.00
1.20
auto 600 800 1000 1200 1400 CPU MISER
normalized delay
normalized energy
2005 - Present
32
Where else can we save energy?
• Processor – DVS– Where everyone starts.
• NIC– Very small portion of systems power
• Disk– A good choice (our future work)
• Power-supply– A very good choice (for a EE or ME)
• Memory– Only 20-30% of system power, but…
2005 - Present
33
The Power of Memory
Effects of increased memory on system power(90 Watt CPU, 9 Watt 4GB DIMM)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 32 64 96 128 160 192 224 256
Amount of memory per processor (GB)
% o
f sy
stem
po
wer
Percentage of system power for memory
Percentage of system power for CPUs
2005 - Present
34
0
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Time (minutes)
Mem
ory
Dev
ices
0
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Time (minutes)
Mem
ory
Dev
ices
0
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Time (minutes)
Mem
ory
Dev
ices
Default Static Dynamic
Memory Management Policies
Memory MISER =
Page Allocation Shaping + Allocation Prediction + Dynamic Control
2005 - Present
35
Memory MISER Evaluationof Prediction and Control
0
1
2
3
4
5
6
7
8
0 5000 10000 15000 20000 25000 30000 35000
Time (seconds)
Gig
ab
yte
s
Memory Online Memory Demand
Prediction/control looks good, but are we guaranteeing performance?
2005 - Present
36
Memory MISER Evaluationof Prediction and Control
Stable, accurate prediction using PID controller.
But, what about big (capacity) spikes?
0
1
2
3
4
5
6
7
8
22850 22860 22870 22880 22890 22900 22910 22920 22930 22940 22950
Time (seconds)
Gig
ab
yte
s
Memory Online Memory Demand
2005 - Present
37
Memory MISER Evaluationof Prediction and Control
Memory MISER guarantees performance in “worst” conditions.
0
1
2
3
4
5
6
7
8
16940 16960 16980 17000 17020 17040 17060
Time (seconds)
Gig
ab
yte
s
Memory Online Memory Used
2005 - Present
38
Memory MISER EvaluationEnergy Reduction
… …
FLASH Memory DemandDevices Online Memory Demand
8
16
24
32
40
48
0
1
2
3
4
5
6
0
Gig
abyt
es
Dev
ices
Timet0 t4t3t2t1
Stable PID control
High freq cyclic alloc/dealloc
Stable PID control
Pinned pages (OS) decrease efficiency
Tiered increases in memory allocations
30% total system energy savings,less than 1% performance loss
2005 - Present
39
Present - 2012
SystemG Supercomputer @ VT
• 325 Mac Pro Computer nodes, each with two 4-core 2.8 gigahertz (GHZ) Intel Xeon Processors.
• Each node has eight gigabytes (GB) random access memory (RAM). Each core has 6 MB cache.
• Mellanox 40Gb/s end-to-end InfiniBand adapters and switches.
• LINPACK result: 22.8 TFLOPS (trillion operations per sec)
• Over 10,000 power and thermal sensors
• Variable power modes: DVFS control (2.4 and 2.8 GHZ), Fan-Speed control, Concurrency throttling,etc.
(Check: /sys/devices/system/cpu/cpuX/Scaling_avaliable_frequencies.)
• Intelligent Power Distribution Unit: Dominion PX (remotely control the servers and network devices. Also monitor current, voltage,
power, and temperature through Raritan’s KVM switches and secure Console Servers.)
SystemG Stats
Deployment Details
24 U
1 U
1 U
8 U
8 U
8 U
1 U* 13 racks total, 24 nodes on each rack
and 8 nodes on each layer.
* 5 PDUs per rack. Raritan PDU Model DPCS12-20. Each single PUD in SystemG has
an unique IP address and Users can use IPMI to access and retrieve
information from the PDUS and also control them such as remotely shuting down and restarting machines, recording system AC power,
etc.
* There are two types of switch:
1) Ethernet Switch: 1 Gb/sec Ethernet switch. 36 nodes share one Ethernet switch.
2) InfiniBand switch: 40 Gb/sec InfiniBand switch. 24 nodes (which is one rack) share one IB switch.
Data collection system and Labview
Sample diagram and corresponding front panel from Labview:
A Power Profile for HPCC benchmark suite
Published Papers And Useful Links
Papers:
1. Rong Ge, Xizhou Feng, Shuaiwen Song, Hung-Ching Chang, Dong Li, Kirk W. Cameron, PowerPack: Energy profiling and analysis of High-Performance Systems and Applications, IEEE Transactions on Parallel and Distributed Systems, Apr. 2009.
2. Shuaiwen Song, Rong Ge, Xizhou Feng, Kirk W. Cameron, Energy Profiling and Analysis of the HPC Challenge Benchmarks, The International Journal of High Performance Computing Applications, Vol. 23, No. 3, 265-276 (2009)
NI system set details:
http://sine.ni.com/nips/cds/view/p/lang/en/nid/202545http://sine.ni.com/nips/cds/view/p/lang/en/nid/202571
46
The future…
• PowerPack– Streaming sensor data from any source
• PAPI Integration
– Correlated to various systems and applications• Prophesy Integration
– Analytics to provide unified interface
• SysteMISER– Study effects of power-aware disks and NICs– Study effects of emergent architectures (CMT, SMT, etc)– Coschedule power modes for energy savings
Present - 2012
48SCAPE Laboratory Confidential
Outreach
• See http://green500.org• See http://thegreengrid.org• See http://www.spec.org/specpower/• See http://hppac.cs.vt.edu
49
Acknowledgements
• My SCAPE Team– Dr. Xizhou Feng (PhD 2006)– Dr. Rong Ge (PhD 2008)– Dr. Matt Tolentino (PhD 2009)– Mr. Dong Li (PhD Student, exp 2010)– Mr. Song Shuaiwen (PhD Student, exp 2010)– Mr. Chun-Yi Su, Mr. Hung-Ching Chang
• Funding Sources– National Science Foundation (CISE: CCF, CNS)– Department of Energy (SC)– Intel
50
Thank you very much.
http://scape.cs.vt.edu
Thanks to our sponsors: NSF (Career, CCF, CNS), DOE (SC), Intel