#AICPApfp EFFECTIVE COMMUNICATION AICPA PFP CONFERENCE January 17-19, 2015 Harold Evensky CFP, AIF.
Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete...
-
Upload
naomi-mclaughlin -
Category
Documents
-
view
220 -
download
1
Transcript of Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete...
![Page 1: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/1.jpg)
Scalable Cluster Management:Frameworks, Tools, and Systems
David A. EvenskyAnn C. GentilePete Wyckoff
Robert C. ArmstrongRobert L. ClayRon Brightwell
Sandia National Laboratories
![Page 2: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/2.jpg)
Lilith: a tool framework for very large clusters
• Most current tools for clusters are designed as monolithic programs, to do one task well.
• If you need a new task, you need a new tool.
• The Lilith framework allows users to easily construct new tools using a component framework.
![Page 3: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/3.jpg)
Control of large distributed systems
• System administration• Auditing & job control by users• Interrogation of processes• Simple Applications
1 sec program on 1000 nodes
16min10sec
![Page 4: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/4.jpg)
Lilith: Scalable component framework
C lientData
D istribution
ExecutionC lient
ResultCollectionC lient
• Lilith spans a tree of machines executing user-defined code.
• User code (Lilim/Lilly) provides component functionality on a single node
• Provides scalable distribution, result collection
![Page 5: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/5.jpg)
Component Methods
• MO[] distributeOnTree(MO, int[])– data distribution down the tree
• MO onTree(MO)– component action on the node
• MO collateOnTree(MO[])– result collection and condensation
![Page 6: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/6.jpg)
Security
Uses purely Java 2 mechanisms atthis time….
User sendscredential with call
LilithHost createsProtectionDomain fromuser credential
LilithHost calls checkPermission
LilithHost
PolicyKeys
Method invocation
Sandbox setup similarly usingthe User credential and PolicyFile
![Page 7: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/7.jpg)
Prototypical tools
System monitoring toolto track the state of acluster of machines
PS-tool to get sortable processinformation from selected nodesof the cluster.
![Page 8: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/8.jpg)
Lilith Lights tool
• Snake toy app– demo that draws a
snake over front panel
– no global repository for state --- all info distributed
– Snake’s movement was limited to left half of machine
• program error in declaration of drand48() biased results
![Page 9: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/9.jpg)
Who serves who?
• Programmers adapt to:– The OS that runs on the machine,– The system configuration chosen by the admins– Changing system environments
• economically driven to heterogeneous distributed computing
• Why can’t the user dictate the software environment as a resource request?
![Page 10: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/10.jpg)
DASE
• Dynamically Adaptive Software Environment• Provide multi-OS/multi-environment
capability• Manage multiple SW environments• “save” user environment for reuse later• Integration with SW component architectures
![Page 11: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/11.jpg)
DASE Service Object Model
Physical systemLogical partitioning
“system”model
PartitionerApp Object- resource spec- data/map objects
Solver
Visualizer
MesherScheduler
ResourceRequest
![Page 12: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/12.jpg)
Flexible Resource Management
RM/VM
S chedu ler/R esource M anagem ent
V irtua l M ach ine
A pp lica tion E nvironm ent
D A S E S ession M anager
H ierarch ica l N et B ooting
RM/VM RM/VM
DASEClient
TFlopsPRE
HPVMCustom
Lin
ux
NT
ComponentsFramew orks
con
tro
l
info
rma
tion
App Environment Specification
![Page 13: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/13.jpg)
Scalable Unit
power serial Ethernet Myrinet
To
syst
em s
uppo
rt n
etw
ork
100BaseT hub
16 p
ort M
yrin
et s
wit
ch
compute
compute
compute
compute
compute
compute
compute
service
8 Myrinet LAN cables
sss0
Ter
min
al s
erve
r
Pow
er c
ontr
olle
r
100BaseT hub
16 p
ort M
yrin
et s
wit
ch
compute
compute
compute
compute
compute
compute
compute
service
Ter
min
al s
erve
r
Pow
er c
ontr
olle
r
![Page 14: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/14.jpg)
System Support Hierarchy
sss1
Admin access
sss0
node
node
node
nodeScalable
Unit
In-use copyof systemsoftware
NFS mountroot fromSSS0
sss0
node
node
node
nodeScalable
Unit
In-use copyof systemsoftware
NFS mountroot fromSSS0
sss0
node
node
node
nodeScalable
Unit
In-use copyof systemsoftware
NFS mountroot fromSSS0
Master copyof systemsoftware
![Page 15: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/15.jpg)
Hardware Management
• Discovery and Control– Perl scripts that
• control individual devices (power controller, terminal server, machine, switch)
• build a database of configuration info (MAC and IP addresses, serial numbers, etc.)
• Roles– database is augmented with each components role
in the system (compute, sss0, terminal server, etc.)
![Page 16: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/16.jpg)
“Virtual Machines”
• Allows arbitrary grouping of scalable units that use the same system software
• Operations to update system software and boot nodes, scalable units, or machines
• Updates system software on an SU in 1 min.• Update system software on 24 SUs in 1.5 min.• Boot an SU in 5 min. (staged for power drain)• Boot 24 SUs in 10 min.
![Page 17: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/17.jpg)
“Virtual Machines”
sss1Uses rdist topush system softwaredown
sss0nodenodenode
nodeScalable
Unit
In-use copyof systemsoftwareNFS mountroot fromSSS0
sss0nodenodenode
nodeScalable
Unit
In-use copyof systemsoftwareNFS mountroot fromSSS0
sss0nodenodenode
nodeScalable
Unit
In-use copyof systemsoftwareNFS mountroot fromSSS0
Linux 2.3Beta
AlphaProduction SU configuration
database
![Page 18: Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649e875503460f94b8a33e/html5/thumbnails/18.jpg)
http://dancer.ca.sandia.govhttp://www.cplant.ca.sandia.govhttp://www.cs.sandia.gov/cplant