1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and...
-
Upload
emma-gardner -
Category
Documents
-
view
215 -
download
0
Transcript of 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and...
11
Putchong Uthayopas, Thara Angsakul, Putchong Uthayopas, Thara Angsakul,
Jullawadee ManeesilpJullawadee Maneesilp
Parallel Research Group, Parallel Research Group,
Computer and Network System Research LaboratoryComputer and Network System Research Laboratory
Department of Computer Engineering,Faculty of EngineeringDepartment of Computer Engineering,Faculty of Engineering
Kasetsart University Bangkok, ThailandKasetsart University Bangkok, Thailand
Phone: (662) 942 8555 Ext.. 1416Phone: (662) 942 8555 Ext.. 1416
Fax: (662) 5614621Fax: (662) 5614621
Email: [email protected]: [email protected]
22
MotivationMotivation
• Beowulf Cluster becomes one of the mBeowulf Cluster becomes one of the most widely used platform for high perfoost widely used platform for high performance computingrmance computing
• Very large and complex Beowulf ClusVery large and complex Beowulf Cluster start to appearter start to appear
• System management is still a challengiSystem management is still a challenging task. There are needs for ng task. There are needs for – The effective way to navigate and interact The effective way to navigate and interact
with cluster components.with cluster components.
– Mechanism and tools to perform collectivMechanism and tools to perform collective commandse commands
– Some services such as monitoring, fault deSome services such as monitoring, fault detection and recoverytection and recovery
– Special software tools that recognize speciSpecial software tools that recognize special characteristics and needs of the cluster aal characteristics and needs of the cluster administration taskdministration task
33
SCMS: An Extensible Cluster MaSCMS: An Extensible Cluster Management Tool for Beowulf Clustenagement Tool for Beowulf Cluste
rr
• A collection of system management A collection of system management tools for Beowulf Clustertools for Beowulf Cluster
• Package includesPackage includes– Portable real-time monitoring Portable real-time monitoring
– Parallel Unix commandParallel Unix command
– Alarm system Alarm system
– Large collection of graphical user interfLarge collection of graphical user interface tools for users and system administace tools for users and system administratorrator
• Checking user statusChecking user status
• Remote software installationRemote software installation
• System disk space and process space statusSystem disk space and process space status
• Boot up and shutdown nodes Boot up and shutdown nodes
• Change node configuration remotelyChange node configuration remotely
– Web/VRML interfaceWeb/VRML interface
• Current version 1.1 only support ReCurrent version 1.1 only support RedHat LinuxdHat Linux
44
Portable Real-time MonitoringPortable Real-time Monitoring
• Provides a global access to node infoProvides a global access to node informationrmation– Interface to local OS and get node inforInterface to local OS and get node infor
mationmation
– Collect the information to a single pointCollect the information to a single point
– Provides heartbeat and node health diagProvides heartbeat and node health diagnostic nostic
– Provides API for application to access tProvides API for application to access the information. The API is available in he information. The API is available in C, Java, and TCL/TK .C, Java, and TCL/TK .
• System ArchitectureSystem Architecture– Client/ServerClient/Server
– Layered ArchitectureLayered Architecture
55
System ArchitectureSystem Architecture
• CMA - Control and Monitoring AgentCMA - Control and Monitoring Agent– Get system information from local opeGet system information from local ope
rating system on each noderating system on each node
– Portability is achieved using HAL (HaPortability is achieved using HAL (Hardware Abstraction Layer)rdware Abstraction Layer)
• SMA - System Management AgentSMA - System Management Agent– Running on management node to colleRunning on management node to colle
ct information from CMAct information from CMA
• RMI - Resource Management InterfacRMI - Resource Management Interfacee
– Library that provides interface to functLibrary that provides interface to functionality of SMAionality of SMA
CMA CMA CMA CMA
SMASMASystem Information
Repository
Resource Management API ( C, TCL, Java)
Configuration Management
Task Scheduling
Performance Monitoring
Parallel Unix command
LOCAL OS (LINUX)
HAL
HAL API
CMA
66
Parallel Unix CommandParallel Unix Command
• Parallel version of comParallel version of commonly used unix commonly used unix commands such as pps, pls, mands such as pps, pls, prmprm
• Follows the scalable unFollows the scalable unix tool model (Lusk anix tool model (Lusk and Gropp 1994) d Gropp 1994)
• Graphical user interfacGraphical user interface for these commandse for these commands– Ease of useEase of use
– Filtering output dataFiltering output data
-ps aux -ps aux -ps aux
datacommanddatacommand
datacom
mand
-pps aux
77
Alarm SystemAlarm System
• Set of daemons that monitor important Set of daemons that monitor important system parameterssystem parameters– Processor utilization, Memory usage, MaiProcessor utilization, Memory usage, Mai
n board temperature and moren board temperature and more
• User can specify the condition to alarm User can specify the condition to alarm and action to be takenand action to be taken
• Issues the alarm and shutdown some paIssues the alarm and shutdown some part of the system if neededrt of the system if needed
• Notification is sent using email. Future Notification is sent using email. Future release will include pager, ICQ and sperelease will include pager, ICQ and speech synthesis ech synthesis
Detector Detector Detector Detector
Alarm Manager ConfigNotification/action
88
SCMS UtilitiesSCMS Utilities
SCMS Comes with many GUI utilitiesSCMS Comes with many GUI utilities
• Node statusNode status
• Control PanelControl Panel
• Disk SpaceDisk Space
• Process StatusProcess Status
• Shutdown/RebootShutdown/Reboot
• Remote loginRemote login
• User statusUser status
• Package InstallationPackage Installation
99
SCMS Screen ShotSCMS Screen Shot
1010
KCAP Web and VRML based IntKCAP Web and VRML based Interface for SCMSerface for SCMS
• Two versions of Web Interface are avaiTwo versions of Web Interface are availablelable– KCAP : Normal web interface KCAP : Normal web interface
– KCAP-VR : VRML Interface that allows KCAP-VR : VRML Interface that allows you to walk and interact with your clusteryou to walk and interact with your cluster
• Java Applet is used to report real-time system Java Applet is used to report real-time system informationinformation
Web Generator
VRML World Generator
VRML World
Web Tree
Web server
External NetworkReal time Monitoring
System Config
1111
KCAP and KCAP-VR Screen shoKCAP and KCAP-VR Screen shott
1212
Future WorksFuture Works
• KSIX: A frame work to support parallel tools and KSIX: A frame work to support parallel tools and applicationsapplications
• Offer features such asOffer features such as– process control, signal deliveryprocess control, signal delivery
– Naming servicesNaming services
– Event based communicationEvent based communication
Interconnection Network
ApplicationApplication
Node Hardware
Node OS
Node Hardware
Node OS
Node Hardware
Node OS
Node Hardware
Node OS
KSIX KSIX (Kasetsart System Interconnect eXecutive) (Kasetsart System Interconnect eXecutive)
MPI
1313
SQMS: SMILE Queuing ManageSQMS: SMILE Queuing Management Systemment System
• Batch scheduler for sequential an paBatch scheduler for sequential an parallel taskrallel task
• Static and dynamic load balancingStatic and dynamic load balancing• Reconfigurable scheduling policyReconfigurable scheduling policy• Auto docking between clusterAuto docking between cluster
Submitter
Task
TaskQueue
NodeAllocator
Scheduler
Cluster Nodes
RemoteQueue
1414
Beowulf Computing Environment Beowulf Computing Environment at Kasetsart University, Thailandat Kasetsart University, Thailand
SMILE Beowulf ClusterSMILE Beowulf Cluster– 16 nodes Pentium II/III 16 nodes Pentium II/III
Cluster Cluster
• Test bed for cluster techTest bed for cluster technology and support of nology and support of HPC research activitiesHPC research activities
PIRUN Beowulf ClusterPIRUN Beowulf Cluster(Pile of Redundant Universal Nodes)(Pile of Redundant Universal Nodes)
• 72 nodes Beowulf System72 nodes Beowulf System– PII500 MHz, 128 MB RAMPII500 MHz, 128 MB RAM
• Largest Computing System iLargest Computing System in Thailandn Thailand
• Installation will completein Installation will completein December 1999December 1999