1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and...

14
1 Putchong Uthayopas, Thara Angsakul, Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Jullawadee Maneesilp Parallel Research Group, Parallel Research Group, Computer and Network System Research Laboratory Computer and Network System Research Laboratory Department of Computer Engineering,Faculty of Eng Department of Computer Engineering,Faculty of Eng ineering ineering Kasetsart University Bangkok, Thailand Kasetsart University Bangkok, Thailand Phone: (662) 942 8555 Ext.. 1416 Phone: (662) 942 8555 Ext.. 1416 Fax: (662) 5614621 Fax: (662) 5614621 Email: [email protected] Email: [email protected]

Transcript of 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and...

Page 1: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

11

Putchong Uthayopas, Thara Angsakul, Putchong Uthayopas, Thara Angsakul,

Jullawadee ManeesilpJullawadee Maneesilp

Parallel Research Group, Parallel Research Group,

Computer and Network System Research LaboratoryComputer and Network System Research Laboratory

Department of Computer Engineering,Faculty of EngineeringDepartment of Computer Engineering,Faculty of Engineering

Kasetsart University Bangkok, ThailandKasetsart University Bangkok, Thailand

Phone: (662) 942 8555 Ext.. 1416Phone: (662) 942 8555 Ext.. 1416

Fax: (662) 5614621Fax: (662) 5614621

Email: [email protected]: [email protected]

Page 2: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

22

MotivationMotivation

• Beowulf Cluster becomes one of the mBeowulf Cluster becomes one of the most widely used platform for high perfoost widely used platform for high performance computingrmance computing

• Very large and complex Beowulf ClusVery large and complex Beowulf Cluster start to appearter start to appear

• System management is still a challengiSystem management is still a challenging task. There are needs for ng task. There are needs for – The effective way to navigate and interact The effective way to navigate and interact

with cluster components.with cluster components.

– Mechanism and tools to perform collectivMechanism and tools to perform collective commandse commands

– Some services such as monitoring, fault deSome services such as monitoring, fault detection and recoverytection and recovery

– Special software tools that recognize speciSpecial software tools that recognize special characteristics and needs of the cluster aal characteristics and needs of the cluster administration taskdministration task

Page 3: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

33

SCMS: An Extensible Cluster MaSCMS: An Extensible Cluster Management Tool for Beowulf Clustenagement Tool for Beowulf Cluste

rr

• A collection of system management A collection of system management tools for Beowulf Clustertools for Beowulf Cluster

• Package includesPackage includes– Portable real-time monitoring Portable real-time monitoring

– Parallel Unix commandParallel Unix command

– Alarm system Alarm system

– Large collection of graphical user interfLarge collection of graphical user interface tools for users and system administace tools for users and system administratorrator

• Checking user statusChecking user status

• Remote software installationRemote software installation

• System disk space and process space statusSystem disk space and process space status

• Boot up and shutdown nodes Boot up and shutdown nodes

• Change node configuration remotelyChange node configuration remotely

– Web/VRML interfaceWeb/VRML interface

• Current version 1.1 only support ReCurrent version 1.1 only support RedHat LinuxdHat Linux

Page 4: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

44

Portable Real-time MonitoringPortable Real-time Monitoring

• Provides a global access to node infoProvides a global access to node informationrmation– Interface to local OS and get node inforInterface to local OS and get node infor

mationmation

– Collect the information to a single pointCollect the information to a single point

– Provides heartbeat and node health diagProvides heartbeat and node health diagnostic nostic

– Provides API for application to access tProvides API for application to access the information. The API is available in he information. The API is available in C, Java, and TCL/TK .C, Java, and TCL/TK .

• System ArchitectureSystem Architecture– Client/ServerClient/Server

– Layered ArchitectureLayered Architecture

Page 5: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

55

System ArchitectureSystem Architecture

• CMA - Control and Monitoring AgentCMA - Control and Monitoring Agent– Get system information from local opeGet system information from local ope

rating system on each noderating system on each node

– Portability is achieved using HAL (HaPortability is achieved using HAL (Hardware Abstraction Layer)rdware Abstraction Layer)

• SMA - System Management AgentSMA - System Management Agent– Running on management node to colleRunning on management node to colle

ct information from CMAct information from CMA

• RMI - Resource Management InterfacRMI - Resource Management Interfacee

– Library that provides interface to functLibrary that provides interface to functionality of SMAionality of SMA

CMA CMA CMA CMA

SMASMASystem Information

Repository

Resource Management API ( C, TCL, Java)

Configuration Management

Task Scheduling

Performance Monitoring

Parallel Unix command

LOCAL OS (LINUX)

HAL

HAL API

CMA

Page 6: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

66

Parallel Unix CommandParallel Unix Command

• Parallel version of comParallel version of commonly used unix commonly used unix commands such as pps, pls, mands such as pps, pls, prmprm

• Follows the scalable unFollows the scalable unix tool model (Lusk anix tool model (Lusk and Gropp 1994) d Gropp 1994)

• Graphical user interfacGraphical user interface for these commandse for these commands– Ease of useEase of use

– Filtering output dataFiltering output data

-ps aux -ps aux -ps aux

datacommanddatacommand

datacom

mand

-pps aux

Page 7: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

77

Alarm SystemAlarm System

• Set of daemons that monitor important Set of daemons that monitor important system parameterssystem parameters– Processor utilization, Memory usage, MaiProcessor utilization, Memory usage, Mai

n board temperature and moren board temperature and more

• User can specify the condition to alarm User can specify the condition to alarm and action to be takenand action to be taken

• Issues the alarm and shutdown some paIssues the alarm and shutdown some part of the system if neededrt of the system if needed

• Notification is sent using email. Future Notification is sent using email. Future release will include pager, ICQ and sperelease will include pager, ICQ and speech synthesis ech synthesis

Detector Detector Detector Detector

Alarm Manager ConfigNotification/action

Page 8: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

88

SCMS UtilitiesSCMS Utilities

SCMS Comes with many GUI utilitiesSCMS Comes with many GUI utilities

• Node statusNode status

• Control PanelControl Panel

• Disk SpaceDisk Space

• Process StatusProcess Status

• Shutdown/RebootShutdown/Reboot

• Remote loginRemote login

• User statusUser status

• Package InstallationPackage Installation

Page 9: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

99

SCMS Screen ShotSCMS Screen Shot

Page 10: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

1010

KCAP Web and VRML based IntKCAP Web and VRML based Interface for SCMSerface for SCMS

• Two versions of Web Interface are avaiTwo versions of Web Interface are availablelable– KCAP : Normal web interface KCAP : Normal web interface

– KCAP-VR : VRML Interface that allows KCAP-VR : VRML Interface that allows you to walk and interact with your clusteryou to walk and interact with your cluster

• Java Applet is used to report real-time system Java Applet is used to report real-time system informationinformation

Web Generator

VRML World Generator

VRML World

Web Tree

Web server

External NetworkReal time Monitoring

System Config

Page 11: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

1111

KCAP and KCAP-VR Screen shoKCAP and KCAP-VR Screen shott

Page 12: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

1212

Future WorksFuture Works

• KSIX: A frame work to support parallel tools and KSIX: A frame work to support parallel tools and applicationsapplications

• Offer features such asOffer features such as– process control, signal deliveryprocess control, signal delivery

– Naming servicesNaming services

– Event based communicationEvent based communication

Interconnection Network

ApplicationApplication

Node Hardware

Node OS

Node Hardware

Node OS

Node Hardware

Node OS

Node Hardware

Node OS

KSIX KSIX (Kasetsart System Interconnect eXecutive) (Kasetsart System Interconnect eXecutive)

MPI

Page 13: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

1313

SQMS: SMILE Queuing ManageSQMS: SMILE Queuing Management Systemment System

• Batch scheduler for sequential an paBatch scheduler for sequential an parallel taskrallel task

• Static and dynamic load balancingStatic and dynamic load balancing• Reconfigurable scheduling policyReconfigurable scheduling policy• Auto docking between clusterAuto docking between cluster

Submitter

Task

TaskQueue

NodeAllocator

Scheduler

Cluster Nodes

RemoteQueue

Page 14: 1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.

1414

Beowulf Computing Environment Beowulf Computing Environment at Kasetsart University, Thailandat Kasetsart University, Thailand

SMILE Beowulf ClusterSMILE Beowulf Cluster– 16 nodes Pentium II/III 16 nodes Pentium II/III

Cluster Cluster

• Test bed for cluster techTest bed for cluster technology and support of nology and support of HPC research activitiesHPC research activities

PIRUN Beowulf ClusterPIRUN Beowulf Cluster(Pile of Redundant Universal Nodes)(Pile of Redundant Universal Nodes)

• 72 nodes Beowulf System72 nodes Beowulf System– PII500 MHz, 128 MB RAMPII500 MHz, 128 MB RAM

• Largest Computing System iLargest Computing System in Thailandn Thailand

• Installation will completein Installation will completein December 1999December 1999