Issues with Production Grids Tony Hey Director of UK e-Science Core Programme.

download Issues with Production Grids Tony Hey Director of UK e-Science Core Programme.

If you can't read please download the document

Transcript of Issues with Production Grids Tony Hey Director of UK e-Science Core Programme.

  • Slide 1

Issues with Production Grids Tony Hey Director of UK e-Science Core Programme Slide 2 The Grid is a set of core middleware services Running on top of high performance global networks to support research and innovation Slide 3 Grids of Grids of Simple Services Overlay and Compose Grids of Grids MethodsServicesFunctional Grids CPUsClusters Compute Resource Grids MPPs Databases Federated Databases SensorSensor Nets Data Resource Grids Slide 4 NGS Today Projects e-Minerals e-Materials Orbital Dynamics of Galaxies Bioinformatics (using BLAST) GEODISE project UKQCD Singlet meson project Census data analysis MIAKT project e-HTPX project. RealityGrid (chemistry) Users Leeds Oxford UCL Cardiff Southampton Imperial Liverpool Sheffield Cambridge Edinburgh QUB BBSRC CCLRC. Interfaces OGSI::Lite Slide 5 NGS Hardware Compute Cluster 64 dual CPU Intel 3.06 GHz (1MB cache) nodes 2GB memory per node 2x 120GB IDE disks (1 boot, 1 data) Gigabit network Myrinet M3F-PCIXD-2 Front end (as node) Disk server (as node) with 2x Infortrend 2.1TB U16U SCSI Arrays (UltraStar 146Z10 disks) PGI compilers Intel Compilers, MKL PBSPro TotalView Debugger RedHat ES 3.0 Data Cluster 20 dual CPU Intel 3.06 GHz nodes 4GB memory per node 2x120GB IDE disks (1 boot, 1 data) Gigabit network Myrinet M3F-PCIXD-2 Front end (as node) 18TB Fibre SAN ( Infortrend F16F 4.1TB Fibre Arrays (UltraStar 146Z10 disks) PGI compilers Intel Compilers, MKL PBSPro TotalView Debugger Oracle 9i RAC Oracle Application server RedHat ES 3.0 Slide 6 NGS Software Slide 7 Src SH2 domain ligand RealityGrid AHM Experiment Measuring protein-peptide binding energies G bind is vital for e.g. understanding fundamental physical processes at play at the molecular level, for designing new drugs. Computing a peptide-protein binding energy traditionally takes weeks to months. We have developed a grid- based method to accelerate this process. We computed G bind during the UK AHM i.e. in less than 48 hours Slide 8 Experiment Details A Grid based approach, using the RealityGrid steering library enables us to launch, monitor, checkpoint and spawn multiple simulations Each simulation is a parallel molecular dynamic simulation running on a supercomputer class machine At any given instant, we had up to nine simulations in progress (over 140 processors) on machines at 5 different sites: e.g 1x TG-SDSC, 3x TG-NCSA, 3x NGS-Oxford, 1x NGS-Leeds, 1x NGS-RAL Slide 9 Experiment Details (2) In all 26 simulations were run over 48 hours. We simulated over 6.8ns of classical molecular dynamics in this time Real time visualization and off-line analysis required bringing back data from simulations in progress. We used UK-light between UCL and the TeraGrid machines (SDSC, NCSA) Slide 10 Computation Starlight (Chicago) Netherlight (Amsterdam) Leeds PSC SDSC UCL Network PoP Service Registry NCSA Manchester UKLight Oxford RAL US TeraGrid UK NGS Steering clients AHM 2004 Local laptops and Manchester vncserver All sites connected by production network (not all shown) The e-Infrastructure Slide 11 The scientific results Some simulations require extending and more sophisticated analysis needs to be performed Slide 12 and the problems Restarted the GridService container Wednesday evening Numerous quota and permission issues, especially at TG-SDSC NGS-Oxford was unreachable Wednesday evening to Thursday morning The steerer and launcher occasionally fail We were unable to checkpoint two simulations The batch queuing systems occasionally did not like our simulations 5 simulations died of natural causes Overall, up to six people were working on this calculation to solve these problems Slide 13 NGS Tomorrow Grid Operation Support Centre Web Services based National Grid Infrastructure Slide 14 WS-I+ profile Web Service Grids: An Evolutionary Approach to WSRF WS-I Standards that have broad industry support and multiple interoperable implementations Specifications that are emerging from standardisation process and are recognised as being useful Specifications that have/will enter a standardisation process but are not stable and are still experimental Slide 15 OMII Vision To be the national provider of reliable, interoperable, open source grid middleware Provide one-stop portal and software repository for grid middleware Provide quality assured software engineering, testing, packaging and maintenance for our products Lead the evolution of Grid middleware through a managed programme and wide reaching collaboration with industry Slide 16 OMII Distribution 1 Oct 2004 Collection of tested, documented and integrated software components for Web Service Grids A base built from off-the-shelf Web Services technology A package of extensions that can be enabled as required An initial set of Web Services for building file- compute collaborative grids Technical preview of Web Service version of OGSA-DAI database middleware Sample applications Slide 17 OMII future distributions Include the services in previous distributions + OMII managed programme contributions Database service Workflow service Registry service Reliable messaging service Notification service Interoperability with other grids Slide 18 Why Workflows and Services? Workflow = general technique for describing and enacting a process Workflow = describes what you want to do, not how you want to do it Web Service = how you want to do it Web Service = automated programmatic internet access to applications Automation Capturing processes in an explicit manner Tedium! Computers dont get bored/distracted/hungry/impatient! Saves repeated time and effort Modification, maintenance, substitution and personalisation Easy to share, explain, relocate, reuse and build Available to wider audience: dont need to be a coder, just need to know how to do Bioinformatics Releases Scientists/Bioinformaticians to do other work Record Provenance: what the data is like, where it came from, its quality Management of data (LSID - Life Science IDentifiers) Slide 19 Workflow Components Scufl Simple Conceptual Unified Flow Language Taverna Writing, running workflows & examining results SOAPLAB Makes applications available Freefluo Workflow engine to run workflows Freefluo SOAPLAB Web Service Any Application Web Service e.g. DDBJ BLAST Slide 20 ABC The Williams Workflows A: Identification of overlapping sequence B: Characterisation of nucleotide sequence C: Characterisation of protein sequence Slide 21 The Workflow Experience Correct and Biologically meaningful results Automation Saved time, increased productivity Process split into three, you still require humans! Sharing Other people have used and want to develop the workflows Change of work practises Post hoc analysis. Dont analyse data piece by piece receive all data all at once Data stored and collected in a more standardised manner Results amplification Results management and visualisation Have workflows delivered on their promise?YES! Slide 22 Future UK e-Infrastructure? LHC ISIS TS2 HPCx + HECtoR GOSC Regional and Campus grids Users get common access, tools, information, nationally supported services, through NGS and robust, standards-compliant middleware from the OMII Integrated internationally VRE, VLE, IE Slide 23