Berkeley Research Computing Town Hall Meeting Savio Overview

20
Berkeley Research Computing Town Hall Meeting Savio Overview

Transcript of Berkeley Research Computing Town Hall Meeting Savio Overview

Berkeley Research Computing

Town Hall MeetingSavio Overview

SAVIO - The Need Has Been Stated

Inception and design was based on a specific need articulated by Eliot Quataert and nine other faculty:

Dear Graham,

We are writing to propose that UC Berkeley adopt a condominium computing model, i.e., a more centralized model for supporting research computing on campus...

SAVIO - Condo Service Offering

● Purchase into Savio by contributing standardized compute hardware

● An alternative for running a cluster in a closet with grad students and postdocs

● The condo trade-off:○ Idle resources are made available to others○ There are no (ZERO) operational costs for

administration, colocation, base storage, optimized networking and access methods, and user services

● Scheduler gives priority access to resources equivalent to the hardware contribution

SAVIO - Faculty Computing Allowance

● Provides allocations to run on Savio as well as support to researchers who have not purchased Condo nodes

● 200k Service Units (core hours) annually● More than just compute:

○ File systems○ Training/support○ User services

● PIs request their allocation via survey● Early user access (based on readiness) now● General availability planned for fall semester

SAVIO - System Overview

● Similar in design to a typical research cluster○ Master Node role has been broken out

(management, scheduling, logins, file system, etc..)● Home storage: Enterprise level, backups,

quotaed● Scratch space: Large and fast (Lustre)● Multiple login/interactive nodes● DTN: Data Transfer Node● Compute nodes are delineated based on role

SAVIO - System Architecture

SAVIO - Specification

● Hardware○ Compute Nodes: 20-core, 64GB, InfiniBand○ BigMem Nodes: 20-core, 512GB, InfiniBand

● Software Stack○ Scientific Linux 6 (equivalent to Red Hat Enterprise

Linux 6)○ Parallelization: OpenMPI, OpenMP, POSIX threads○ Intel Compiler○ SLURM job scheduler○ Software Environment Modules

SAVIO - OTP

● The biggest security threat that we encounter ...

STOLEN CREDENTIALS

● Credentials are stolen via keyboard sniffers installed on researchers laptops or workstations, incorrectly assumed to be secure

● OTP (One Time Passwords) offers mitigation● Easy to learn, simple to use, and works on both

computers and smartphones!

SAVIO - Future Services

● Serial/HTC Jobs○ Expanding the initial architecture beyond just HPC○ Specialized node hardware (12-core, 128GB, PCI

flash storage)○ Designed for jobs that use <= 1 node○ Nodes are shared between jobs

● GPU nodes○ GPUs are optimal for massively parallel algorithms○ Specialized node hardware (8-core, 64GB, 2x Nvidia

K80)

Questions

Berkeley Research ComputingTown Hall Meeting

Savio User Environment

SAVIO - Faculty Computing Allowance

● Eligibility requirements○ ladder-rank faculty or PI on UCB campus.○ In need of compute power to solve a research problem.

● Allowance Request Procedure○ First fill out the Online Requirements Survey○ Allowance can be used either by the faculty or by immediate group members.○ For additional cluster accounts fill out - Additional User Account Request Form

● Allowances○ New allowances start on June 1st of every year.○ Mid-year requests are granted a prorated allocation○ A cluster specific project (fc_projectname) with all user accounts is setup○ Scheduler account (fc_projectname) with 200K core hours is setup○ Annual allocation exipres on May 31st of the following year

SAVIO - Access● Cluster access

○ Connect using SSH (server name - hpc.brc.berkeley.edu)○ Uses OTP - One Time Passwords (Multifactor authentication) ○ Multiple login nodes (randomly distribute users)

● Coming in future○ NERSC’s NEWT REST API for web portal development○ iPython notebooks & Jupyter hub integration

SAVIO - Data Storage Options

● Storage ○ No local storage on compute nodes○ All storage accessed over network○ Either NFS or Lustre protocol

● Multiple file systems○ HOME - NFS, 10GB quota, Backed up, No purge.○ SCRATCH - Lustre, No quota, No Backups, can be purged○ Project (GROUP) space - NFS, 200GB quota, No Backups, No Purge.○ No long term archive.

SAVIO - Data Transfers

● Use only the dedicated Data Transfer Node (DTN)● Server name - dtn.brc.berkeley.edu● Highly recommend using Globus (Web interface) for management ● Many other traditional tools are also supported on the DTN

○ SCP/SFTP○ Rsync○ BBCP

SAVIO - Software Support● Software module farm

○ Many of the most commonly used packages are already available.○ In most cases packages compiled from source○ Easy command line tools to browse and access packages ($ module cmd)

● Supported package list○ Open Source

■ Tools - octave, gnuplot, imagemagick, visit, qt, ncl, paraview, lz4, git, valgrind, etc..

■ Languages - GNU C/C++/Fortran compilers, Java (JRE), Python, R, etc..

○ Commercial■ Intel C/C++/Fortran compiler suite, Matlab with 80 core license for MDCS

● User applications○ Individual user/group specific packages can be built from source by users○ Recommend using GROUP storage space for sharing with others in group.○ SAVIO consultants available to answer your questions.

SAVIO - Job Scheduler

● SLURM

● Multiple Node Options (partitions)

● Interaction with Scheduler○ Only with command line tools and utilities.○ Online web interfaces for job management can be supported in future via

NERSC’s NEWT REST API or iPython/Jupyter or both.

Quality of Service Max allowed running time/job Max number of nodes/job

savio_debug 30 minutes 4

savio_normal 72 hours (i.e 3 days) 24

Partition # of nodes # of cores/node Memory/node Local Storage

savio 160 20 64 GB No local storage

savio_bigmem 4 20 512 GB No local storage

savio_htc 12 12 128 GB Local PCI Flash

SAVIO - Job Accounting

● Jobs gain exclusive access to assigned compute nodes.● Jobs are expected to be highly parallel and capable of using all

the resources on assigned nodes.

For example:

● Running on one standard node for 5 hours uses 1 (nodes) * 20 (cores) * 5 (hours) = 100 core-hours (or Service Units).

● Online User Documentation○ User Guide - http://research-it.berkeley.edu/services/high-performance-

computing/user-guide○ New User Information - http://research-it.berkeley.edu/services/high-

performance-computing/new-user-information

● Helpdesk○ Email : [email protected]○ Monday - Friday, 9:00 am to 5:00 pm○ Best effort in non working hours

SAVIO - How to Get Help

Thank you

Questions