Virtual Science in the Cloud

55
Virtual Science in the Cloud Roy Williams California Institute of Technology

description

 

Transcript of Virtual Science in the Cloud

Page 1: Virtual Science in the Cloud

Virtual Sciencein the Cloud

Roy WilliamsCalifornia Institute of Technology

Page 2: Virtual Science in the Cloud

humans clouds sensorsbeginner to expert

sharinglogins and access

click to code to workflow

personal storagebig data and replication

compute and scalingsoftware as component

interoperabilty

survey and eventcontrol or autonomous

The New Science

Page 4: Virtual Science in the Cloud

Service Oriented Architecture

servicerequest

response

clientrequest

response

registry 1. publish

2. find

3. bind

service contract

Principle: Click or Code

Page 5: Virtual Science in the Cloud

VO Data Services• Cone Search

• radius+position list of objects • encoded as VOTable

– Simple Image Access Protocol– Simple Spectrum Access Protocol

• spectra have subtleties protocol more complicated

• Astronomical Data Query Language– For database queries– Core SQL functions plus astronomy-specific extensions

• Sky region, Xmatch

• Table Access Protocol– Exposes relational databases

• What tables• What table schema• Here is a query in ADQL

Page 6: Virtual Science in the Cloud

VO Compute Services

• Asynchronous• May not get immediate answer

– just get a place to check back

• Security• Expensive resources, big requests, sequestered data• Strong or Weak or None

• Scalable• Graduated path to powerful computation and big data

• Cloud store• VOSpace• Sharable

Page 7: Virtual Science in the Cloud

VO Registry• publish -- find -- bind• Registry Metadata

– Descriptions of – data collections – data delivery services– organizations, etc.

– Based on Dublin Core with astronomy-specific extensions

– Represented as XML schema; extensible

– Contents stored in Resource Registries • exchange metadata records through the

Open Archives Initiative Protocol (OAI-PMH)

Page 8: Virtual Science in the Cloud

Distributed Registry

Caltech

NCSA

STScI/JHU

HEASARC

Astrogrid

CDS

JapanVO

Ongoing harvesting March 07(CfA, ESO, NOAO soon)

ESO

CfA

NOAO

Page 9: Virtual Science in the Cloud

Semantics & Search

• Identifiers ivo://nasa.gsfc.gcn/SWIFT#BAT_GRB_Pos_374875-722

• Free tags beard Fred pudding

• Controlled Vocab (UCD) phot.flux;em.ir

• Controlled Vocab interop (SKOS)• Ontology Greek isA Man, Socrates isA Greek Socrates isA Man

• Data Models Each sky position will have a circular positional error estimate ...

• Text markup Outflows from <object>NGC 666</object> are irregular ...

• Schema Columns are Magnitude, Position, Identifier , ...

• Metadata (registry) forms Full Registry: true; ManagedAuthorities: authority, nasa.heasarc

• Formal service description

Page 10: Virtual Science in the Cloud

Cloud Based Toolscode & presentation data

Page 11: Virtual Science in the Cloud
Page 12: Virtual Science in the Cloud

Open SkyQuery.netVO Astronomical Crossmatch Service

• Query builder• Presentation

Page 13: Virtual Science in the Cloud

Execution

• Query planning• Query execution• Workflow

Page 14: Virtual Science in the Cloud

MicrolensingOptical transients

Radio transientsX-ray transients

Gamma transients

Follow-up Scheduler

TelescopeTelescope

Telescope

Authors SubscribersInternational

GCN Broker annotation from archives

Events and annotation disseminated to subscribers

in real time with intelligence

skyalert.org

AstronomersAmateursStudents

Page 15: Virtual Science in the Cloud

Skyalert

• Push-based workflow– Can be cyclic

• Portfolio aggregation by citation• Annotation as software components• Stream owner builds template• Django, Python, Jquery

• now 4 developers via SVN

Page 16: Virtual Science in the Cloud

Skyalert Stream Registry... will be VO registry

Page 17: Virtual Science in the Cloud

Roleshuman or robot

2. subscribehuman or robot

3. author 4. annotatecontrib software componentsarchive, mining

triggers

portfolios db

actions

web

push inject

human or robot

1. browsequery, human computing, WWT/Google

IM/tweet/email/TCP

skyalert.org

Page 18: Virtual Science in the Cloud

Trigger

Action

Cyclic workflow graph

CRTS[“Geometry”][“Moon angle”] > 30and SDSS[“Photoprimary”][“g-magnitude”] < 18

dynamically loads modulerun(triggerEvent, portfolio): <business logic>can build event and inject recursively

annotator

followup request

send message

Alerts and event cascade

18

skyalert.org

Page 19: Virtual Science in the Cloud

Skyalert-LSST•Test run for LSST mobile app

•Data service from CRTS and Skyalert• gets JSON event list via http

•LSST building skyalert clone• Pasadena and Tucson both get

events by Jabber/XMPP

• “Unknown” is now choice ofCataclysmic Variable, Supernova, Blazar Outburst, Active Galactic Nucleus Variability, UVCeti Variable, Asteroid, Variable, Mira Variable, High Proper Motion Star, Comet, Eclipsing Variable, Gamma Ray Burst Afterglow, Microlensing, Nova, Planetary Microlensing, RRLyrae Variable, Tidal Disruption Flare

skyalert.org

Page 20: Virtual Science in the Cloud

Tier1 and Tier2 Event NodesEvolving in IVOA

• Tier1: • Immediate Forwarding, Reliable?, Topology?

• Tier2:• Subscription, Repository, Query, Portfolio, Registry, Machine

Learning, Substreams etc etc

Tier1

Tier2

Brokering

Jabber/XMPPor raw socket

Authoring

Distribution

Registry:• Stream definitions• Event Servers

Page 21: Virtual Science in the Cloud

NSF Teragrid

• World’s largest open distributed cyberinfrastructure• 11 Resource Provider sites, >2 Petaflop HPC & >27000 CPUs, >3 Petabyte disk, >60 PB tape• Fast network, Visualization, experiments (VMs, GPUs, FPGAs)• For US researchers and their collaborators through national peer-review process

Page 22: Virtual Science in the Cloud

Teragrid 2002

user100s of nodes

purged /scratch

parallel file system/home

login node

job submission and queueing(Condor, PBS, ..)

metadata node

parallel I/O

global file system

Unix, Globus, C++, ssh, files, MPI, PBS, make

Page 23: Virtual Science in the Cloud

Architectures 2010

• Science Gateway (no architecture!)• Node farm (condor)• Parallel computing

– Message-passing MPI– Shared memory

• Graphics Processing Units• 104 independent tiny threads

• Data Intensive• Flash memory (TG/UCSD)• Graywulf (JHU/Pannstarrs)

• Immediate resources

Page 24: Virtual Science in the Cloud

Science Gateways• Biology and Biomedicine Science Gateway• Open Life Sciences Gateway• The Telescience Project• Grid Analysis Environment (GAE)• Neutron Science Instrument Gateway• TeraGrid Visualization Gateway, ANL• BIRN• Open Science Grid (OSG)• Special PRiority and Urgent Computing Environment (SPRUCE)• National Virtual Observatory (NVO)• Arroyo Adaptive Optics• Linked Environments for Atmospheric Discovery (LEAD)• Computational Chemistry Grid (GridChem)• Computational Science and Engineering Online (CSE-Online)• GEON(GEOsciences Network)• Network for Earthquake Engineering Simulation (NEES)• SCEC Earthworks Project• Network for Computational Nanotechnology and nanoHUB• GIScience Gateway (GISolve)• Gridblast Bioinformatics Gateway• Earth Systems Grid• Astrophysical Data Repository (Cornell)

Slide courtesy of Nancy Wilkins-Diehr

Page 25: Virtual Science in the Cloud

GPU for molecular modelling

Page 26: Virtual Science in the Cloud

Data valetload/validate

mergecrawl

replicatelog

User facingSQL/casjobsworkbench

privacy/sharestored queries

wor

kflow

wor

kflow

compute

datahead/slice

hot/warm/cold

Fault tolerance: multiple replication, fault workflowCost and energy carefully consideredFuture: Hadoop/Mapreduce

Pannstarrs PS1

Page 27: Virtual Science in the Cloud

Cloud Supercomputing?

• Teragrid/Globus vs Cloud/Amazon MI

• Both ways to get wholesale computing• Both provide IaaS, Infrastructure as a Service

• Virtual Machine more popular than CTSS stack• What about parallelism? I/O speed? GPUs? etc

– Watch 3leaf and ScaleMP for these

Page 28: Virtual Science in the Cloud

Science and Web 2.0

• Easy for groups to form and collaborate• Integrates with user workspace

– iGoogle and OpenSocial– alongside other aspects of their lives

• Use existing tools• SlideShare, blogs, google gadgets, facebook, Gwave, Flickr,

YouTube

• Sharing workspace• Electronic log• Provenance• Virtual Data as “equivalent script”

Page 29: Virtual Science in the Cloud

Science and Web 2.0

• Server delivers only code– Browser makes presentation– Ajax and Ajaj and Http “long poll”– Jquery and Google toolkit– see WWT and GSky in Skyalert

• “Everything is a wiki”• or a wave?

• Visible/editable by group/s

Page 30: Virtual Science in the Cloud

Adaptive Optics Gateway

proposed upgrade of the Palomar AO system to a 56x56 subaperture system

• Adaptive optics simulations• 30-meter telescope• Planet finding coronograph

• 4-day run for 4-sec!• Parallel parameter sweeps

Page 31: Virtual Science in the Cloud

Arroyo

Page 32: Virtual Science in the Cloud

Arroyo Gateway Architecture

Django

webserver

daemon

MySQLjob definitions and status

local space for results

remote space for results

wholesale computing1. use HTML/JS from webserver to create job definition.

2. Daemon is polling & sees new job, makes local space for it.

3. Start job on compute resource & update jpb status.

4. Fetch &update status of running job. Repeat.

5. Output to remote space.

5. Daemon copies output from remote to local, updates job status.

7. User fetches results from webserver

retail wholesale

RW and J. Bunn

Page 33: Virtual Science in the Cloud

Pegasus workflow

E. Deelman

Page 34: Virtual Science in the Cloud

E. Deelman, G. Berriman, RW, et al

Page 35: Virtual Science in the Cloud

LIGO Grid• Condor/DAGMan• now 45,000 jobs per month• Pegasus for load balancing?

Page 36: Virtual Science in the Cloud

Asynchronous services: User needs feedback

• AJAJ (AJAX but with JSON)

• Detailed progress reports during run

• Strong/weak security model with certificates

Page 37: Virtual Science in the Cloud
Page 38: Virtual Science in the Cloud

Wide-area Mosaicking

Griffith Observatory, Los Angeles

158 feet

Page 39: Virtual Science in the Cloud

Citizen Science

Page 40: Virtual Science in the Cloud

Human Volunteers

• Science Layer– Describe what you see in image– Each person has level of expertise– How to use results most effectively– Galaxyzoo.org, citizensky.org good models

• Game Layer– Makes people come back– Top 10 ranking etc– Anonymous partner a la gwap.com

Page 41: Virtual Science in the Cloud

Human Volunteer Evidence

Donalek et alarXiv:0810.4945 [astro-ph]

4 of 10 say artifact artifact

Page 42: Virtual Science in the Cloud

RW and C. Donalek

Page 43: Virtual Science in the Cloud

Macromolecule Citizen Science

A. Cunha

Page 44: Virtual Science in the Cloud

Information Fusion

Page 45: Virtual Science in the Cloud

Classic Machine LearningMetric in “Feature Space”

RW and J. Beck

Feature VectorsLearning from Training setPicking relevant lessons

Relevance Vector Machine (Tipping)

Page 46: Virtual Science in the Cloud

New Machine Learning:Information Fusion

• Data Portfolios• selected from known set of object

types

• Evidence object• set of class/prob and prior assumptions• may be correlated priors

• Annotator builds evidence• from portfolio• may include other evidence

• Inference (= Expert System)

• Combines evidence with cost-benefit• Builds Importance

• Alchemy• Logic handles

complexity• Probability handles

uncertainty• Markov Logic Networks• Matrix Completion• Influence Diagrams

Page 47: Virtual Science in the Cloud

Automated Decision through Tripod of Data

• Archive• nearby radio source escalates p(blazar)• nearby galaxy escalates p(supernova)

• Human• Crowded field? Artifact present?• Can make follow-up observation

• Machine• Fuzzy center escalates p(host galaxy)• Moving source escalates p(asteroid)• Bobotic follow-up observation

decision

human

machine

learningarch

ive

Page 48: Virtual Science in the Cloud

Lessons Learned

Page 49: Virtual Science in the Cloud

User Interface (wrong)

Finally get some helpAsk for helpTranslate VOTable formatLearn to use VO RegistryRead about web servicesRead about XMLWait for accountRegister

and now do some science....

Page 50: Virtual Science in the Cloud

Web form

some science....

Register

more science....

Run bigger job

hey this is interesting ....Learn the VO structure

Power user

User interface (right)in Darwinian evolution every small change must give benefit

Anonymous

be careful with complex authentication!

Page 51: Virtual Science in the Cloud

Steering the Ship

• Short term Pragmatism• useful tools now• simple protocols (eg cone search)• “just use RA and Dec”

vs • Long term Architecture

• modular suite of interoperable tools• sophisticated protocols (eg skynode)• sophisticated Space-Time coordinates

Page 52: Virtual Science in the Cloud

Building Information Standards

• Semantics• Meaning• Usefulness• Applicability

• Code• Services• Interfaces

• Documents• Agreements• Data Models• Tight Schema• Loose Schema

• UML• XSD • WSDL

A Data Model is a bridge fromcommunity to computers

Page 53: Virtual Science in the Cloud

What is a Data Center?

machines services

doesn’t matter where or howtesting testing testing

do we have enough power and HVAC?

Page 54: Virtual Science in the Cloud

Complex scienceComplex machines

• Separate science user from complexity– Must have domain science context

• Making simple things simple but– Power to scale up– Drill-down if wanted

• Machines are not the objective– Science through data, compute, sharing

Page 55: Virtual Science in the Cloud

eScience is for People, right?

Summer Schools

ForumDocumentationKnowledge Base

Social MediaBlog/newsfeed

Help Desk

Education

Getting Started

Campus Champions

Contact UsCalendar

Advanced Supportfor Developers