David P. Anderson Space Sciences Laboratory University of California – Berkeley
description
Transcript of David P. Anderson Space Sciences Laboratory University of California – Berkeley
David P. AndersonSpace Sciences Laboratory
University of California – Berkeley
Public Distributed Computingwith BOINC
Public-resource computing
● 1 billion Internet-connected PCs in 2010● >50% of PCs are privately owned● Assume 100M participants
– At least 100 PetaFLOPs– At least 1 Exabyte (10^18) storage
● Problems– incentive, security, failures, ...
SETI@home
● Started May 1999● ~600,000 active participants● ~60 TeraFLOPs● Problems with current software
– hard to change/add algorithms– can't share participants w/ other projects– inflexible data architecture
SETI@home data architecture
ideal:current:
commercialInternet
Berkeley
participants
tapes Internet2(free)
commercialInternet
Berkeley Stanford USC
participants
50 Mbps
BOINC: Berkeley Open Infrastructure for Network Computing
● Multiple projects
– easy to develop and operate
– independent● Support wide range of tasks
– computation/storage
– task “topologies”● Participant features
– can choose projects, resource allocation
– configurable; invisible on participant hosts
– many platforms supported
BOINC server architecture
work generator
projectDBBOINC
DB
timeout/retry
validater
assimilator
file deleter data serverdata serverdata server
data serverdata serverscheduling server
Web interfaces(PHP)
BOINC client architecture
BOINCcore client
screensaver
application
BOINClibrary
application
BOINClibrary
files,shared memory
messages schedulers,data servers
Data architecture
● Files
– immutable, replicated– may originate on client or project– may remain resident on client
● Persistent, non-intrusive file transfers● XML descriptor:
<file_info><name>arecibo_3392474_jun_23_01</name><url>http://ds.ssl.berkeley.edu/a3392474</url><url>http://dt.ssl.berkeley.edu/a3392474</url><md5_cksum>uwi7eyufiw8e972h8f9w7</md5_cksum><nbytes>10000000</nbytes>
</file_info>
BOINC applications
● Any language (C, C++, Fortran)● BOINC API
– filename translation– checkpoint/restart, % done, CPU time– graphics (based on OpenGL, GLUT)
Work units● Template for a computation● Resource estimates
– Integer, FP ops; memory; disk space● Delay bound
– determines retry, client abort
<file_info><name>arecibo_3392474_jun_23_01</name>...
</file_info><workunit>
<name>ar_13323313</name><file_ref>
<name>arecibo_3392474_jun_23_01</name><open_name>input_file</open_name>
</file_ref><command_line>-niter 1000</command_line>
</workunit>
Results
● An instance of a computation (completed or not)
● Includes: host ID, claimed/granted credit
<file_info><name>arecibo_3392474_jun_23_01.out</name>...
</file_info><result>
<workunit_name>ar_13323313</workunit_name><file_ref>
<name>arecibo_3392474_jun_23_01.out</name><open_name>output_file</open_name>
</file_ref></result>
Scheduling
● Work buffering on client– upper, lower bounds
● Host attributes– FP/int/mem speeds, disk/memory sizes– network bandwidth up/down– fraction of time connected, computing
● Scheduler policy:– send as much work as requested, subject
to feasibility, WU deadlines
Client/server protocol (XML-RPC)
● Request– Authentication– Host description– Persistent file descriptions– Result descriptions– Duration of work requested
● Reply– Application, workunit, result descriptors– Result acknowledgements– Preferences– Control messages (redirect, back off, etc.)
Work sequences● Handle long (weeks or months)
computations with large local state● Sequence normally stays on one host;
move to different host if failure● Scheduling, redundancy checking are
trickyUpload state
Check for abort
Redundant computing
● Create several results per workunit● Find “canonical result” with project-
specific consensus policy● Generate additional copies as needed,
up to error thresholds● One result per WU per user
Participant Credit● Goals:
– credit for work actually done (CPU, network, storage)
– don't know workunit size in advance– cheat-proof
● Integration with redundancy– claimed credit = benchmark * CPU time– granted credit = minimum claimed credit
● Handling graphics coprocessors– project-specific benchmarks
Work unit lifecycle
● Work generator: create WU, N results
● Timeout check
– create new results if needed
– detect too many errors, too many results without consensus
● Validator
– find canonical result; grant credit● Assimilator
– merge canonical result into project DB● File deleter
– delete input and output files when no longer needed
Participating in a BOINC project
User Project web site
create account
email account IDdownload core client
core client
enter account ID, project URL
get list of scheduling servers
scheduler RPC
Windows GUI
● Multi-language● Operations: suspend/resume,
attach/detach projects, etc.
Participant preferences
Project-specific preferences
User-visible web features
● User profiles– user of the day
● Forums● Self-moderating FAQs● Teams● XML data export (3rd party statistics
reporting)
Project configuration file
<boinc><config> <db_name>ap</db_name> <db_passwd></db_passwd> <shmem_key>0x35740417</shmem_key> <key_dir>/mydisks/a/users/boincadm/keys</key_dir> <upload_url>http://setiboinc.ssl.berkeley.edu/ap_cgi/file_upload_handler</upload_url> <upload_dir>/mydisks/a/users/boincadm/projects/AstroPulse_Beta/upload</upload_dir> <cgi_url>http://setiboinc.ssl.berkeley.edu/ap_cgi</cgi_url> <log_dir>/mydisks/a/users/boincadm/projects/AstroPulse_Beta/log</log_dir> <disable_account_creation/></config><daemons> <daemon><cmd>feeder -d 1</cmd></daemon> <daemon><cmd>validate_test -d 2 -app AstroPulse -quorum 3</cmd></daemon> <daemon><cmd>timeout_check -d 2 -app AstroPulse -nerror 10 -ndet 10 -nredundancy 3</cmd></daemon> <daemon><cmd>assimilator -d 2 -app AstroPulse</cmd></daemon> <daemon><cmd>file_deleter -d 2</cmd></daemon></daemons><tasks> <task><cmd>update_stats -update_users -update_hosts -update_teams</cmd><period>1 hour</period></task> <task><cmd>get_load</cmd><period>5 min</period></task> <task><cmd>db_count "user"</cmd><output>count_users.out</output><period>5 min</period></task> <task><cmd>db_count "result"</cmd><output>count_results_all.out</output><period>5 min</period></task></tasks></boinc>
Project control
● Single control program– enable, disable– cron– status
● uses PID files to keep track of daemons● uses timestamp file for period tasks● uses lockfiles for mutual exclusion
Python-based testing system● Create objects representing projects,
hosts, applications, work, etc.● Activate objects to realize (create
databases and directories, run servers and clients)
● Simulate various types of failures● Check correctness of final system state
(database, result files, etc.) host = Host() user = UserUC() for i in range(2): ProjectUC(users=[user], hosts=[host], redundancy=5, short_name="test_1sec_%d"%i, resource_share=[1, 5][i]) run_check_all()
Monitoring/debugging tools
● All backend processes create log files– web/grep tool for tracking particular
WU/result● Database browsing tools
– summary of activity; entry point for browsing● Strip charts
– record, graph measures of system health● Watchdogs
– detect system failures; ring pager
Summary and status
● BOINC is funded by a 3-year NSF grant● Computing projects at Space Sciences Lab
– Astropulse (in beta test)– SETI@home (original, Australian)
● Other projects– Folding@home– Climateprediction.net
● Source code is free for noncommercial use