London bosc2010

37
Dealing with the Data Deluge: What can the Robotics Community Teach us? Making our pipelines organic, adaptable, and scalable Darin London

Transcript of London bosc2010

Page 1: London bosc2010

Dealing with the Data Deluge: What can the Robotics Community Teach us?

Making our pipelines organic, adaptable, and scalable

Darin London

Page 2: London bosc2010

Part I. The Challenges of NextGen Sequencing Data

Page 3: London bosc2010

Datasets

50+ Cell LinesEach sequenced with up to 2 different technologies (DNaseHS and FAIRE) and 3 different ChIP-Seq antibodies (CTCF, PolII, c-Myc), as well as a Control (Input) for comparisonMost involved multiple biological replicates, and some biological replicates were sequenced multiple times to create technical replicates of the same biological sample1.3 Gb zipped raw data per Cell_line-Technology-Replicate on average351 Gb zipped raw sequence data analyzed (and counting...)

Page 4: London bosc2010
Page 5: London bosc2010

Some characteristics of NextGen Sequencing Data

heterogeneous in time:comes in batches by lane and sampleorder of date of sample submission does not fix the order of date of receipt of data

heterogeneous in size:some samples will produce more data than otherssize affects timing of most computational tasks

heterogeneous in quality:some data will not merit being run through the entire pipelinesome data may merit extra analysis

Page 6: London bosc2010
Page 7: London bosc2010
Page 8: London bosc2010

Part II. A Tale of Two Robots

Page 9: London bosc2010

Meet Shakey

http://www.ai.sri.com/movies/Shakey.ram

The first fully autonomous robot able to reason about its surroundingsPioneered many algorithms to model data from multiple sensors into a central world map, apply one or more plans of action, and determine appropriate action behaviors to achieve these plans If science is the 'Art of the Soluble' then Shakey demonstrated the solubility of autonomous robotics to the world.

Page 10: London bosc2010

That being said...

The autonomous systems roving on mars, fighting in Afghanistan, and cleaning our floors do not share much in common with Shakey.

Page 11: London bosc2010

These systems descend from more practical approaches pioneered in the 1980s by Rodney Brooks and others

In 1986, he introduced the world to Allen, a Behavior-based robot based on the Subsumption Architecture

Page 12: London bosc2010

Behavior-based RobotsAttempt to mimic biological actions, rather than human cognitionBuilt out of many small modulesModules act autonomously by continuously sensing the environment for specific signals, and immediately perform a specific action based on that sensory inputModules arranged hierarchically, with higher layer modules able to mask (subsume) the input or output of lower layer modules (lower layer modules are not aware that they are being subsumed)There is no central planning moduleThe intelligence of the system is completely distributed throughout all the smaller subsystems, each designed to achieve certain parts of the overall task list opportunistically as the environment becomes favorable to it acting as it is designed to act

Page 13: London bosc2010

Cost-Benefit AnalysisBenefits of Behavior-based robots over AI:

Easier and cheaper to buildScale better with existing technologyMore easily adaptable, new behaviors emerge with the addition of modules with little or no change to other modulesMore fault tolerant, partial behaviors tend to persist even when many modules fail to act

Deficiencies of Behavior-based robots:'Higher order' reasoning and logic functions are too complexNo capacity to learn from mistakes except through changes, addition, or subtraction of modules

Page 14: London bosc2010

Part III. Making our Bioinformatics Pipelines Organic, and Adaptable

Page 15: London bosc2010

Many Bioinformatics Pipelines resemble Shakey by:

Involving centralized controller systems which control every aspect of pipeline behaviorMixing the logic for selecting tasks from a list together with the logic for performing these tasks

Page 16: London bosc2010

Except that, unlike Shakey, many pipelines:

Have little or no knowledge of their computing environmentHave no, or very little, capacity to:

perform tasks in different orders, opportunisticallytemporarily re-focus their work on smaller subsets of the total task listrun tasks in paralleletc.

Lack intelligent points for human agent inclusionAre subject to human will at every level

Page 17: London bosc2010

Behavior Based Pipelines are ideal for dealing with

heterogeneous data efficiently

Page 18: London bosc2010

They are Modular

Much like Object Oriented ProgrammingFailure is easy to diagnose and fixFailure in one module does not (necessarily) impact other module actionsFailure in one module does not (necessarily) require other modules to be rerun, or require complex skipping logic in the pipeline code

Page 19: London bosc2010

They are Adaptable

New analyses should simply require plugging in a new module, with minimal or no 'rewiring' of other modulesReanalyses should simply require the removal of certain outputs, and possibly a reset of the completion state of a particular task to accomplish, and all downstream tasks should either react to the presence of new data, or require minimal state manipulation to get them to rerun themselvesModules can be augmented, or replaced as needed, with little or no change to other modules, as long as their original functionality is maintained or assumed by another module

Page 20: London bosc2010

They are Scalable

Modules can be deployed onto as many different machines as are available (servers, nodes on a cluster, nodes in a cloud) to expand throughputModules with high resource requirements can be deployed onto separate machines from those with low resource requirementsModules can be grouped together on different machines, or sets of machines, according to functionality, or data proximity

Page 21: London bosc2010

They act Autonomously

Individual modules can 'react' to data to produce information as soon as the data is made available in the 'environment'Datasets can be moved through the pipeline at different ratesModules do not require humans to manage them, but, instead, react and respond to different human inputs at many different placesHumans are really just another intelligent agent in the system

Page 22: London bosc2010

They can act Opportunistically

Modules can be tied into multiple task-management systems

overall dataset-task listpriority dataset-task listmachine specific dataset-task lists manual intervention

The priority system can be set to take precedence over the overall system, but if priority datasets get backlogged, the system can still opportunistically process items in the overall system until the backlog is cleared, and the priority system can then regain the focus of some or all machines in the system

Page 23: London bosc2010

They are sensitive to their computing environment, and knowledgeable of the resources they need to work

Modules should know how much memory, file system space, etc. they needModules should know about other modules that would compete with them for scarce resourcesThis may run counter to the ethos of platform nutrality, but, for instance (if you are running on redhat/centos) you can parse /proc/meminfo for memory information (my $meminfo = YAML::LoadFile('/proc/meminfo')), ps for information on other processes running in the environment, df for filesystem information, etc.

Page 24: London bosc2010

These systems have other advantages:

They make it easy to get up and running with 1-2 modules tested on a small dataset, which can then be applied to all other datasets available, and yet to comeThey allow for 'partial solutions', e.g. some data will always be produced even if the entire pipeline is not finished (what pipeline is ever 'finished', anyway), or if one or more parts of the pipeline are discovered to have bugs New modules can be created, tested against 1 or more datasets, and then 'released to the wild' so that they can autonomously fill in the gaps for all previously received data, and then analyze all data received in the futureBuggy modules can be pulled out of the pipeline, fixed and tested in the same way

Page 25: London bosc2010

Part IV. The IGSP Encode Pipeline

Page 26: London bosc2010

Pipeline designed to generate data for the Encyclopedia of DNA Elements (EncODE)

http://www.genome.gov/10005107

For both EncODE and non EncODE cell_lines and treatments:Automates movement of data from sequencing staging to IGSP serverAligns raw sequence files to hg19 using bwa (previously hg18 using maq)Generates feature density distributions of whole-genome sequence data aligned to hg19Generates visual tracks of data in the IGSP internal UCSC Genome BrowserGenerates submission tarballs of bam, peaks, parzen bigWigs, and base count bigWigs to be submitted to UCSC

Page 27: London bosc2010

Compute Infrastructure4 Centos Compute Nodes: 8 core (2.50 Ghz, dual quad core procs), 32GB 1066Mhz Ram, Primary 120GB HDD, Secondary 250GB HDDDuke Shared Cluster Resource: 19 high priority Encode nodes, each with 8 cores and 16 GB RamCompute nodes connected to DSCR via NFS mounted volume provided by a Netapp NAS array of 42 15k 450GB FC disks exported through a 10G Fibre-E linkRaw Data, and analytical output stored on two NFS mounted volumes provided by a Netapp NAS array of 14 7.2k sata disks, 1TB and 750G in sizeEach compute node contains its own, locally mounted 230G scratch directory to minimize NFS read-write concurrency issues

Page 28: London bosc2010

Pipeline composed of many different agents, each falling into one of three categories:

Runner Agents: These simply read through a list of datasets and tasks to be done on each dataset, and launch the necessary processing agents required to accomplish each task on the dataset. They do not care whether it is possible for the agent to accomplish the task on the datasetProcessing Agents: These are small programs designed to perform a specific processing task on a given dataset. In addition, they are designed to know when it is possible to perform the task (based on prerequisites), whether the resources (memory, storage space, etc) required for it to run are available, and whether other programs which are running on the system will compete with it in ways which adversely effect its performance

Page 29: London bosc2010

Main Task List

Composed of a set of worksheets in a Google Spreadsheet. This has a number of advantages:

Allows people all over the world to keep track of what has been done, and what remains to be doneSince the Google Spreadsheet API is also available to agents on any internet connected computer, it can be used by runner and processing agents on any number of servers

Page 30: London bosc2010

The third type of agents in this system are humans

The google spreadsheet model makes it very easy to plug humans into the overall logic of the system:

arguments, variables, and state switches can be communicated to an agent using meta-fields on the worksheet. The values for these fields can be filled in by humans, or other computer agentsprocessing agents can be coded to require prerequisite meta-fields which require a human to switch on before they runprocessing agents can write data to information fields upon completion, failure, or both. This might include changing the state of prerequisite fields required by other agentsprocesses requiring human intervention can be replaced by computational logic over time, as the logic becomes formalized into one or more agents

Page 31: London bosc2010

Part V.Google::Spreadsheet::Agent

http://search.cpan.org/~dmlond/Google-Spreadsheet-Agent-0.01

Page 32: London bosc2010

#!/usr/bin/perl use strict;use Getopt::Std;use Google::Spreadsheet::Agent;# usually other modules are used

my $goal = basename($0);$goal =~ s/\_agent.pl//;

my $cell_line = shift or die "cell_line\n";my $technology = shift or die "technology\n";my $replicate = shift or die "replicate\n";my $google_page = ($replicate =~ m/.*\_TP.*/) ? 'combined' : $technology;

my %opts;getopts('dr:P:', \%opts);my $debug = $opts{d};$data_root = $opts{r} if ($opts{r})$google_page = $opts{P} if ($opts{P});

my $prerequisites = [];$prerequisites->[0] = ($replicate =~ m/.*\_TP.*/) ? 'combined' : 'aligned';

my $google_agent = Google::Spreadsheet::Agent->new( agent_name => $goal, page_name => $google_page, debug => $debug, max_selves => 3, bind_key_fields => { cellline => $cell_line, technology => $technology, replicate => $replicate }, prerequisites => $prerequisites); $google_agent->run_my(\&agent_code);exit;

Page 33: London bosc2010

my $min_gigs = 18; # start with an 18G /scratch2 availability requirement my $gigs_avail = &get_scratch_availability or exit(1);exit if ($gigs_avail < $min_gigs); sub get_scratch_availability { my $opened = open (my $df_in, '-|', 'df', '-h', '/scratch2'); unless ($opened) { print STDERR "Couldnt check scratch2 usage $!\n"; return; } my $in = <$df_in>; # skip first line $in = <$df_in>; chomp $in; close $df_in; my $gigs_avail = (split /\s+/, $in)[3]; $gigs_avail =~ s/\D+$//; return $gigs_avail;}

use YAML::Any qw/LoadFile/;my $min_mem = 16; # requires about 16-18G memory to run exit if (&get_available_memory <= $min_mem);sub get_available_memory { my $info = LoadFile('/proc/meminfo') or die "Couldnt load meminfo $!\n"; my $free_mem = $info->{MemFree}; $free_mem =~ s/\D+$//; my $buffers = $info->{Buffers}; my $cached = $info->{Cached}; $buffers =~ s/\D+$//; $cached =~ s/\D+$//; $free_mem += $buffers + $cached; $free_mem /= (1024*1024); return $free_mem;}

Page 34: London bosc2010

sub agent_code { my $entry = shift; my $replicate_root = join('/', $data_root, $cell_line, $technology, 'sequence_'.$replicate); my $db_name = getDBName($replicate_root); my $scratch_root = $replicate_root; $scratch_root =~ s/$data_root/\/scratch2/;

my $helper_command = join(' ', join('/', $generic_apps_dir, 'parzen_fseq_helper.pl'), $replicate_root, join('/', $replicate_root, 'bwa_'.$entry->{build}, 'sequence.final.bed'), $cell_line, $technology, $entry->{sex}, $entry->{build}, $db_name );

print STDERR "Running ${helper_command}\n"; `$helper_command`; if ($?) { print "Problem running parzen_helper $!"; return; }

my $parzen_track_name = $db_name . "_parzen"; my $scratch_parzen_dir = join('/', $scratch_root, 'parzen_'.$build); my $parzen_dir = join('/', $replicate_roo 'parzen_'.$build); $parzen_dir =~ s/sata2/sata4/;

my $wiggle_helper = join(' ', join('/', $generic_apps_dir, 'parzen_wiggle_helper.pl'), $build, $parzen_track_name, $parzen_dir, $scratch_parzen_dir );

print STDERR "Running ${wiggle_helper}\n"; `$wiggle_helper`; if ($?) { print STDERR "Problem running wiggle_helper $!\n"; return; } return 1;}

Page 35: London bosc2010

#!/usr/bin/perl use FindBin; use Google::Spreadsheet::Agent;

my $google_agent = Google::Spreadsheet::Agent->new( agent_name => 'agent_runner', page_name => 'all', bind_key_fields => { cellline => 'all', technology => 'all', replicate => 'all' }); # iterate through each page on the database, get runnable rows, and run each runnable on the rowforeach my $page_name ( map { $_->title } $google_agent->google_db->worksheets ) { foreach my $runnable_row ( grep { $_->content->{ready} && !$_->content->{complete} } $google_agent->google_db->worksheet({ title => $page_name })->rows ) { foreach my $goal (keys %{$runnable_row->content}) { next if ($runnable_row->content->{$goal}; # r,1,F cause it to skip # some of these will skip because they are fields without agents my $goal_agent = $FindBin::Bin.'/../agent_bin/'.$goal.'_agent.pl'; return unless (-x $goal_agent); my @cmd = ($goal_agent); foreach my $query_field ( sort { $google_agent->config->{key_fields}->{$a}->{rank} <=> $google_agent->config->{key_fields}->{$b}->{rank} } keys %{$google_agent->config->{key_fields}} ) { next unless ($row_content->{$query_field}); push @cmd, $row_content->{$query_field}; } system( join(' ', @cmd).'&'); sleep 5; } }}exit;

Page 36: London bosc2010

Future Plans

1. Making inter-lab communication more concrete, automatic 2. Each server can have its own 'task' view of a particular

google spreadsheet worksheet, in that it can have its own unique set of executable agent_bin scripts tied to a set of fields that systems on other servers would ignore

3. Put some of the runner code, and requirements checking routines into Google::Spreadsheet::Agent for version 1.1

Page 37: London bosc2010

Acknowledgements

The Institute for Genome Sciences and Policy (IGSP)The Encode Consortium Terry FureyAlan BoyleGreg CrawfordMark DeLongRob WagnerPeyton VaughnDarrin MannAlan Cowles