PacMin @ AMPLab All-Hands

18
PacMin: rethinking genome analysis with long reads Frank Austin Nothaft, AMPLab Joint work with Adam Bloniarz 10/14/2014

description

Describes early work on PacMin, a ploidy aware overlap/string based approach for assembling genomes from long reads.

Transcript of PacMin @ AMPLab All-Hands

Page 1: PacMin @ AMPLab All-Hands

PacMin: rethinking genome analysis with long reads

Frank Austin Nothaft, AMPLab Joint work with Adam Bloniarz

10/14/2014

Page 2: PacMin @ AMPLab All-Hands

Note:• This talk is mostly speculative.

• I.e., the methods we’ll talk about are partially* implemented.

• This means you have an opportunity to steer the direction of this work!

* I’m being generous to myself.

Page 3: PacMin @ AMPLab All-Hands

• Most sequence data today comes from Illumina machines, which perform sequencing-by-synthesis

!

!

!

• We get short (100-250 bp) reads, with high accuracy

• Reads are (usually) paired

Sequencing 101

http://en.wikipedia.org/wiki/File:Sequencing_by_synthesis_Reversible_terminators.png

Page 4: PacMin @ AMPLab All-Hands

Current Pipelines are Reference Based

• Map subsequences to a “reference genome”

• Compute variants (diffs) against the reference

From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices

Page 5: PacMin @ AMPLab All-Hands

An aside: What is the reference genome?

• Pool together n individuals, and assemble their genomes together

• A few problems:

• How does the reference genome handle polymorphisms?

• What about structural rearrangements?

• Subpopulation specific alternate haplotypes?

• It has gaps. 14 years after the first human reference genome was released, it is still incomplete.*

* This problem is Hard.

Page 6: PacMin @ AMPLab All-Hands

The Sequencing Abstraction

• Sample poisson distributed substrings from a larger string

• Reads are more or less unique and correct

It was the best of times, it was the worst of times…It was the

the best oftimes, it was

the worst ofworst of times

Metaphor borrowed from Michael Schatz

best of times was the worst

Page 7: PacMin @ AMPLab All-Hands

…is a leaky abstraction• We frequently encounter “gaps” in the sequence

Ross et al, Genome Biology 2013

Page 8: PacMin @ AMPLab All-Hands

…is a leakier abstraction• We preferentially sequence from “biased” regions:

Ross et al, Genome Biology 2013

Page 9: PacMin @ AMPLab All-Hands

A very leaky abstraction!

• Reads aren’t actually correct

• >2% error (expect 0.1% variation)

• Error probability estimates are cruddy

• Reads aren’t actually unique

• >7% of the genome is not unique (K. Curtis, SiRen)

Page 10: PacMin @ AMPLab All-Hands

The State of Analysis• We’re really good at calling SNPs!

• But, we’re still pretty bad at calling INDELs, and SVs

• And we’re also bad at expressing diffs

• Hence, SMaSH! But really, reference + diff format need to be burnt to the ground and redesigned.

• And, its slow. 2 weeks to sequence, 1 week to analyze. Not fast enough for practical clinical use.

Page 11: PacMin @ AMPLab All-Hands

Opportunities

• New read technologies are available

• Provide much longer reads (250bp vs. >10kbp)

• Different error model… (15% INDEL errors, vs. 2% SNP errors)

• Generally, lower sequence specific biasLeft: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/

Page 12: PacMin @ AMPLab All-Hands

If long reads are available…• We can use conventional methods:

Carneiro et al, Genome Biology 2012

Page 13: PacMin @ AMPLab All-Hands

But!• Why not make raw assemblies out of the reads?

=?

Find overlapping reads Find consensus sequencefor all pairs of reads (i,j):

i j

…ACACTGCGACTCATCGACTC…

• Problems:

1. Overlapping is O(n2) and single evaluation is expensive anyways

2. Typical algorithms find a single consensus sequence; what if we’ve got polymorphisms?

Page 14: PacMin @ AMPLab All-Hands

Fast Overlapping with MinHashing

• Wonderful realization by Berlin et al1: overlapping is similar to document similarity problem

• Use MinHashing to approximate similarity:

1: Berlin et al, bioRxiv 2014

Per document/read, compute signature:!!

1. Cut into shingles 2. Apply random

hashes to shingles 3. Take min over all

random hashes

Hash into buckets:!!Signatures of length l can be hashed into b buckets, so we expect

to compare all elements with similarity ≥ (1/b)^(b/l)

Compare:!!For two documents with signatures of length l, Jaccard similarity is

estimated by (# equal hashes) / l

!

• Easy to implement in Spark: map, groupBy, map, filter

Page 15: PacMin @ AMPLab All-Hands

Overlaps to Assemblies• Finding pairwise overlaps gives us a directed

graph between reads (lots of edges!)

Page 16: PacMin @ AMPLab All-Hands

Transitive Reduction• We can find a consensus between clique members

• Or, we can reduce down:

• Via two iterations of Pregel!

Page 17: PacMin @ AMPLab All-Hands

Actually Making Calls• From here, we need to call copy number per edge

• Probably via Newton-Raphson based on coverage; we’re not sure yet.

• Then, per position in each edge, call alleles:

Notes:!Equation is from Li, Bioinformatics 2011

g = genotype state m = ploidy

𝜖 = probability allele was erroneously observed k = number of reads observed

l = number of reads observed matching “reference” allele TBD: equation assumes biallelic observations at site and reference allele; we won’t have either of those conveniences…

Page 18: PacMin @ AMPLab All-Hands

Output• Current assemblers emit FASTA contigs

• In layperson’s speak: long strings

• We’ll emit “multigs”, which we’ll map back to reference graph

• Multig = multi-allelic (polymorphic) contig

• Working with UCSC, who’ve done some really neat work1 deriving formalisms & building software for mapping between sequence graphs, and GA4GH ref. variation team

1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.