Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

87
Untangling Healthcare with Spark and Dataflow Ryan Brush @ryanbrush

Transcript of Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Page 1: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Untangling Healthcare with Spark and Dataflow

Ryan Brush

@ryanbrush

Page 2: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016
Page 3: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Actual depiction of healthcare data

Page 4: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

One out of six dollars

Page 5: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016
Page 6: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Three Acts

(Mostly)

Page 7: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Act IMaking sense of the pieces

Page 8: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

answer = askQuestion (allHealthData)

Page 9: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

8000 CPT Codes 72,000 ICD-10 Codes

63,000 SNOMED disease codes

Incomplete, conflicting data sets No common person identifier

Standard data models and codes interpreted inconsistently

Different meanings in different contexts

How do we make sense of this?55 Million Patients 3 petabytes of data

Page 10: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Claims

Medical Records

Pharma

Operational

Link Records Semantic Integration

User-entered annotations

Condition Registries

Quality Measures

Analytics

Rules

Page 11: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Link Records

Claims

Medical Records

Pharma

Operational

Semantic Integration

User-entered annotations

Condition Registries

Quality Measures

Analytics

Rules

Page 12: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Medical Records

Document

Sections

Notes

Addenda Order

. . .

Normalize Structure

Clean Data

Page 13: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Link Records

Claims

Medical Records

Pharma

Operational

Semantic Integration

User-entered annotations

Condition Registries

Quality Measures

Analytics

Rules

Page 14: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

answer = askQuestion (allHealthData)

Page 15: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

linkedData = link(clean (pharma), clean (claims), clean (records))normalized = normalize(linkedData)

answer = askQuestion (normalized)

Page 16: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Rein in variance

http://fortune.com/2014/07/24/can-big-data-cure-cancer/

Page 17: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Rein in variance

oral vs. axillary temperature

Page 18: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Join all the things!

Page 19: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

JavaRDD<ExternalRecord> externalRecords = ...

Page 20: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

JavaRDD<ExternalRecord> externalRecords = ... JavaPairRDD<ExternalRecord, ExternalRecord> cartesian = externalRecords.cartesian(externalRecords);

Page 21: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

JavaRDD<Similarity> matches = cartesian.map(t -> {

})

return Similarity.newBuilder() .setLeftRecordId(left.getExternalId()) .setRightRecordId(right.getExternalId()) .setScore(score) .build();

ExternalRecord left = t._1(); ExternalRecord right = t._2(); double score = recordSimilarity(left, right);

.filter(s -> s.getScore() > THRESHOLD);

Page 22: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Person 1 Person 2 Person 3

Person 1 1 0.98 0.12

Person 2 1 0.55

Person 3 1

Page 23: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Reassembly Humpty Dumpty in Code

Page 24: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

JavaPairRDD<String,String> idToLink = . . .

JavaPairRDD<String,ExternalRecord> idToRecord = . . .

Page 25: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

JavaPairRDD<String,String> idToLink = . . .

JavaPairRDD<String,ExternalRecord> idToRecord = . . . JavaRDD<Person> people = idToRecord.join(idToLink) .mapToPair( // Tuple of universal ID and external record. item -> new Tuple2<>(item._2._2, item._2._1)) .groupByKey() .map(SparkExampleTest::mergeExternalRecords);

Page 26: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

SNOMED:388431003

HCPCS:J1815

ICD10:E13.9

CPT:3046F

SNOMED:43396009, value: 9.4

Page 27: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

SNOMED:388431003

HCPCS:J1815

InsulinMed InsulinMed

ICD10:E13.9

DiabetesCondition

Diabetic

CPT:3046F SNOMED:43396009, value: 9.4

Retaking Rules for Developers, Strange Loop 2014

Page 28: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016
Page 29: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

29

select * from outcomes where…

Page 30: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

30

Page 31: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Start with the questions you want to ask and transform the data to fit.

Page 32: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

But what questions are we asking?

Page 33: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016
Page 34: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

“The problem is we don’t understand the problem.”

-Paul MacReady

Page 35: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

cleanData = clean (allHealthData)projected = projectForPurpose (cleanData)

answer = askQuestion (projected)

Page 36: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Sepsis

Page 37: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016
Page 38: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Early Lessons

Page 39: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Make no assumptions about your data

Page 40: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Your errors are a signal

Page 41: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Data sources have a signature

Page 42: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

But the latency!And the complexity!

Page 43: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Act IIPutting health care together…fast

Page 44: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

JavaPairRDD<String,String> idToLink = . . .

JavaPairRDD<String,ExternalRecord> idToRecord = . . . JavaRDD<Person> people = idToRecord.join(idToLink) .mapToPair( // Tuple of UUID and external record. item -> new Tuple2<>(item._2._2, item._2._1)) .groupByKey() .map(SparkExampleTest::mergeExternalRecords);

Page 45: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

JavaPairDStream<String,String> idToLink = . . .

JavaPairDStream<String,ExternalRecord> idToRecord = . . .

idToLink.join(idToRecord);

Page 46: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

JavaPairDStream<String,String> idToLink = . . .

JavaPairDStream<String,ExternalRecord> idToRecord = . . . StateSpec personAndLinkStateSpec = StateSpec.function(new BuildPersonState());

JavaDStream<Tuple2<List<ExternalRecord>,List<String>>> recordsWithLinks = idToRecord.cogroup(idToLink) .mapWithState(personAndLinkStateSpec);// And a lot more...

Page 47: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Link Updates

Record Updates

Grouped Record and Link Updates

PreviousState

Person Records

Page 48: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Link Updates

Record Updates

Grouped Record and Link Updates

PreviousState

Person Records

What about deletes?

Page 49: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Stream processing is not “fast” batch processing.

Little reuse beyond core functions

Different pipeline semantics

Must implement compensation logic

Page 50: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Batch Processing

Stream Processing

Reusable Code

Page 51: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

If you're willing to restrict the flexibility of your approach, you can almost

always do something better.

-John Carmack

Page 52: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

public void rollingMap(EntityKey key, Long version, T value, Emitter emitter);

public void rollingReduce(EntityKey key, S state, Emitter emitter);

Page 53: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

emitter.emit(key,value);

emitter.tombstone(outdatedKey);

Page 54: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Domain-Specific API

Batch Host Streaming Host

Reusable Code

Page 55: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

So are we done?

Limited expressiveness

Not composable

Learning curve

Artificial complexity

Page 56: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Act IIIReframing the problem

Page 57: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Batch Processing

Stream Processing

Reusable Code

Page 58: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Batch Processing

Stream Processing

Kappa Architecture

Page 59: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

It’s time for a POSIX of data processing

Page 60: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Make everything a stream!

(If the technology can scale to your volume and historical data.)

(If your problem can be expressed in monoids.)

Page 61: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Apache Beam(was: Google Cloud Dataflow)

Page 62: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Potentially a POSIX of data processing

Composable Units (PTransforms)

Unification of Batch and Stream

Spark, Flink, Google Cloud Dataflow runners

Page 63: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Bounded: a fixed dataset

Unbounded: continuously updating dataset

Window: time range of your data to process

Trigger: when to process a time range

Page 64: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

10:00 11:00 12:008:00 9:00

Windows and Triggers

Page 65: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

10:00 11:00 12:008:00 9:00

Windows and Triggers

Page 66: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

10:00 11:00 12:008:00 9:00

Windows and Triggers

Page 67: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

10:00 11:00 12:008:00 9:00

Windows and Triggers

Page 68: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

8:03 8:04 8:058:01 8:02

Windows and Triggers

Page 69: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

public class LinkRecordsTransform extends PTransform<PCollectionTuple,PCollection<Person>> {

}

public static final TupleTag<RecordLink> LINKS = new TupleTag<>(); public static final TupleTag<ExternalRecord> RECORDS = new TupleTag<>();

@Override public PCollection<Person> apply(PCollectionTuple input) { . . . }

Page 70: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

PCollection<KV<String,CoGbkResult>> cogrouped = KeyedPCollectionTuple .of(LINKS, idToLinks) .and(RECORDS, idToRecords)

// Combines by key AND window return uuidToRecs.apply( Combine.<String,ExternalRecord,Person>perKey( new PersonCombineFn())) .setCoder(KEY_PERSON_CODER) .apply(Values.<Person>create());

PCollection<KV<String,ExternalRecord>> uuidToRecs = cogrouped.apply( ParDo.of(new LinkExternalRecords())) .setCoder(KEY_REC_CODER);

apply implementation:

.apply(CoGroupByKey.create());

Page 71: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());

PCollection<RecordLink> windowedLinks = . . .

PCollection<ExternalRecord> windowedRecs = . . .

Page 72: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

PCollection<RecordLink> windowedLinks = . . .

PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());

PCollection<ExternalRecord> windowsRecs = . . .

Page 73: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());

PCollection<RecordLink> windowedLinks = links.apply( Window.<RecordLink>into( FixedWindows.of(Duration.standardMinutes(60)));

PCollection<ExternalRecord> windowsRecs = . . .

Page 74: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());

PCollection<RecordLink> windowedLinks = links.apply( Window.<RecordLink>into( FixedWindows.of(Duration.standardMinutes(60))) .withAllowedLateness(Duration.standardMinutes(15)) .accumulatingFiredPanes());

PCollection<ExternalRecord> windowsRecs = . . .

Page 75: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());

PCollection<RecordLink> windowedLinks = links.apply( Window.<RecordLink>into( FixedWindows.of(Duration.standardMinutes(60))) .withAllowedLateness(Duration.standardMinutes(15)) .accumulatingFiredPanes()); .triggering(Repeatedly.forever( AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(10)))));

PCollection<ExternalRecord> windowsRecs = . . .

Page 76: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());

PCollection<RecordLink> windowedLinks = links.apply( Window.<RecordLink>into( SlidingWindows.of(Duration.standardMinutes(120))) .withAllowedLateness(Duration.standardMinutes(15)) .accumulatingFiredPanes()); .triggering(Repeatedly.forever( AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(10)))));

PCollection<ExternalRecord> windowsRecs = . . .

Page 77: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());

PCollection<RecordLink> windowedLinks = links.apply(

PCollection<ExternalRecord> windowsRecs = . . .

Window.<RecordLink>into(new GlobalWindows()) .triggering(Repeatedly.forever( AfterProcessingTime.pastFirstElementInPane(). plusDelayOf(Duration.standardMinutes(5)))) .accumulatingFiredPanes());

Page 78: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

when was that data created?what data have I received?

when should I process that data?how should I group data to process?what to do with late-arriving data?should I emit preliminary results?

how should I amend those results?

Untangling Concerns

Page 79: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Simple Made EasyRich Hickey, Strange Loop 2011

Modular

ComposableEasier to reason about

Page 80: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

But some caveats:

Runners at varying level of maturity

Retraction not yet implemented (see BEAM-91)

APIs may change

Page 81: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

MLLib

REPLDataframes

Spark offers a rich ecosystemSpark SQL

Genome Analysis Toolkit

Page 82: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Large, complex, processing pipelines

Exploration and transformation of data

Two classes of problems:

Page 83: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Actual depiction of healthcare data

Page 84: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Time

Und

erst

andi

ng

OrientationPattern

DiscoveryPrescriptiveFrameworks

Scalable Processing

Web Development

Page 85: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Time

Und

erst

andi

ng

OrientationPattern

DiscoveryPrescriptiveFrameworks

Scalable Processing Web Development

Page 86: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Focus on the essencerather than the accidents.

Page 87: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016

Questions?

@ryanbrush