Hms nyc* talk
-
Upload
brian-oneill -
Category
Documents
-
view
621 -
download
1
Transcript of Hms nyc* talk
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
The Big Data Quadfecta
Brian O’NeillLead Architect, Health Market Science@boneill42, [email protected]
Taylor GoetzDevelopment Lead, Health Market Science@ptgoetz, [email protected]
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Quadfecta?
1. Quadfecta• A legendary beirut/beer pong shot that lands on
the tops of four cups simultaneously. Considered the rarest shot in the game, topping even the trifecta, 2-cup knockover-and-sink, and simultaneous 6-cup game-ending double bounce-in.
• Kafka• Storm• Elastic Search• Cassandra
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
3 V’s
Volume Variety Velocity
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
The Use Case
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Our Mission
Prescriber eligibility and remediation
Eliminate fraud, waste and abuse
Insights into the healthcare space
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
The BusinessBusiness Solutions
Health Care Provider & Facilities
Variety/Velocity• >l2000 of sources
• 6 Million unique HCPs
• 10+ years history
Data Challenges• Constant change in
real world data
• Conflicting & partial info
• Frequent changes to source structure
• Authoritative sources vs. crowdsource
• Predicting source quality
Master Data SolutionsMedical Procedures &
Diagnosis
Volume/Velocity• ~1B claims annually
• +5B records annually
• 5+ years history
Data Challenges• Sources have
incomplete capture
• Overlapping source data
• Statistical projections & biases
• Social media type relationships
Medical Claims Data
CompleteView, Expense Manager,
CompleteSpend
Prescriber Eligibility/Remdi
ation
Analtyics (Influencer Networks)
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Our SolutionsBusiness
Needs
Finance & LegalBusiness SystemsCompliance Sales & Marketing
SolutionsProvider Data ComplianceData Assessment, Integration &
Enrichment Services
01010011
Market Intelligence
HMSAuthoritative
SourcesPDC Federal StateMedical Claims Web Derived
AdvancedTechnology
Master Data Management
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Datacenter
¾ Petabytes of raw storage
Virtualized (VMware)
On a SAN
Should we go physical???
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Under the Hood
Visualization
Dashboard / Reports
Structured Storage
RelationalIndexing
Flexible Storage
NoSQL Graph(s)
Interfacing
Web Services
Distributed Processing
Standardize
Validate
MatchConsolidat
e
Analytics
Data Sources
Government
Web
Customer
I’m happy
User Interface
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Master Data Management
Harvested
Government
PrivateSchema Change!
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
The Design
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
System of Record
Flexibility (Variety)Scalability (Velocity + Volume)
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Primitives of Distributed Processing
emit/proce
ss(tuple(…
))
map<key<map<[], value>>
pop(push(v))
index(field, type)
Kafka
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Design Principles
PatternsIdempotent Operations
Elegantly handle replay
Immutable dataAssertions of facts over time
Anti-PatternsTransactions / Locking
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
State / Counting
Exactly-once semantics for state
Create small batchesOrder batchesBatch 1
Batch 2
Batch 3
Batch 3’
4
13
13
6
Batch Total
1 4
3 4 (wait)
2 10 (+6)
3 23 (+13)
3’ 23 (+0)
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
What we did wrong…
Could not react to transactional changes
Needed extra logic to track what changed
Took too long
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
What we did wrong… (II)
AOP-based triggersWorked well initially.Business Processes captured as side-effects.
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
What we did right.
REST APIs for Loose Coupling
See Virgil:https://github.com/hmsonline/virgil
But really… watch out for Intraverthttps://github.com/zznate/intravert-ug
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Kafka• Millions of Messages• Replay Enabled• No transactions / Lightning Fast
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Elastic Search• Edit Distance / Soundex• Native Scalability• Fuzzy Search• Geospatial• Facets
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm• Guaranteed once semantics• Well-designed processing
abstraction• Beats BYODP• Momentum
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
The System
KafkaQueue(s)
Offset
C*
A
BC
C* ES1Kafka
ElasticSearch
ES2C*
REST API
NP. We can route around
it.
NP. Replication Factor > 1.
NP. Rewind!
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
What comes after Quadfecta?
?
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Real-Time Integration
Real-time CRUD via Web Services
DRPC“Real-time” Queue
Not quite sure?
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
The Storm/C* Bridge
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Anatomy of a Storm Cluster
NimbusMaster Node
ZookeeperCluster Coordination
SupervisorsWorker Nodes
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm Primatives
StreamsUnbounded sequence of tuples
SpoutsStream Sources
BoltsUnit of Computation
TopologiesCombination of n Spouts and n BoltsDefines the overall “Computation”
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm Spouts
Represents a source (stream) of dataQueues (JMS, Kafka, Kestrel, etc.)Twitter FirehoseSensor Data
Emits “Tuples” (Events) based on source
Primary Storm data structureSet of Key-Value pairs
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm Bolts
Receive Tuples from Spouts or other Bolts
Operate on, or React to DataFunctions/Filters/Joins/AggregationsDatabase writes/lookups
Optionally emit additional Tuples
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm Topologies
Data flow between spouts and bolts
Routing of Tuples between spouts/bolts
Stream “Groupings”
Parallelism of Components
Long-Lived
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm Topologies
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm and Cassandra
Use Cases:Write Storm Tuple data to C*
Computation ResultsPre-compute indices
Read data from C* and emit Storm Tuples
Dynamic Lookupshttp://github.com/hmsonline/storm-cassandra
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm Cassandra Bolt Types
CassandraBolt
CassandraLookupBolt
CassandraBoltWrites data to CassandraAvailable in Batching and Non-Batching
CassandraLookupBoltReads data from Cassandra
http://github.com/hmsonline/storm-cassandra
C*STORM
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm-Cassandra Project
Provides generic Bolts for writing/reading Storm Tuples to/from C*
TupleTuple
Mapper Rows
C*Tuples
ColumnsMapper Columns
STORM
http://github.com/hmsonline/storm-cassandra
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm-Cassandra Project
TupleMapper InterfaceTells the CassandraBolt how to write a tuple to an arbitrary data model
Given a Storm Tuple:Map to Column FamilyMap to Row KeyMap to Columns
http://github.com/hmsonline/storm-cassandra
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm-Cassandra Project
ColumnsMapper InterfaceTells the CassandraLookupBolt how to transform a C* row into a Storm Tuple
Given a C* Row Key and list of Columns:
Return a list of Storm Tuples
http://github.com/hmsonline/storm-cassandra
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm-Cassandra Project
Current State:Version 0.4.0Uses Astyanax ClientSeveral out-of-the-box *Mapper Implementations:
Basic Key-Value ColumnsValue-less ColumnsCounter ColumnsLookup by row keyLookup by range query
Composite Key/Column SupportTrident support
http://github.com/hmsonline/storm-cassandra
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Storm-Cassandra Project
Future Plans:Switch to CQLEnhanced Trident Support
http://github.com/hmsonline/storm-cassandra
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Persistent Word Count
http://github.com/hmsonline/storm-cassandra
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
DRPC
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
“Reach” Computation
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
MDM Topology*
*Notional
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Load Topology
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Shameless Shoutouts
HMS (https://github.com/hmsonline/)storm-cassandrastorm-elastic-searchstorm-jdbi (coming soon)
ptgoetz (https://github.com/ptgoetz) storm-jmsstorm-signals
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Next Level : Trident
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Trident
Provides a higher-level abstraction for stream processing
Constructs for state management and Batching
Adds additional primitives that abstract away common topological patterns
Deprecates transactional topologies
Distributes with Storm
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Sample Trident Operations
Partition LocalFunctions ( execute(x) x + y )
Filters ( isKeep(x) 0,x )
PartitionAggregateCombiner ( pairwise combining )Reducer ( iterative accumulation )Aggregator ( byoa )
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
A sample topology
TridentTopology topology = new TridentTopology();
TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"),
new Split(),
new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(
MemcachedState.opaque(serverLocations),
new Count(),
new Fields("count"))
.parallelismHint(6);
https://github.com/nathanmarz/storm/wiki/Trident-state
1• 8
00.5
93.4
467
• in
fo@
heal
thm
arke
tsci
ence
.com
Trident State
Sequenced writes by batch/transaction id.
SpoutsTransactional
Batch contents never change
OpaqueBatch contents can change
StateTransactional
Store tx_id with counts to maintain sequencing of writes.
OpaqueStore previous value in order to overwrite the current value when contents of a batch change.