Post on 17-Jan-2016
Joe CasertaPresident
Elliott CordoChief Architect
September 30, 2015, Javits Center, New York City
Building a Data Lake for Digital Music Dominance
Build a Dynamic Platform – Paradigm ShiftOLD WAY:• Structure Ingest Analyze• Fixed Capacity• Monolith
NEW WAY:• Ingest Analyze Structure• Dynamic Capacity• Ecosystem
RECIPE:• Cloud• Data Lake• Polyglot Warehouse
Move to the Cloud
Existing On-Premise Solution • Challenges with operations of Hadoop servers in Data Center• Increasing infrastructure complexity• Keeping up with data growth
Cloud Advantages• Reduced upfront capital investment• Faster speed to value• Elasticity “Those that go out and buy expensive
infrastructure find that the problem scope and domain shift really quickly. By the time they get around to answering the original question, the business has moved on.” - Matt Wood, AWS
Essentially, Servers Suck
But more importantly think Infrastructure as code• Your servers should be API calls• Use stateless processes• Make all resources ephemeral• Make everything scalable and elastic!
Ephemeral?Disposable:• Processing Fleets• Elastic Map Reduce Clusters• Redshift Clusters
Use distributed services and systems to maintain state and preserve your data: • Cassandra, Dynamo • S3
Elastic Map Reduce
Hadoop on Demand• No Operations –your cluster dies so what• Bootstrap whatever processing engine makes sense• Programmatically estimate instance type and cluster
size
You May Need Some Persistent Servers
If at all possible they should be inherently scalable, distributed, and elastic
Move to a Data Lake ParadigmTechnology:• Scalable distributed storage S3• Pluggable fit-for-purpose processing EMR
Functional Capabilities:• Remove barriers from data ingestion and analysis• Storage and processing for all data• Tunable Governance
Ingest Raw Data
Organize, Define, Complete
Munging, BlendingMachine Learning Data Quality and Monitoring
Metadata, ILM , Security Data Catalog Data Integration
Fully Governed ( trusted)Arbitrary/Ad-hoc Queries and Reporting
BigDataWarehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
Usage Pattern Data Governance
Metadata, ILM, Security
Putting it together: The Big Data Pyramid
Data Ingestion and Onboarding
• Incoming to S3:– Lightweight API wrapper– Web front end– Direct writes to S3
• Ingest the data in a reasonable partitioning schema: Bucket and Keys
• Turn analysts and data scientists loose Late bind analytics
But we need to feed the cash register
• Data needs to be refined and mapped:– Processing Fleet– EMR
• 80/20 rule: metadata driven when possible• Abstract away “Big Data”• And make sure it’s right!– Automated data quality checks using HAMBOT, soon to be
open sourced
“…any decent sized enterprise will have a variety of different data technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” - Martin Fowler
Think Data Ecosystem, Not Tech Stack
Polyglot in Practice
Best practices from traditional EDW• Consolidation• Data Governance• Master Data• Tuned for analytics
Applied to:• Fit-for-purpose technologies and approaches• Relational, MPP, Graph, KV, TimeseriesDB, Data Lake• Apply “tunable governance” and traditional principles
Use the right tool for the job
The Landscape for Digital Dominance
Landing Que
ue
Data Lake
BDW
Data Science
API
Data Providers
Near Real-time
Batch
Data Science Clusters
EDWGraph
RDS Metastore
Joe CasertaPresident, Caserta Conceptsjoe@casertaconcepts.com @joe_Caserta
Elliott CordoChief Architect, Caserta Conceptselliott@casertaconcepts.com
•Award-winning company•Transformative Data Strategies•Modern Data Engineering•Advanced Architecture
•Innovation Partner•Strategic Consulting•Advanced Technical Design•Build & Deploy Solutions
•BDW Meetup•New York City•3,000+ members•Knowledge sharing
Data is not important, it’s what you do with it that’s important!
Thank You