Engineering patterns for implementing data science models on big data platforms

39
Data Science Models on Big Data Platforms Engineering Patterns for Implementing Hisham Arafat Digital Transformation Lead Consultant Solutions Architect, Technology Strategist & Researcher Riyadh, KSA – 31 January 2017

Transcript of Engineering patterns for implementing data science models on big data platforms

Page 1: Engineering patterns for implementing data science models on big data platforms

Data Science Models on Big Data Platforms

Engineering Patterns for Implementing

Hisham ArafatDigital Transformation Lead Consultant Solutions Architect, Technology Strategist & Researcher

Riyadh, KSA – 31 January 2017

Page 2: Engineering patterns for implementing data science models on big data platforms

http://www.visualcapitalist.com/what-happens-internet-minute-2016/

Big Data…Practical Definition!

• Big Data is the challenge not the solution

• Big Data technologies address that challenge

• Practically:• Massive Streams

• Unstructured

• Complex Processing

Page 3: Engineering patterns for implementing data science models on big data platforms

Let’s Have a Use Case…Social Marketing

Page 4: Engineering patterns for implementing data science models on big data platforms

Social Marketing…Looks Simple!

Ingest Social Feeds

Build Corpus Metrics

Design Text Mining Model

Deploy All to a Big

Data Platform

Application for

Marketing Users

What people are saying about our new brand “LemaTea”?

Page 5: Engineering patterns for implementing data science models on big data platforms

Ingest Social Feeds

Build Corpus Metrics

Design Text Mining Model

Deploy All to a Big

Data Platform

Application for

Marketing Users

Page 6: Engineering patterns for implementing data science models on big data platforms

It’s NOT as Easy as it’s Looks Like!

Page 7: Engineering patterns for implementing data science models on big data platforms

Not Only Building Appropriate Model, but More Into

Designing a Solution…

Engineering Factors

Page 8: Engineering patterns for implementing data science models on big data platforms

• Interfacing with sources: REST APIs, source HTML,… (text is assumed)

• Parsing to extract: queries, Regular Expressions,…• Crawling frequency: every 1 minute, 1 hour, on event,…• Document structure: post, post + comments, #, Reach,

Retweets,…• Metadata: time, date, source, tags, authoritativeness,… • Transformations: canonicalization, weights, tokenization,…

- Size: average size of 2 KB / doc- Initial load: 1.5B doc- Frequency: every 5 minutes- Throughput: 2 KB * 60,000 doc = 120 MB / load - Grows per day ~ 34 GB

Engineering Factors

Page 9: Engineering patterns for implementing data science models on big data platforms

• Input format: text, encoded text,…• Document representation: bag of words, ontology… • Corpus structures: indexes, reverse indexes,…• Corpus metrics: doc frequency, inverse doc

frequency,…• Preprocessing: annotation, tagging,…• Files structure: tables, text files, files-day,…

- No of docs: 1.5B + 17M / day- Processing window: 60K per 3 mins- Processing rate: 20K doc per min- Final doc size = 2KB * 5 ~ 10KB- Scan rate: 20k * 10KB min ~

200MB/min - Many overheads need to be added

Engineering Factors

Page 10: Engineering patterns for implementing data science models on big data platforms

• Dimensionality reduction: stemming, lemmatization, noisy words…

• Type of applications: search/retrieval, sentiment analysis… • Modeling methods: classifiers, topic modeling, relevance…• Model efficiency: confusion metrics, precision, recall…• Overheads: intermediate processing, pre-aggregation,…• Files structure: tables, text files, files-day,…

- No of docs: 1.5B + 17M / day- Search for “LemaTea sweet taste”- No of tf to calculate ~ 1.5B * 3 ~

4.5B- No of idf to calculate ~ 1.5B- Total calculations for 1 search ~ 6

B- Consider daily growth

Engineering Factors

Page 11: Engineering patterns for implementing data science models on big data platforms

• Files structure: tables, text files, files-day,… • Files formats: HDFS, parquet, avro…• Platform technology: Hadoop/YARN, Spark, Greenplum, Flink,…• Model deployment: Java/Scala, Mahoot, Mllib, MADlib, PL/R, FlinkML… • Data ingestion: Spring XD, Flume, Sqoop, G. Data Flow,

Kafka/Streaming…• Ingestion pattern: real-time, micro batches,…

- Overall Storage- Processing capacity per node- No of nodes- Tables Hive, Hbase, Greenplum- Individual files Spark, Flink- Files-day Hadoop HDFS

Engineering Factors

Page 12: Engineering patterns for implementing data science models on big data platforms

• Workload: no of requests, request size,… • Application performance: response time, concurrent

requests…• Applications interfacing: RESET APIs, native, messaging,…• Application implementation: integration, model scoring,…• Security model: application level, platform level,…

- For 3 search terms ~ 6B calculations

- For 5 search terms ~ 9B calculations

- For 10 concurrent requests ~ 75B- Resource queuing / prioritization- Search options like date range- Access control model

Engineering Factors

Page 13: Engineering patterns for implementing data science models on big data platforms

Ongoing Process…Growing Requirements

What if?• New sources are included • Wider parsing Criteria • Advanced modeling: POS, Word Co-

occurrence, Co-referencing, Named Entity, Relationship Extraction,…

• Better response time is needed• More frequent ingestion

Dynamic

Platform

Ingestion

Corpus Processin

g

Model Processin

g

Requests Processin

g

• Larger number of docs• Increased processing requirements• Platform expansion • Overall architecture reconsidered

Page 14: Engineering patterns for implementing data science models on big data platforms

Some Building Blocks

Page 15: Engineering patterns for implementing data science models on big data platforms

What is a Data Science Model?• Type & format of inputs date• Data ingestion• Transformations and feature engineering• Modeling methods and algorithms• Model evaluation and scoring• Applications implantations considerations• In-Memory vs. In-Database

Page 16: Engineering patterns for implementing data science models on big data platforms

Key Challenges for Data Science Models

Volume

Stationary

Batches

Structured

Insights

Growth

Streams

Real-time

Unstructured

Responsive

Scale out Performance

Data Flow Engines

Event Processing

Complex Formats

Perspective / Deep Models

Page 17: Engineering patterns for implementing data science models on big data platforms

Traditional Data Management Systems• Shared I/O• Shared Processing• Limited Scalability• Service Bottlenecks• High Cost Factor

Shar

ed B

uffer

s

Data Files

Database Cluster

I/O

I/O

I/O

Network

Data

base

Ser

vice

Page 18: Engineering patterns for implementing data science models on big data platforms

Abstraction of Big Data Platforms Data

Nodes

Master NodesI/O

Network

Inte

rcon

nect

• Parallel Processing• Shared Nothing• Linear Scalability• Distributed Services• Lower Cost Factor

I/O

I/O

I/O

Metadata

1

2

3

n

Direct access to user

data

MetadataStand

by

User data / Replicas

User data / Replicas

User data / Replicas

User data / Replicas

Page 19: Engineering patterns for implementing data science models on big data platforms

In a Nutshell

Source: http://dataconomy.com/2014/06/understanding-big-data-ecosystem/

• Very huge.• Overlaps.• Overloading.• You need to

start with a use case to be able to get your solutions well engineered.

Page 20: Engineering patterns for implementing data science models on big data platforms

Engineered Systems• Packaged: Hortonworks – Pivotal – Cloudera• Appliances: EMC DCA – Dell DSSD – Dell VxRack• Cloud offerings: Azure – AWS – IBM – Google Cloud

Page 21: Engineering patterns for implementing data science models on big data platforms

Engineering Patterns in Implementation

Page 22: Engineering patterns for implementing data science models on big data platforms

Lambda Architecture…Social Marketing• Generic, scalable

and fault-tolerant data processing architecture.

• Keeps a master immutable dataset while serving low latency requests.

• Aims at providing linear scalability.

Source: http://lambda-architecture.net/

Page 23: Engineering patterns for implementing data science models on big data platforms

Social Marketing…Revisted

Ingest Social Feeds

Build Corpus Metrics

Design Text Mining Model

Deploy All to a Big

Data Platform

Application for

Marketing Users

What people are saying about our new brand “LemaTea”?

Page 24: Engineering patterns for implementing data science models on big data platforms

Lambda Architecture (cont.)

Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark

Page 25: Engineering patterns for implementing data science models on big data platforms

Lambda Architecture (cont.)

Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark

Page 26: Engineering patterns for implementing data science models on big data platforms

Lambda Architecture (cont.)

Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark

Sequence Files

Page 27: Engineering patterns for implementing data science models on big data platforms

Apache Spark / MLlib• In memory distributed

Processing• Scala, Python, Java and

R• Resilient Distributed

Dataset (RDD)• Mllib – Machine

Learning Algorithms• SQL and Data Frames /

Pipelines• Streaming• Big Graph analytics

Spark Cluster Mesos HDFS/YARN

Page 28: Engineering patterns for implementing data science models on big data platforms

Apache Spark• Supports

different types of Cluster Managers

• HDFS / YARN, Mesos, Amazon S3, Stand Alone, Hbase, Casandra…

• Interactive vs Application Mode

• Memory OptimizationSource: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-

architecture.html

Page 29: Engineering patterns for implementing data science models on big data platforms

Apache Spark

Page 30: Engineering patterns for implementing data science models on big data platforms

Apache Spark MLlib

Page 31: Engineering patterns for implementing data science models on big data platforms

Apache Spark…The Big Picture

Source” https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/

Page 32: Engineering patterns for implementing data science models on big data platforms

Greenplum / MADLib• Massively Parallel

Processing• Shared Nothing• Table distribution

• By Key• By Round Robin

• Massively Parallel Data Loading

• Integration with Hadoop

• Native MapReduce

Page 33: Engineering patterns for implementing data science models on big data platforms

Apache MADLib

Page 34: Engineering patterns for implementing data science models on big data platforms

Image Processing…Unusual WayMassively Parallel, In-Database Image Processing

Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1

Page 35: Engineering patterns for implementing data science models on big data platforms

Image Processing…Unusual WayMassively Parallel, In-Database Image Processing

Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1

Page 36: Engineering patterns for implementing data science models on big data platforms

Image Processing…Unusual WayMassively Parallel, In-Database Image Processing

Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1

Page 37: Engineering patterns for implementing data science models on big data platforms

Take Aways• A Data Science is not just the algorithms but it includes and end-

to-end solution.• The implementation should consider engineering factors and

quantify them so appropriate components can be selected.• The Big Data technology land scape is really huge and growing –

start with a solid use case to identify potential components.• Abstraction of specific technology will enable you to put your

hands on the pros and cons.• Creativity in solutions design and technology selection case by

case.• Lambda Architecture, Spark, Spark MLlib, Spark Streaming, Spark

SQL Kafka, Hadoop / Yarn, Greenplum, MADLib.

Page 38: Engineering patterns for implementing data science models on big data platforms

Q & A

Page 39: Engineering patterns for implementing data science models on big data platforms

Email: [email protected]: hichawyLinkedIn: https://eg.linkedin.com/in/hisham-arafat-a7a69230

Thank You