Engineering patterns for implementing data science models on big data platforms

Post on 16-Apr-2017

111 views 1 download

Transcript of Engineering patterns for implementing data science models on big data platforms

Data Science Models on Big Data Platforms

Engineering Patterns for Implementing

Hisham ArafatDigital Transformation Lead Consultant Solutions Architect, Technology Strategist & Researcher

Riyadh, KSA – 31 January 2017

http://www.visualcapitalist.com/what-happens-internet-minute-2016/

Big Data…Practical Definition!

• Big Data is the challenge not the solution

• Big Data technologies address that challenge

• Practically:• Massive Streams

• Unstructured

• Complex Processing

Let’s Have a Use Case…Social Marketing

Social Marketing…Looks Simple!

Ingest Social Feeds

Build Corpus Metrics

Design Text Mining Model

Deploy All to a Big

Data Platform

Application for

Marketing Users

What people are saying about our new brand “LemaTea”?

Ingest Social Feeds

Build Corpus Metrics

Design Text Mining Model

Deploy All to a Big

Data Platform

Application for

Marketing Users

It’s NOT as Easy as it’s Looks Like!

Not Only Building Appropriate Model, but More Into

Designing a Solution…

Engineering Factors

• Interfacing with sources: REST APIs, source HTML,… (text is assumed)

• Parsing to extract: queries, Regular Expressions,…• Crawling frequency: every 1 minute, 1 hour, on event,…• Document structure: post, post + comments, #, Reach,

Retweets,…• Metadata: time, date, source, tags, authoritativeness,… • Transformations: canonicalization, weights, tokenization,…

- Size: average size of 2 KB / doc- Initial load: 1.5B doc- Frequency: every 5 minutes- Throughput: 2 KB * 60,000 doc = 120 MB / load - Grows per day ~ 34 GB

Engineering Factors

• Input format: text, encoded text,…• Document representation: bag of words, ontology… • Corpus structures: indexes, reverse indexes,…• Corpus metrics: doc frequency, inverse doc

frequency,…• Preprocessing: annotation, tagging,…• Files structure: tables, text files, files-day,…

- No of docs: 1.5B + 17M / day- Processing window: 60K per 3 mins- Processing rate: 20K doc per min- Final doc size = 2KB * 5 ~ 10KB- Scan rate: 20k * 10KB min ~

200MB/min - Many overheads need to be added

Engineering Factors

• Dimensionality reduction: stemming, lemmatization, noisy words…

• Type of applications: search/retrieval, sentiment analysis… • Modeling methods: classifiers, topic modeling, relevance…• Model efficiency: confusion metrics, precision, recall…• Overheads: intermediate processing, pre-aggregation,…• Files structure: tables, text files, files-day,…

- No of docs: 1.5B + 17M / day- Search for “LemaTea sweet taste”- No of tf to calculate ~ 1.5B * 3 ~

4.5B- No of idf to calculate ~ 1.5B- Total calculations for 1 search ~ 6

B- Consider daily growth

Engineering Factors

• Files structure: tables, text files, files-day,… • Files formats: HDFS, parquet, avro…• Platform technology: Hadoop/YARN, Spark, Greenplum, Flink,…• Model deployment: Java/Scala, Mahoot, Mllib, MADlib, PL/R, FlinkML… • Data ingestion: Spring XD, Flume, Sqoop, G. Data Flow,

Kafka/Streaming…• Ingestion pattern: real-time, micro batches,…

- Overall Storage- Processing capacity per node- No of nodes- Tables Hive, Hbase, Greenplum- Individual files Spark, Flink- Files-day Hadoop HDFS

Engineering Factors

• Workload: no of requests, request size,… • Application performance: response time, concurrent

requests…• Applications interfacing: RESET APIs, native, messaging,…• Application implementation: integration, model scoring,…• Security model: application level, platform level,…

- For 3 search terms ~ 6B calculations

- For 5 search terms ~ 9B calculations

- For 10 concurrent requests ~ 75B- Resource queuing / prioritization- Search options like date range- Access control model

Engineering Factors

Ongoing Process…Growing Requirements

What if?• New sources are included • Wider parsing Criteria • Advanced modeling: POS, Word Co-

occurrence, Co-referencing, Named Entity, Relationship Extraction,…

• Better response time is needed• More frequent ingestion

Dynamic

Platform

Ingestion

Corpus Processin

g

Model Processin

g

Requests Processin

g

• Larger number of docs• Increased processing requirements• Platform expansion • Overall architecture reconsidered

Some Building Blocks

What is a Data Science Model?• Type & format of inputs date• Data ingestion• Transformations and feature engineering• Modeling methods and algorithms• Model evaluation and scoring• Applications implantations considerations• In-Memory vs. In-Database

Key Challenges for Data Science Models

Volume

Stationary

Batches

Structured

Insights

Growth

Streams

Real-time

Unstructured

Responsive

Scale out Performance

Data Flow Engines

Event Processing

Complex Formats

Perspective / Deep Models

Traditional Data Management Systems• Shared I/O• Shared Processing• Limited Scalability• Service Bottlenecks• High Cost Factor

Shar

ed B

uffer

s

Data Files

Database Cluster

I/O

I/O

I/O

Network

Data

base

Ser

vice

Abstraction of Big Data Platforms Data

Nodes

Master NodesI/O

Network

Inte

rcon

nect

• Parallel Processing• Shared Nothing• Linear Scalability• Distributed Services• Lower Cost Factor

I/O

I/O

I/O

Metadata

1

2

3

n

Direct access to user

data

MetadataStand

by

User data / Replicas

User data / Replicas

User data / Replicas

User data / Replicas

In a Nutshell

Source: http://dataconomy.com/2014/06/understanding-big-data-ecosystem/

• Very huge.• Overlaps.• Overloading.• You need to

start with a use case to be able to get your solutions well engineered.

Engineered Systems• Packaged: Hortonworks – Pivotal – Cloudera• Appliances: EMC DCA – Dell DSSD – Dell VxRack• Cloud offerings: Azure – AWS – IBM – Google Cloud

Engineering Patterns in Implementation

Lambda Architecture…Social Marketing• Generic, scalable

and fault-tolerant data processing architecture.

• Keeps a master immutable dataset while serving low latency requests.

• Aims at providing linear scalability.

Source: http://lambda-architecture.net/

Social Marketing…Revisted

Ingest Social Feeds

Build Corpus Metrics

Design Text Mining Model

Deploy All to a Big

Data Platform

Application for

Marketing Users

What people are saying about our new brand “LemaTea”?

Lambda Architecture (cont.)

Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark

Lambda Architecture (cont.)

Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark

Lambda Architecture (cont.)

Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark

Sequence Files

Apache Spark / MLlib• In memory distributed

Processing• Scala, Python, Java and

R• Resilient Distributed

Dataset (RDD)• Mllib – Machine

Learning Algorithms• SQL and Data Frames /

Pipelines• Streaming• Big Graph analytics

Spark Cluster Mesos HDFS/YARN

Apache Spark• Supports

different types of Cluster Managers

• HDFS / YARN, Mesos, Amazon S3, Stand Alone, Hbase, Casandra…

• Interactive vs Application Mode

• Memory OptimizationSource: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-

architecture.html

Apache Spark

Apache Spark MLlib

Apache Spark…The Big Picture

Source” https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/

Greenplum / MADLib• Massively Parallel

Processing• Shared Nothing• Table distribution

• By Key• By Round Robin

• Massively Parallel Data Loading

• Integration with Hadoop

• Native MapReduce

Apache MADLib

Image Processing…Unusual WayMassively Parallel, In-Database Image Processing

Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1

Image Processing…Unusual WayMassively Parallel, In-Database Image Processing

Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1

Image Processing…Unusual WayMassively Parallel, In-Database Image Processing

Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1

Take Aways• A Data Science is not just the algorithms but it includes and end-

to-end solution.• The implementation should consider engineering factors and

quantify them so appropriate components can be selected.• The Big Data technology land scape is really huge and growing –

start with a solid use case to identify potential components.• Abstraction of specific technology will enable you to put your

hands on the pros and cons.• Creativity in solutions design and technology selection case by

case.• Lambda Architecture, Spark, Spark MLlib, Spark Streaming, Spark

SQL Kafka, Hadoop / Yarn, Greenplum, MADLib.

Q & A

Email: hiarafat@hotmail.comSkype: hichawyLinkedIn: https://eg.linkedin.com/in/hisham-arafat-a7a69230

Thank You