Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need...
Transcript of Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need...
This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may be reproduced, copied, or transmitted in any form or for any purpose without the express prior written permission of Actian.
This document is not intended to be binding upon Actian to any particular course of business, pricing, product strategy, and/or development. Actian assumes no responsibility for errors or omissions in this document. Actian shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. Actian does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.
Disclaimer
Actian Hybrid DataConference2018 London
ActianHybrid DataConference2018 London
Vidisha Sharma
Actian Vector with DataFlowUsing Machine Learning Algorithms for Business Analytics
Technical Support Engineer
How do Actian Vector and DataFlow support AI/ML workloads?
What is the impact of Artificial Intelligence/ Machine Learning on analytic databases?
What will be covered ?
4 © 2018 Actian Corporation
Real-world use.
What is the impact of Artificial Intelligence/Machine Learning on analytic databases?
What is Machine Learning
6 © 2018 Actian Corporation
Machine Learning: Mathematically intensive systems that learns some Task from its Experience and its Performance becomes better with more experience.Traditional programming: A step by step procedure using an predefined algorithm to solve a specific problem in hand.
Traditional Programing AI/ Machine Learning
Machine Learning is everywhere• Recommender systems (Amazon,
Netflix)• Facebook tagging• Email Spam• Insurance Domain• Banking Domain and so on
Why Machine Learning is getting popular ?• Increase in computational speeds • Too much data generation • High Dimensional data• Faster improvement cycles
compared to manual programming
What are the implications of AI/ML on the future of analytic databases?
7 © 2018 Actian Corporation
Demographic transformation
✓ 90% of data generated in last 5 years
✓ 1 terabyte of data 8 years back is around 7 petabyte of data today.
✓ More and more data generated by machine, like Internet of Things.
✓ Querying such large amount of data would need new strategies.
Performance requisites
✓ To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data.
✓ Integrating data from different sources will become main focus
Change in client needs
✓ Change in data model as the data already exists
✓ Physical -> Logical-> conceptual
✓ Building new capabilities
Applying AI/ML in use cases
✓ Machine generated data create new use cases
✓ Marketing for business can be done affectively
✓ “What-if” questions asked and answered
Why Actian ?
✓ Equipped for paradigm shift.✓ Actian Vector analytic database✓ DataFlow
How do Vector and DataFlowsupport AI/ML workloads?
What is Vector and what it can do ?
Vector• A columnar, relational database designed for reporting
and analytics
• Delivers extreme high performance even on just a single node
• Easy to install and utilize
• Runs on Linux and Windows, 64 bit
• Excellent concurrency and real-time update characteristics
VectorH• VectorH scales from single machine Vector to a cluster
• leveraging the HDFS distributed filesystem and the YARN resource controller.
• The result is a fully capable Vector DBMS which takes advantage of clustered hardware, for massive performance gains.
DataFlow
• Single platform for end-to-end data access, transformation, preparation, and predictive analysis
• Combines the KNIME (open source data mining platform) drag and drop visual workflow environment and the Actian DataFlow platform
• Eliminates memory constraints, as well as the need for data movement into specific data stores before analytics are run
• Execute on desktop, remote server, or clusters --including Hadoop clusters
• Transform, cleanse and analyze terabytes of data into actionable insights at record-breaking speed on commodity hardware
10 © 2018 Actian Corporation
DataFlow Concepts
• Operators (nodes) linked together in a directed acyclic graph (DAG)
• Data flows along edges
• Shared nothing architecture
• Provides pipeline parallelism
• Supports data parallelism
• Data scalable
11 © 2018 Actian Corporation
Vector and DataFlow for AI/ML Workloads
▪ Fast parallel data ingestion
▪Access to analytic routines
▪Parallel query execution through Vector
▪Ability to support multiple higher-level interfaces like to Spark, R, Scala, Python other advanced analytics tools
▪Support for ANSI SQL
▪Visualization/dashboard tools (like Tableau, Looker, Qlik) based on that standard
▪Quicker execution cycles for faster iteration
▪Ability to build a workflow through KNIME graphical user interface
▪Powerful speed due to DataFlow executor
▪DataFlow has Capability to run on Hadoop and Non-Hadoop cluster.
12 © 2018 Actian Corporation
Integrating Vector, DataFlow and ML
Used unsupervised learning to make homogenous groups i.e K-mean algorithm to separate data in 3 clustersDecision Trees were used to derive Key Patterns, Which is applied on the cluster to name them as
- Cluster 1: Risk Zone- Cluster 2: Potential Risk Zone- Cluster 3: Safe Zone
After data is labeled, train and test models using various algorithms. Best accuracy came with Logistic RegressionLogistic Regression model used to predict driving behavior.
Use case at a glance
14 © 2018 Actian Corporation
Evaluate driver’s driving behavior which leads to differential pricing of insurance premium, Dynamic assessment helps in claims of approvalP
rob
lem
S
tate
men
t
50 Million records, 24 variables
Dat
aset
an
d
exp
ecta
tio
ns
Predicting Risk Zone of a driver.Potential use of risk zone profile in prescribing Insurance Premium.
Met
ho
do
log
y
Implementation steps
15 © 2018 Actian Corporation
Ingest
Read data from csv and copy to Vector
✓ Read all 50 million rows into Vector using DataFlow✓ Use ‘ Load Actian Vector on Hadoop Direct’ operator, which reads all data directly to Vector.✓ Took around 7 mins to add that data to Vector.✓ There are more ways to add this data to Vector like vwload or using Director.
16 © 2018 Actian Corporation
Cluster and Label
Use k-Mean to label data(Convert- unsupervised to supervised)
✓ Read about 50% data to make a k-mean cluster.✓ Use Dataflow ‘Type Conversion’ operator to change some variables to categorical variables✓ ‘Cluster Predictor’ assigns input data to appropriate cluster.✓ ‘Drive Fields’ helps in assigning appropriate name to each cluster. ✓ Write the output to Vector
17 © 2018 Actian Corporation
Train and Test
Use the labeled data; create Logistic model
✓ Read around 1,000,000 rows from database and passed to Logistic Regression Learner✓ Logistic Regression Predictor, predicts a target value using a previously built logistic regression model.✓ Time taken to build the model - 2 min, 15 secs✓ Logistic classification model is written to PMML file
18 © 2018 Actian Corporation
Classify
Logistic Regression can be used to classify larger dataset
✓ Read original 50 million rows✓ Use PMML file built in stage-3 for predictions✓ Took 1 min and 44 sec to classify remaining 25 million rows and write this classification to database
19 © 2018 Actian Corporation
Assessment for 50 Million rows, 4GB data for k-mean for various combinations
19 © 2018 Actian Corporation
Vector CSV
DataFlow 2 min, 26 sec(93.06%)
8 min, 13 sec(75.06%)
KNIME 10 min, 7 sec (67.17%)
32 min 6 sec (base)
• k-mean with R and CSV hangs and after 19 mins throws lots of errors.• Imagine how much more advantage it would give as the data increases• Winning combination
Quantitative comparison
Using DataFlow and Vector gives 93% improvement over KNIME and CSV.
20 © 2018 Actian Corporation20 © 2018 Actian Corporation
Visualization
Conclusions
• Data is growing fast
• Sooner or later Machine Learning will be applied almost everywhere
• Tools with high speed and performance will become instrumental in making right decisions for business
• Vector and DataFlow is an ideal combination
21 © 2018 Actian Corporation
Acknowledgements
Saurabh Mishra & J V Kameshwar Rao, Analytics CoE, ERS, HCLTechnologies
22 © 2018 Actian Corporation
Thank you!