End-to-End data mining feature integration, transformation...
Transcript of End-to-End data mining feature integration, transformation...
© 2014 Datameer, Inc. All rights reserved.
End-to-End data mining feature integration, transformation and selection with Datameer
© 2014 Datameer, Inc. All rights reserved.
Fastest time to Insights
Rapid Data Integration Zero coding data integration Wizard-led data integration & No ETL
required Over 55+ out-of-the-box adapters
OpenAPI to create custom data connections Schema on read Flexible integration methods Exception Reporting
Rapid Feature Transformation Point & Click Analytics Spreadsheet UI 270+ pre-built functions Visual data profiling Drag & Drop Visualization
Powerful Feature Selection Out-of-the-box Data Mining on
Hadoop (Decision Trees, Column Dependency, …, Pearson, Spearman, …)
Reuse of own functions written in Java, R, Python, SAS, SPSS and more
Feature discovery, selection and data mining on BigData within a fraction of time
© 2014 Datameer, Inc. All rights reserved.
ProblemIt takes months to integrate, pre-process, merge and select data from a wide range of data sources for the purpose of data mining in the area of credit scoring.
This is due toTechnical Challenges • Large number of source systems• Heterogeneous data formats • Large data volume• Evolving systems lead to long integration
processes
Organizational Challenges• Many alignment round trips between SMEs and
IT to get the right data in the right form• Intermediate insights lead to changing
requirements, which in turn again trigger change requests at IT
• All data from the different sources is ingested into a Hadoop-based data lake in their original format following the pattern “store everything, discover later”• Datameer enables SMEs and data scientists to
merge data from many data sources.• Comprehensive & easy to use data transformation
functionalities help to understand and clean up data quickly.• Feature selection functions allow to spot
relationships in data sets and reduce thousands of attributes to a couple of hundred or even less depending on the use case.• Different sampling techniques are applied to
extract data for the purpose to create predictive models in SAS• Datameer’s PMML interfaces allows to run those
created predictive models on Big Data to get more precise rules.
Solution
© 2014 Datameer, Inc. All rights reserved.
Results• Datameer reduces the process of data integration, feature
transformation and selection from months to merely days.
• Datameer eliminated the overhead processes between IT and business units
• SMEs and data scientist can utilize Datameer as a self service platform for data discovery without going back and forth between IT with ever changing requirements.
• Predictive data mining now delivers better results as models can now be run on Big Data
© 2014 Datameer, Inc. All rights reserved.
Feature Selection & Discovery
Feature Transformation!
Feature Selection!
Modeling!
Prediction, Scoring, …!
• Data Cleansing• Vari. histogram distributions• Reduce cardinality• Binning• …
• Pearson• Spearman• Mutual Information• Gini• …
• Regression• Neural Network• Bayesian Networks• …
• PMML • Ensemble
3rd party tools for modeling sampled data and Datameer for executing models on BigData
© 2014 Datameer, Inc. All rights reserved.
Solution Architecture Blueprint
DB!
Import Adapters or Data Links!
Workbooks!
Data !Sources!
Data !Sources! …!
Hive!
Workbooks!Filtering, Aggregation, Joins
Visualization
Option: Write results to database and import to mining tools to build
models
Option: Export CSV or write to Hive Table for Data Mining Tools
Export PMML to Datameer
Workbooks!