Using Hadoop as a platform for Master Data Management

Using Hadoop as a Platform for Master Data Management

Roman KuceraAtaccama Corporation

Using Hadoop as a platform for Master Data ManagementRoman Kucera, Ataccama Corporation

Roman KuceraHead of Technology and Research

Implementing MDM projects for major banks since 2010

Last 12 months spent on expanding Ataccama portfolio into Big Data space, most importantly adopting the Hadoop platform

Ataccama Corporation

Ataccama is a software vendor focused on Data Quality, Master Data Management, Data Governance and now also on Big Data processing in general

Quick Introduction

Why have I decided to give this speech?

Typical MDM quotes on Hadoop conferences: „There are no MDM tools for Hadoop“

„We have struggled with MDM and Data Quality“

„You do not need MDM, it does not make sense on Hadoop“

My goal is to: Explain that MDM is necessary, but it does not have to be

Show a simplified example

What is Master Data Management?

„Master Data is a single source of basic business data used across multiple systems, applications, and/or processes“

(Wikipedia)

Important parts of MDM solution: Collection – gathering of all data

Consolidation – finding relations in the data

Storage – persistence of consolidated data

Distribution – providing a consolidated view to consumers

Maintenance – making sure that the data is serving its purpose

… and a ton of Data Quality

How is this related to Big Data?

Traditional MDM using Big Data technologies Some companies struggle with performance and/or price of

hardware and DB licenses for their MDM solution

Big Data technologies offer some options for better scalability, especially as the data volumes and data diversity grows

MDM on Big Data Adding new data sources that were previously not mastered

Your Hadoop is probably the only place where you have all of the data together, therefore it is the only place where you can create the consolidated view

Traditional MDM

Source Name Phone Email Passport

CRM John Doe +1 (245) 336-5468

985221473

CRM Jane Doe +1 (212) 972-6226

3206647982

CRM Load

Traditional MDM

CRM John Doe +1 (245) 336-5468

985221473

CRM Jane Doe +1 (212) 972-6226

3206647982

WEBAPP J. Doe 2129726226 Jane.doe@gmail.com

CRM Load

WEBAPP Load

Traditional MDM

CRM John Doe +1 (245) 336-5468

985221473

CRM Jane Doe +1 (212) 972-6226

3206647982

Billing Doe John John.doe@yahoo.com

985221473

CRM Load

WEBAPP Load

Billing Load

Traditional MDM

CRM John Doe +1 (245) 336-5468

985221473

CRM Jane Doe +1 (212) 972-6226

3206647982

Billing Doe John John.doe@yahoo.com

985221473

ID Name Phone Email Passport

1 John Doe

+1 (245) 336-5468

John.doe@yahoo.com

985221473

2 Jane Doe

+1 (212) 972-6226

Jane.doe@gmail.com

3206647982

Match and Merge

MDM on Big DataThe goal is to get all relevant data about given entity

John Doe, ID 007• Links to original source records• Traditional mastered attributes• Contact history• Clickstream in web app• Call recordings• Usage of the mobile app• Tweets• Gazillion different classification

attributes computed in Hadoop

Billing

Twitter

Web app & mobile

Single view of…

People say „Let’s just store the raw data and do the transformation only when we know the purpose“

But you still need some definition of your business entities, what use is any analysis of your clients behavior without having a definition of client?

Processes need to relate to some central master data

You may end up with multiple views on the same entity, some usage purposes may need a different definition than others, but the process of creating these multiple views is exactly the same.

Main parts of sample solution on Hadoop

Integration of source data Covered by many other presentations, various tools

available

Match and merge to identify real complex entities Assign a unique identifier to groups of records representing

one business relevant entity

Create Golden records

Provide services to other systems Access Master Data

Manipulate Master Data

Search in Master Data

ProfilingThe most important part of Data Integration is knowing your data

Moving MDM process to Hadoop

The matching itself is the only complicated part This is where sophisticated tools come in … only there is

not many of them that work in Hadoop properly

Common approaches Simple matching („group by“) is easy to implement using

MapReduce for large batch, or with simple lookup for small increments

Complex matching as implemented in commercial MDM tools typically does not scale well and it is difficult to implement these methods in Hadoop from scratch – some of them are not scalable even on a theoretical level

Matching options

Rule-based matching

Traditional approach, good for auditability – for every matched record you know exactly why they are matched

Probabilistic matching, machine learning

Serves more like a black box, but with proper training data, it can be easier to configure for the multitude of big data sources

Search-based matching

Not really matching, but can be used synergically to supplement matching – Traditional MDM for traditional data sources and then use full-text search to find related pieces of information in other (Big Data) sources

Complex matching

Problems Some traditionally efficient algorithms are not possible to

run in parallel even on theoretical level

Others have quadratic or worse complexity, meaning that these algorithms do not scale well for really big data sets, no matter the platform

Typical solutions If the data set is not too big, use one of the traditional

algorithms that are available on Hadoop

Use some simpler heuristics to limit the candidates for matching, e.g. using simple matching on some generic attributes

Either way, using a proper toolset is highly advised

Transitivity and each-to-each matching guarantee

Simple matching with hierarchies

Name Social Security Number

Passport Matching Group ID

John Doe

987-65-4320 -

Doe John

987-65-4320 3206647982 -

J. Doe 3206647982 -

John Doe

987-65-4320 1

Doe John

987-65-4320 3206647982 1

J. Doe 3206647982 - Matching by the primary key – Social Security Number

John Doe

987-65-4320 1

Doe John

987-65-4320 3206647982 1

J. Doe 3206647982 1 Matching by the secondary key – Passport

Records that did not have a group ID assigned in the first run and can be matched by a secondary key will join the primary group

Finding a perfect match by a key attribute is one of the most basic MapReduce aggregations

If the key attribute is missing, use a secondary key for the same process, to expand the original groups For each set of possible keys, one MapReduce is generated

For small batches or online matching, lookup relevant records from repository based on keys and perform matching on partial dataset In traditional MDM, this repository typically was RDBMS

In Hadoop, this could be achieved with HBase, or other similar database with fast direct access based on a key

Sample tool

Step 1 | Bulk matching

Matching Engine[MapReduce]

MDM Repository[HDFS file]

Source 1[Full Extract]

Source Increment Extract[HDFS file]

Step 2 | Incremental bulk matching

Matching Engine[MapReduce]

New MDM Repository[HDFS file]

Old MDM Repository[HDFS file]

Step 3 | Online MDM Services

Matching Engine[Non-Parallel Execution]

MDM Repository[Online Accessible DB]

Online or Microbatch[Increment]

1. Online request comes through designated interface

2. Matching engine asks MDM repository for all related records, based on defined matching keys

3. Repository returns all relevant records that were previously stored

4. Matching engine computes the matching on the available dataset and stores new results (changes) back into the repository

Step 4 | Complex Scenario

Matching Engine

SMALL DATASET[Non-Parallel Execution]

LARGE DATASET[MapReduce]Size?

Update Repository

Full scan

Step 4 | Complex Scenario

Matching Engine

SMALL DATASET[Non-Parallel Execution]

LARGE DATASET[MapReduce]Size?

Full scan

Update Repository

Delta Detection[MapReduce]

Typical MDM services for consumers

Insert, update (upsert)

Record is matched against the existing repository and results are stored back

Identify

Similar to upsert, but it does not store the results back into the repository

Search

Using fulltext (or other) index to find master entities

Get all the information on master record identified by its ID

Get all master records for batch analysis

Questions?

For more information, visit us at Ataccama booth!

Using Hadoop as a platform for Master Data Management

Technology

Transcript of Using Hadoop as a platform for Master Data Management

Falcon - Data Management Platform on Hadoop (Beyond ETL)

Platform as a service standard for hadoop environment

eBay Experimentation Platform on Hadoop

Secure Hadoop clusters on Windows platform

Building a self service analytical platform around Hadoop ...files.meetup.com/3097452/Hadoop User Group - Self Service at Sano… · Building a self service analytical platform around

How to Deploy a Secure, Highly-Available Hadoop Platform · How to Deploy a Secure, Highly-Available Hadoop Platform . Agenda ! Environment ! Automation ... puppetdb (Database of

Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Building the Right Platform Architecture for Hadoop

Mining of Datasets using Big Data Technique: Hadoop Platform

A distributed video management cloud platform using hadoop

Snapshotting in Hadoop Distributed File System for Hadoop ...€¦ · Snapshotting in Hadoop Distributed File System for Hadoop Open Platform as Service ... 2.2 Hadoop Open Platform

Osztott, skálázódó platform stream-feldolgozáshoz · S4 vs. Hadoop & Map/Reduce “We considered extending the open source Hadoop platform to support computation of unbound streams

Gearing Hadoop towards HPC sytems · Hadoop •a well- known processing platform for large data sets and widely used in many domains •Hadoop cluster @ Yahoo •Hadoop cluster @

BIG DATA COURSE · Hadoop Architecture Hadoop 1.x Core Components Hadoop 2.x Core Components Fundamentals of Hadoop Hadoop Master-Slave Architecture YARN for Resource Management Diﬀerent

HP Vertica Analytics Platform 7.0.x Hadoop Integration …HADOOP_HOME/conf/hadoop-env.shfile,ensurethattheJAVA_HOMEenvironment variableissettoyourJavainstallation.

Dockerized Hadoop Platform and Recent Updates in …schd.ws/hosted_files/apachebigdata2016/4e/Dockerized Hadoop... · Dockerized Hadoop Platform and Recent Updates in ... Packaging

Introduction to Hadoop-Mapreduce Platform

Data Platform for Hadoop - NEC(Japan)...Data Platform for Hadoop HW サーバ Express 5800 SW OS Red Hat Enterprise Linux Hadoop Hortonworks Data Platform Cloudera DataFlow (旧名：Hortoworks

Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Hadoop as an Analytic Platform: Why Not?