Hw09 Hadoop Based Data Mining Platform For The Telecom Industry

Parallel Data Mining Platform in Telecom Industry-- Big Cloud based Parallel Data Mining Platform

Friday, Oct 2, 2009 NYC

Research Institute of China Mobile Communication Corporation

Feng Cao

英文正文标题字体： Arial字号： 32 （加粗）颜色：黑色

英文正文字体： Arial字号： 16颜色：黑色

内部资料注意保密

Outline

Introduce

BC-PDM architectureArchitecture

Features compared between phase I and phase II

Conclusions and Future worksConclusions

Future works




Large scale data in China Mobile Communication Corporation (CMCC)

Subscribers: 500 million

Subscribers’ CDR(calling data record) data

5~8TB/day in CMCC

For a branch company (> 20 million subscribers)Voice: 100million* 1KB = 100GB/day

SMS: 100~200 million * 1KB = 100~200GB/day

……

Network signaling data, for a branch company (> 20 million subscribers)

GPRS signaling data: 48GB/day for a branch companies

3G signaling data: 300GB/day for a branch companies

voice, SMS signaling data, ……




Large Scale Data Applications and current solution

Precision marketingAnalysis of User Behavior

Customer Churn Prediction

Service Association Analysis

……

Network OptimizationNetwork QOS Analysis

Singalling Data Analysis

......

Service Optimization and Log Processing

Spam Message Filtering

……

Commercial database / data warehouse systems

Commercial Data Mining Tools

Most are running on Unix Servers, data stored in Storage Arrays

The The RequirementsRequirements

Current solutionCurrent solution

ClemetineEnterprise Miner Intelligent Miner




What’s BASS

BASS (Business Analysis Support System) is a BI system for CMCC to support enterprise decision-making, marketing management analysis, and sales.

BASS includes data extract layer, data process layer, data display layer, application layer

Main operation in data process layer is: Data extract from other system,

Data transfer

Data gather

Data statics

…

Based on database system, most of operation are deal in database, which realizes ELT(Extract, Load and Transfer), rather than ETL.




Challenges and limitations of BASS

The invest of Hardware is large, and the enlargement is high cost.62% invest is on hardware

Because there’s different critia between the unix server, when enlargement, we should buy totally new unix servers rather than just makeup some unix servers.

The management of IT system is complex.One unix server can’t support a BASS, in every branch subsystme, there’s about 3-5 servers, such as ETL Server, Database Server, Interface Server, and Display server.

The pressure on database is over load.ELT makes large pressure on database, in branch company, one server cant support all operation.

Data back up can’t be support wellOff line data back up (5 branches) cost lots of time, online data back up(8 branches) cost lots of resource, file back up (18branches) restore slowly




What is the BC-PDM

BC-PDM: Big Cloud based Parallel Data Mining Platform

A data mining solution for large-scale data analysisMassive scalability - based on Hadoop

Low cost - commodity machines and free software

Customization – facing to application requirements

Easy to use - similar user interface to commercial tools




BC-PDM Architecture

•Large Scale Data Process

•Large Scale Data Mining

•Excellent scalability

•Large Scale Storage

•High performance

•High Availablity

•Low Price

Data mining App

DE DT




Features of BC-PDM (I)

BC-PDM(phase I)Workflow management

GUI - Drag Operation for application modeling design

Job Monitoring

Flow Configuration

ETL (14 different ETL operations from 6 categories based on MapReduce)

Statistic, attribute processing, data sampling, query, data processing, redundancy data processing

Data mining Algorithm (9 algorithms from 3 categories based on MapReduce)

Clustering, Classifier, Association Analysis

» BC-PDM(phase II)› DE(Data Exploration)

› Simple data analysis and preview

› ETL (25 more)• To simulate SQL operation,

support Join, Group by, Expression, case when, Update, and etc.

› Data mining Algorithm (4 more)

• Classifier, Sequence Association Analysis

» Targeting general data analysis and data mining platform/tools




Features of BC-PDM(II)

BC-PDM(phase I)Visualization

Text, decision tree, cake graph, and histogram

» BC-PDM(phase II)› Web based GUI

› Provide SaaS mode for users

› Data Transfer Tool› Provide data upload and

download tools for SaaS

› Security› Multi-tanent and user group

for branch, ACL for data access

» Targeting general data analysis and data mining platform/tools




Case I – Mapreduce based ETL

Function- Redundancy RemoveTo delete the same records in a CDR, and reserve the unique one.

Input Data

Set the targe fields to Key, other fields

to Value

Reduce the same key, read from the value list

and write once

Output Data

Define the target fields (one or all)


to Value


to Value

MapTasker 1 MapTasker 2 MapTasker n

ReduceTasker 1


and write once

ReduceTasker m




关键技术方案 - 并行 ETL- 冗余删除

Input data Set

Map函数

Reduce函数

将整行数据作为Key，Value为空文本

Output of reduce

function

将Key值相同的数据仅保留一行记录输出

功能冗余删除操作实现了针对所有数据样本中完全相同的两条或多条记录进行删除，只保留相同记录中的一条记录。

指标 1 ）实现数据表冗余删除的并行化2 ）正确性与串行结果完全一致3 ）加速比接近线性， TB 级处理时间千秒级

参考方案数据库中的串行冗余删除我们的方案

1 ）通过 map 对待处理数据进行分块处理，每个数据块对应一个处理节点； map 中输入的 key为默认值——每行数据的偏移量， value 为该行数据的文本形式，以此方式实现在每块中依次读入每行数据； map 任务输出中间 <key,value>对，其中， key 从整行数据文本， value 为空文本；2 ）对具有相同 key 值的数据由 reduce 输出：key 为整行数据， value 值为空，即可实现同样的数据记录仅保留一条数据记录；将 reduce 输出结果存储到分布式文件系统。




Case II – Mapreduce based DM Algorithm

Function- AssocationTo discover association rules in data. It iteratively generates candidate k-length item sets from frequent item sets of length k−1.

Input Data

Set the frequent k-1 length item sets to

Key, appear times to Value


and sum

Output Data

Set the frequent k-1 length item sets to Key, appear times

to Value

Set the frequent k-1 length item sets to Key, appear times

to Value

MapTasker 1 MapTasker 2 MapTasker n

ReduceTasker 1


and sum

ReduceTasker m

Output rules satisfy both minimum support value

and minimum confidence value




关键技术方案 - 并行关联规则算法 -PApriori

。。。

Map：并行化方式产生全部k项候选集

返回全部频繁项集

Lk是否为空

是

候选k项集-1 候选k项集-2 候选k项集-n

Reduce：合并候选K项集、计算满足最小支持度和置信度的频繁K项集Lk

开始

K=K+1

否

结束

功能 Apriori 是基于统计频繁项集的策略发现属性间的关联关系

指标 1 ）实现查找频繁 k 项集的并行化2 ）正确性与串行结果完全一致3 ）扩展性优良， TB 级处理时间千秒级

参考方案串行 Apriori 算法我们的方案

1 ）采用 Map/Reduce 机制逐层迭代方法来发现频繁项集，在查找每个频繁 k 项集时进行并行化；2 ）将数据转换为中间 Key/Value对输出： key 为候选 k 项集， value 为项集计数；将各处理节点输出的数据进行合并处理，满足最小支持度阈值的作为频繁 k 项集；3 ）由频集产生强关联规则，输出满足最小可信度阈值的关联规则。




Experiment EnvironmentSoftwareHardware

•256 nodes

•Datanode： 1-way 4core XEON

2.5G/8GB Mem/4*250G SATA II

•Namenode/JobTracker: 2-way 2core

AMD Opteron 2.6GHz /16G Mem/ 4*146G

SAS

•network： Gbps Switch (now all 256

nodes connected on a 264-port switch)

•OS： RHEL5.2•Hadoop 0.19.1•Program language ： Jdk1.6 / Linux

Shell•Tools： Eclipse3.3




Evaluation of BC-PDM(Phase I)

CorrectnessCorrectnessCompare to SPSS Clementine, it satisfies application requirement Compare to SPSS Clementine, it satisfies application requirement

PerformancePerformance (16 nodes compared to an general unix server)(16 nodes compared to an general unix server)

Key Technology EvaluationKey Technology Evaluation

The performance of parallel ETL improves about 12 to 16 timesThe performance of parallel ETL improves about 12 to 16 times

The performance of data mining ETL improves about 10 to 50 timesThe performance of data mining ETL improves about 10 to 50 times

When there are 256 nodes, it can store, process and mine the data on hundreds TB level.When there are 256 nodes, it can store, process and mine the data on hundreds TB level.

Typical Application Typical Application

The performance of the 3 applications of Shanghai Branch Company improves 3 to 7 The performance of the 3 applications of Shanghai Branch Company improves 3 to 7 timestimes

ScalabilityScalabilityParallel ETL has excellent scalabilityParallel ETL has excellent scalability

Parallel data mining algorithm has good scalabilityParallel data mining algorithm has good scalability

Data Mining ApplicationsData Mining ApplicationsUser Cluster Analysis: To find the difference of user group, characterize the groups to make User Cluster Analysis: To find the difference of user group, characterize the groups to make precision marketingprecision marketing

Service association: To find the associations among new value added services to make out how Service association: To find the associations among new value added services to make out how to recommend new services to customersto recommend new services to customers




Conclusions

The BC-PDM framework integrates data mining applications on MapReduce and HDFS platform.

In BC-PDM phase I, we implements 14 ETL operations and 9 data mining algorithms. Our practices and experiment results verified that data mining application on MapReduce could deal with large scale data and speed up the response time effectively.

For BASS’s requirements, BC-PDM phase II support SaaS mode and more features.

In phase II, we use Map chain to optimize performance, especially for operation sequence.




Future works

BC-PDM phase II is under developing, facing some challengesData privacy protection, if SaaS for public, security is most important.

The migration of online system from SQL to BC-PDM is difficult.

How to improve user-friendliness of BC-PDM

Workflow and API for designer is not so flexible as SQL.

General EvaluationCorrectness

Performance

Scalability

Application evaluationETL process, rather than ELT

Choose typical cases in BASS, use BC-PDM to realize the totally process from source data to database layer

Output the result to business database for display layer, because display layer need SQL support

using real data, compare to real system

Data MiningUse the data mining result to total process of BI

Check the result buy marketing




People(cloud computing team from CMRI)

Shaoling Sun

Zhiguo Luo

Meng Xu

Dan Gao

Chao Deng

Ling Qian

Jinyu Han

Leitao Guo

Xu Wang

Zhihong Zhang

Ji Qi

Min Hu

Hongwei Sun

Peng Zhao




Collaborations are welcome!

Thanks and Questions?

[email protected]@chinamobile.com [email protected]

Cloud Computing E-Channel (in Chinese)

http://labs.chinamobile.com/cloud

Hw09 Hadoop Based Data Mining Platform For The Telecom Industry

Technology

Transcript of Hw09 Hadoop Based Data Mining Platform For The Telecom Industry