Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
-
Upload
cloudera-inc -
Category
Technology
-
view
8.953 -
download
65
Transcript of Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Parallel Data Mining Platform in Telecom Industry-- Big Cloud based Parallel Data Mining Platform
Friday, Oct 2, 2009 NYC
Research Institute of China Mobile Communication Corporation
Feng Cao
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
Outline
Introduce
BC-PDM architectureArchitecture
Features compared between phase I and phase II
Conclusions and Future worksConclusions
Future works
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
Large scale data in China Mobile Communication Corporation (CMCC)
Subscribers: 500 million
Subscribers’ CDR(calling data record) data
5~8TB/day in CMCC
For a branch company (> 20 million subscribers)Voice: 100million* 1KB = 100GB/day
SMS: 100~200 million * 1KB = 100~200GB/day
……
Network signaling data, for a branch company (> 20 million subscribers)
GPRS signaling data: 48GB/day for a branch companies
3G signaling data: 300GB/day for a branch companies
voice, SMS signaling data, ……
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
Large Scale Data Applications and current solution
Precision marketingAnalysis of User Behavior
Customer Churn Prediction
Service Association Analysis
……
Network OptimizationNetwork QOS Analysis
Singalling Data Analysis
......
Service Optimization and Log Processing
Spam Message Filtering
……
Commercial database / data warehouse systems
Commercial Data Mining Tools
Most are running on Unix Servers, data stored in Storage Arrays
The The RequirementsRequirements
Current solutionCurrent solution
ClemetineEnterprise Miner Intelligent Miner
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
What’s BASS
BASS (Business Analysis Support System) is a BI system for CMCC to support enterprise decision-making, marketing management analysis, and sales.
BASS includes data extract layer, data process layer, data display layer, application layer
Main operation in data process layer is: Data extract from other system,
Data transfer
Data gather
Data statics
…
Based on database system, most of operation are deal in database, which realizes ELT(Extract, Load and Transfer), rather than ETL.
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
Challenges and limitations of BASS
The invest of Hardware is large, and the enlargement is high cost.62% invest is on hardware
Because there’s different critia between the unix server, when enlargement, we should buy totally new unix servers rather than just makeup some unix servers.
The management of IT system is complex.One unix server can’t support a BASS, in every branch subsystme, there’s about 3-5 servers, such as ETL Server, Database Server, Interface Server, and Display server.
The pressure on database is over load.ELT makes large pressure on database, in branch company, one server cant support all operation.
Data back up can’t be support wellOff line data back up (5 branches) cost lots of time, online data back up(8 branches) cost lots of resource, file back up (18branches) restore slowly
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
What is the BC-PDM
BC-PDM: Big Cloud based Parallel Data Mining Platform
A data mining solution for large-scale data analysisMassive scalability - based on Hadoop
Low cost - commodity machines and free software
Customization – facing to application requirements
Easy to use - similar user interface to commercial tools
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
BC-PDM Architecture
•Large Scale Data Process
•Large Scale Data Mining
•Excellent scalability
•Large Scale Storage
•High performance
•High Availablity
•Low Price
Data mining App
DE DT
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
Features of BC-PDM (I)
BC-PDM(phase I)Workflow management
GUI - Drag Operation for application modeling design
Job Monitoring
Flow Configuration
ETL (14 different ETL operations from 6 categories based on MapReduce)
Statistic, attribute processing, data sampling, query, data processing, redundancy data processing
Data mining Algorithm (9 algorithms from 3 categories based on MapReduce)
Clustering, Classifier, Association Analysis
» BC-PDM(phase II)› DE(Data Exploration)
› Simple data analysis and preview
› ETL (25 more)• To simulate SQL operation,
support Join, Group by, Expression, case when, Update, and etc.
› Data mining Algorithm (4 more)
• Classifier, Sequence Association Analysis
» Targeting general data analysis and data mining platform/tools
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
Features of BC-PDM(II)
BC-PDM(phase I)Visualization
Text, decision tree, cake graph, and histogram
» BC-PDM(phase II)› Web based GUI
› Provide SaaS mode for users
› Data Transfer Tool› Provide data upload and
download tools for SaaS
› Security› Multi-tanent and user group
for branch, ACL for data access
» Targeting general data analysis and data mining platform/tools
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
Case I – Mapreduce based ETL
Function- Redundancy RemoveTo delete the same records in a CDR, and reserve the unique one.
Input Data
Set the targe fields to Key, other fields
to Value
Reduce the same key, read from the value list
and write once
Output Data
Define the target fields (one or all)
Set the targe fields to Key, other fields
to Value
Set the targe fields to Key, other fields
to Value
MapTasker 1 MapTasker 2 MapTasker n
ReduceTasker 1
Reduce the same key, read from the value list
and write once
ReduceTasker m
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
关键技术方案 - 并行 ETL- 冗余删除
Input data Set
Map函数
Reduce函数
将整行数据作为Key,Value为空文本
Output of reduce
function
将Key值相同的数据仅保留一行记录输出
功能 冗余删除操作实现了针对所有数据样本中完全相同的两条或多条记录进行删除,只保留相同记录中的一条记录。
指标 1 )实现数据表冗余删除的并行化2 )正确性与串行结果完全一致3 )加速比接近线性, TB 级处理时间千秒级
参考方案 数据库中的串行冗余删除我们的方案
1 )通过 map 对待处理数据进行分块处理,每个数据块对应一个处理节点; map 中输入的 key为默认值——每行数据的偏移量, value 为该行数据的文本形式,以此方式实现在每块中依次读入每行数据; map 任务输出中间 <key,value>对,其中, key 从整行数据文本, value 为空文本;2 )对具有相同 key 值的数据由 reduce 输出:key 为整行数据, value 值为空,即可实现同样的数据记录仅保留一条数据记录; 将 reduce 输出结果存储到分布式文件系统。
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
Case II – Mapreduce based DM Algorithm
Function- AssocationTo discover association rules in data. It iteratively generates candidate k-length item sets from frequent item sets of length k−1.
Input Data
Set the frequent k-1 length item sets to
Key, appear times to Value
Reduce the same key, read from the value list
and sum
Output Data
Set the frequent k-1 length item sets to Key, appear times
to Value
Set the frequent k-1 length item sets to Key, appear times
to Value
MapTasker 1 MapTasker 2 MapTasker n
ReduceTasker 1
Reduce the same key, read from the value list
and sum
ReduceTasker m
Output rules satisfy both minimum support value
and minimum confidence value
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
关键技术方案 - 并行关联规则算法 -PApriori
。。。
Map:并行化方式产生全部k项候选集
返回全部频繁项集
Lk是否为空
是
候选k项集-1 候选k项集-2 候选k项集-n
Reduce:合并候选K项集、计算满足最小支持度和置信度的频繁K项集Lk
开始
K=K+1
否
结束
功能 Apriori 是基于统计频繁项集的策略发现属性间的关联关系
指标 1 )实现查找频繁 k 项集的并行化2 )正确性与串行结果完全一致3 )扩展性优良, TB 级处理时间千秒级
参考方案 串行 Apriori 算法我们的方案
1 )采用 Map/Reduce 机制逐层迭代方法来发现频繁项集,在查找每个频繁 k 项集时进行并行化;2 )将数据转换为中间 Key/Value对输出: key 为候选 k 项集, value 为项集计数;将各处理节点输出的数据进行合并处理,满足最小支持度阈值的作为频繁 k 项集;3 )由频集产生强关联规则,输出满足最小可信度阈值的关联规则。
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
Experiment EnvironmentSoftwareHardware
•256 nodes
•Datanode: 1-way 4core XEON
2.5G/8GB Mem/4*250G SATA II
•Namenode/JobTracker: 2-way 2core
AMD Opteron 2.6GHz /16G Mem/ 4*146G
SAS
•network: Gbps Switch (now all 256
nodes connected on a 264-port switch)
•OS: RHEL5.2•Hadoop 0.19.1•Program language : Jdk1.6 / Linux
Shell•Tools: Eclipse3.3
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
Evaluation of BC-PDM(Phase I)
CorrectnessCorrectnessCompare to SPSS Clementine, it satisfies application requirement Compare to SPSS Clementine, it satisfies application requirement
PerformancePerformance (16 nodes compared to an general unix server)(16 nodes compared to an general unix server)
Key Technology EvaluationKey Technology Evaluation
The performance of parallel ETL improves about 12 to 16 timesThe performance of parallel ETL improves about 12 to 16 times
The performance of data mining ETL improves about 10 to 50 timesThe performance of data mining ETL improves about 10 to 50 times
When there are 256 nodes, it can store, process and mine the data on hundreds TB level.When there are 256 nodes, it can store, process and mine the data on hundreds TB level.
Typical Application Typical Application
The performance of the 3 applications of Shanghai Branch Company improves 3 to 7 The performance of the 3 applications of Shanghai Branch Company improves 3 to 7 timestimes
ScalabilityScalabilityParallel ETL has excellent scalabilityParallel ETL has excellent scalability
Parallel data mining algorithm has good scalabilityParallel data mining algorithm has good scalability
Data Mining ApplicationsData Mining ApplicationsUser Cluster Analysis: To find the difference of user group, characterize the groups to make User Cluster Analysis: To find the difference of user group, characterize the groups to make precision marketingprecision marketing
Service association: To find the associations among new value added services to make out how Service association: To find the associations among new value added services to make out how to recommend new services to customersto recommend new services to customers
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
Conclusions
The BC-PDM framework integrates data mining applications on MapReduce and HDFS platform.
In BC-PDM phase I, we implements 14 ETL operations and 9 data mining algorithms. Our practices and experiment results verified that data mining application on MapReduce could deal with large scale data and speed up the response time effectively.
For BASS’s requirements, BC-PDM phase II support SaaS mode and more features.
In phase II, we use Map chain to optimize performance, especially for operation sequence.
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
Future works
BC-PDM phase II is under developing, facing some challengesData privacy protection, if SaaS for public, security is most important.
The migration of online system from SQL to BC-PDM is difficult.
How to improve user-friendliness of BC-PDM
Workflow and API for designer is not so flexible as SQL.
General EvaluationCorrectness
Performance
Scalability
Application evaluationETL process, rather than ELT
Choose typical cases in BASS, use BC-PDM to realize the totally process from source data to database layer
Output the result to business database for display layer, because display layer need SQL support
using real data, compare to real system
Data MiningUse the data mining result to total process of BI
Check the result buy marketing
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
People(cloud computing team from CMRI)
Shaoling Sun
Zhiguo Luo
Meng Xu
Dan Gao
Chao Deng
Ling Qian
Jinyu Han
Leitao Guo
Xu Wang
Zhihong Zhang
Ji Qi
Min Hu
Hongwei Sun
Peng Zhao
英文正文标题字体: Arial字号: 32 (加粗)颜色:黑色
英文正文字体: Arial字号: 16颜色:黑色
内部资料 注意保密
Collaborations are welcome!
Thanks and Questions?
[email protected]@chinamobile.com [email protected]
Cloud Computing E-Channel (in Chinese)
http://labs.chinamobile.com/cloud