2
Web-browsing data social network communications sensor data->Behavior dataGoogle and Facebook, for example, are Big Data companies.
•Big data processing•Extracting useful information that reflects user behavior from massive log•Instance data management•Data analysis
Behavior data (like web log) can be used for improving and supporting business processes.Data mining, process mining and so on
Big data
Challenges Opportunities
3
Distributed File
System(HDFS)
Distributed File
System(HDFS)
Key-value
Database(HBase ,Cassa
ndra, MongoDB)
Key-value
Database(HBase ,Cassa
ndra, MongoDB)
Unstructured Data
Cloud Storage
Big Data processing
BI/
Reporting
BI/
Reporting
Data
Mining
Data
Mining
Machine
Learning
Machine
Learning
Analytic applications
Cloud computing
(Map/Reduce Framework)
Cloud computing
(Map/Reduce Framework)
Big Data Access HiveHive NoSQLNoSQL
Raw data
Instance data
Distributed File
System(HDFS)
NoSQLNoSQL
Cloud computing
(Map/Reduce Framework)
Cloud computing
(Map/Reduce Framework)
CassandraCassandra
Web Data
Process
Mining
Process
Mining
Process
Mining
Process
Mining
Case study: Search Engine Company
4
•News, Page, Image, Maps, Music, navigationDataset: 66 million clicks in one month, 2.2 million clicks per
day->generate behavior in 10 minutes
User Behavior:•Visiting path (Referer)•Searching result effectiveness •Abs Clicking Behavior•Source and Destination of User visiting•Robot Behavior Reorganization and Analysis•Visiting page layout•Behavior comparison and product improvement•User grouping and recommendation
首页
图片首页
新闻首页
时评首页
网页结果页
时评结果页
图片结果页 图片过渡页
新闻过渡页
新闻专题页
新闻结果页
网页搜索
页面切换
网页结果点击
图片搜索
新闻搜索
时评搜索
外部页
图像点击
页面切换
新闻点击
点击全文
Data features
5
• It contains massive information in a well recorded format
• Large scale with big growing potential
• Real-time analysis
existing tools
6
Data extracting: XESame , Prom Import
Process Mining : ProM 1)Due to large data set, analysing has low speed and in most situations it got crash 2)Offline analysis-> real-time analysis
Cloud Storage/no rational DB
Instance data(XES)
Extracting data from cloud
System Structure
7
Log processingLog processing
UnderstandableUnderstandable modelmodel
Extracting useful Extracting useful information that information that reflects user behavior reflects user behavior from massive logfrom massive log
9
AEBDCFDAEFG
CaseID1+T1+ACaseID1+T3+ECaseID2+T3+BCaseID3+T2+DCaseID2+T1+CCaseID3+T3+FCaseID1+T2+DCaseID2+T4+ACaseID3+T1+ECaseID4+T1+FCaseID2+T2+G
CaseID1+T1+ACaseID1+T2+DCaseID1+T3+E
CaseID2+T1+CCaseID2+T2+GCaseID2+T3+BCaseID2+T4+A
CaseID3+T1+ECaseID3+T2+DCaseID3+T3+F
CaseID4+T1+F
A D E
C G B A
E D F
F
ADECGBAEDF
F
UKOC
If the events number in Xlog exceed 5000, output one Xlog, to avoid the exceed
heap size of computer
Map ReduceSort and Partition
XESName_0.xesXESName_1.xesXESName_2.xes
10
fileSize logNum OnePCTime MapReduceTime MapNum ReduceNum
8.84 MB 36422 5 s, 921 ms 7s 3 15
65.8M 218177 30 s, 846 ms 25s 3 15
112 M 772241 48 s, 559 ms 30s 3 15
One day(371M) 2,200,000 2.5minutes 1.3minutes 40 15
One week 15,000,000 20 Minutes (Expected )
2.5minutes 280 15
One month 66,000,000 2 hours(Expected )
6 minutes 1200 15
CPU: Intel Xeon 2.40GHZ RAM:2GB14Nodes
Process Discovery
11
Alpha minerHeuristic minerFuzzy minerSequence model
One instance/case is defined as one visitor’s one time visiting.•IP+UA•CookieIDActivity varies based on different requirements
Behavior analysis
12
User behavior pattern
range activity Data selection
Interaction between channels
all ContentType
Web Map vising path
all Referer/URL
webpage layout news ContentType+PageType+Block
(Channel =news)AND(PageType=195)
image ContentType+PageType+Block
(Channel =image)AND(PageType=435)
Searching result
all
Behavior grouping
all
Registration
13
User behavior pattern
range activity Data selection
Interaction between channels
all ContentType
Web Map vising path
all Referer/URL
webpage layout news ContentType+PageType+Block
(Channel =news)AND(PageType=195)
image ContentType+PageType+Block
(Channel =image)AND(PageType=435)
Searching result
all
Behavior grouping
all
Registration
Behavior analysis
14
User behavior pattern
range activity Data selection
Interaction between channels
all ContentType
Web Map vising path
all Referer/URL
webpage layout news ContentType+PageType+Block
(Channel =news)AND(PageType=195)
image ContentType+PageType+Block
(Channel =image)AND(PageType=435)
Searching result
all
Behavior grouping
all
Registration
Behavior analysis
16
User behavior pattern
range activity Data selection
Interaction between channels
all ContentType
Web Map vising path
all Referer/URL
webpage layout news ContentType+PageType+Block
(Channel =news)AND(PageType=195)
image ContentType+PageType+Block
(Channel =image)AND(PageType=435)
Searching result
all
Behavior grouping
all
Registration
Conclusion
22
It is a nice project to get into data analysis field ,with the combination of web data analysis, process mining and cloud computing technology.
Future work:1 More algorithms and technologies should be applied to this data set.2 Behavior comparison and user recommendation still need to be accomplished.3 Can process mining analyze the behavior that does not have a certain pattern.
1 Log Sampling2 Detect the incorrectness from logs before applying log to analysis technologies.3 Extend function of “converting data from key-value database or cloud storage to event log” in Prom or XESame.
Top Related