Leveraging Hadoop to mine customer insights in a developing market
-
Upload
roman-zykov -
Category
Technology
-
view
1.255 -
download
0
description
Transcript of Leveraging Hadoop to mine customer insights in a developing market
Leveraging Hadoop in Wikimart Roman Zykov
Head of analytics http://wikimart.ru
London, Big Data World Europe, 20th September 2012
Key problem
To be or not to be….
Hadoop
Introduction
Key tasks for Wikimart
What
• BI tasks
• Web analytics (in-house solution)
• Recommendations on site
• Data services for marketing
Who
• Core analytics team
• Analytics members in other departments
• IT site operations
Problem
Too time consuming or too
expensive? • Data volume
• # of data services
Map Reduce
DATA
Standalone
Map Reduce
Our idea
New platform for “Big Data” tasks only
• Start research on Map Reduce software
• First patient - recommendation engine
Difficulties
- no planned budget -> Hadoop is free
- no experts -> learn it
- no hardware -> virtual cluster
Requirements for Hadoop
• Easy scalable
• Easy deployment
• Easy integration
• Less low level Java coding
• SQL-like querries
Data flow
Data feeds DWH
Accomplishments
Recommendations
• Collaborative filtering (item-to-item on browsing history, PIG)
• Similar products (items attributes, PIG)
• Most popular items (browsing history + orders, HiveQL)
• Internal and external search recommendations (HiveQL)
Some statistics after 1 year
• >10% of revenue
• 3 months to launch
• Tens of gigabytes are processed 2 hours daily
• 1 crash only (cluster lost power)
Decision: Invest to Hardware cluster
End user
Internal high-level languages
• HiveQL
• Pig
Reporting
• Pre-aggregated data for OLAP
• RDBMS - front end
• OLAP and Reporting software should
support HiveQL
Data Integration
• SQOOP
• Parallel data exchange with RDBMS
(MS SQL, MySQL, Oracle, Teradata… )
• Incremental updates
• HDFS, Hive, HBASE
• Talend Open Studio
Hadoop vs RDBMS
• Never replace RDBMS:
• Latency
• Weak capabilities of HiveQL vs SQL
• Only some tasks with offline processing:
• Machine learning
• Queries to Big tables
• ….
• Real time: NOSQL
Hadoop myth
Terabytes?
Petabytes?
Big tasks!
Conclusion
• Hadoop is not Rocket Science
• Intermediate data can be Big Data
Starter kit
• Hadoop management system
• Virtual hardware (cloud, virtual servers, etc)
• Offline data tasks
• Pig or HiveQL
• Sqoop: import data from existing data sources