Post on 07-Jan-2017
StudySapuri Data Analytics Platform with Treasure Data
Tetsuo Yamabe Recruit Marketing Partners Co., Ltd.
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
About Me
Tetsuo Yamabe
2
Data Engineer / Ph.D. (Eng) Communication Design Group Business Development Department Online Learning Development Office Education & Learning Business Division
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
About Me
Tetsuo Yamabe
3
Joined RMP at Aug.2015 10 months TD experience Data analytics platform development for our online learning service (a.k.a. StudySapuri)
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
• 980 JPY / month ~ • Individual & In class business model
5
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Individual In class
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Individual In class
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
http://www.slideshare.net/Seigen/ss-61816140
Adaptive Learning for personalized LX Collaborative research with Matsuo Lab. at Tokyo Univ.
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Outline
1. Background 2. Platform Migration and TD 3. Technical Details 4. Challenges and Future Work 5. Conclusion
9
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
1. Background
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved. 11
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved. 12
Recruit Technologies
Recruit Marketing Partners
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved. 13
Recruit Marketing Partners
Recruit TechnologiesQuipper
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Quipper
• “Distributors of Wisdom” ‒ Japanese EdTech company launched in London ‒ Teacher-student communication support system
• Worldwide presence in global education scene ‒ London, Tokyo, Manila, Jakarta, Mexico City ‒ Open culture with strong engineering competence ‒ Acquired by Recruit Marketing Partners in Apr. 2015
14
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Recruit private cloud
AWS
Before After2016.2.25
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
2. Platform Migration and TD
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Before “Quipper Migration”
• Main usage ‒ KPI monitoring ‒ Adhoc user activity analytics
• Used together with private Hadoop ‒ WebHive
18
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Before “Quipper Migration”
19
Raw tables/logs Transformed tables
Member attributes
Activity logs
Data
Ops
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Extract, Transform and Load Pattern
Pros • Easy to use (simple schema, aggregated information) • Easy to maintain (data team perspective) • Reduced size information and logs Cons • Inflexibility in fixed data source and schema definition • Bloating tables • Black-boxed transformation • Communication cost across divisions/companies
20
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
After “Quipper Migration”
21
Raw tables/logsScooped tables
Member attributes
Activity logs
Transformed tables
DataInfraDev
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Extract, Load and Transform Pattern
Pros • You have everything you need/want • Fully aggregated data in TD Cons • Duplicate business logic • Batch process maintenance cost • Data volume and load time • Learning cost (app data and internal architecture)
22
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved. 23
Contents Performance Monitoring
Customer Support Support
Students Performance Report
Class Status Report
KPI MonitoringSalesman Support
Developer Support Prototyping New FeatureData Science Support
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Fact Sheet
• 50+ tables are daily imported by Embulk • 30+ hive queries are invoked by Luigi • 10+ presto queries are scheduled in TD web console • 20+ reports are delivered to 5 business divisions
24
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
3. Technical Details
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Streaming Insert
Application (Server side)
Databases
Application (Client side)
TD SDK
Kinesis Lambda
DataTank
PlazmaDB
Join /w FDW
Bulk import
System OverviewPayment logs
Video info
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Featured Topics
• Client-side events ‒ SPA event tracking ‒ Customized TD tag
• Server-side events ‒ Streaming insert with Kinesis + Lambda
• td-client-python ‒ Durability improvement
27
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Featured Topics
• DataTank ‒ Isolate sensitive information from Plazma DB ‒ Data mart store to connect BI
• Luigi ‒ Define data transforming job with table dependency ‒ Invoke Embulk command inside Luigi Jobs
28
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Featured Topics
• Bulk import ‒ Cross import from MongoDB and PostgreSQL to
PlazmaDB and DataTank • embulk-input-mongodb • embulk-input-postgresql • embulk-filter-insert • embulk-filter-eval • embulk-output-td • embulk-output-postgresql
29
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
4. Challenges and Future Work
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Scooped raw tables
Transformed tables
Report tables / marts
Scheduled queries in web console • Select all without conditions • Assign column name in Japanese • Result export to Google spreadsheet
Transform tables in Luigi tasks
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Record Set Versioning at Transforming Phase
32
=2016/03/31
2016/04/01
2016/04/02
append
user_0001 user_0002 user_0003
Table C
user_0001 user_0002 user_0003
user_0001 user_0002 user_0003
user_0001 user_0002 user_0003
Table B
user_0001 user_0002 user_0003
user_0001 user_0002 user_0003
user_0001 user_0002 user_0003
Table A
user_0001 user_0002 user_0003
user_0001 user_0002 user_0003
=
=
+
+
+
Partition-based versioning pattern
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Record Set Versioning at Transforming Phase
33
create
Table A_yyyymmdd
=2016/03/31user_0001 user_0002 user_0003
user_0001 user_0002 user_0003
user_0001 user_0002 user_0003
+
2016/04/01user_0001 user_0002 user_0003
user_0001 user_0002 user_0003
user_0001 user_0002 user_0003
=+
2016/04/02user_0001 user_0002 user_0003
user_0001 user_0002 user_0003
user_0001 user_0002 user_0003
=+
Table B_yyyymmdd Table C_yyyymmdd
Table-based versioning pattern
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Record Set Versioning at Transforming Phase
• Table-based versioning doesn’t fit TD ‒ Increased table degrades query performance ‒ Union operator is needed for all the tables ‒ Append and remove is not realistic
• Partition-based versioning with “once a day” rule ‒ Drop daily partition first before record insert ‒ ALTER TABLE capability would be helpful to
invoke drop partition in a query34
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Reuse Application’s Business Logic
• Frequently appearing clause should be defined as a common UDF or view ‒ Incl. schema definition, const definition etc ‒ TD is missing both UDF and view features
• Preliminary transform complicated tables in application side before loading into TD? ‒ Hybrid approach ‒ Reuse application code
35
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Other topics
• Increasing users across division ‒ Account management (incl. dev/ops/biz) ‒ Race condition in Presto resource ‒ Large file delivery via web console
• Presto/Hive query testing framework ‒ Test against small dataset with Presto/Hive SQL
interface?
36
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
5. Conclusion
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Success Factors
• TD allows to focus on understanding application and communication with Quipper engineers ‒ Fully managed Hadoop service ‒ Customer support’s quick response
• Different DB but still in same TD ‒ No extra cost at database-cross JOIN ‒ Continuous analytics with JukenSapuri data
38
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Success Factors
• Quipper’s culture and strong skills are really helpful to setup a data analytics platform for their application ‒ Global market already had a BQ based platform ‒ Open information and communication
• Slack x GitHub x Google Drive ‒ Clean code with fine readability ‒ HRT : Humanity, Respect, and Trust
• Cultural convergence between Quipper and RMP
39
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Conway’s Law?
40
Data
Infra
Dev
Casual open communication over chat + PR
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Beyond Monitoring and Reporting
• Sophisticated machine-learning with Hivemall • Realtime data processing and feed to application
41
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved.
Distributors of Wisdom x
世界の果てまで最高のまなびを届ける
42
(C) Recruit Marketing Partners Co.,Ltd. All rights reserved. 43