Hi, I’m Eugene I’m here to share aboutmy data science journey andwhat I do at Lazada
4th April 2016SMU Masters of IT in Business
Studied Psychology and Businessat Singapore Management University (SMU); wanted to usedata to create positive impact
Collected and analyzed tweets to provide insight on tweet share and sentiment for electronics conglomerate
Then, was transferred to workforce analytics team, working on data from IBM’s 450k employees to build…
Job recommendation engine to increase internal transfers, skill renewal, satisfaction, and reduce attrition
Written and verbal communication from essays and presentations (SMU), and briefs and stakeholder engagement with industry leaders (MTI)
Skill sets needed to be a data scientist and how I acquired them
- Statistics- Experimental
Design- SPSS & R- Communication- Teamwork
More R via MOOCs:- Data Analysis and statistical inference (Duke) - Computing for Data Analysis ( Johns Hopkins)
Python via MOOCs: - Computer Science and Programming in Python (MIT)- Interactive programming in Python (Rice)
Machine Learning via MOOCs:- Machine Learning (Stanford)- Statistical Learning (Stanford)- Social and Economic Networks (Stanford)- Text Mining and Analytics (Urbana-Champaign)
Distributed storage and processing via MOOCs: - Mining Massive Datasets (Stanford)- Big data with Apache Spark (UC Berkeley) - Scalable Machine Learning with Apache Spark (UC Berkeley)
Volunteer for things people don’t want to do- Volunteered for project on Twitter tracking with $0 budget
Twitter project: Connect to API, download tweets 24/7 over 2 weeks, analyze tweets; learnt how to:- Work with APIs- Recover from failure automatically- Work with data that can’t fit in memory- Text analytics and sentiment analysis
Skill sets to be a better data scientist (what I’m focusing on now)
- Statistics- Experimental
Design- SPSS & R- Communication- Teamwork
- Python- SQL- Machine Learning- Distribute Storage
& Processing
My journey so far…
- Statistics- Experimental
Design- SPSS & R- Communication- Teamwork
- Python- SQL- Machine Learning- Distribute Storage
& Processing
- Finding use cases- Software Engineering- Designing data
products- Spark & Scala
So what can you do?- Get very good at basic SQL- Get very good at either R or Python- Understand basic machine learning techniques- Understand distributed systems and processing- Improve communication by writing and sharing
- Get experience by doing projects on machine learning and distributed processing (e.g., Open data, Volunteering, Kaggle, etc)
A rough guide to each role
Collect, store, maintainEngineers
Explore, prepare, modelScientists
Expose, integrate, platform-izeTool Developers
Lines may blur between roles
Product-related:- Product Categorization- Attribute Extraction- Spam Detection- Image Quality Checking
Product categorization
Product title & description
Machine Learning Categorization
Rules-based Categorization
CrowdCategorization
Product Category
Quality Checking and Validation
Sufficient confidence
If insufficient confidence
API for self-service
Production
Scheduled batch jobs
Product Category
Product Ranking for onsite display
Product Data
Purchase Data
Behavioral Data (e.g., clickstream)
Other Data (e.g., ratings, etc)
Merging datasets
Feature Engineering
Model product rankings
Data Cleaning
Rule-based modifiers
Measurement & A/B Testing
Recommendations for newsletter subscribers
Product Data
Purchase Data
Behavioral Data (e.g., clickstream)
Other Data (e.g., ratings, etc)
Merging datasets
Feature Engineering
Data Cleaning
Customer Segmentation
Forecasted Top Sellers
Recommendations Newsletter Creation
Measurement & A/B Testing
Rule-based modifiers
Data Preparation,
50%
Modeling, 20%
Productionizing, 30%
Coding Breakdown
Majority of time spent coding (thankfully)
Coding, 55%
Engagment, 30%
Others, 15%
Data Preparation- Merging data- Imputing nulls- Removing duplicates- Handling outliers- Fixing formats- Etc, etc, etc
Deploying to production- Proof-of-concept- Developing API- Scheduling jobs- Continuous integration- Fixing bugs
Engagement (with stakeholders)- Roadmap planning (quarterly)- Aligning solution with problem- Explaining and getting buy-in