Machine Learning for Big Data (CSE 547 / STAT 548)
Transcript of Machine Learning for Big Data (CSE 547 / STAT 548)
![Page 1: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/1.jpg)
Machine Learning for Big Data (CSE 547 / STAT 548)
(Or how to do really kickass research in the age of big data)
![Page 2: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/2.jpg)
Course Staff
Instructor:
• Sham Kakade
TAs:
• Yao Lu
• John Thickstun
![Page 3: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/3.jpg)
CONTENT
What is the course about?
![Page 4: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/4.jpg)
Course Structure
• 5 “case studies”– Estimating Click Probabilities
– Document Retrieval
– fMRI Prediction
– Collaborative Filtering
– Document Mixed Membership Modeling
• Not comprehensive, but a sample of tasks and associated solution methods
• Methods broadly applicable beyond these case studies
![Page 5: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/5.jpg)
1. Estimating Click Probabilities
• Goal: Predict whether a person clicks on an ad
• Basic method: logistic regression, online learning
Query
Ad Info
Features of user
MODELYes!
No
![Page 6: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/6.jpg)
1. Estimating Click Probabilities
• Challenge I: Overfitting, high-dimensional feature space
• Advanced method: L2 regularization, hashing
Query
Ad Info
Features of user
MODEL
![Page 7: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/7.jpg)
1. Estimating Click Probabilities
• Challenge II: Dimension of feature space changes– New word, new user attribute, etc.
• Advanced method: sketching, hashing
![Page 8: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/8.jpg)
2. Document Retrieval
• Goal: Retrieve documents of interest
• Methods: fast K-NN, k-means, mixture models, Hadoop
![Page 9: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/9.jpg)
3. fMRI Prediction
• Goal: Predict word probability from fMRI image
• Challenge: p >> n (feature dimension >> sample size)
• Methods: L1 regularization (LASSO), parallel learning
MODELHAMMER
orHOUSE
![Page 10: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/10.jpg)
Features of word MODEL
GIRAFFE
3. fMRI Prediction
• Goal: Predict fMRI image for given stimulus
• Challenge: zero shot learning (generalization)
• Methods: features of words, Mechanical Turk, graphical LASSO
HORSE
![Page 11: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/11.jpg)
4. Collaborative Filtering
• Goal: Find movies of interest to a user based on movies watched by the user and others
• Methods: matrix factorization, latent factor models, GraphLab
![Page 12: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/12.jpg)
City of God
Wild Strawberries
The Celebration
La Dolce Vita
Women on the Verge of aNervous Breakdown
What do I recommend???
![Page 13: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/13.jpg)
4. Collaborative Filtering
• Challenge: Cold-start problem (new movie or user)
• Methods: use features of movie/user
IN THEATERS
![Page 14: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/14.jpg)
5. Document Mixed Membership
• Challenge: Document may belong to multiple clusters
• Methods: mixed membership models (e.g., LDA), distributed Gibbs, stochastic variational inference
EDUCATION
FINANCE
TECHNOLOGY
![Page 15: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/15.jpg)
Scalability
• Throughout case studies, introduce notions of parallel learning and distributed computations
![Page 16: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/16.jpg)
Assumed Background
Official Prereq (strict): CSE 546 or STAT 535
Specific topics:• Linear and logistic regression, ridge regression, LASSO• Basic optimization (e.g., gradient descent, SGD)• Perceptron algorithm• K-NN, k-means, EM algorithm
Comfortable with:• Java or Python• Probabilistic and statistical reasoning
Computational and mathematical maturity
![Page 17: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/17.jpg)
LOGISTICS
How is the course going to operate?
![Page 18: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/18.jpg)
Website and Catalyst
• Course website:courses.cs.washington.edu/courses/cse547/16s
p/index.html
• Canvas:– Used for all discussions
– Post all questions there (unless personal)
– Homework collection
– Personal: [email protected]
![Page 19: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/19.jpg)
Reading
• No req’d textbook, but background reading in:
“Machine Learning: A Probabilistic Perspective”
Kevin P. Murphy
• Readings will be from papers linked to on course website
• Please do reading before lecture on topic
![Page 20: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/20.jpg)
Homework
• 4 HWs, approx one for each case study
• Collaboration allowed, but write-ups and coding must be done individually
• You must submit your code.
• On due date, due at beginning of class time
• Allowed 2 “late days” for entire quarter
• YOU MUST SUBMIT ALL HW TO PASS THE COURSE (EVEN IT IS FOR 0 CREDIT)
• 3rd assignment must be completed individually
“Midterm”
![Page 21: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/21.jpg)
Project
• Individual, or teams of two
• New work, but can be connected to research
• Schedule:
– Proposal (1 page) – April 19
– Progress report (3 pages) – May 12
– Poster presentation –Thursday, June 2, 9:00-11:00am (??)
– Final report (8 pages, NIPS format) – June 7
![Page 22: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/22.jpg)
Grading
• HWs 1, 2, 4 (15% each)
• HW 3 (20%) – midterm exam
• Final project (35%)
• GRADING QUESTIONS: All regrading/policy change questions must be requested by email at [email protected]. All in personal discussions (for TAs/instructors) are limited to knowledge based questions. Regrading may result in any part of the HW set going up or down.
![Page 23: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/23.jpg)
Support/Resources
• Office Hours
– TBD
• Discussion Board
![Page 24: Machine Learning for Big Data (CSE 547 / STAT 548)](https://reader030.fdocuments.us/reader030/viewer/2022012806/61bd3dd461276e740b10c7c4/html5/thumbnails/24.jpg)
Conclusion
• I like Big Data and I cannot lie
[INSERT SONG HERE]
Or, let’s just carry on with the first lecture…