Recommending Semantic Nearest Neighbors Using Storm and Dato
-
Upload
ashok-venkatesan -
Category
Engineering
-
view
461 -
download
1
Transcript of Recommending Semantic Nearest Neighbors Using Storm and Dato
![Page 1: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/1.jpg)
Presented at Dato Conf, SF
Personalization @ StumbleUpon
Recommending"Semantic Nearest Neighbors"
using Storm and Dato
![Page 2: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/2.jpg)
OVERVIEW
![Page 3: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/3.jpg)
StumbleUpon – Choose Topics, Discover Content
![Page 4: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/4.jpg)
Bookmark, Organize and Share
![Page 5: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/5.jpg)
Recommendations – Matching User With Content
TELEVISION MUSIC
1. Understand User 2. Understand Content 3. Recommend 4. Get Feedback
TELEVISON MUSIC
TRENDING FRIENDS
LIKEMINDED USERS
EXPERTS
ANIMALS
DOGS
PHOTOGRAPHY
MOVIES
ARTS
HUMOR
![Page 6: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/6.jpg)
Architecture Overview
Ingestion Queue
Discovery Queue
Content Analysis
MySQL
Recommendation Engine
1. INGESTION
Cold Start Model
HBase ES
New Content
Event Processors
3. OFFLINE COMPUTATIONS
2. CHECK QUALITY
4. RECS 5. ONLINE COMPUTATION
Rec Models Rec Models Rec Models
Event Queue
![Page 7: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/7.jpg)
CONTEXTUAL RECS
![Page 8: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/8.jpg)
• Problem: – Recommend Items based on the topics discovered in the current
page a user is on
• Strategies: – Find semantically similar items – Find items that dig further into a specific topic – Find items that dig further into a broader topic – Others…
Problem
![Page 9: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/9.jpg)
• Very quick “Ingestion to Recommendation” turn around time (x10 seconds) – Adopt stream processing with at-least-once processing guarantees – Build idempotent subsystems – Capitalize on non-linearity wherever possible
• Low latency retrieval of recs (x10 ms) – Pre-compute recs – Retrieve recs in θ(1) time
• Horizontally scalable design – Utilize distributed processing systems/data stores
Constraints/SLAs
![Page 10: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/10.jpg)
• (Offline) Utilize a high quality dataset to build a topic model • (Online) For each URL ingested,
– Extract text features that summarize the documents • Use pre-built topic models for
– Filtering noisy keywords – Finding general topics – Finding specific topics – Computing topic hashes
• Compute similarity/relevance • Store for quick retrieval
Approach Overview
![Page 11: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/11.jpg)
Feature Extraction
Wikipedia Annotation2
Detect Language
Parse
Noun Chunking1
Cleanup
Remove Boilerplate
Coalesce Tags
1Manning, Christopher D., et al. “The Stanford CoreNLP natural language processing toolkit.” Proceedings of 52nd Annual Meeting of the ACL: System Demonstrations. 2014. 2Milne, David, and Ian H. Witten. "An open-source toolkit for mining Wikipedia." Artificial Intelligence 194 (2013): 222-239.
Compute Tag Score
![Page 12: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/12.jpg)
Topic Modeling
3Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.
image courtesy: http://parkcu.com/blog/latent-dirichlet-allocation/
![Page 13: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/13.jpg)
• Similar to constrained clustering, LDA can be run with topic associations4.
• Perform hierarchical/agglomerative clustering on SU’s taxonomy to obtain K=75 clusters of topic sets.
• Use the topic sets as possible labels for the latent topic z
• The words themselves are not learnt for the specific topic they have been mapped to.
LDA with Topic Associations
4Andrzejewski, David, and Xiaojin Zhu. “Latent Dirichlet Allocation with topic-in-set knowledge.” Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing. Association for Computational Linguistics, 2009.
![Page 14: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/14.jpg)
Example
Topic Associa6ons (Pre LDA) Top Words in a Topic (Post LDA)
![Page 15: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/15.jpg)
• Choose relevant topics – Rank/Threshold by to get
• Filtering noisy tags – Rank/Threshold by
• Getting specific words – Rank/Threshold by
• Getting general words – Rank/Threshold by
Using the Topic Models
![Page 16: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/16.jpg)
Graphlab-Create I
Image Courtesy: https://dato.com/products/create/technology.html
![Page 17: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/17.jpg)
• Allows fast prototyping on a single machine – Python Interface to a C++ backend – Scalable Data Structures (Tabular and Graphs) made available – Out-of-core implementation of standard ML algorithms – Makes basic Data Engineering and Visualization tasks easy
• Easy to deploy micro services (predictive services) around models built using Graphlab create/pandas/scikit-learn. – REST-ful API hosted over a Tornado server – Distributed cache – Amazon Cloudwatch for monitoring (for AWS deploys)
• (Con) Debugging the service can be difficult
Graphlab-Create II
![Page 18: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/18.jpg)
• Distributed Realtime Computation System – Fault Tolerant, Scalable and Guaranteed Processing – Master --> Zookeeper --> Worker Nodes
• Workers – Spout Stream sources– Bolt Computation units
• Data Flow – Streams Unbounded sequence of Tuples– Topologies A network of spouts and bolts
Storm Basics5
5http://www.slideshare.net/ptgoetz/cassandra-and-storm-at-health-market-sceince
![Page 19: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/19.jpg)
Architecture
URLs
Webpage Surveyor Service
TMS*
Models
HTML to Text
Text to (Tags , Concepts)
Merge
1. Topic Model Query
2.a. Load ES 2.b. Get Similar Items
3. Load Similar Items for quick lookup
Build Topic Model
Fetch Page HTML
To S3
SIMILAR ITEM TOPOLOGY
KaXa Broker
*TMS – Topic Model Service
Get Similar Items
![Page 20: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/20.jpg)
• Number of Storm Workers: 3 • Number of ES Nodes: 3 • Training:
– Document Size: 2M – Vocabulary: 400K – Time: ~8s/iteration (16 cores)
• Predictive service performance: – Peak requests handled: 200/min – Avg response time: 110 ms
• URL Turn around time: 10s • Number of URLs ingested: 70/min
Some Numbers I
![Page 21: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/21.jpg)
Some Numbers II
![Page 22: Recommending Semantic Nearest Neighbors Using Storm and Dato](https://reader034.fdocuments.us/reader034/viewer/2022051503/5a6899cb7f8b9a4a258b6669/html5/thumbnails/22.jpg)
THANKS. QUESTIONS?