Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)
Practical Machine Learning for Smarter Search with Solr and Spark
-
Upload
jake-mannix -
Category
Technology
-
view
258 -
download
0
Transcript of Practical Machine Learning for Smarter Search with Solr and Spark
![Page 1: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/1.jpg)
Practical Machine Learning for Smarter Search
with Solr and Spark
Jake Mannix
@pbrane
Lead Data Engineer, Lucidworks
![Page 2: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/2.jpg)
$ whoamiNow: Lucidworks, Office of the CTO: applied ML / data engineering R&D
Previously: • Allen Institute for AI: Semantic Search on academic research
publications• Twitter: account search, user interest modeling, content
recommendations• LinkedIn: profile search, generic entity-to-entity recommender
systems
Prehistory:• other software companies, algebraic topology, particle cosmology
![Page 3: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/3.jpg)
• Why Spark and Solr for Data Engineering?• Quick intro to Solr• Quick intro to Spark• Example: ManyNewsgroups
• data exploration• clustering: unsupervised ML• classification: supervised ML• recommender: collaborative filtering + content-based• search ranking
Overview
![Page 4: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/4.jpg)
Practical Data Science with Spark and Solr
Why does Solr need Spark?
Why does Spark need Solr?
![Page 5: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/5.jpg)
Why do data engineering with Solr and Spark?
Solr Spark• Data exploration and
visualization• Easy ingestion and feature
selection• Powerful ranking features• Quick and dirty classification
and clustering• Simple operation and scaling• Stats and math built in
• General purpose batch/streaming compute engine
Whole collection analysis!• Fast, large scale iterative
algorithms• Advanced machine learning:
MLLib, Mahout, Deep Learning4j
• Lots of integrations with other big data systems
![Page 6: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/6.jpg)
Why does Spark need Solr?
Typical Hadoop / Spark data-engineering task, start with some data on HDFS:
$ hdfs dfs -ls /user/jake/mail/lucene-solr-user/2015…-rw-r--r-- 1 jake staff 6304388 Feb 4 18:22 part-00001.lzo-rw-r--r-- 1 jake staff 7977085 Feb 4 18:22 part-00002.lzo-rw-r--r-- 1 jake staff 7210817 Feb 4 18:22 part-00003.lzo-rw-r--r-- 1 jake staff 1215048 Feb 4 18:22 part-00004.lzo
Now what? What’s in these files?
![Page 7: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/7.jpg)
Solr gives you:
• random access data store
• full-text search
• fast aggregate statistics
• just starting out: no HDFS / S3 necessary!
• world-class multilingual text analytics:
• no more: tokens = str.toLowerCase().split(“\\s+“)
• relevancy / ranking
• realtime HTTP service layer
![Page 8: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/8.jpg)
• Apache Lucene
• Grouping and Joins
• Stats, expressions, transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete, highlighting
• Cursors
• More Like This
• De-duplication
![Page 9: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/9.jpg)
Why Spark for Solr?
• Spark-shell: a Big Data REPL with all your fave JVM libs!
• Build the index in parallel very, very quickly!
• Aggregations
• Boosts, stats, iterative global computations
• Offline compute to update index with additional info (e.g. PageRank, popularity)
• Whole corpus analytics and ML: clustering, classification, CF, rankers
• General-purpose distributed computation
• Joins with other storage (Cassandra, HDFS, DB, HBase)
![Page 10: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/10.jpg)
Spark Key Features
• General purpose, high powered cluster computing system
• Modern, faster alternative to MapReduce
• 3x faster w/ 10x less hardware for Terasort
• Great for iterative algorithms
• APIs for Java, Scala, Python and R
• Rich set of add-on libraries for machine learning, graph processing, integrations with SQL and other systems
• Deploys: Standalone, Hadoop YARN, Mesos, AWS, Docker, …
![Page 11: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/11.jpg)
• Initial exploration of ASF mailing-list archives
• Index it into Solr
• Explore a bit deeper: unsupervised Spark ML
• Exploit labels: predictive analytics
Example: Many NewsGroups
![Page 12: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/12.jpg)
• Initial exploration of ASF mailing-list archives
• index into Solr: just need to turn your records into json
• facet:
• fields with low cardinality or with sensible ranges
• document size histogram
• projects, authors, dates
• find: broken fields, automated content, expected data missing, errors
• now: load into a spark RDD via SolrRDD:
Many NewsGroups: Initial Exploration
![Page 13: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/13.jpg)
• cleanup/filtering via spark DataFrame operations:
• create thread groups:
Many NewsGroups: Initial Exploration
![Page 14: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/14.jpg)
• try other text analyzers: (no more str.split(“\\w+”)! )
Many NewsGroups: Initial Exploration
ref: Lucidworks blog on LuceneTextAnalyzer by Steve Rowe
![Page 15: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/15.jpg)
• Unsupervised machine learning:
• clustering documents with KMeans
• extract topics with Latent Dirichlet Allocation
• learn word vectors with Word2Vec
Many NewsGroups: Exploratory Data Science
![Page 16: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/16.jpg)
• Vectorize and run KMeans:
Many NewsGroups: Exploratory Data Science
![Page 17: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/17.jpg)
• Build topic models with LDA:
Many NewsGroups: Exploratory Data Science
![Page 18: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/18.jpg)
• Build word vector representations with Word2Vec:
Many NewsGroups: Exploratory Data Science
![Page 19: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/19.jpg)
• Now for some real Data Science:
Many NewsGroups: Supervised Learning
![Page 20: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/20.jpg)
• What else could you do?• Try other classification algs, cross-validate to pick!• Recommender Systems
• content-based: • mail-thread as “item”, head msgs grouped by
replier as “user” profile• search query of users against items to recommend
• collaborative-filtering:• users replying to a head msg “rate” them +-tively• train a Spark ML ALS RecSys model
• Train search rankers in click logs
Many NewsGroups: Next steps?
![Page 21: Practical Machine Learning for Smarter Search with Solr and Spark](https://reader035.fdocuments.us/reader035/viewer/2022081520/58f2d2101a28ab91188b457f/html5/thumbnails/21.jpg)
Resources
• spark-solr: https://github.com/Lucidworks/spark-solr
• Company: http://www.lucidworks.com
• Our blog: http://www.lucidworks.com/blog
• Apache Solr: http://lucene.apache.org/solr
• Apache Spark: http://spark.apache.org
• Fusion: http://www.lucidworks.com/products/fusion
• Twitter: @pbrane