Spark search datapalooza
-
Upload
jun-wang -
Category
Technology
-
view
832 -
download
0
Transcript of Spark search datapalooza
![Page 1: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/1.jpg)
Spark Search
Taka Shinagawa — sparksearch.org November 10-12, 2015
IBM Datapalooza
![Page 2: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/2.jpg)
Speaker’s Background
SF-based Analytics Engineer developing “Spark Search” as a personal project
• Data Mining, Analytics, NLP, Machine Learning for finance, online advertising, e-commerce
• Web/Mobile/Social App Development (Ruby on Rails, JS)
• OS Kernel, Network Security, Applied Crypto, OpenSSL
• Cross-Language Information Retrieval— Research Project for TREC
At Fortune 500 corporations and multiple early & late-stage startups as well as academic research
![Page 3: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/3.jpg)
Today’s Topic — Spark Search
Spark Search is:
•Spark + Lucene integration in development in Scala (“Spark Search” = Lucene+Spark in this presentation)
•For “offline transactions” such as interactive search, analytics and machine learning (feature engineering/extraction)
•NOT for real-time search and “online transactions” such as web service & application backend
![Page 4: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/4.jpg)
Spark Search — Project Status•Pre-release •Finalizing the features and APIs •Testing for reliability and bigger data •Creating examples and documentation •Looking for feedback from users
http://www.forcegt.com/wp-content/uploads/2013/11/2014-Nissan-GT-R-Nismo-nurburgring-testing-1.jpg
![Page 5: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/5.jpg)
Outline
1. Background •Lucene, Spark
•Big Data Search Challenge — Why we need Spark Search
2. Spark Search •Architecture
•Operations
•Performance
3. Demo with Reddit Data
Q&A
![Page 6: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/6.jpg)
1. Background
![Page 7: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/7.jpg)
1.1. Lucene & Spark
![Page 8: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/8.jpg)
Apache Lucene
•Originally created by Doug Cutting in 1999
•Very popular IR (search engine) library used by:
•Apache Solr, ElasticSearch
•Many open source and commercial products
•High-performance Indexing and Search
•Language Analyzers for many languages
•Pluggable ranking models (vector space, BM25)
•Lots of Features
•https://lucene.apache.org
![Page 9: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/9.jpg)
Indexing: Documents to Inverted IndexID Text
1The Apache Lucene
project develops open-source search software
2
spellchecking, hit highlighting and
advanced analysis/tokenization capabilities
3
spellchecking, hit highlighting and
advanced analysis/tokenization capabilities
4
subproject with the aim of collecting and distributing free
materials
Term Documents
Freq
advanced [2,3] 2
aim [4] 1
analysis [2,3] 2
apache [1] 1
capability [2,3] 2
collect [4] 1
develop [1] 1
distribute [4] 1
free [4] 1
hightlight [2,3] 2
Language Analysis
Tokenizing
Normalization
Filtering
Stopwords Removal
Indexing
………..
![Page 10: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/10.jpg)
What’s Spark — Distributed Computing Architecture
Cluster Manager (Yarn, Mesos, Standalone)
File System (HDFS, S3, LocalFS, etc.)Spark Documentation https://spark.apache.org/docs/latest/cluster-overview.html
Worker Node
Executor
Cache
Driver ProgramSpark Context
Task Task TaskTask
Worker Node
Executor
Cache
Task Task TaskTask
Worker Node
Executor
Cache
Task Task TaskTask
![Page 11: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/11.jpg)
What’s Spark — Data Computation FlowData (HDFS, S3, LocalFS, HBase, Cassandra, etc.)
PartitionRDD 1
Result
Partition Partition Partition
PartitionRDD 2 Partition Partition Partition
Transformation 1
PartitionRDD n Partition Partition Partition
Transformation n
Action
![Page 12: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/12.jpg)
Spark Dataframe
•Distributed collection of data with schema/columns (RDD + Schema) — used to be called schemaRDD (Spark 1.0 - 1.2) and renamed to Dataframes since Spark 1.3
•Similar to Dataframes with R and Python Pandas
• Integrated with Spark SQL
•Dataframe can read data from external sources (e.g. JSON, Parquet, CSV, Solr, Cassandra) through Data Sources API
http://spark.apache.org/docs/latest/sql-programming-guide.html
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
![Page 13: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/13.jpg)
1.2. Big Data Search Challenge
Why we need Spark Search
![Page 14: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/14.jpg)
Lucene is Very Useful for Data Scientists, too
• For small data, sequential text search or string pattern matching is fast enough. This doesn’t work for big data
• With Lucene index, we can perform full-text search on big data much more quickly
• Lucene can provides ranking scores, text similarities, faceting as well as NLP-related stuffs (e.g. language analyzer, n-gram) for analytics & ML
• Lucene index is basically a matrix of documents (vectorization is already done)
![Page 15: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/15.jpg)
Challenges for Data Scientists
• However, many data scientists and analytics experts are NOT Lucene, Solr, Elasticsearch experts
• Learning curve for non-search specialists (Solr 5 has become easier :-)
• For big data, many clients for Solr and ES are not very scalable. Problems with millions of documents
• Data scientists wouldn't set up a large computer cluster just for search
![Page 16: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/16.jpg)
Challenges with Indexing Big Data
• Lucene is just a library without built-in distributed computing tools and front-end. It doesn’t scale. And hands-on tools are not available. That’s why Solr and ES exist
• Solr/ES are not as easy as we might expect for typical data scientists to index and search big data
• Solr/ES connectors with Spark are helpful, but • the network connection can be the bottleneck easily • the best data locality requires two clusters on the same nodes
![Page 17: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/17.jpg)
Hypothesis from My Past Experience
Lucene is faster on a single system without parallelism
•My impression: Solr & ES have lots of overhead
Analogy: small lighter car vs large heavier car with the same engine
![Page 18: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/18.jpg)
Experiments
•Goal is to get some rough idea on the performance overhead per single node with Lucene vs Solr vs ES (minimum parallelism on Solr/ES)
•NOT to measure absolute performance numbers with the most optimized settings
• Indexing millions of short documents (each line of a Wikipedia dump file as a document) with Lucene, Solr and Elasticsearch (no XML parser used)
•The simplest setup — 1 shard, no replication, 1 node (on my laptop), no parallelism at Solr/ES level
![Page 19: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/19.jpg)
Experiment EnvironmentLucene •Lucene 5.3.1, Java8 •1 Shard, No Replica •Test with RAMDirectory (memory), SimpleFSDirectory (magnetic disk), NIOFSDirectory
(magnetic disk)
Solr •Solr 5.3.1, Java8 •SolrJ library •1 shard, no replication, 1 node, cloud mode (aka SolrCloud)
ElasticSearch •Elasticsearch 1.7.2 (uses Lucene 4.10.4) •1 shard, no replication, 1 node, Transport Client •Bulk API (https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/bulk.html)
![Page 20: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/20.jpg)
Lucene (Result)
•Linearly Scalable on a single node
•Performance between RAMDirectory and FSDirectory is NOT significant— the gap is smaller than I expected
•Batch Indexing doesn’t improve performance much (unline Solr and ES)
Tim
e (s
ec)
0
40
80
120
160
# of documents1M 2M 3M 4M 5M
Lucene (RAMDirectory)Lucene (RAMDirectory 500K batch)Lucene (NIOFSDirectory)Lucene (SimpleFSDirectory)
![Page 21: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/21.jpg)
Solr (Result)
• Indexing without batching is a bad idea for millions of docs (needed to spend some time to figure this out)
• Indexing with 500K docs batches (need clear documentation about batching and batch sizes) Ti
me
(sec
)
0
65
130
195
260
# of documents1M 2M 3M 4M 5M
Solr (1 shard, noreplica, commit every 500K, cloud mode)
![Page 22: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/22.jpg)
Elasticsearch (Result)
•Bulk Request doesn’t scale
•Must use BulkProcessor
• Indexing Performance Tips (https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html)
•Various Errors and JVM Exceptions
Tim
e (s
ec)
0
75
150
225
300
# of documents1M 2M 3M 4M 5M
Elasticsearch (default - 5MB batch, 1 shard, noreplica)Elasticsearch (10MB batch - 1 shard, noreplica)Elasticsearch (15MB batch - 1 shard, noreplica)
![Page 23: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/23.jpg)
Elasticsearch (Result)
•Occasional Data Loss — problems in the Jepsen Test seemingly still exist with version 1.7.2
https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0
•Recommendation: Always check the number of documents in ES after indexing. Store data somewhere else (e.g. MySQL, Cassandra, HBase, HDFS, S3)
• (5M - 4,984,827) = 15,173 documents missing :-(
![Page 24: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/24.jpg)
Experiment Results (Indexing locally, minimum parallelism with Solr/ES)Ti
me
(sec
onds
)
0
75
150
225
300
# of Documents1M 2M 3M 4M 5M
Lucene (RAMDirectory)Lucene (NIOFSDirectory)Solr (1 shard, noreplica, commit every 500K, cloud mode)Elasticsearch (default - 5MB batch, 1 shard, noreplica) Solr &
Elasticsearch
Lucene
![Page 25: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/25.jpg)
Findings from the Experiment
1. Solr and ES are slower than Lucene for indexing on a single machine with minimum parallelism (lots of overhead)
2. Solr and ES require expertise and fine tuning for big data indexing
3. Indexing big data with the existing tools is not as easy as we wish
How about running Lucene inside Spark?
docu
men
ts /
sec
0
12500
25000
37500
50000
Lucene (RAMDirectory 500K Batch)Lucene (RAMDirectory)Lucene (FSDirectory)SolrElasticsearch
![Page 26: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/26.jpg)
2. Spark Search
![Page 27: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/27.jpg)
Spark
Spark Search OverviewLucene Index
Tachyon (TachyonDirectory)
Yarn/Mesos
Executor
Memory (RAMDirectory)
HDFS (HdfsDirectory)
LocalFS (FSDirectory)
Executor
Memory (RAMDirectory)
Task TaskTask Task TaskTask
![Page 28: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/28.jpg)
28
Spark Search — IndexingData (HDFS, S3, LocalFS)
Partition 1RDD Partition 2 Partition 3 Partition 4
Index/shard1 (Directory)
LuceneRDD
Indexer Indexer Indexer Indexer
Index/shard1 (Directory)
Index/shard1 (Directory)
Index/shard1 (Directory)
Transformations for Indexing
Index StoreIndex Store Index StoreIndex Store
![Page 29: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/29.jpg)
Spark Search — Search
Results1RDD Result 2 Result 3 Result 4
Index/shard1 (Directory)
LuceneRDD
IndexSearcher
Index/shard1 (Directory)
Index/shard1 (Directory)
Index/shard1 (Directory)
Transformation
Query
IndexSearcher IndexSearcher IndexSearcher
Results (documents + scores)
Action
![Page 30: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/30.jpg)
Spark Search vs others — Indexing (small 5M docs, single machine, minimum parallelism on Solr/ES, 1 partition with SparkSearch)
Tim
e (s
econ
ds)
0
75
150
225
300
# of Documents1M 2M 3M 4M 5M
Lucene (RAMDirectory)SparkSearch (1 partition, RAMDirectory, cached)SparkSearch (1 partition, RAMDirectory, no-cache)Solr (1 shard, noreplica, commit every 500K, cloud mode)Elasticsearch (default - 5MB batch, 1 shard, noreplica)
Lucene SparkSearch (No Cache)
SparkSearch (with Cache)
Elasticsearch Solr
![Page 31: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/31.jpg)
SparkSearch Indexing Speed (small 5M docs, single machine, minimum parallelism on Solr/ES, 1 partition with SparkSearch)
SparkSearch (memory cached)
SparkSearch (without cache)
Lucene (RAMDirectory)
Solr
ElasticSearch
documents / sec
0 25000 50000 75000 100000
![Page 32: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/32.jpg)
Indexing and Querying — RDD (API Preview)
Indexing val rdd = sc.textFile(“/tmp/inputData.txt”) val luceneRDD = LuceneRDD(rdd).index
Querying val count = luceneRdd.count(“*:*”)
val results = luceneRdd.search(“body:lucene”) val docs = results.take(100) // get top 100 docs
val docsAll = results.collect() // get all docs
!!! These APIs are pre-release versions. Subject to change
![Page 33: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/33.jpg)
Indexing and Querying — Dataframe (API Preview)
Indexing val df = sqlContext.read.json(“/tmp/inputData.json")
val luceneDf= LuceneDataFrame(df, "/tmp/luceneIndex", "simplefs").index
Querying val counts = luceneDf.count(“body:lucene”) val results = luceneDf.search(“body:lucene”)
val docs = results.take(100) // get top 100 docs val docsAll = results.collect() // get all docs
!!! These APIs are pre-release versions. Subject to change
![Page 34: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/34.jpg)
“More Like This” Query
!!! The API parameters are pre-release versions. Subject to change
“How MoreLikeThis Works in Lucene” http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
luceneDf.moreLikeThis("apple orange california", "body", 20)
![Page 35: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/35.jpg)
Term Faceting
!!! The API parameters are pre-release versions. Subject to change
Example— Term Faceting at
E-Commerce Website
![Page 36: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/36.jpg)
Value Range Faceting
!!! The API parameters are pre-release versions. Subject to change
Example— Value Range Faceting at
E-Commerce Website
![Page 37: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/37.jpg)
Date/Time Range Faceting
!!! The API parameters are pre-release versions. Subject to change
![Page 38: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/38.jpg)
Term Counts for Search Resultscala> df2.termFreqAll("textField:java", "textField", "label").show
!!! The API parameters are pre-release versions. Subject to change
![Page 39: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/39.jpg)
Term Vector — HashingTF for MLlib
scala> df2.hashingTF("*:*", "textField", 20).show
!!! The API parameters are pre-release versions. Subject to change
![Page 40: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/40.jpg)
SparkSearch Query Performance
•Query time on shards can be predicted with their Lucene index sizes
•Better to have even/similar shard sizes
•One big shard’s query time affects overall query performance
•Querying on Lucene Index is faster than on Dataframe
• Inverted index provides efficincy
![Page 41: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/41.jpg)
3. Scalability Test
![Page 42: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/42.jpg)
Dataset — Public Reddit Comments Corpus
•The Entire Dataset — 1TB- 1.65B comments on Reddit in JSON format
•One JSON record per line
•The Dataset used in this example — 100GB 177.5M (177,544,909) Reddit comments (10% of the entire dataset). Combined multiple files into a single 100GB file
https://archive.org/details/2015_reddit_comments_corpus
![Page 43: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/43.jpg)
Dataset Schema scala> reddit.printSchemaroot |-- archived: boolean (nullable = true) |-- author: string (nullable = true) |-- author_flair_css_class: string (nullable = true) |-- author_flair_text: string (nullable = true) |-- body: string (nullable = true) |-- controversiality: long (nullable = true) |-- created_utc: string (nullable = true) |-- distinguished: string (nullable = true) |-- downs: long (nullable = true) |-- edited: string (nullable = true) |-- gilded: long (nullable = true) |-- id: string (nullable = true) |-- link_id: string (nullable = true) |-- name: string (nullable = true) |-- parent_id: string (nullable = true) |-- removal_reason: string (nullable = true) |-- retrieved_on: long (nullable = true) |-- score: long (nullable = true) |-- score_hidden: boolean (nullable = true) |-- subreddit: string (nullable = true) |-- subreddit_id: string (nullable = true) |-- ups: long (nullable = true)
12 fields
the “body” field stores Reddit comments
![Page 44: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/44.jpg)
A Sample JSON Record from the Dataset
{“ups”:3,"subreddit":"pics","id":"c0ifz1b","author_flair_text":null,"distinguished":null, “created_utc":"1262984678","author":"DrK","link_id":"t3_an7yh","author_flair_css_class":null,“subreddit_id":"t5_2qh0u","controversiality":0, "body":"Learn to use [Lucene syntax](http://lucene.apache.org/java/2_3_2/queryparsersyntax.html) or use http://www.searchreddit.com/“, “score_hidden”:false,”edited":false,"score":3,"retrieved_on"1426178579, "archived":true,"parent_id":"t1_c0iftql","gilded":0,"downs":0,"name":"t1_c0ifz1b"}
![Page 45: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/45.jpg)
Spark Search for 100GB Data on AWS
How long does it take to index 100GB?
Computing Resources (AWS EC2)
•8 instances (r3.4xlarge), No VPC
•CPUs: 16 vCPU x 8 = 128 vCPUs
•Memory: 122GB x 8 = 976GB
•SSD Storage: 320GB x 8
![Page 46: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/46.jpg)
Steps — Indexing 100GB with RDD
100GBJSON Data
on S3
RDD(128 partitions)
SSD(NIOFSDirectory)
Memory(RAMDirectory)
Read by line
Index Index
![Page 47: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/47.jpg)
Results — Indexing 100GB with RDD
100GBJSON Data
on S3
RDD(128 partitions)
SSD(NIOFSDirectory)
Memory(RAMDirectory)
Read by line
Index Index
1.7min
2.6min (cached RDD)
2.5min (cached RDD)
6min (RDD
not cached)
Index6.1min
(RDD not cached)
![Page 48: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/48.jpg)
Steps — Indexing 100GB (177.5M JSON records) with Dataframe
100GBJSON Data
on S3
Dataframe(20 partitions)
SSD(NIOFSDirectory)
Read by JSON record
Index Index
Dataframe(128 partitions)
Repartition
![Page 49: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/49.jpg)
Results — Indexing 100GB (177.5M JSON records) with Dataframe
100GBJSON Data
on S3
Dataframe(20 partitions)
Read
Index
Dataframe(128 partitions)
5.8min
5.2minRepartition
SSD(NIOFSDirectory)
3.1min (cached
DF)
Index7.4min
(repartitioned DF not
cached)
Index12min
![Page 50: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/50.jpg)
Findings from the 100GB Scalability Test
•Spark Search is fast and scalable — Spark is a great distributed computing engine for Lucene
•Reliability and performance depends on the store (type of Lucene Directory)
•FSDirectory with SSD is the most reliable store with performance
•RAMDirectory can be improved for better reliability and performance
•Both indexing and query times can be estimated with each partition’s input data and Lucene index size
![Page 51: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/51.jpg)
3. Demo with Reddit Dataset
![Page 52: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/52.jpg)
Dataset for This Demo
•Small subset (January 2011) of the entire dataset
•2.88M (2,884,096) Reddit Comments
•1.6GB JSON file
•Needed to choose rather small dataset due to Zeppelin’s bug!!
https://archive.org/details/2015_reddit_comments_corpus
![Page 53: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/53.jpg)
Load Data into DataFrame and Repartition
!!! The API parameters are pre-release versions. Subject to change
![Page 54: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/54.jpg)
Index the DataFrame
!!! The API parameters are pre-release versions. Subject to change
![Page 55: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/55.jpg)
Full-Text Search for Reddit Comments
!!! The API parameters are pre-release versions. Subject to change
![Page 56: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/56.jpg)
“More Like This” Search for Reddit Comments
!!! The API parameters are pre-release versions. Subject to change
![Page 57: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/57.jpg)
“More Like This” Search with Term Boosts
!!! The API parameters are pre-release versions. Subject to change
![Page 58: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/58.jpg)
Find Top Co-occurrence Terms
!!! The API parameters are pre-release versions. Subject to change
![Page 59: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/59.jpg)
Term Faceting
!!! The API parameters are pre-release versions. Subject to change
![Page 60: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/60.jpg)
Term Faceting
!!! The API parameters are pre-release versions. Subject to change
![Page 61: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/61.jpg)
Term Faceting
!!! The API parameters are pre-release versions. Subject to change
![Page 62: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/62.jpg)
Value Range Faceting (0 to 100, Interval=1)
!!! The API parameters are pre-release versions. Subject to change
![Page 63: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/63.jpg)
Value Range Faceting (0 to 500, Interval=10)
!!! The API parameters are pre-release versions. Subject to change
![Page 64: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/64.jpg)
Value Range Faceting (0 to 100, Interval=1)
!!! The API parameters are pre-release versions. Subject to change
![Page 65: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/65.jpg)
Value Range Faceting (0 to 500, Interval=10)
!!! The API parameters are pre-release versions. Subject to change
![Page 66: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/66.jpg)
Popular Terms in the Document
!!! The API parameters are pre-release versions. Subject to change
![Page 67: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/67.jpg)
Term Counts for Search Result
!!! The API parameters are pre-release versions. Subject to change
![Page 68: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/68.jpg)
Term Vector — HashingTF for MLlib (Very Fast)
!!! The API parameters are pre-release versions. Subject to change
![Page 69: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/69.jpg)
Summary
•SparkSearch (Lucene+Spark) is very fast, efficient and scalable — Spark adds little overhead
•Simple APIs — Just a single line for indexing •Seamless Integration with Spark RDD & DataFrame
•All indexing and search operations can be performed with Spark command line or application.
![Page 70: Spark search datapalooza](https://reader031.fdocuments.us/reader031/viewer/2022030308/58eefcd41a28ab46768b4599/html5/thumbnails/70.jpg)
Questions & Feedback [email protected]
Love to hear your feedback, help and ideas — contact me!
Update @sparksearch
www.SparkSearch.org