SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
Solr
-
Upload
- -
Category
Technology
-
view
109 -
download
3
description
Transcript of Solr
by NNNN (周世恩)Code & Coffee 2013/11/1
What is Solr?
What is
• Full-featured text search
• High performance
• index size: 20-30% the size of text data.
• small RAM requirements(~1MB)
• Powerful, Accurate and Efficient Search Algorithms
• 100% in Java(^^)
Lucene(cont.)
• Multiple Analyzer / Tokenizer
• Fields Searching
• Merge results
• Flexible faceting, highlighting, joins and result grouping
• Typo-tolerant suggesters(當然要⾃自⼰己建⽴立)
• Customize ranking model..(VSM, BM25)
Lucene(Query)
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/Query.html
http://www.ibm.com/developerworks/cn/java/j-lo-lucene1/fig001.jpg
Where is index file stored?
• Memory
• File System
• HDFS
• FileSystem config 設定為 HDFS
#Note
• 只有被index的field 才可以search
• 可以純store 不index
• ⽀支援多種Type(Long, Int, String, Text...)
• Indexing 就要決定好Tokenizer(Analyzer) 了
• ⽀支援同時searching and indexing?
#Note
• 使⽤用前搖⼀一搖• ⼀一開始就要清楚有哪些Field
• 降低重建index的機會(RDB只要打個指令就好)
Lucene Index file項⺫⽬目很多, 少⼀一個你就GG
What is Solr?
超屌企業級免費的
Search Platform
Lucene 功能該有的都有了
Solr 還多了....
• 漂亮的Admin Interface!
• REST-like API(易與其他App結合)
• Dynamic clustering
• Database integration
• Geospatial search(Google Map?)
• 調整Cache Size
還記得雲端的優勢...
• Highly reliable
• Scalable
• Fault tolerant
• Distributed indexing
• Replication
• Load-balanced
• Automated failover and recovery(?)
常⽤用的config
• schema.xml(定義每個field)
• solrconfig.xml (定義每個handler的URI)
• jetty.xml(!)
• solr.xml(定義core的數量)
Real-time indexing?
Near Real-time indexing
• Documents are available for search almost immediately after being indexed...
• 也要有commit 才算數(....)
https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching
Searching
• Query
“id: {id} AND name:{Name} OR title:{text}”
• Highlighting
• Projection
• Sorting(asc, desc)
• Output format: JSON, CSV, XML
• Others: spellcheck, Wildcard Query, +-*/
Sample Output
Import Data From DB
• 在solrconfig.xml 修改
http://wiki.apache.org/solr/DataImportHandler
The diff between Solr and RDB
• Solr is for indexed text or lots of unstructured docs.
• Solr is optimized for searching, not for storage and retrieval of individual records.
http://stackoverflow.com/questions/5814050/solr-or-database
Distributed Search cluster
• 很多台機器架設 Solr, 選⼀一台來進⾏行聯結
• 需要在config設定嗎?
Distributed Solr Cluster & Load balancer
http://wiki.apache.org/solr/SolrReplication
http://wiki.apache.org/solr/SolrReplication
#Note
• 你可以先⽤用包有lucene indexing 功能的java application 先製作好index directory再給solr ⽤用
• 如果solr要進⾏行update時, 最好先確認沒有其他application正在進⾏行寫⼊入的程序, 否則GG
• indexing 時, 不管是solr還是lucene, write-lock不要亂刪
Live Demo
眾神們曾經說過這很危險的
下載Solr 最新版 $: cd solr-4.4.0/example $: java -Xmx2048m -jar start.jar
The End
好Tool 分享
• Luke(檢查index⽤用)
• Apache Tika
• Apache hadoop
• Apache Tomcat
BBQ(Bonus)
• Customize tokenizer
• Document Boosting
• Field Boosting
• Field aliasing / renaming