Lucene

36
Information Retrieval API: Lucene

Transcript of Lucene

  1. 1. Search Engines
  2. 2. How does any search engine works? Internet search engines are special sites on the web that are designed to help people find information on the world wide web. Any search engine operates in the following order Web crawling Indexing searching
  3. 3. Search engine uses software called spiders (crawlers), which comb the internet looking for documents and their web addresses.
  4. 4. The documents and web addresses are collected and sent to the search engine's indexing software.
  5. 5. The indexing software extracts information from the documents, storing it in a database.
  6. 6. When you perform a search by entering keywords, the database is searched for documents that match.
  7. 7. What is lucene? Lucene is an open source, highly scalable information retrieval (IR) library. Information retrieval refers to the process of searching for documents, information within documents or metadata about documents.
  8. 8. Overview of How lucene works?
  9. 9. ANALYSIS Analysis is converting the text data into a fundamental unit of searching, which is called as term. During analysis, the text data goes through multiple operations: extracting the words, removing common words, ignoring punctuation, reducing words to root form, changing words to lowercase, etc. Analysis happens just before indexing and query parsing. Analysis converts text data into tokens, and these tokens are added as terms in the Lucene index.
  10. 10. HTML Extract text Extract text Extract text Extract text PDF MS Word XML Analysis Index Performed by lucene
  11. 11. Lucene Analysers Analyzer in Lucene is tokenizer + stemmer + stop-words filter. For e.g. :- Analyze: XY&Z Corporation - [email protected] 1) Whitespace Analyzer: Splits tokens at whitespace [XY&Z] [Corporation] [-] [[email protected]] 2) Simple Analyzer: Divides text at non-letter characters and puts text in lowercase [xy] [z] [corporation] [xyz] [example] [com] 3) Stop Analyzer: Removes stop words (not useful for searching) and puts text in lowercase [xy] [z] [corporation] [xyz] [example] [com] 4) Standard Analyzer: Tokenizes text based on a sophisticated grammar that recognizes: e-mail addresses; acronyms; Chinese, Japanese, and Korean characters; alphanumerics.Puts text in lowercase. Removes stop words [xy&z] [corporation] [xyz@example] [com]
  12. 12. 5) Metaphone Replacement Analyzer: It literally replaces the incoming token with some metacode. Two phrases that sound similar yet are spelled completely differently are tokenized and encoded the same. For e.g. :"The quick brown fox jumped over the lazy dogs" will be encoded as " [0] [KK] [BRN] [FKS] [JMPT] [OFR] [0] [LS] [TKS] Now if user wants to look for : "Tha quik brown phox jumpd ovvar tha lazi dogz" there will be an exact match as it will be encoded into the same code as above and exact match will be found.
  13. 13. INDEXING A process of converting text data into a format that facilitates rapid searching. Simple analogy a book For indexing data, is should available in simple text format.
  14. 14. Core Indexing Classes Document Field 1 Field 2 Field 3 Field 4 Analyzer Index Writer Directory
  15. 15. Directory : The Directory class represents the location of a Lucene index. Its an abstract class that allows its subclasses to store the index as they see fit. Index Writers : A class that either creates or maintains an index. Its constructor accepts a Boolean that determines whether a new index is created or whether an existing index is opened. It provides methods to add, delete, or update documents in the index. IndexWriter creates a lock file for the directory to prevent index corruption by simultaneous index updates.
  16. 16. Fields : The class that actually holds the textual content to be indexed. The Field class encapsulates a field name and its value. Lucene provides options to specify if a field needs to be indexed or analyzed and if its value needs to be stored.
  17. 17. Document : A Document represents a collection of fields. You can think of it as a virtual documenta chunk of data, such as a web page, an email message, or a text filethat you want to make retrievable at a later time. Analyzers : They are responsible for preprocessing the text data and converting it into tokens stored in the index.
  18. 18. Lucene Indexes Every Lucene index consists of one or more segments. Each segment is a standalone index itself, holding a subset of all indexed documents. At search time, each segment is visited separately and the results are combined together. Each segment, in turn, consists of multiple files, of the form _X. that references all live segments. The value , called the generation, is an integer that increases by one every time a change is committed to the index.
  19. 19. Index Structure _0.fnm _0.fdt _0.fdx _0.frq _0.tis _0.tii _0.prx _0.nrm _0_1.del _1.fnm _1.fdt _1.fdx [] segments_3
  20. 20. Lucene index has many separate segments. Lucene must search each segment separately and then combine the results. There is an performance issue. Index needs to be optimized. optimize() optimize(int maxNumSegments), optimize(boolean doWait) optimize(int maxNumSegments, boolean doWait) tradeoff of a large one-time cost, for faster searching
  21. 21. Fascinating Lucene :Inverted Index Lucene stores the input in a data structure known as an inverted index. What makes this structure inverted is that it uses tokens extracted from input documents as lookup keys instead of treating documents as the central entities.
  22. 22. Searching in Lucene Searching is the process of looking for words in the index and finding the documents that contain those words.
  23. 23. Core Searching classes Searcher : Searcher is an abstract base class that has various overloaded search methods. The Search method returns an ordered collection of documents ranked by computed scores. Lucene calculates a score for each of the documents that match a given query. Term : Term is the most fundamental unit for searching. It's composed of two elements: the text of the word and the name of the field in which the text occurs. Term objects are also involved in indexing, but they are created by Lucene internals.
  24. 24. Score Docs : A simple pointer to a document contained in the search results. This encapsulates the position of a document in the index and the score computed by Lucene. Top Docs : Encapsulates the total number of search results and an array of ScoreDoc.
  25. 25. Querying Lucene Indexes Query is an abstract base class for queries. They are used as strategy to look up into the address indexes and return the matching documents. Some of the queries are : 1)Term Query: .The most elementary way to search an index is for a specific term. A term is the smallest indexed piece, consisting of a field name and a text-value pair.
  26. 26. 2) Wildcard Query: Wildcard queries let you query for terms with missing pieces Two standard wildcard characters are used: * for zero or more characters For example, to search for test, tests or tester, you can use the search: test* ? for zero or one character For example, to search for "text" or "test" you can use the search: te?t 3) Range Query: Range queries allow to match all the documents whose field value(s) are b/w lower and upper bound specified by range query. They can be inclusive or exclusive : Inclusive range queries are denoted by square brackets([]). Exclusive range queries are denoted by curly brackets({ }). For e.g. : date:[20020101 TO 20030101] This will find documents whose date fields have values between 20020101 and 20030101, inclusive.
  27. 27. 4)Fuzzy Query : Lucene supports fuzzy searches based on the lenevstein distance , or edit distance algorithm. To do a fuzzy search use the tilde~, symbol at the end of a single word term. FuzzyQuery matches terms "close" to a specified base term : you specify an allowed maximum edit distance and any terms within that edit distance from the base term and, then, the docs containing those terms) are matched. For e.g. : To search for a term similar in spelling to "roam" use the fuzzy search. 5)Boolean Query: Boolean operators allow terms to be combined through logic operators. Lucene supports AND , OR and NOT as Boolean operators
  28. 28. 7) Boosting Query: Boosting allows you to control the relevance(which terms/clauses are "more important") of a document by boosting its term . The higher the boost factor, the more relevant the term will be, and therefore the higher the corresponding document scores. To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. For e.g. : If you are searching for : IIT(BHU) Varanasi and you want the term " Varanasi" to be more relevant boost it using the ^ symbol along with the boost factor next to the term. Query Syntax : IIT (BHU) Varanasi^4
  29. 29. Luke Lucene Index Toolbox
  30. 30. Applications of lucene Searchable email Online documentation search Version control and content management Content search .. And the list goes on.
  31. 31. THANK YOU