Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic)...

22
Introduc)on to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata

Transcript of Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic)...

Page 1: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Introduc)on  to  Lucene  

Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata

Page 2: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Open  source  search  engines  §  Academic

–  Terrier (Java, University of Glasgow) –  Indri, Lemur (C++, and Java too, UMass & CMU) –  Zettair (University of Melbourne)

§  Apache project (non-academic) –  Lucene –  Apache license, legally easier for commercial use

§  Lucene –  Java search engine library, with many features –  Ports/integration to other languages available (C/C++, C#,

Python, Ruby, … ) –  Other projects on top of Lucene: Solr and others –  Used by: LinkedIn, Twitter, CiteSeer, …

2  

Page 3: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Lucene:  overview  

3  

Lucene  document  

Documents  

Tokens   Index  and  dic)onary  

Lucene  query  

Text  query  

Search  results  

Results  display  

User:  document  building  

Lucene:  Analyzing  

Lucene:  Indexing  

Lucene:  Analyzing  

Lucene:  Searching  

User:  reading  search  results  

User:  query  building  

UI  

User  

User  =  Programmer    

Page 4: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Lucene  document  building  §  A document is a collection of Fields §  Document: CV of a student, several fields –  Last_Name: Banerjee –  First_Name: Arkadeep –  Age: 24 –  Gender: M –  Institute: ISI Kolkata –  Location: Kolkata –  Description: Arkadeep is a highly motivated student who

simply loves challenge. He is …

§  A field can be a text, a number, a range

4  

Page 5: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Building  document  

5  

import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;

……

Document doc = new Document();doc.add(new StringField(“Last_Name”,”Banerjee”, …));doc.add(new StringField(“First_Name”,”Arkadeep”, …));doc.add(new IntField(“Age”,24, …);doc.add(new TextField(“Description”,description, …));

Now,  let’s  understand  the  fields  

Page 6: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Building  document  

§  Field.Store –  NO : Don’t store the field

value in the index –  YES : Store the field value

in the index

§  Field.Index –  ANALYZED : Tokenize with

an Analyzer –  NOT_ANALYZED : Do not

tokenize –  NO : Do not index this field –  Other options

6  

new StringField(“Last_Name”,”Banerjee”,Field.Store.Yes));new StringField(“First_Name”,”Arkadeep”,Field.Store.Yes));new NumericField(“Age”,24,Field.Store.Yes);new TextField(“Description”,desc, Field.Store.No));

Page 7: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Indexing  

§  IndexWriter: primary class for indexing –  Stores the index in a directory –  RAMDirectory also possible (in memory)

7  

StandardAnalyzer analyzer = new StandardAnalyzer(); FSDirectory dir = FSDirectory.open(new File(“directory_path”));IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);IndexWriter w = new IndexWriter(index, config); w.addDocument(document);w.close();

Page 8: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Lucene  document  building  §  A document is a collection of Fields §  Document: CV of a student, several fields –  Last_Name: Banerjee –  First_Name: Arkadeep –  Age: 24 –  Gender: M –  Institute: ISI Kolkata –  Location: Kolkata –  Description: Arkadeep is a highly motivated student who

simply loves challenge. He is …

§  A field can be a text, a number, a range

8  

What  would  we  like  to  search  with?    

Page 9: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Lucene  Analyzer  §  Tokenizes the text in the Fields §  Common Analyzers –  WhitespaceAnalyzer

Splits tokens on whitespace –  SimpleAnalyzer

Splits tokens on non-letters, and then lowercases –  StopAnalyzer

Same as SimpleAnalyzer, but also removes stop words –  StandardAnalyzer

Most sophisticated analyzer that knows about certain token types, lowercases, removes stop words, ...

Page 10: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Analysis  example  §  “The quick brown fox jumped over the lazy dog” §  WhitespaceAnalyzer –  [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy]

[dog] §  SimpleAnalyzer –  [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy]

[dog] §  StopAnalyzer –  [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

§  StandardAnalyzer –  [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

Page 11: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Analysis  example  2  §  “XY&Z Corporation – [email protected]” §  WhitespaceAnalyzer –  [XY&Z] [Corporation] [-] [[email protected]]

§  SimpleAnalyzer –  [xy] [z] [corporation] [xyz] [example] [com]

§  StopAnalyzer –  [xy] [z] [corporation] [xyz] [example] [com]

§  StandardAnalyzer –  [xy&z] [corporation] [[email protected]]

Page 12: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Searching  

§  IndexReader and IndexSearcher

12  

int k = 10;StandardAnalyzer analyzer = new StandardAnalyzer();IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(

".//index")));IndexSearcher searcher = new IndexSearcher(reader);

Query q = new QueryParser("Description", analyzer).parse(query);

TopDocs docs = searcher.search(q, k);ScoreDoc[] hits = docs.scoreDocs;

Page 13: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Query  and  QueryParser  

QueryParser parser = new QueryParser(Version.LUCENE_40, ”LastName”,

new StandardAnalyzer()); Query query = parser.parse(q);  

§  QueryParser –  Need to parse the query in the same way the documents

were indexed –  Tell the query which field should it use (field based search) –  Use the same analyzer

Page 14: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

TopDocs  and  ScoreDoc  

TopDocs docs = searcher.search(q, k);ScoreDoc[] hits = docs.scoreDocs;  

§  Search returns TopDocs –  Reference to the top ranked documents returned by search

§  TopDoc has ScoreDoc(s) –  Each ScoreDoc is a single document

Page 15: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

GeXng  the  results  

System.out.println("Found " + hits.length + " hits.");for(int i=0;i<hits.length;++i) {

int docId = hits[i].doc;Document d = searcher.doc(docId);System.out.println((i + 1) + ". " +

d.get("First_Name") + " " + d.get("Last_Name"));}

§  Get the required fields from the documents

Page 16: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Adding/dele)ng  Documents

void addDocument(Document d);void addDocument(Document d, Analyzer a);

Important: Need to ensure that Analyzers used at indexing time are consistent with Analyzers used at searching time // deletes docs containing term or matching// query. The term version is useful for// deleting one document.void deleteDocuments(Term term);void deleteDocuments(Query query);

Page 17: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Index  format  §  Each Lucene index consists of one or more segments

–  A segment is a standalone index for a subset of documents –  All segments are searched –  A segment is created whenever IndexWriter flushes adds/

deletes §  Periodically, IndexWriter will merge a set of

segments into a single segment –  Policy specified by a MergePolicy–  Segments are grouped into levels –  Segments within a group are roughly equal size (in log space) –  Once a level has enough segments, they are merged into a

segment at the next level up§  Explicitly invoke optimize() to merge segments

Page 18: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Searching  a  changing  index  

Directory dir = FSDirectory.open(...);IndexReader reader = IndexReader.open(dir);IndexSearcher searcher = new IndexSearcher(reader);

Above reader does not reflect changes to the index unless you reopen it. Reopening is more resource efficient than opening a new IndexReader.

IndexReader newReader = reader.reopen();If (reader != newReader) {

reader.close();reader = newReader;searcher = new IndexSearcher(reader);

}

Page 19: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Near-­‐real-­‐)me  search  

IndexWriter writer = ...;IndexReader reader = writer.getReader();IndexSearcher searcher = new IndexSearcher(reader);

// Now let us say there’s a change to the index using writerwriter.addDocument(newDoc);

// reopen() and getReader() force writer to flushIndexReader newReader = reader.reopen();if (reader != newReader) {

reader.close();reader = newReader;searcher = new IndexSearcher(reader);

}

Page 20: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Query  Syntax  Query  expression   Document  matches  if…  

java   Contains  the  term  java  in  the  default  field  

java  junit  java  OR  junit  

Contains  the  term  java  or  junit  or  both  in  the  default  field  (the  default  operator  can  be  changed  to  AND)  

+java  +junit  java  AND  junit  

Contains  both  java  and  junit  in  the  default  field  

)tle:ant   Contains  the  term  ant  in  the  )tle  field  

)tle:extreme  –subject:sports   Contains  extreme  in  the  )tle  and  not  sports  in  subject  

(agile  OR  extreme)  AND  java   Boolean  expression  matches  

)tle:”junit  in  ac)on”   Phrase  matches  in  )tle  

)tle:”junit  ac)on”~5   Proximity  matches  (within  5)  in  )tle  

java*   Wildcard  matches  

java~   Fuzzy  matches  

lastmodified:[1/1/09  TO  12/31/09]  

Range  matches  

Page 21: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Programma)cally  constructed  queries  §  TermQuery, TermRangeQuery §  NumericRangeQuery §  PrefixQuery §  BooleanQuery §  PhraseQuery, WildcardQuery

Page 22: Introduc)on*to*Lucenedebapriyo/teaching/ir2015/slides/Lucene.pdf · Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use ! Lucene – Java

Source  and  acknowledgements  §  Slides by Manning and Nayak:

http://web.stanford.edu/class/cs276/handouts/lecture-lucene.pptx

§  The Lucene tutorial website: http://www.lucenetutorial.com

§  Apache lucene: http://lucene.apache.org

22