[F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene
Transcript of [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene
![Page 1: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/1.jpg)
Information Access with Apache Lucene
Pierpaolo Basile, PhDResearch Assistant Univeristy Of Bari Aldo Moro
co-founder QuestionCube s.r.l.pierpaolo.basile@{uniba.it, questioncube.com}
CODE CAMP: 20 LUGLIO - 1 AGOSTO, 2012S. VITO DEI NORMANNI (BR)
![Page 2: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/2.jpg)
Outline
• Search Engine– Vector Space Model– Inverted Index
• Apache Lucene– Indexing– Searching
• Advanced topics– Relevance feedback– Document similarity– Apache Tika
![Page 3: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/3.jpg)
SEARCH ENGINE
![Page 4: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/4.jpg)
Searching
Collection of documents
User query
Relevant documents
![Page 5: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/5.jpg)
Information Retrieval
Information Retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web.
Wikipedia
![Page 6: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/6.jpg)
Information Retrieval Process
UserInterface
Text Operations
Indexing
Ranking
Searching
QueryOperations
Index
Doc
Rankeddocuments
User feedback
Text
Query
Logic view
Inverted IndexQuery
Retrived documents
DocDocDocs
![Page 7: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/7.jpg)
Information Retrieval Model
<D, Q, F, R(qi, dj)>
• D: document representation
• Q: query representation
• F: query/document representation framework
• R(qi, dj): ranking function
![Page 8: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/8.jpg)
Bag-of-words representation
Document/query as unordered collection of words
John likes to watch movies. Mary likes too. John also likes to watch football games.
{"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10}
![Page 9: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/9.jpg)
Vector Space Model
Query/Document Framework
• Algebraic model
• Documents/queries are represented as vector in a geometric space
• Terms are vector components
![Page 10: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/10.jpg)
Vector Space Model
),,,(
),,,(
,,2,1
,,2,1
qtqq
jtjjj
wwwq
wwwd
Each vector component corresponds to a term
The component value wi,j is the weight of the term ti
in the document dj
![Page 11: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/11.jpg)
Vector Space Model
dj=(hit:5, refresh:2)q1=(refresh:1)q2=(hit:1)
hit
refresh
dj
q2
q1
![Page 12: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/12.jpg)
Vector Space Model
Ranking function
• Vector similarity between query and document computed by cosine
qd
qdqdR
j
j
j
cos),(
![Page 13: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/13.jpg)
Vector Space Model
dj=(hit:5, refresh:2)q1=(refresh:1)q2=(hit:1)
hit
refresh
dj
q2
q1
θ1θ2
![Page 14: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/14.jpg)
Term-Document matrix
•‘It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to introduce it.’
• ‘It’s the oldest rule in the book,’ said the King. ‘Then it ought to be Number One,’ said Alice.
•Alice watched the White King as he slowly struggled up from bar to bar, till at last she said, ‘Why, you’ll be hours and hours getting to the table, at that rate.
•Alice looked round eagerly, and found that it was the Red Queen. ‘She’s grown a good deal!’ was her first remark.
• In the pool of light was a billiards table, with two figures moving around it. Alicewalked toward them, and as she approached they turned to look at her.
•Alice lay back, and closed her eyes. There was the Red Queen again, with that incessant grin. Or was it the Cheshire cat's grin?
![Page 15: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/15.jpg)
Term-Document matrix
D1 D2 D3
Cheshire Cat 1 0 1
Alice 2 2 2
book 1 0 0
King 1 1 0
table 0 1 1
Queen 0 1 1
grin 0 0 2
![Page 16: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/16.jpg)
Term-Document matrix
D1 D2 D3
Cheshire Cat 1 0 1
Alice 2 2 2
book 1 0 0
King 1 1 0
table 0 1 1
Queen 0 1 1
grin 0 0 2
Query:
Alice AND Queen
0
1
0
0
0
1
0
![Page 17: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/17.jpg)
Inverted Index
![Page 18: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/18.jpg)
Dictionary
Alice
…
book
…
Cheshire Cat
…
grin
…
King
…
Queen
…
table
…
Inverted Index
D1:2 D2:2 D3:2
D1:1
D1:1 D3:1
D3:2
D1:1 D2:1
D2:1 D3:1
D2:1 D3:1
![Page 19: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/19.jpg)
Dictionary
Alice
…
book
…
Cheshire Cat
…
grin
…
King
…
Queen
…
table
…
Inverted Index
D1:2 D2:2 D3:2
D1:1
D1:1 D3:1
D3:2
D1:1 D2:1
D2:1 D3:1
D2:1 D3:1
Posting List
![Page 20: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/20.jpg)
Dictionary
Alice
…
book
…
Cheshire Cat
…
grin
…
King
…
Queen
…
table
…
Inverted Index
D1:2 D2:2 D3:2
D1:1
D1:1 D3:1
D3:2
D1:1 D2:1
D2:1 D3:1
D2:1 D3:1
![Page 21: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/21.jpg)
Dictionary
Alice
…
book
…
Cheshire Cat
…
grin
…
King
…
Queen
…
table
…
Inverted Index
D1:2 D2:2 D3:2
D1:1
D1:1 D3:1
D3:2
D1:1 D2:1
D2:1 D3:1
D2:1 D3:1
![Page 22: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/22.jpg)
Dictionary
Alice
…
book
…
Cheshire Cat
…
grin
…
King
…
Queen
…
table
…
Inverted Index
D1:2 D2:2 D3:2
D1:1
D1:1 D3:1
D3:2
D1:1 D2:1
D2:1 D3:1
D2:1 D3:1
![Page 23: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/23.jpg)
Dictionary
Alice
…
book
…
Cheshire Cat
…
grin
…
King
…
Queen
…
table
…
Inverted Index
D1:2 D2:2 D3:2
D1:1
D1:1 D3:1
D3:2
D1:1 D2:1
D2:1 D3:1
D2:1 D3:1
![Page 24: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/24.jpg)
Dictionary
Alice
…
book
…
Cheshire Cat
…
grin
…
King
…
Queen
…
table
…
Inverted Index
D1:2 D2:2 D3:2
D1:1
D1:1 D3:1
D3:2
D1:1 D2:1
D2:1 D3:1
D2:1 D3:1
![Page 25: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/25.jpg)
Dictionary
Alice
…
book
…
Cheshire Cat
…
grin
…
King
…
Queen
…
table
…
Inverted Index
D1:2 D2:2 D3:2
D1:1
D1:1 D3:1
D3:2
D1:1 D2:1
D2:1 D3:1
D2:1 D3:1
![Page 26: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/26.jpg)
Dictionary
Alice
…
book
…
Cheshire Cat
…
grin
…
King
…
Queen
…
table
…
Inverted Index
D1:2 D2:2 D3:2
D1:1
D1:1 D3:1
D3:2
D1:1 D2:1
D2:1 D3:1
D2:1 D3:1
![Page 27: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/27.jpg)
Dictionary
Alice
…
book
…
Cheshire Cat
…
grin
…
King
…
Queen
…
table
…
Inverted Index
D1:2 D2:2 D3:2
D1:1
D1:1 D3:1
D3:2
D1:1 D2:1
D2:1 D3:1
D2:1 D3:1
Posting List Position List
14, 49 1, 40 18, 34
33
Term positions in D1
![Page 28: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/28.jpg)
Dictionary
Alice
…
book
…
Cheshire Cat
…
grin
…
King
…
Queen
…
table
…
Inverted Index: query processing
AND (intersection )
Alice AND King -> (D1, D2)
D1:2 D2:2 D3:2
D1:1
D1:1 D3:1
D3:2
D1:1 D2:1
D2:1 D3:1
D2:1 D3:1
D1:2 D2:2 D3:2
D1:1 D2:1
<Alice><King>
![Page 29: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/29.jpg)
Dictionary
Alice
…
book
…
Cheshire Cat
…
grin
…
King
…
Queen
…
table
…
Inverted Index: query processing
D1:2 D2:2 D3:2
D1:1
D1:1 D3:1
D3:2
D1:1 D2:1
D2:1 D3:1
D2:1 D3:1
OR (union )
Alice AND King -> (D1, D2, D3)
<grin> <King>
D3:2
D1:1 D2:1
![Page 30: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/30.jpg)
Dictionary
Alice
…
book
…
Cheshire Cat
…
grin
…
King
…
Queen
…
table
…
Inverted Index: query processing
D1:2 D2:2 D3:2
D1:1
D1:1 D3:1
D3:2
D1:1 D2:1
D2:1 D3:1
D2:1 D3:1
NOT (complement \)
Alice NOT grin -> (D1, D2)
<Alice> \ <grin>
D1:2 D2:2 D3:2
D3:2
![Page 31: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/31.jpg)
Term-weight
• Measures the term relevance in a document
– component value in the document representation
– TF*IDF
• TF (term frequency): term occurrences in the document
• IDF (inverse document frequency): inverse to the number of documents in which the term occurs
}:{log*),(),(*
dtDd
Ddttfdtidftf
![Page 32: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/32.jpg)
Term-weight
• Measures the term relevance in a document
– component value in the document representation
– TF*IDF
• TF (term frequency): term occurrences in the document
• IDF (inverse document frequency): inverse to the number of documents in which the term occurs
}:{log*),(),(*
dtDd
Ddttfdtidftf
Number of documents in the collection
idf
![Page 33: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/33.jpg)
TF*IDF insights
• increases proportionally to the term frequency in the document
• decreases to the number of documents in which the term belongs
– common words are generally more frequent in the collection
• IDF depends on the collection, TF on the document
![Page 34: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/34.jpg)
Inverted index/TF*IDF
• TF: computed by term occurrences in the posting list
• IDF: computed by the posting list cardinality
D1:2 D2:2 D3:2Alice |D|=100
3
100log*2)1,(* DAliceidftf
TF
![Page 35: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/35.jpg)
Inverted index/TF*IDF
• TF: computed by term occurrences in the posting list
• IDF: computed by the posting list cardinality
D1:2 D2:2 D3:2Alice |D|=100
3
100log*2)1,(* DAliceidftf
IDF
![Page 36: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/36.jpg)
APACHE LUCENE
http://lucene.apache.org
![Page 37: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/37.jpg)
Apache Lucene
• Apache Software Foundation project
– http://lucene.apache.org
• What Lucene is
– Search library: indexing and searching Application Programming Interface (API)
• What Lucene is not
– Search engine (no crawling, server, user interface, etc.)
![Page 38: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/38.jpg)
Lucene
• Full-text search
• High performance
• Scalable
• Cross-platform
• 100%-pure Java
• Porting in other languages: .NET, Python
![Page 39: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/39.jpg)
Information Retrieval Process
UserInterface
Text Operations
Indexing
Ranking
Searching
QueryOperations
Index
Doc
Rankeddocuments
User feedback
Text
Query
Logic view
Inverted IndexQuery
Retrived documents
![Page 40: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/40.jpg)
Information Retrieval Process
UserInterface
Analyzer/Tokenizer/Filter
IndexWriter
Similarity
IndexSearcher
QueryParser
Index
Document
Rankeddocuments
User feedback
Text
Query
Logic view
Inverted IndexQuery
Retrived documents
IndexReader
org.apache.lucene
![Page 41: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/41.jpg)
Lucene Model 1/2
• Based on Vector Space Model• Features:
– multi-field document– tf–term frequency: number of matching terms in field– lengthNorm: number of tokens in field– idf: inverse document frequency– coord: coordination factor, number of matching
• Terms– field boost– query clause boost
![Page 42: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/42.jpg)
Lucene Model 1/2
• coor(q,d) score factor based on how many of the query terms are found in the specified document
• queryNorm(q,d) is a normalizing factor used to make scores between queries comparable
• boost(t) term boost• norm(t,d) encapsulates a few (indexing time) boost and length
factors
![Page 43: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/43.jpg)
Lucene
TopDocs
![Page 44: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/44.jpg)
Document Indexing
FSDirectory dir = FSDirectory.open(new File(“/home/user/index_test));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, config);
Document doc = new Document();
doc.add(new Field("super_name", "Sandman", Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field("name", "William Baker", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("category", "superhero", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.close();
![Page 45: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/45.jpg)
Field Options
• Field.Store
– NO: field value is not stored
– YES: field value is stored
• Field.Index
– ANALYZED: field is analyzed by the analyzer
– NO: not indexed
– NOT_ANALYZED: indexed but not analyzed
![Page 47: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/47.jpg)
Document: Fields Representationhttp://www.bbc.co.uk/news/technology-15365207
![Page 48: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/48.jpg)
Document: Fields RepresentationData: Index.Store.YES,
Field.Index.NOT_ANALYZED
Authors: Index.Store.YES, Field.Index.ANALYZED
Title: Index.Store.NO, Field.Index.ANALYZED
Abstract: Index.Store.NO, Field.Index.ANALYZED
Content: Index.Store.NO, Field.Index.ANALYZED Comments: Index.Store.NO,
Field.Index.ANALYZED
http://www.bbc.co.uk/news/technology-15365207 URL: Index.Store.YES, Field.Index.NO
![Page 49: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/49.jpg)
Text Operations
• Tokenization
– split text in token
• Stop word elimination
– remove common words and closed word class (e.g. function words)
• Stemming
– reducing inflected (or sometimes derived) words to their stem
![Page 50: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/50.jpg)
Analyzers
• KeywordAnalyzer– Returns a single token
• WhitespaceAnalyzer– Splits tokens at whitespace
• Simple Analyzer– Divides text at nonletter characters and lowercases
• StopAnalyzer– Divides text at non letter characters, lowercases, and removes
stop words
• StandardAnalyzer– Tokenizes based on sophisticated grammar that recognizes e-
mail addresses, acronyms, etc.; lowercases and removes stop words (optional)
![Page 51: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/51.jpg)
Analyzers
The quick brown fox jumped over the lazy dogs
• WhitespaceAnalyzer:
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
• SimpleAnalyzer:
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
• StopAnalyzer:
[quick] [brown] [fox] [jumped] [lazy] [dogs]
• StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
![Page 52: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/52.jpg)
Analyzers
"XY&Z Corporation - [email protected]"
• WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [[email protected]]
• SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
• StandardAnalyzer:
[xy&z] [corporation] [[email protected]]
![Page 53: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/53.jpg)
Tokenizers
• A Tokenizer is a TokenStream whose input is a Reader
• Tokenizers break field text into tokens
• StandardTokenizer– source string: “full-text lucene.apache.org”
– “full” “text” “lucene.apache.org”
• WhitespaceTokenizer– “full-text” “lucene.apache.org”
• LetterTokenizer– “full” “text” “lucene” “apache” “org”
![Page 54: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/54.jpg)
TokenFilters
• A TokenFilter is a TokenStream whose input is another token stream
• LowerCaseFilter
• StopFilter
• LengthFilter
• PorterStemFilter– stemming: reducing words to root form
– rides, ride, riding => ride
– country, countries => countri
![Page 55: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/55.jpg)
Analyzer
public class MyAnalyzer2 extends Analyzer {
private Set sw = StopFilter.makeStopSet(Version.LUCENE_36, new
ArrayList(StopAnalyzer.ENGLISH_STOP_WORDS_SET));
@Override
public TokenStream tokenStream(String string, Reader reader) {
TokenStream ts = new WhitespaceTokenizer(Version.LUCENE_36, reader);
ts = new LowerCaseFilter(Version.LUCENE_36, ts);
ts = new StopFilter(Version.LUCENE_36, ts, sw);
return ts;
}
}
![Page 56: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/56.jpg)
TermVectors
• Store information about term frequency, positions and offsets
• Relevance Feedback and “More Like This”
• Clustering
• Similarity between two documents
• Highlighter
– needs offsets info
![Page 57: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/57.jpg)
Lucene TermVector
• TermFreqVector is a representation of all of the terms and term counts in a specific Field of a Document instance
• As a tuple:– termFreq = <term, term countD>
– <fieldName, <…,termFreqi, termFreqi+1,…>>
• As Java:– public String getField();
– public String[] getTerms();
– public int[] getTermFrequencies()
![Page 58: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/58.jpg)
Lucene TermPositionVector
• TermPositionVector is a representation of all of the terms, term counts, term position and term offsets in a specific Field of a Document instance
• As Java:– public String getField();
– public String[] getTerms();
– public int[] getTermFrequencies()
– public int[] getTermPositions(int index)
– public TermVectorOffsetInfo[]
getOffsets(int index)
![Page 59: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/59.jpg)
Store TermVector
• During indexing, create a Field that stores Term
• Vectors:– new
Field(“text",text,Field.Store.YES,Field.Index.A
NALYZED,Field.TermVector.YES);
• Options are:– Field.TermVector.YES
– Field.TermVector.NO
– Field.TermVector.WITH_POSITIONS
– Field.TermVector.WITH_OFFSETS
– Field.TermVector.WITH_POSITIONS_OFFSETS
![Page 60: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/60.jpg)
Term Vector: Positions and Offsets
0 1 2 3 4 5 6 7 8
The quick brown fox jumped over the lazy dogs0<--->3 4<----->9 10<----->15 16<--->19 20 <------> 26 27<---->31 32<--->35 36<---->40 41<---->45
Positions
Offsets
![Page 61: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/61.jpg)
Example
• Main method with three parameters– Input_dir: directory to index– Index_dir: directory containing the index files– Flag (string) corresponding to different
Field.TermVector... options• t: tokenize• tv: term vectors • tvp: term vectors with positions• tvo: term vectors with positions and offsets
• The constructor of the Index class takes in input the three parameters and indexes the directory– Filters only “.txt” files
![Page 62: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/62.jpg)
Lucene Search
• IndexReader: interface for accessing an index
• QueryParser: parses the user query
• IndexSearcher: implements search over a single IndexReader
– search(Query query, int numDoc)
– TopDocs -> result of search
• TopDocs.scoreDocs returns an array of retrieved documents
![Page 63: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/63.jpg)
Lucene SearchDirectory dir = FSDirectory.open(new File(args[0]));
IndexReader reader = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
QueryParser parser = new QueryParser(Version.LUCENE_36, "text", analyzer);
Query query = parser.parse(args[1]);
TopDocs topDocs = searcher.search(query, 100);
System.out.println("matches: " + topDocs.totalHits);
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
Document doc = reader.document(topDocs.scoreDocs[i].doc);
System.out.println("name: " + doc.get("name") + ", score: " +
topDocs.scoreDocs[i].score);
System.out.println();
}
searcher.close();
![Page 64: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/64.jpg)
Lucene QueryParser
• Example: – queryParser.parse(“name:SpiderMan”);
• good human entered queries, debugging
• does text analysis and constructs appropriate queries
• not all query types supported
![Page 65: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/65.jpg)
QUERY SYNTAX 1/2
• title:"The Right Way" AND text:go– Phrase query + term query
• Wildcard Searches– tes* (test - tests - tester)– te?t (test – text)– te*t (tempt) – tes?
• Fuzzy Searches (Levenshtein Distance) (TERM)– roam~ (foam – roams)– roam~0.8
• Range Searches– mod_date:[20020101 TO 20030101]– title:{Aida TO Carmen}
• Proximity Searches (PHRASE)– "jakarta apache"~10
![Page 66: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/66.jpg)
QUERY SYNTAX 2/2
• Boosting a Term– jakarta^4 apache– "jakarta apache"^4 "Apache Lucene”
• Boolean Operator– NOT, OR, AND– + required operator– - prohibit operator
• Grouping by ( )• Field Grouping• title:(+return +"pink panther")• Escaping Special Characters by \
![Page 67: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/67.jpg)
QUERY (BY CODE) 1/2
TermQuery query =
new TermQuery(new Term(“name”,”Spider-Man”))
• explicit, no escaping necessary
• does not do text analysis for you
• Query– TermQuery
– BooleanQuery
– PhraseQuery
– FuzzyQuery / WildcardQuery /
![Page 68: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/68.jpg)
QUERY (BY CODE) 2/2
TermQuery tq1 = new TermQuery(new Term(“name”,”pippo”))
TermQuery tq2 = new TermQuery(new Term(“name”,”pluto”))
BooleanQuery bq = new BooleanQuery();
bq.add(tq1, BooleanClause.Occur.MUST);
bq.add(tq2, BooleanClause.Occur.SHOULD);
![Page 69: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/69.jpg)
DELETING DOCUMENTS
IndexWriter– deleteDocuments(Term t)
– deleteDocuments(Term… t)
– deleteDocuments(Query t)
– deleteDocuments(Query… t)
– updateDocument(Term t, Document d)
– updateDocument(Term t, Document d,
Analyzer a)
– deleting does not immediately reclaim space
![Page 70: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/70.jpg)
ADVANCED TOPICS
Relevance feedback
Documents similarity
Apache Tika
![Page 71: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/71.jpg)
Relevance feedback
• Improve retrieval performance using information about document relevance
– Explicit: relevance indicated by the user (assessor)
– Implicit: from user behavior
– Blind (or pseudo): using information about the top k retrieved documents
![Page 72: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/72.jpg)
Relevance feedback
Searcher
User query
Retrieveddocuments
Relevance feedback
Newretrieved
documents
New query
![Page 73: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/73.jpg)
Rocchio Algorithm
norelkrelj DD
k
norelDD
j
rel
new DD
DD
QQ 11
Original query
Relevant document
Non-relevant document
![Page 74: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/74.jpg)
Rocchio Algorithm (blind)
norelkrelj DD
k
norelDD
j
rel
new DD
DD
QQ 11
In blind (pseudo) relevance feedback we know only relevant documents (supposed to be the top Kdocuments)
(generally adopted by search engine)
![Page 75: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/75.jpg)
Relevance feedback
Q = Alice
It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to introduce it.
It’s the oldest rule in the book,’ said the King. ‘Then it ought to be Number One,’ said Alice
Alice lay back, and closed her eyes. There was the Red Queen again, with that incessant grin. Or was it the Cheshire cat's grin?
Alice looked round eagerly, and found that it was the Red Queen. ‘She’s grown a good deal!’ was her first remark.
Qnew = Alice Cheshire Cat King book
It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to introduce it.
It’s the oldest rule in the book,’ said the King. ‘Then it ought to be Number One,’ said Alice
Alice book was reprinted and published in 1866.
…
![Page 76: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/76.jpg)
Relevance feedback in Lucene
• Possible using TermVector
– Retrieve the top K documents
– Build the Qnew using TermVector
– Re-query using Qnew
![Page 77: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/77.jpg)
Document similarity
• Documents are represented by vectors
– Build the document vector using TermVector
– Compute the similarity using cosine similarity
• Implement a functionality like “similar to this document”
![Page 78: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/78.jpg)
Apache Tika
• Extract metadata and text from several file formats
– XML, HTML, OpenOffice, Microsoft Office, PDF
– MBOX format (email)
– Compressed files
– EXIF metadata from image
– Metadata from audio file (MP3)
• Apache project as Lucene
![Page 79: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/79.jpg)
Extract metadata and text
Tika tika = new Tika();
String type = tika.detect(new File(args[0]));
System.out.println("File type: " + type);
Metadata metadata = new Metadata();
String text = tika.parseToString(new FileInputStream(new File(args[0])),
metadata);
System.out.println("Metadata");
String[] names = metadata.names();
for (int i = 0; i < names.length; i++) {
String value = metadata.get(names[i]);
if (value != null) {
System.out.println(names[i] + "=" + value);
}
}
System.out.println("Text: " + text);
![Page 80: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/80.jpg)
Apache Tika + Lucene
Tika Lucene
Files/URLs
metadata
text
![Page 81: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/81.jpg)
CONCLUSIONS
![Page 82: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/82.jpg)
Conclusions
• A popular IR model
– Vector Space Model
• Lucene
– provides API to build search engine
• Apace Tika
– extracts metadata and text from files and URLs
• LET’S GO TO BUILD YOUR SEARCH ENGINE!
![Page 83: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/83.jpg)
Lucene Contrib
• analyzers: new analyzer
– Arabic, Chinese, …
• highlighter: a set of classes for highlighting matching terms in search results
• Snowball: stemmer analyzer based on Snowball library http://snowball.tartarus.org
• spellchecker: tools for spellchecking and suggestions with Lucene
![Page 84: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/84.jpg)
Lucene related project
• Apache Solr: open source enterprise search platform from the Apache Lucene– Full-Text Search Capabilities, High Volume Web Traffic,
HTML Administration Interfaces
• Apache Nutch: web-search engine based on Solrand Lucene– crawler, link-graph database, HTML
![Page 85: [F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene](https://reader033.fdocuments.us/reader033/viewer/2022052506/557d62c0d8b42ac43c8b45c2/html5/thumbnails/85.jpg)
Thank you for your attention!
Questions?