Introduction To Apache Lucene

Introduction to Apache Lucene

Sumit Luthra

Agenda What is Apache Lucene ?

Focus of Apache Lucene

Lucene Architecture

Core Indexing Classes

Core Searching Classes

Questions & Answers

What is Apache Lucene? Apache Lucene is a high-performance, full- featured text search engine library written entirely in Java.”

Also known as Information Retrieval Library.

Lucene is specifically an API, not an application.

Open Source

Focus Indexing Documents

Searching Documents

Note : You can use Lucene to provide consistent full-text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, emails and so on).

Lucene Architecture

Raw Content

Acquire content

Build document

Analyze document

Index document

Search UI

Build query

Render results

Run query

Indexing DocumentsIndexWriter writer = new IndexWriter(directory, analyzer, true);

Document doc = new Document();doc.add(new Field(“content", “Hello World”,

Field.Store.YES, Field.Index.TOKENIZED));doc.add(new Field(“name", “filename.txt",

Field.Store.YES, Field.Index.TOKENIZED));doc.add(new Field(“path", “http://myfile/",

Field.Store.YES, Field.Index.TOKENIZED));// [...]

writer.addDocument(doc);

writer.close();

Core indexing classes

IndexWriter

Directory

FSDirectory

RAMDirectory

DbDirectory

FileSwitchDirectory

JEDirectory

AnalyzersTokenizes the input text

Common Analyzers

– WhitespaceAnalyzerSplits tokens on whitespace

– SimpleAnalyzerSplits tokens on non-letters, and then lowercases

– StopAnalyzerSame as SimpleAnalyzer, but also removes stop words

– StandardAnalyzerMost sophisticated analyzer that knows about certain token types, lowercases, removes stop words, ...

Analysis examples• “The quick brown fox jumped over the lazy dog”

• WhitespaceAnalyzer

– [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

• SimpleAnalyzer

– [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

• StopAnalyzer

– [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

• StandardAnalyzer

– [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

More analysis examples• “XY&Z Corporation – xyz@example.com”

• WhitespaceAnalyzer

– [XY&Z] [Corporation] [-] [xyz@example.com]

• SimpleAnalyzer

– [xy] [z] [corporation] [xyz] [example] [com]

• StopAnalyzer

– [xy] [z] [corporation] [xyz] [example] [com]

• StandardAnalyzer

– [xy&z] [corporation] [xyz@example.com]

Document & FieldsA Document is the atomic unit of indexing and

searching, It contains Fields

Fields have a name and a value

– You have to translate raw content into Fields

– Examples: Title, author, date, abstract, body, URL, keywords, ...

– Different documents can have different fields

Field optionsField.Store

– NO : Don’t store the field value in the index

– YES : Store the field value in the index

Field.Index

– ANALYZED : Tokenize with an Analyzer

– NOT_ANALYZED : Do not tokenize

– NO : Do not index this field

Searching an Index

IndexSearcher searcher = new IndexSearcher(directory);QueryParser parser = new QueryParser(Version, field_name

,analyzer);Query query = parser.parse(WORD_SEARCHED);

TopDocs hits = searcher.search(query, noOfHits);

ScoreDoc[] document = hits.scoreDocs;

Document doc = searcher.doc(0); // look at first matchSystem.out.println(“name=" + doc.get(“name"));searcher.close();

Core searching classes

IndexSearcher

QueryParser

TopDocs

ScoreDoc

IndexSearcherConstructor:

– IndexSearcher(Directory d);

• // Deprecated

– IndexSearcher(IndexReader r);

• Construct an IndexReader with static method IndexReader.open(dir)

Query• TermQuery

– Constructed from a Term

• TermRangeQuery

• NumericRangeQuery

• PrefixQuery

• BooleanQuery

• PhraseQuery

• WildcardQuery

• FuzzyQuery

• MatchAllDocsQuery

QueryParser• Constructor

– QueryParser(Version matchVersion, String defaultField, Analyzer analyzer);

• Parsing methods

– Query parse(String query) throwsParseException;

– ... and many more

QueryParser syntax examplesQuery expression Document matches if…

java Contains the term java in the default field

java junitjava OR junit

Contains the term java or junit or both in the default field (the default operator can be changed to AND)

+java +junit

java AND junit

Contains both java and junit in the default field

title:ant Contains the term ant in the title field

title:extreme –subject:sports Contains extreme in the title and not sports in subject

(agile OR extreme) AND java Boolean expression matches

title:”junit in action” Phrase matches in title

title:”junit action”~5 Proximity matches (within 5) in title

java* Wildcard matches

java~ Fuzzy matches

lastmodified:[1/1/09 TO 12/31/09]

Range matches

TopDocs Class containing top N ranked searched documents/results that match a given query.

ScoreDocArray of ScoreDoc containing documents/resultsthat match a given query.

You will require lucene-core-x.y.jar for this demo.

Demo of simple indexing and searching using Apache Lucene

Any Questions ?

Thank You.

Introduction To Apache Lucene

Technology

Transcript of Introduction To Apache Lucene

Apache Lucene - a library retrieving data for millions …gotocon.com/dl/goto-amsterdam-2011/slides/SimonWillnauer_Apache...Apache Lucene - a library retrieving data for millions of

Apache Lucene/Solr Meetup 07/28/2010cdn.harmonyapp.com/assets/.../lucenesolrmeetupjuly2010short.pdf · Apache Lucene/Solr Meetup 07/28/2010 1 Monday, August 9, 2010. See you in your

Apache Lucene 5 - FOSDEM · 2015-02-18 · Apache Lucene 5 New Features and Improvements for Apache Solr and Elasticsearch ... My Background •Committer and PMC member of Apache

EPL 660: Lab 2 Apache Lucene Part II Department of ...

Introducing Apache Lucene with two demos

Introduction to Open Source Search with Apache Lucene and Solr

Apache UIMA Lucene CAS Indexer Documentation v2.3.1

Apache lucene

Apache Lucene for Java EE Developers

Apache Lucene - a library retrieving data for millions of ... · Apache Lucene - a brief introduction •A fulltext search library entirely written in Java •An ASF Project since

Hibernate Search - Apache Lucene™ Integration · Hibernate Search Apache Lucene™ Integration Reference Guide Emmanuel Bernard Hardy Ferentschik Gustavo Fernandes Sanne Grinovero

PANGAEA - Providing access to geoscientific data using Apache Lucene Java

Apache Lucene: Searching the Web and Everything Else - Daniel Naber

Apache Lucene 6 - Berlin Buzzwords Background • Committer and PMC member of Apache Lucene and Solr - main focus is on development of Lucene Core. • Implemented fast numerical search

Suche mit Apache Lucene & Co Christian Meder - inovex GmbH

Open Source Search: Die Welt von Apache Lucene - WJax 2009

Apache Lucene: Searching the Web and Everything Else (Jazoon07)

Apache Solr/Lucene: Looking Ahead

Intelligent Apps with Apache Lucene, Mahout and Friends

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM