Overview of Searching in Solr 1.4

18
Open Source What’s New in Apache So A Lucid Imagination Technical White Paper Search: olr 1.4

description

Solr offers a rich, flexible set of features for search. To understand the extent of this flexibility, it's helpful to begin with an overview of the steps and components involved in a Solr search.http://www.lucidimagination.com/developers/whitepapers/whats-new-solr-14

Transcript of Overview of Searching in Solr 1.4

Page 1: Overview of Searching in Solr 1.4

Open Source Search

What’s New

in Apache Solr 1.4A Lucid Imagination

Technical White Paper

Open Source Search:

Solr 1.4

Page 2: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page ii

© 2009 by Lucid Imagination, Inc. under the terms of Creative Commons license, as detailed at

http://www.lucidimagination.com/Copyrights-and-Disclaimers/. Version 1.02, published 26 October 2009.

Solr, Lucene, Apachecon and their logos are trademarks of the Apache Software Foundation.

Page 3: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page iii

Abstract

Apache Solr is the definitive application development implementation for Lucene, and it is the leading open source search platform.

Solr 1.3 set a high bar for functionality, extensibility, and performance. As time marches on, Solr committers and contributors have been hard at work engineering to make a good thing even better.

This white paper describes the new features and improvements in the latest version, Apache Solr 1.4. In the simplest terms, Solr is now faster and better than before. Central components of Solr have been improved to cut the time needed for processing queries and indexing documents. The goal: to provide a powerful, versatile search application server with ever better scalability, performance and relevancy. New features include streamlined caching, smarter handling of index changes, faster faceting, enhanced data import capabilities, speedier numeric range queries, duplicate detection and more.

Page 4: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page iv

Table of Contents

Introduction ............................................................................................................................................................ 1

Performance Improvements ............................................................................................................................. 2

Streamlined Caching ........................................................................................................................................ 2

Scalable Concurrent File Access .................................................................................................................. 2

Smarter Handling of Index Changes .......................................................................................................... 3

Faster Faceting ................................................................................................................................................... 4

Streaming Updates for SolrJ .......................................................................................................................... 4

What Else Is New for Solr 1.4 Performance ............................................................................................ 5

Feature Improvements ....................................................................................................................................... 5

Solr Becomes an Omnivore ........................................................................................................................... 5

DataImportHandler Enhancements ........................................................................................................... 6

Smoother Replication ...................................................................................................................................... 7

More Choices for Logging .............................................................................................................................. 8

Multiselect Faceting ......................................................................................................................................... 9

Speedier Range Queries .................................................................................................................................. 9

Duplicate Detection ....................................................................................................................................... 10

New Request Handler Components ........................................................................................................ 11

What Else Is New with Solr 1.4 Features .............................................................................................. 11

Get Started & Resources .................................................................................................................................. 12

Next Steps ............................................................................................................................................................. 12

APPENDIX: Choosing Lucene or Solr .......................................................................................................... 13

Page 5: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 1

Introduction

Apache Solr is the definitive application development implementation for Apache Lucene,

and it is the leading open source search platform. If you imagine Lucene as a high-

performance race car engine, then Solr is all the things that make that engine usable, such

as a chassis, gas pedal, steering wheel, seat, and much more.

Solr makes it easy to develop sophisticated, fast search applications with advanced features

such as faceting. Solr builds on another open source search technology, Lucene, which

provides indexing and search technology, as well as spellchecking, hit highlighting, and

advanced processing capabilities. Both Solr and Lucene are developed at the Apache

Software Foundation.

Lucene currently ranks among the top 15 open source projects and is one of the top 5

Apache projects, with installations at over 4,000 companies. Lucene and Solr downloads

have grown nearly tenfold over the past three years; Solr is the fastest-growing Lucene

subproject. Lucene and Solr offer an attractive alternative to proprietary licensed search

and discovery software vendors.1.

Solr 1.3 set a high bar for functionality, extensibility, and performance. As time marches on,

Solr engineers have been hard at work making a good thing even better. This white paper

describes the new features and

improvements in the latest

version, Solr 1.4. In the

simplest terms, Solr is now

faster and better than before.

Central components of Solr

have been improved to cut the

time needed for processing

queries and indexing

documents. Many new features

1 See the Appendix for a discussion of when to choose Lucene or Solr.

Page 6: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 2

have been added, all with the goal of providing users with the information they want as fast

as possible.

Performance Improvements Solr 1.4 increases Solr’s speed with numerous improvements in key areas. Some of these

enhancements are high-performance replacements for standard off-the-shelf Java platform

components. Much as a car hobbyist replaces stock parts of an engine, the architects and

programmers working on Solr have replaced crucial components to make Solr 1.4 run

faster than ever for many common operations.

Streamlined Caching Solr caches data from its index as an optimization, because reading from memory is always

faster than reading from the file system. Over the duration of a single faceting request, the

cache might be accessed hundreds or even thousands of times. Previously, the cache

implementation was a synchronized LinkedHashMap from the Java platform API.

Solr 1.4 uses a new class, ConcurrentLRUCache, which is specifically designed to

minimize the overhead of synchronization. Anecdotal evidence suggests that this

implementation can double query throughput in some circumstances.

Scalable Concurrent File Access In the past, Solr used the Java platform’s RandomAccessFile to read data from index

files. Reading a portion of a file involves calling seek() to find the right part of the file, and

read() to actually retrieve the data.

Multithreaded access to the same file has meant that the seek() and read() pairs must

be synchronized. If the data to be read isn’t already in the operating system cache, things

get worse: the synchronization causes all other reading threads to wait while the data is

retrieved from disk.

Page 7: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 3

The Java Nonblocking Input/Output (NIO) API offers a much better solution. NIO’s

FileChannel includes a read() method that, in essence, performs a seek() and a

read() in a single operation.

public int read(ByteBuffer dst, long position)

Solr 1.4 uses this NIO method (via Lucene’s NIOFSDirectory) to read index files.2

Smarter Handling of Index Changes Solr generally keeps a big pile of documents in an existing index. New documents are

periodically added, but usually the number of new documents is small compared with the

size of the index. Solr (via Lucene) stores the index as a collection of segments; as new

documents are added, most of the segments will remain unchanged.

Solr 1.4 is very much aware that, for the most part, index segments don’t change.

Consequently, Solr is much smarter about reusing unchanged segments, which results in

less memory churn, less disk access, and better performance.

2 On Windows, the older RandomAccessFile implementation is used because of a bug in the Windows NIO implementation.

reopen()

Index New index

Index segments on disk

Page 8: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 4

One example is reloading an index. Previously, the entire index was loaded again, which is

expensive in time and resources. Now, Solr 1.4 is smart enough to reuse index segments

that haven’t changed, resulting in a much more efficient reload of a modified index.

This means that adding new documents to an index and making them available comes at a

lower resource cost. The figure above illustrates the mechanism.

Many other optimizations have been made with respect to index segments. The field cache,

for example, is now split so there is one field cache per segment. Again, this results in much

more efficient processing of index updates, because the field caches for every unchanged

segment do not need to be touched.

Faster Faceting One of Solr’s killer features is faceting, the ability to quickly narrow and drill down into

search results by categories. Solr uses UnInvertedField to keep mapping between

documents and field values so it can provide faceting information in response to queries.

For multivalued fields, Solr 1.4 includes a new implementation of UnInvertedField that

can be 50 times faster and 5 times smaller than its predecessor. Single value fields still use

either the enum or fieldcache method.

Streaming Updates for SolrJ SolrJ is the API that Java client applications use to work with Solr. The Solr 1.4 version of

SolrJ includes an optimized implementation, StreamingUpdateSolrServer, which is

useful for indexing many documents at a time.

In one simple test, the number of

documents indexed per second jumped

from 231 to 25,000 using the new

implementation.

Page 9: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 5

For bulk updates, consider switching to the new implementation. In one simple test, the

number of documents indexed per second jumped from 231 to 25,000 when using the new

implementation.

What Else Is New for Solr 1.4 Performance In addition to these important performance enhancements in Solr 1.4, there are several

more, including:

� Binary format for updates, much more compact than XML, now available for SolrJ.

� OmitTermFreqAndPositions can be applied to a field so that Solr does not

compute the number of terms and list of positions for that field, which saves time

and space for nontext fields.

� Queries that don’t sort by score can eliminate scoring, which speeds up queries.

� Filters now apply before the main query, which makes queries 300% faster in some

cases.

� New filter implementation for small results sets, so it runs smaller and faster.

Feature Improvements Aside from performance improvements, Solr 1.4 sports a variety of great new features. As

an open source project, Solr 1.4 is largely created by the people who use it, so the new

features are the ones that the community cares about most passionately.

Solr Becomes an Omnivore Solr can’t give you good results unless you give it good data. Normally you feed Solr XML

documents corresponding to the structure of your schema. This works fine, and if all your

data consists of XML documents, they can be fed directly to Solr or easily transformed to

the correct input.

Of course, reality is always messy. Chances are that many documents you want to include in

your Solr index are in other file formats, like PDF or Microsoft Word. Fortunately, Solr 1.4

knows how to deal with the mess.

Page 10: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 6

Solr 1.4 can now ingest these other types of documents using a feature called Solr Cell.3 Solr

Cell uses another open source project, Tika, to read documents in a variety of formats and

convert them to an XHTML stream. Solr parses the stream to produce a document, which is

then indexed.

Here are a few of the formats that Tika understands:

• PDF

• OpenDocument (OpenOffice formats)

• Microsoft OLE 2 Compound Document (Word, PowerPoint, Excel, Visio, etc.)

• HTML

• RTF

• gzip

• ZIP

• Java Archive (JAR) files

DataImportHandler Enhancements DataImportHandler knows how to index data pulled from relational databases or XML

files. The details of what is indexed and how it happens are configured in solrconfig.xml.

Solr 1.4 contains some extremely useful upgrades to DataImportHandler.

The first is the ability to push data into DataImportHandler. In Solr 1.3,

DataImportHandler was pull-only. This meant that the only possibly way to push data

to Solr was to use the update XML or CSV format, which meant you couldn’t take advantage

of any of DataImportHandler’s capabilities. In the Solr 1.4 world, a new component

called ContentStreamDataSource allows you to use DataImportHandler’s features

for indexing content.

Another powerful enhancement in Solr 1.4 is the ability to listen for import events. All you

need to do is provide an implementation of the EventListener interface and let Solr

3 The name is based on the acronym Content Extraction Library (CEL). This feature is also known by its more technical name ExtractingRequestHandler.

Page 11: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 7

know about it in solrconfig.xml. When importing begins and ends, your listener will be

notified.

Solr 1.4 also brings the ability to control error handling in DataImportHandler. For

each entity, you can control what happens when an error occurs via solrconfig.xml. The

choices for error handling are as follows:

� abort : The import is stopped and all changes are rolled back.

� skip : The current document is skipped.

� continue : Import continues as if the error did not occur.

DataImportHandler contains many more enhancements and optimizations in Solr 1.4,

including new data sources, new entity processors, and new transformers.

Smoother Replication Replication is a fancy name for making a copy of a Solr index, which at its heart is just a

matter of copying files. Making copies of an index is useful for two reasons. The first is

simply to create a backup. The second reason is to place the same index on multiple Solr

servers, which is necessary if you want to distribute incoming requests to improve

performance.

Prior to Solr 1.4, replication was implemented with shell scripts, and consequently would

only work effectively on platforms with a shell, like Linux; it relied on the Unix rsync file

utility and it relied on the OS providing hard links, which could require cumbersome

scripting, excluding tiered deployments on Windows platforms.

In Solr 1.4, replication has been abstracted and implemented entirely at the Java platform

layer, which means it will work (and work the same) wherever the Java platform runs. This

is great news for anyone using Solr because it means that backups can be performed in the

same way on a Solr instance, regardless of hardware or operating system, and it means that

configuring replication across multiple Solr instances is similarly uniform. Replication does

not require a backup and the index is copied from one live index to another.

Replication and backups are configured in solrconfig.xml. Add a couple lines if you just want

to make a backup—you can choose to backup upon Solr startup or after every commit or

optimize. In addition, you can use an http command to request a backup at any time.

Page 12: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 8

If you need to replicate an index across multiple servers, the configuration is pretty simple.

Set it up on the master server’s solrconfig.xml like this:

<requestHandler name="/replication" class="solr.ReplicationHandler">

<lst name="master">

<str name="replicateAfter">commit</str>

<str name="confFiles">schema.xml,stopwords.txt</str>

</lst>

</requestHandler>

You can choose to replicate on startup, after commits, or after optimization. The

confFiles element specifies configuration files you want to replicate to slaves.

Once the server configuration is done, point the slaves at the master, something like this:

<requestHandler name="/replication" class="solr.ReplicationHandler">

<lst name="slave">

<str name="masterUrl">

http://masterhostname:8983/solr/replication

</str>

<str name="pollInterval">00:00:60</str>

</lst>

</requestHandler>

The slaves periodically query the master to see if the index has changed. If so, they pull

down the changes and apply them. That’s all!

More Choices for Logging Logging is a crucial capability in a server application. Administrators examine logs to

monitor Solr instances and figure out how to make them run optimally. Up until now, Solr

used the logging facility included with the Java Development Kit (JDK).

Solr 1.4 uses a more flexible logging framework, SLF4J. SLF4J can bind to several logging

implementations, including log4j, Jakarta Commons Logging (JCL), and JDK logging. This

binding can be changed at runtime simply by switching JAR files around.

Page 13: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 9

This is the best possible kind of upgrade. The default configuration, binding SLF4J to JDK

logging, provides the same functionality as previous releases of Solr. However, you now

have the option of easily plugging in log4j or JCL if you prefer.

Multiselect Faceting Faceting is the ability to group search results by certain fields. Solr 1.4 adds support for

multiselect faceting, which is the ability to narrow search results by multiple facets.

Solr’s support is generic and includes the ability to tag filters and to exclude filters by tag

when faceting. A sample query string might look like this:

q=index replication&facet=true

&fq={!tag=proj}project:(lucene OR solr)

&facet.field={!ex=proj}project

&facet.field={!ex=src}source

To see this in action, check out the search facility that Lucid Imagination provides to search

technical knowledge resources on Solr along with Lucene and all its subprojects:

http://search.lucidimagination.com/.

Speedier Range Queries Solr can process queries that include numeric ranges, which means it can answer questions

like “Which hats are between size 56 and 64?” and “Which swimming pools are less than 10

meters long?”

In Solr 1.4, standard range queries now use a prefix tree or trie. Numbers are placed into

the tree based on their digits, which makes range queries faster than comparing each

complete number. Thus, for example, 175 is indexed as hundreds:1 tens:17 ones:175.

The results have been observed at up to 40 times faster than standard range queries

To take advantage of fast range queries, use the TrieField type in your schema. The

implementation takes care of the details, and you will notice that range queries are

significantly faster.

The illustration below shows an Example of a Prefix Tree, where the leaves of the tree hold

the actual term values and all the descendants of a node have a common prefix associated

with the node. Bold circles mark all relevant nodes to retrieve a range from 215 to 977.

Page 14: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 10

Let’s look at another example, this time in the schema. The type attribute in the schema’s

field type declaration tells Solr which numeric type you will represent with TrieField.

Here are a few declarations that show how to use TrieField for various numeric types:

<fieldType name="tint" class="solr.TrieField" type="integer" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />

<fieldType name="tlong" class="solr.TrieField" type="long" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />

<fieldType name="tdouble" class="solr.TrieField" type="double" omitNorms="true" positionIncrementGap="0" indexed="true" stored="false" />

Duplicate Detection With large sets of documents to be indexed, it is important to detect documents that are

identical or nearly identical so that the document only gets added to the index once.

Solr 1.4 offers this capability, named document duplicate detection or deduplication. The

more technical name is SignatureUpdateProcessor.

SignatureUpdateProcessor creates a message digest or hash value from some or all of

the fields of a document. The hash value acts like a fingerprint for the document and can be

quickly compared to the hash values for other documents.

Page 15: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 11

Several hashing algorithms are available: MD5Signature and Lookup3Signature are

both useful for exact matching, while TextProfileSignature (from the Apache Nutch

project) is a fuzzy hashing implementation to detect documents that are nearly equivalent.

New Request Handler Components New request handler components are now available in Solr 1.4:

� ClusteringComponent uses Carrot2 to dynamically cluster the top N search

results, something like dynamically discovered facets.

� TermsComponent returns indexed terms and document frequency in a field, useful

for auto-suggest, etc.

� TermVectorComponent returns term information per document (term

frequency, positions).

� StatsComponent computes statistics on numeric fields: min, max, sum,

sumOfSquares, count, missing, mean, stddev.

What Else Is New with Solr 1.4 Features Solr 1.4 has many other new features. A few of them are listed here:

• Ranges over arbitrary functions: {!frange l=1 u=2}sqrt(sum(a,b))

• Nested queries, for function queries too

• solrjs: JavaScript client library

• commitWithin: doc must be committed within x milliseconds

• Binary field type

• Merge one index into another

• SolrJ client for load balancing and failover

• Field globbing for some params: hl.fl=*_text

• Doublemetaphone, Arabic stemmer, etc.

• VelocityResponseWriter: template responses using Velocity

Page 16: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 12

Get Started & Resources http://www.lucidimagination.com/blog/2009/02/05/looking-forward-to-new-features-

in-solr-14/

http://wiki.apache.org/solr/SolrReplication

http://wiki.apache.org/solr/ExtractingRequestHandler

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-

Extraction-Tika

http://www.lucidimagination.com/blog/tag/range-queries/

http://www.slf4j.org/manual.html

http://wiki.apache.org/solr/Deduplication

http://shalinsays.blogspot.com/2009/09/whats-new-in-dataimporthandler-in-solr.html

Next Steps For more information on how Lucid Imagination can help your employees, customers, and

partners find the information they need more quickly, effectively, and at lower cost, please

visit http://www.lucidimagination.com/ to access blog posts, articles, and reviews of

dozens of successful implementations.

Certified Distributions from Lucid Imagination are complete, supported bundles of

software which include additional bug fixes, performance enhancements, along with our

free 30-day Get Started program. Coupled with one of our support subscriptions, a Certified

Distribution can provide a complete environment to develop, deploy, and maintain

commercial-grade search applications. Certified Distributions are available at

www.lucidimagination.com/Downloads.

Please e-mail specific questions to:

Support and Service: [email protected]

Sales and Commercial: [email protected]

Consulting: [email protected]

Or call: 1.650.353.4057

Page 17: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 13

APPENDIX: Choosing Lucene or Solr The great improvements in the capabilities of Lucene and Solr open source search

technology have created rapidly growing interest in using them as alternatives to other

search applications. As is often the case with open-source technology, online community

documentation provides rich details on features and variations, but does little to provide

explicit direction on which technologies would be the best choice. So when is Lucene

preferable to Solr and vice versa?

There is in fact no single answer, as Lucene and Solr bring very similar underlying

technology to bear on somewhat distinct problems. Solr is versatile and powerful, a full-

featured, production-ready search application server requiring little formal software

programming. Lucene presents a collection of directly callable Java libraries, with fine-

grained control of machine functions and independence from higher-level protocols.

In choosing which might be best for your search solution, the key questions to consider are

application scope, deployment environment, and software development preferences.

If you are new to developing search applications, you should start with Solr. Solr provides

scalable search power out of the box, whereas Lucene requires solid information retrieval

experience and some meaningful heavy lifting in Java to take advantage of its capabilities.

In many instances, Solr doesn’t even require any real programming.

Solr is essentially the “serverization” of Lucene, and many of its abstract functions are

highly similar, if not just the same. If you are building an app for the enterprise sector, for

instance, you will find Solr an almost 100% match to your business requirements: it comes

ready to run in a servlet container such as Tomcat or Jetty, and ready to scale in a

production Java environment. Its RESTful interfaces and XML-based configuration files can

greatly accelerate application development and maintenance. In fact, Lucene programmers

have often reported that they find Solr to contain “the same features I was going to build

myself as a framework for Lucene, but already very-well implemented.” Once you start

with Solr, and you find yourself using a lot of the features Solr provides out of the box, you

will likely be better off using Solr’s well-organized extension mechanisms instead of

starting from scratch using Apache Lucene.

If, on the other hand, you don’t want to make any calls via HTTP, and want to have all of

your resources controlled exclusively by Java API calls that you write, Lucene may be a

better choice. Lucene works best when constructing and embedding a state-of-the-art

search engine, allowing programmers to assemble and compile inside a native Java

Page 18: Overview of Searching in Solr 1.4

What’s New in Solr 1.4 A Lucid Imagination Technical White Paper • October 2009 Page 14

application. Some programmers set aside the convenience of Solr in order to more directly

control the large set of sophisticated features with low-level access, data, or state

manipulation, and choose Lucene instead, for example with byte-level manipulation of

segments or intervention in data I/O. Investment at the lower level enables development of

extremely sophisticated, cutting edge text search and retrieval capabilities.

As for features, the latest version of Solr generally encapsulates the latest version of

Lucene. As the two are in many ways functional siblings, spending time on gaining a solid

understanding how Lucene works internally can help you understand Apache Solr and its

extension of Lucene's workings.

No matter which you choose, the power of open source search is yours to harness. More

information on both Lucene and Solr can be found at http://www.lucidimagination.com.