Hippo get together presentation solr integration

Solr integration

April 20, 2012Ard Schrijvers • a.schrijvers@onehippo.com / ard@apache.org

1. Working at Hippo since 20012. Email: a.schrijvers@onehippo.com

ard@apache.org 3. Worked primarily on:

1. HST 2. Hippo Repository / Jackrabbit3. Lucene 4. Cocoon 5. Slide

4. Apache committer of Jackrabbit and Cocoon

About me:Ard Schrijvers

Outline

1. The current search (HST / repo) architecture

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions

Current search architecture

SoAn HSTQuery

is translated to anXPath query

Which is delegated to the repository that returns aJCR NodeIterator

which the HST binds back toHippoBean's

That sounds doable and not to complex

is it?

Well, it is .......

Well, it is ....... very complex

Reasons:

1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.4

Reasons:

2. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results

Reasons:

3. Support for XPath / SQL was needed. However, Lucene likes flattened data, JCR with XPath / SQL is all about hierarchical data

Reasons:

3. Support for XPath / SQL was needed. However, Lucene likes flattened data, JCR with XPath / SQL is all about hierarchical data

4. JCR Nodes != Documents

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A short HOWTO as developer 6. A very fast demo7. Wrap up 8. Questions

Current problems / shortcomings / mismatches

1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)

2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion

3. Very hard and very limited to customize

3. Very hard and very limited to customize4. A single index for an entire workspace

3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price

of CPU, Memory and complexity

of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no 'derived'

field indexes

field indexes7. To index external sources, the sources need to be stored in

the repository

the repository8. Range queries (and others) easily blow up

the repository8. Range queries (and others) easily blow up9. Getting the number of hits is complex

Extra problem

JCR Nodes !=

Documents

For example : A news document contains a link to an author document : Through the author name, the news document should be found

Outline

Objectives

1. Fix all the 9+ problems / shortcomings/ mismatches from previous slides

2. Easy to use and customize3. Satisfied customers4. Satisfied partners5. Scalable searches : CPU, memory and large document

numbers6. Document oriented 7. Integration with HST ContentBeans (HippoBeans)8. Index external sources 9. Control the SIZE of the index yourself

10. Don't invent but integrate ( with out-of-the-box features supported by a large community)

Objective: Fix all the 9 problems / shortcomings/ mismatches from

previous slides

Objective: Fix all the 9 problems / shortcomings/ mismatches from

previous slidesEasy:

Solr integration to rescue

Objective: Easy to use and customize

YOU will be in the driver seat

No more complete dependence on what the sometimes not so smAR&D Hippo team thought was good for YOU

Objective : Easy to use and customize

You decide 'from where', 'what', 'how' and 'when' to index

You decide 'from where', 'what', 'how' and 'when' to index 1. from where: which sources (jcr, webpages, database,

noSQL store, nuxeo, alfresco, anything)

noSQL store, nuxeo, alfresco, anything)2. what : which parts of a document (not jcr node) or external

source

source3. how :

1. which analyzer, 2. index on document level, property level or both3. store the text

source3. how :

1. which analyzer, 2. index on document level, property level or both3. store the text

4. when : when do you want to index

But of course, out-of-the-box support and toolingready to be used by YOU

1. Default hippo repository indexer & observer

1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing

1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBean's

1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBean's4. Deployment support

1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBean's4. Deployment support 5. Clustering support

Objective: Satisfied customers

Most likely they just will be satisfied

If they are not satisfied enough you can:

1. Easily customize it (aka tune it until 'je een ons weegt')2. Hire anyone with Solr experience : All our partners have

Solr experience

Still not satisfied?

Let them pay too much for a Google Search appliance, Autonomy or any of the other 'useless to pay for software'

Objective: Satisfied partners

Although on thin ice here, I strongly believe in this because:

1. Our partners frequently have good knowledge about Solr

1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations

1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge

1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations

1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations5. Our partners will earn more on Hippo and have happier

developers

1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations5. Our partners will earn more on Hippo and have happier

developers6. Hippo will earn more through HES: Which will satisfy

partners again, because Hippo can spend more on AR&D ==> more features

Objective: Scalable searches

1. Using Solr to do the searches

1. Using Solr to do the searches 2. Not the complex JCR hierarchical searches

1. Using Solr to do the searches 2. Not the complex JCR hierarchical searches3. Document oriented instead of JCR Nodes ( #docs <<

#nodes)

Objective: Document oriented

What do we want to search for?

Exactly,

Documents!!

A Document ==

A HippoBean !=

JCR Node

So let's index

HippoBeans(ContentBeans)

Objective: Integration with ContentBeans (HippoBeans)

As a developer ....

how am I going to index my beans?

I know how to write HippoBeans, that all I ever did in my life

How do you expect me to index my beans?

Annotate your getters with

@IndexField or

@IndexField(name="foo")

And account for them in Solr schema.xml <field name="title" type="text_general" indexed="true" stored="true" /> <field name="summary" type="text_general" indexed="true" stored="true"/>

An example: @Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField(name="samenvatting") public String getSummary() { return getProperty("demosite:summary") ; }}

Another example: @Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField public String getSummary() { return getProperty("demosite:summary") ; } @IndexField public String getAuthor() { return getLinkedBean("demosite:author", Author.class).getAuthor(); }}

Another example: @Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField public String getSummary() { return getProperty("demosite:summary") ; } @ReIndexOnChange @IndexField public Author getAuthor() { return getLinkedBean("demosite:author", Author.class); }}

Another example: Setters@Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { private String title; private String summary; @IndexField public String getTitle() { return title == null ? getProperty("demosite:title"): title ; } public void setTitle(String title) { this.title = title; } @IndexField public String getSummary() { return summary == null ? getProperty("demosite:summary"): summary ; } public void setSummary(String summary) { this.summary = summary; }}

Bonus : What can we achieve with the Setters?

That's all you need to do And the HST binds some extra indexing fields like 1. The path2. The canonicalUUID3. The name4. The localized name5. The depth 6. The class hierarchy (including interfaces)

Objective: Index external sources

You can

1. Push them directly to Solr

You can

1. Push them directly to Solr2. Push them to a HST JAX-RS resource that binds to a

ContentBean and commits to Solr

You can

1. Push them directly to Solr2. Push them to a HST JAX-RS resource that binds to a

ContentBean and commits to Solr3. Crawl from the HST and bind to ContentBeans and commit

them to Solr

A ContentBean does *not* need a JCR Node!

ContentBean interface:

public interface ContentBean { @IndexField(name="id") String getPath(); void setPath(String path);}

An example : GoGreenProductBean in Testsuite

public class GoGreenProductBean implements ContentBean { private String path;

private String title;

private String summary;

private String description;

public String getPath() {return path;}

public void setPath(final String path) {this.path = path;}

@IndexField public String getTitle() {return title;}

public void setTitle(String title) {this.title = title;}

@IndexField

public String getSummary() {return summary ;}

public void setSummary(String summary) {this.summary = summary;}

@IndexField

public String getDescription() {return description;}

public void setDescription(String description) {this.description = description;}}

And add the GoGreenProductBean to Solr

List<GoGreenProductBean> gogreenBeans = new ArrayList<GoGreenProductBean>(); // FILL THE gogreenBeans LIST

// NOW ADD TO INDEX HippoSolrManager solrManager = HstServices.getComponentManager().getComponent( HippoSolrManager.class.getName(), SOLR_MODULE_NAME); try { solrManager.getSolrServer().addBeans(gogreenBeans); UpdateResponse commit = solrManager.getSolrServer().commit(); } catch (IOException e) { e.printStackTrace(); } catch (SolrServerException e) { e.printStackTrace(); }}

Objective: Control the SIZE of the index yourself

JCR / Jackrabbit / Hippo-Repository has a generic

one-fits-all-index (or one-fits-none-index)

Which grows very large easily, and can hardly be customized

Objective: Control the SIZE of the index yourself

However, search is

domain specific

Just index what is needed for the customer

Objective: Don't invent but integrate

Use Solr

Use Solrj client

Expose the Solrj SolrQuery

For example:HippoSolrManager solrManager = ...String query = ...HippoQuery hippoQuery = solrManager.createQuery(query); hippoQuery.setLimit(pageSize); hippoQuery.setOffset((page - 1) * pageSize); // hippoQuery.getSolrQuery() is the SolrQuery object // include scoring

hippoQuery.getSolrQuery().setIncludeScore(true);hippoQuery.getSolrQuery().setHighlight(true); hippoQuery.getSolrQuery().setHighlightFragsize(200); hippoQuery.getSolrQuery().addHighlightField("title"); hippoQuery.getSolrQuery().addHighlightField("summary"); hippoQuery.getSolrQuery().addHighlightField("htmlContent"); HippoQueryResult result = hippoQuery.execute(true);

For example:HippoSolrManager solrManager = ...String query = ...HippoQuery hippoQuery = solrManager.createQuery(query); hippoQuery.setLimit(pageSize); hippoQuery.setOffset((page - 1) * pageSize); // hippoQuery.getSolrQuery() is the SolrQuery object // include scoring

hippoQuery.getSolrQuery().setIncludeScore(true);hippoQuery.getSolrQuery().setHighlight(true); hippoQuery.getSolrQuery().setHighlightFragsize(200); hippoQuery.getSolrQuery().addHighlightField("title"); hippoQuery.getSolrQuery().addHighlightField("summary"); hippoQuery.getSolrQuery().addHighlightField("htmlContent"); HippoQueryResult result = hippoQuery.execute(true);

Outline

Solr integration to rescue

No further comments :-)

Outline

A very fast demo

setup ~75.000 long wikipedia docs in repository

............... doing the demo .................

That was : a very fast demo

Outline

Wrap up

I think that with the Solr integration

Wrap up

I think that with the Solr integration 1. Developers will be happier

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier4. Hippo will be happier

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier4. Hippo will be happier

And finally, last and least

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier4. Hippo will be happier5. Infra will be happier because the servers stop sweating

Outline

Questions?

Check out the example at :http://svn.onehippo.org/repos/hippo/hippo-cms7/testsuite/trunk

Hippo get together presentation solr integration

Technology

Transcript of Hippo get together presentation solr integration

Schemaless Solr and the Solr Schema REST API

Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

TYPO3 Camp Poznan - Solr Usecases with Hosted Solr

Hippo Gourmet

Cms integration of apache solr how we did it.

Solr - home.apache.orgpeople.apache.org/~yonik/presentations/Solr_notes.pdf · solr/data/index Master solr/data/index Searcher new segment solr/data/snapshot-2006062950000 1. hard

Hippo Drama

The First Class Integration of Solr with Hadoop

HIPPO EFFECT

Hippo Current

Drupal Integration with Solr for Fabulous CMS Search

A collaborative environment for web crawling and web data ... · Intro Integration in ... • Integration of Apache Solr for querying and data analisys; • Integration of OpenWayback

Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

Hippo Revised

Apache Solr + ajax solr

ApacheCon NA 2015 Spark / Solr Integration

Oak / Solr integration Tommaso Teofili - pro!vision · Solr replicated architecture Solr%@10.1.1.20% C1 C2 Solr%@10.1.1.21% C1 C2 Solr%@10.1.1.22% C1 C2 RRLoad%balancer% adaptTo()

Solr has a lot of extensive features Solr Integration and Enhancements Todd Hatcher.

NYC Lucene/Solr Meetup: Spark / Solr

Hippo Therapy