Hippo get together presentation solr integration

Post on 26-Jan-2015

108 views 0 download

description

By Ard Schrijvers

Transcript of Hippo get together presentation solr integration

Solr integration

April 20, 2012Ard Schrijvers • a.schrijvers@onehippo.com / ard@apache.org

1. Working at Hippo since 20012. Email: a.schrijvers@onehippo.com

ard@apache.org 3. Worked primarily on:

1. HST 2. Hippo Repository / Jackrabbit3. Lucene 4. Cocoon 5. Slide

4. Apache committer of Jackrabbit and Cocoon

About me:Ard Schrijvers

Outline

1. The current search (HST / repo) architecture

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions

Current search architecture

Current search architecture

SoAn HSTQuery

is translated to anXPath query

Which is delegated to the repository that returns aJCR NodeIterator

which the HST binds back toHippoBean's

Current search architecture

That sounds doable and not to complex

is it?

Current search architecture

Well, it is .......

Current search architecture

Well, it is ....... very complex

Current search architecture

Reasons:

1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.4

Current search architecture

Reasons:

1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.4

2. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results

Current search architecture

Reasons:

1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.4

2. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results

3. Support for XPath / SQL was needed. However, Lucene likes flattened data, JCR with XPath / SQL is all about hierarchical data

Current search architecture

Reasons:

1. Back in the days when Jackrabbit 1 started, Lucene was at version 1.4

2. The first JSR-170 spec imposed some very harsh constraints : A save must result in directly updated search results

3. Support for XPath / SQL was needed. However, Lucene likes flattened data, JCR with XPath / SQL is all about hierarchical data

4. JCR Nodes != Documents

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A short HOWTO as developer 6. A very fast demo7. Wrap up 8. Questions

Current problems / shortcomings / mismatches

1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)

Current problems / shortcomings / mismatches

1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)

2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion

Current problems / shortcomings / mismatches

1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)

2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion

3. Very hard and very limited to customize

Current problems / shortcomings / mismatches

1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)

2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion

3. Very hard and very limited to customize4. A single index for an entire workspace

Current problems / shortcomings / mismatches

1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)

2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion

3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price

of CPU, Memory and complexity

Current problems / shortcomings / mismatches

1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)

2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion

3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price

of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no 'derived'

field indexes

Current problems / shortcomings / mismatches

1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)

2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion

3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price

of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no 'derived'

field indexes7. To index external sources, the sources need to be stored in

the repository

Current problems / shortcomings / mismatches

1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)

2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion

3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price

of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no 'derived'

field indexes7. To index external sources, the sources need to be stored in

the repository8. Range queries (and others) easily blow up

Current problems / shortcomings / mismatches

1. JCR Nodes are indexed instead of Documents (#nodes >> #documents)

2. A search result only returns Nodes (Rows) : what if you want something else, like auto-completion

3. Very hard and very limited to customize4. A single index for an entire workspace5. Support for very complex XPath / SQL queries at a price

of CPU, Memory and complexity6. Only JCR Nodes and properties are indexed : no 'derived'

field indexes7. To index external sources, the sources need to be stored in

the repository8. Range queries (and others) easily blow up9. Getting the number of hits is complex

Current problems / shortcomings / mismatches

Extra problem

JCR Nodes !=

Documents

For example : A news document contains a link to an author document : Through the author name, the news document should be found

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions

Objectives

1. Fix all the 9+ problems / shortcomings/ mismatches from previous slides

2. Easy to use and customize3. Satisfied customers4. Satisfied partners5. Scalable searches : CPU, memory and large document

numbers6. Document oriented 7. Integration with HST ContentBeans (HippoBeans)8. Index external sources 9. Control the SIZE of the index yourself

10. Don't invent but integrate ( with out-of-the-box features supported by a large community)

Objective: Fix all the 9 problems / shortcomings/ mismatches from

previous slides

Objective: Fix all the 9 problems / shortcomings/ mismatches from

previous slidesEasy:

Solr integration to rescue

Objective: Easy to use and customize

Objective: Easy to use and customize

YOU will be in the driver seat

Objective: Easy to use and customize

Objective: Easy to use and customize

Objective: Easy to use and customize

No more complete dependence on what the sometimes not so smAR&D Hippo team thought was good for YOU

Objective : Easy to use and customize

Objective: Easy to use and customize

You decide 'from where', 'what', 'how' and 'when' to index

Objective: Easy to use and customize

You decide 'from where', 'what', 'how' and 'when' to index 1. from where: which sources (jcr, webpages, database,

noSQL store, nuxeo, alfresco, anything)

Objective: Easy to use and customize

You decide 'from where', 'what', 'how' and 'when' to index 1. from where: which sources (jcr, webpages, database,

noSQL store, nuxeo, alfresco, anything)2. what : which parts of a document (not jcr node) or external

source

Objective: Easy to use and customize

You decide 'from where', 'what', 'how' and 'when' to index 1. from where: which sources (jcr, webpages, database,

noSQL store, nuxeo, alfresco, anything)2. what : which parts of a document (not jcr node) or external

source3. how :

1. which analyzer, 2. index on document level, property level or both3. store the text

Objective: Easy to use and customize

You decide 'from where', 'what', 'how' and 'when' to index 1. from where: which sources (jcr, webpages, database,

noSQL store, nuxeo, alfresco, anything)2. what : which parts of a document (not jcr node) or external

source3. how :

1. which analyzer, 2. index on document level, property level or both3. store the text

4. when : when do you want to index

Objective: Easy to use and customize

But of course, out-of-the-box support and toolingready to be used by YOU

Objective: Easy to use and customize

But of course, out-of-the-box support and toolingready to be used by YOU

1. Default hippo repository indexer & observer

Objective: Easy to use and customize

But of course, out-of-the-box support and toolingready to be used by YOU

1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing

Objective: Easy to use and customize

But of course, out-of-the-box support and toolingready to be used by YOU

1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBean's

Objective: Easy to use and customize

But of course, out-of-the-box support and toolingready to be used by YOU

1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBean's4. Deployment support

Objective: Easy to use and customize

But of course, out-of-the-box support and toolingready to be used by YOU

1. Default hippo repository indexer & observer2. ContentBean (HippoBean) annotations for indexing3. Binding search results to ContentBean's4. Deployment support 5. Clustering support

Objective: Satisfied customers

Objective: Satisfied customers

HOW?

Objective: Satisfied customers

EASY

Objective: Satisfied customers

Most likely they just will be satisfied

Objective: Satisfied customers

If they are not satisfied enough you can:

1. Easily customize it (aka tune it until 'je een ons weegt')2. Hire anyone with Solr experience : All our partners have

Solr experience

Objective: Satisfied customers

Still not satisfied?

Let them pay too much for a Google Search appliance, Autonomy or any of the other 'useless to pay for software'

Objective: Satisfied partners

Objective: Satisfied partners

Although on thin ice here, I strongly believe in this because:

Objective: Satisfied partners

1. Our partners frequently have good knowledge about Solr

Objective: Satisfied partners

1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations

Objective: Satisfied partners

1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge

Objective: Satisfied partners

1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations

Objective: Satisfied partners

1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations5. Our partners will earn more on Hippo and have happier

developers

Objective: Satisfied partners

1. Our partners frequently have good knowledge about Solr2. Our partners depend less on the current search limitations3. Our partners can pitch with their Solr knowledge4. Our partners can sell more Hippo implementations5. Our partners will earn more on Hippo and have happier

developers6. Hippo will earn more through HES: Which will satisfy

partners again, because Hippo can spend more on AR&D ==> more features

Objective: Scalable searches

Objective: Scalable searches

1. Using Solr to do the searches

Objective: Scalable searches

1. Using Solr to do the searches 2. Not the complex JCR hierarchical searches

Objective: Scalable searches

1. Using Solr to do the searches 2. Not the complex JCR hierarchical searches3. Document oriented instead of JCR Nodes ( #docs <<

#nodes)

Objective: Document oriented

Objective: Document oriented

What do we want to search for?

Objective: Document oriented

Exactly,

Documents!!

Objective: Document oriented

A Document ==

A HippoBean !=

JCR Node

Objective: Document oriented

So let's index

Objective: Document oriented

So let's index

HippoBeans(ContentBeans)

Objective: Integration with ContentBeans (HippoBeans)

Objective: Integration with ContentBeans (HippoBeans)

As a developer ....

how am I going to index my beans?

Objective: Integration with ContentBeans (HippoBeans)

I know how to write HippoBeans, that all I ever did in my life

Objective: Integration with ContentBeans (HippoBeans)

How do you expect me to index my beans?

Objective: Integration with ContentBeans (HippoBeans)

Annotate your getters with

@IndexField or

@IndexField(name="foo")

And account for them in Solr schema.xml <field name="title" type="text_general" indexed="true" stored="true" /> <field name="summary" type="text_general" indexed="true" stored="true"/>

Objective: Integration with ContentBeans (HippoBeans)

An example: @Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField(name="samenvatting") public String getSummary() { return getProperty("demosite:summary") ; }}

Objective: Integration with ContentBeans (HippoBeans)

Another example: @Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField public String getSummary() { return getProperty("demosite:summary") ; } @IndexField public String getAuthor() { return getLinkedBean("demosite:author", Author.class).getAuthor(); }}

Objective: Integration with ContentBeans (HippoBeans)

Another example: @Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { @IndexField public String getTitle() { return getProperty("demosite:title") ; } @IndexField public String getSummary() { return getProperty("demosite:summary") ; } @ReIndexOnChange @IndexField public Author getAuthor() { return getLinkedBean("demosite:author", Author.class); }}

Objective: Integration with ContentBeans (HippoBeans)

Another example: Setters@Node(jcrType="demosite:textdocument")public class TextBean extends BaseDocument { private String title; private String summary; @IndexField public String getTitle() { return title == null ? getProperty("demosite:title"): title ; } public void setTitle(String title) { this.title = title; } @IndexField public String getSummary() { return summary == null ? getProperty("demosite:summary"): summary ; } public void setSummary(String summary) { this.summary = summary; }}

Bonus : What can we achieve with the Setters?

Objective: Integration with ContentBeans (HippoBeans)

That's all you need to do And the HST binds some extra indexing fields like 1. The path2. The canonicalUUID3. The name4. The localized name5. The depth 6. The class hierarchy (including interfaces)

Objective: Index external sources

Objective: Index external sources

You can

1. Push them directly to Solr

Objective: Index external sources

You can

1. Push them directly to Solr2. Push them to a HST JAX-RS resource that binds to a

ContentBean and commits to Solr

Objective: Index external sources

You can

1. Push them directly to Solr2. Push them to a HST JAX-RS resource that binds to a

ContentBean and commits to Solr3. Crawl from the HST and bind to ContentBeans and commit

them to Solr

Objective: Index external sources

A ContentBean does *not* need a JCR Node!

ContentBean interface:

public interface ContentBean { @IndexField(name="id") String getPath(); void setPath(String path);}

Objective: Index external sources

An example : GoGreenProductBean in Testsuite

public class GoGreenProductBean implements ContentBean { private String path;

private String title;

private String summary;

private String description;

public String getPath() {return path;}

public void setPath(final String path) {this.path = path;}

@IndexField public String getTitle() {return title;}

public void setTitle(String title) {this.title = title;}

@IndexField

public String getSummary() {return summary ;}

public void setSummary(String summary) {this.summary = summary;}

@IndexField

public String getDescription() {return description;}

public void setDescription(String description) {this.description = description;}}

Objective: Index external sources

And add the GoGreenProductBean to Solr

{

List<GoGreenProductBean> gogreenBeans = new ArrayList<GoGreenProductBean>(); // FILL THE gogreenBeans LIST

// NOW ADD TO INDEX HippoSolrManager solrManager = HstServices.getComponentManager().getComponent( HippoSolrManager.class.getName(), SOLR_MODULE_NAME); try { solrManager.getSolrServer().addBeans(gogreenBeans); UpdateResponse commit = solrManager.getSolrServer().commit(); } catch (IOException e) { e.printStackTrace(); } catch (SolrServerException e) { e.printStackTrace(); }}

Objective: Control the SIZE of the index yourself

Objective: Control the SIZE of the index yourself

JCR / Jackrabbit / Hippo-Repository has a generic

one-fits-all-index (or one-fits-none-index)

Which grows very large easily, and can hardly be customized

Objective: Control the SIZE of the index yourself

However, search is

domain specific

Thus,

Just index what is needed for the customer

Objective: Don't invent but integrate

Objective: Don't invent but integrate

Use Solr

Use Solrj client

Expose the Solrj SolrQuery

Objective: Don't invent but integrate

For example:HippoSolrManager solrManager = ...String query = ...HippoQuery hippoQuery = solrManager.createQuery(query); hippoQuery.setLimit(pageSize); hippoQuery.setOffset((page - 1) * pageSize); // hippoQuery.getSolrQuery() is the SolrQuery object // include scoring

hippoQuery.getSolrQuery().setIncludeScore(true);hippoQuery.getSolrQuery().setHighlight(true); hippoQuery.getSolrQuery().setHighlightFragsize(200); hippoQuery.getSolrQuery().addHighlightField("title"); hippoQuery.getSolrQuery().addHighlightField("summary"); hippoQuery.getSolrQuery().addHighlightField("htmlContent"); HippoQueryResult result = hippoQuery.execute(true);

Objective: Don't invent but integrate

For example:HippoSolrManager solrManager = ...String query = ...HippoQuery hippoQuery = solrManager.createQuery(query); hippoQuery.setLimit(pageSize); hippoQuery.setOffset((page - 1) * pageSize); // hippoQuery.getSolrQuery() is the SolrQuery object // include scoring

hippoQuery.getSolrQuery().setIncludeScore(true);hippoQuery.getSolrQuery().setHighlight(true); hippoQuery.getSolrQuery().setHighlightFragsize(200); hippoQuery.getSolrQuery().addHighlightField("title"); hippoQuery.getSolrQuery().addHighlightField("summary"); hippoQuery.getSolrQuery().addHighlightField("htmlContent"); HippoQueryResult result = hippoQuery.execute(true);

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions

Solr integration to rescue

No further comments :-)

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions

A very fast demo

setup ~75.000 long wikipedia docs in repository

............... doing the demo .................

That was : a very fast demo

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions

Wrap up

I think that with the Solr integration

Wrap up

I think that with the Solr integration 1. Developers will be happier

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier4. Hippo will be happier

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier4. Hippo will be happier

And finally, last and least

Wrap up

I think that with the Solr integration 1. Developers will be happier2. Customers will be happier 3. Partners will be happier4. Hippo will be happier5. Infra will be happier because the servers stop sweating

Outline

1. The current search (HST / repo) architecture 2. The current problems / shortcomings / mismatches3. What we are trying to improve, the objectives 4. Solr integration to rescue5. A very fast demo6. Wrap up 7. Questions

Questions?

Check out the example at :http://svn.onehippo.org/repos/hippo/hippo-cms7/testsuite/trunk