Open Source Search: An Analysis

Post on 27-Jun-2015

1.237 views 2 download

Tags:

description

Slides from my talk at PHP Conference 2012, giving a brief overview of four key players in the Open Source Search market

Transcript of Open Source Search: An Analysis

Open Source SearchAn analysis and comparison from a developer’s perspective

The Contenders

The Data

Report Buyer product catalogue:

• Text fields: title, subtitle, summary, toc• Product code and ISBN• Supplier, category, type and

availability• Publication date and price

Apache Solr

Enterprise class search engineScalable and based on Apache

LuceneREST-ful API or PECL extensionFast, transactional full-text indexingFaceted and geospatial searchRich document indexingComes with simple web interfaceBuilt-in caching of queries and

responsesNumerous plug-ins

Apache Solr: Installation

Available as system packages Uses Tomcat or Jetty Requires a restart on configuration

change Packages install as a service

Apache Solr: Configuration

Specify database location Memory settings Query caching options Request handler setup Search components and plug-ins Spell checker configuration

Apache Solr: Data Schema <!-- Report Buyer fields --><field name="item_guid" type="string" indexed="true" stored="true"

required="true" /><field name="name" type="text" indexed="true" stored="true" required="true"

boost="75" omitNorms="false" /><field name="subtitle" type="text" indexed="true" stored="true" required="false"

boost="25" omitNorms="false" /><field name="summary" type="text" indexed="true" stored="false" boost="1"

omitNorms="false" /><field name="toc" type="text" indexed="true" stored="false" boost="1"

omitNorms="false" /><field name="isbn" type="string" indexed="true" stored="false" boost="200"

omitNorms="false" /><field name="product_code" type="string" indexed="true" stored="true" boost="200"

omitNorms="false" /><field name="publish_date" type="tdate" indexed="true" stored="true" /><field name="price" type="tfloat" indexed="true" stored="true" /><field name="availability" type="boolean" indexed="true" stored="true" /><field name="link" type="string" indexed="false" stored="true" /><field name="text" type="text" indexed="true" stored="false" multiValued="true"/>

<copyField source="name" dest="text"/><copyField source="subtitle" dest="text"/><copyField source="summary" dest="text"/><copyField source="toc" dest="text"/>

<uniqueKey>item_guid</uniqueKey><defaultSearchField>text</defaultSearchField>

Apache Solr: Indexing Options

Data Import Handler REST-ful API PHP PECL Extension Third-party libraries, like Solarium

Apache Solr: PHP PECL Indexer

<?php$solr_options = array('secure' => false, 'hostname' => 'localhost', 'port' => 8080);$solr = new SolrClient($solr_options);$doc = new SolrInputDocument();while ($row = mysql_fetch_array($result, MYSQL_ASSOC)){

$doc = new SolrInputDocument();$row['publish_date'] = strftime('%Y-%m-%dT00:00:01Z', strtotime($row['publish_date']));foreach ($row as $key => $value) {

$doc->addField($key, $value);}$updateResponse = $solr->addDocument($doc);$response = $updateResponse->getResponse();if ($response->responseHeader->status != 0) {

print "Error importing into Solr: "; print_r($response);

}}

$solr->commit();?>

Apache Solr: RESTful indexing

POST to http://localhost:8080/solr/update?commit=true

<add><doc>

<field name="item_guid">a34bbff9e17ada79658c72fde90c7369</field><field name="name">Research Report on China's Corn Industry</field><field name="price">1265</field>etc

</doc></add>

Apache Solr: PHP Querying

$solr_options = array('secure' => false, 'hostname' => 'localhost', 'port' => 8080);$solr = new SolrClient($solr_options);$query = new SolrQuery();$query->setQuery("research in china");$query->setFacet(true);$query->addFacetField('availability');

$query->addField('item_guid')->addField('name')->addField('publish_date')->addField('subtitle')->addField('product_code')->addField('availability')->addField('price');

$query->addSortField('publish_date', SolrQuery::ORDER_DESC);

$query_response = $solr->query($query); $response = $query_response->getResponse();

print "Found ".$response->response->numFound." results, for {$query_string} in ".$response->responseHeader->QTime." ms:\n\n";

foreach ($response->response->docs as $position=>$doc_data) {$download = ($doc_data['availability'] == '1') ? 'Yes' : 'No';print "{$position} - Date:{$pub_date} - {$doc_data['product_code']} - D/L:{$download} £".sprintf("%5d", $doc_data['price'])." - {$doc_data['name']}\n";

}print "Facets for instant ".$response->facet_counts->facet_fields->availability->false;

Apache Solr: RESTful Queries

http://localhost:8080/solr/select/?q=research%20%in%20china&indent=on&hl=true&hl.fl=item_guid,name,publish_date,subtitle,product_code,availability,price&facet=true&facet.field=availability&wt=json

{ "responseHeader":{ "status":0, "QTime":20, "params":{

"facet":"true", "indent":"on", "q":"research \u0000 china","hl.fl":"item_guid,name,publish_date,subtitle,product_code,availability,price","facet.field":"availability", "wt":"json", "hl":"true"}},

"response":{"numFound":197481,"start":0,"docs":[{ "item_guid":"e68cf64921a02e926137d78d2c52da35", "name":"Market Research Report on China Civil Aero Industry", "product_code":"SFC00076", "price":190.0, "availability":false, "type":10,"link": "/industry_manufacturing/plant_heavy_equipment/market_research_report_china_civil_aero_industry.html", "publish_date":"2008-07-22T00:00:01Z"}

}

Apache Solr: Comparison Points

More features than other products Responsive, busy mailing list Large team of developers Good PHP libraries for integration Several books available Fairly heavy footprint

Elasticsearch: Features

Also built on Apache Lucene JSON-based Distributed, scalable server model Easy to configure, or configuration

free Faceting and highlight support Auto type detection Multiple indexes CouchDB integration

Elasticsearch: Installation Installation

Download and unpack zip file Run elasticsearch/bin/elasticsearch

Elasticsearch: Configuration

No schema is required - almost No configuration is required - almost

Elasticsearch: Accesing the system

GET http://localhost:9200/ HTTP/1.0{

"ok" : true, "name" : "Test", "version" : { "number" : "0.18.7", "snapshot_build" : false }, "tagline" : "You Know, for Search", "cover" : "DON'T PANIC", "quote" : { "book" : "The Hitchhiker's Guide to the Galaxy", "chapter" : "Chapter 27", "text1" : "\"Forty-two,\" said Deep Thought, with infinite majesty and calm.", "text2" : "\"The Answer to the Great Question, of Life, the Universe and Everything\"" }}

Elasticsearch: Creating an Index

curl -XPUT http://localhost:9200/reports/ -d '{

"index:" {"analysis": {

"analyzer": {"my_analyzer": {

"tokenizer": "standard","filter": ["standard", "lowercase",

"my_stemmer"]}

},"filter": {

"my_stemmer": {"type": "stemmer","name": "english"

}}

}}

}'

Elasticsearch: Mapping the data

<?phprequire_once("ElasticSearch.php");$es = new ElasticSearch;$es->index = 'reports';$type = 'report';$mappings = array($type => array('properties' => array(

'_id' => array('type' => 'string', 'path' => 'item_guid'),'item_guid' => array('type' => 'string', 'store' => 'yes', 'index' =>

'not_analyzed'),'name' => array('type' => 'string', 'store' => 'no', 'boost' => 75),'subtitle' => array('type' => 'string', 'store' => 'yes', 'boost' => 25),'summary' => array('type' => 'string', 'store' => 'yes', 'boost' => 10),'toc' => array('type' => 'string', 'store' => 'no'),'product_code' => array('type' => 'string', 'store' => 'yes', 'boost' =>

200, 'index' => 'not_analyzed'),'isbn' => array('type' => 'string', 'store' => 'yes', 'boost' => 200, 'index'

=> 'not_analyzed'),)));

$json = json_encode($mappings);

$es->map($type, $json);?>

Elasticsearch: Indexing

<?phprequire_once("ElasticSearch.php");$es = new ElasticSearch;$es->index = 'reports';$type = 'report';

$sql = "SELECT `item_guid`, `name`, `subtitle`, `summary`, `toc`, `supplier`,`product_code`, `isbn`, `category`, `price`, `availibility` as `availability`, `type`, `link`, `publish_date` FROM `rb_search`";

$result = read_query($sql);

while ($row = mysql_fetch_array($result, MYSQL_ASSOC)){

$es->add($type, $row['item_guid'], json_encode($row));}?>

Elasticsearch: Inspection

GET http://localhost:9200/reports/report/_count/

{"count":260349,"_shards":{"total":1,"successful":1,"failed":0}}

Elasticsearch: Querying

<?phprequire_once("ElasticSearch.php");$es = new ElasticSearch;

$es->index = 'reports';$type = 'report';

$query = array('fields' => array('item_guid', 'name', 'subtitle'),'query' => array(

'term' => array('name' => 'research'),),

'facets' => array('availability' => array(

'terms' => array('field' => 'availability'))

));

$result = $es->query($type, json_encode($query));?>

Elasticsearch: PHP APIs

Nicholas Ruflin's elastica Raymond Julin's elasticsearch Niranjan Uma Shankar's

elasticsearch-php

Elasticsearch: Comparison Points

Very fast indexing Auto-scaling architecture Elegant REST approach Flexible zero configuration model Poor documentation No feature list, conceptual model or

introduction All data is stored, meaning large

indices

Sphinx Indexes MySQL, MSSQL, XML or

ODBC Querying through Sphinx PHP API Searching through SQL queries or

API Scalable to index 6TB of data in

16bn documents and 2000 queries/sec

Used by Craigslist, Boardreader Runs as a storage engine in MySQL

Sphinx: Installation

Install from system packages or source

Source tarball is needed to get PHP SphinxAPI

No other software needed Runs as a service in Ubuntu

Sphinx: Indexing Types

Plain index - fast search, slow update Real-time index - fast update, less

efficient Distributed - combination of both

methods

Sphinx: Configuration

index rb_test{

# index typetype = rtpath = /mnt/data_indexed/sphinx/rb_test# define the fields we're indexingrt_field = namert_field = subtitlert_field = summaryrt_field = toc

#define the fields we want to get back outrt_attr_string = item_guidrt_attr_string = supplierrt_attr_string = product_codert_attr_string = isbnrt_attr_string = categoryrt_attr_uint = pricert_attr_string = linkrt_attr_timestamp = publish_date

# morphology preprocessors to applymorphology = stem_enhtml_strip = 1html_index_attrs = img=alt,title; a=title;html_remove_elements = style, script

}

Spinx: Indexing the data

<?phprequire_once("mysql.inc.php");$sql = "SELECT conv(mid(md5(`item_guid`), 1, 16), 16, 10) AS `id`, `item_guid`,

`name`, `subtitle`, `summary`, `toc`, `supplier`, `product_code`, `isbn`,

`category`, `price`, `availibility` as `availability`, `type`, `link`,

UNIX_TIMESTAMP(`publish_date`) AS `publish_date` FROM `rb_search`";$result = read_query($sql);$sphinx = mysql_connect("127.0.0.1:9306", "", "", true);while ($row = mysql_fetch_array($result, MYSQL_ASSOC)) {

foreach ($row as $key=>$value) {$row[$key] = mysql_escape_string($value);

}$sql = "REPLACE INTO `rb_search` (`id`, `title`, `subtitle`,`availability`, `type`, `price`, `publish_date`, `item_guid`, `supplier`, `product_code`, `isbn`, `category`, `link`, `summary`, `toc`)

VALUES ('{$row['id']}', '{$row['name']}', '{$row['subtitle']}',

'{$row['availability']}', '{$row['type']}','{$row['price']}', '{$row['publish_date']}', '{$row['item_guid']}', '{$row['supplier']}', '{$row['product_code']}', '{$row['isbn']}', '{$row['category']}', '{$row['link']}','{$row['summary']}', '{$row['toc']}')";mysql_query($sql, $sphinx);

}?>

Sphinx: Querying the Index

mysql --host=127.0.0.1 --port=9306

Welcome to the MySQL monitor. Commands end with ; or \g.Your MySQL connection id is 1Server version: 2.0.3-id64-release (r3043)

mysql> select item_guid, title, subtitle, price from rb_search where match('china pharmaceutical') and price > 100 and price < 300 limit 2\G

************************** 1. row *************************** id: 5228810066049016302 weight: 6671 price: 220item_guid: cc74cb075aa37696198e87850f033398 title: North China Pharmaceutical Group Corp-Therapeutic Competitors Report subtitle: *************************** 2. row *************************** id: 3548867347418583847 weight: 6662 price: 190item_guid: 6ce04df0fb277aa3ff596c2ca00c81a9 title: China Pharmaceutical Industry Report subtitle: 2006-20072 rows in set (0.01 sec)

Sphinx: Comparison Points

Fastest indexing of all engines Really simple interface via SQL Document IDs must be unsigned

integers No faceting support Good support in forums

Xapian Deployed as a C++ library Bindings provided to connect to PHP Available in most package

repositories Binding need to be compiled

separately Query Parser, similar to other

engines Stemming and faceted search Server replication

Xapian: Installation

Install from system packages Compile PHP bindings from source No other software needed Runs on demand

Xapian: Configuration concepts

No configuration required Define-and-go schema Documents Terms Values Document data

Xapian: Indexing Data

<?php$xapian_db = new XapianWritableDatabase($xapian,

Xapian::DB_CREATE_OR_OVERWRITE);$xapian_term_generator = new XapianTermGenerator();$xapian_term_generator->set_stemmer(new XapianStem("english"));

while ($row = mysql_fetch_array($result, MYSQL_ASSOC)) {$doc = new XapianDocument();

$xapian_term_generator->set_document($doc);foreach ($xapian_term_weights as $field => $weight) {

$xapian_term_generator->index_text($row[$field], $weight);}

$xapian_term_generator->index_text($row['name'], 75, 'S:');$doc->add_boolean_term('CODE:' . $row['product_code']);

$doc->add_value($xapian_value_slots['price'], Xapian::sortable_serialise($row['price']));$doc->add_value($xapian_value_slots['publish_date'], strftime("%Y%m%d", strtotime($row['publish_date'])));

// add in additional values that we're going to use for facets

$doc->add_value($xapian_value_slots['availability'], $row['availability']);$doc->set_data(serialize($doc_data));$docid = 'Q'.$row['item_guid'];$xapian_db->replace_document($docid, $doc);

}?>

Xapian: Querying the Index

<?php$xapian_db = new XapianDatabase($xapian);$query_parser = new XapianQueryParser();$query_parser->set_stemmer(new XapianStem("english"));$query_parser->set_default_op(XapianQuery::OP_AND);

$dvrProcessor = new XapianDateValueRangeProcessor($xapian_value_slots['publish_date'], 'date:');$query_parser->add_valuerangeprocessor($dvrProcessor);

$query_parser->add_prefix("code", "CODE:");$query_parser->add_prefix("category", "CATEGORY:");$query_parser->add_prefix("title", "S:");$query = $query_parser->parse_query('“Medical devices” NEAR china NOT russian price:10..150

category:medical');

$enquire = new XapianEnquire($xapian_db);$enquire->set_query($query);$matches = $enquire->get_mset($offset, $pagesize);while (!($start->equals($end))) {

$doc = $start->get_document();$price = Xapian::sortable_unserialise($doc->get_value($xapian_value_slots['price']));$start->next();

}?>

Xapian: PHP APIs

Only one option available from Xapian

Requires additional compilation due to licensing

Not very well documented API

Xapian: Comparison Points

Reasonably fast indexing Very flexible implementation Faceting and range searching Good Quick Start guide Responsive mailing list Third-party paid support

In Summary

Every project has different needs Not one search product fits all Fastest to index was Sphinx Most feature-rich: Solr The next steps are up to you