AWS Webcast - Build a Scalable Search Engine with the New Amazon CloudSearch
London Amazon CloudSearch Meetup Jon Handler
description
Transcript of London Amazon CloudSearch Meetup Jon Handler
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
London Amazon CloudSearch Meetup
Jon Handler
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
AgendaCloudSearch technical overview (Jon Handler, Amazon CloudSearch Solution Architect)
NakedWines and CloudSearch (Matt Reid, Developer at NakedWines)
Searching Wikipedia with Amazon CloudSearch (Iain Fletcher, Search Technologies)
Building UI with CloudSearch (Stefan Olafsson, Co-Founder, Twigkit)
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
What is SearchShoes
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Do You Want Search With That?
Build your own – database, home-rolled, site search
Open source
Legacy enterprise search
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Search Challenges
Complex, expertise required
Costly, often with up-front expenditure
Long time to market, innovation and experimentation are slowed
Operational overhead is undifferentiated work
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon CloudSearch
Pay for infrastructure you need when you need itLow costNo need to guess capacityExperiment fast with low riskWe do the undifferentiated heavy liftingGo global in minutes
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon CloudSearch ArchitectureDNS / Load Balancing
Search API Console
SEARCH SERVICE
DocSvc API
CommandLine Tools
Console
DOCUMENT SERVICE
AWS Query
ConfigAPI
CommandLine Tools
Console
CONFIG SERVICE
Search Domain
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Automatic Scaling
SEARCH INSTANCEIndex Partition n
Copy 1
SEARCH INSTANCEIndex Partition 2
Copy 2
SEARCH INSTANCEIndex Partition n
Copy 2
SEARCH INSTANCEIndex Partition 2
Copy n
SEARCH INSTANCE
DATA Document Quantity and Size
TRAFFICSearch Request Volume and Complexity
Index Partition nCopy n
SEARCH INSTANCEIndex Partition 1
Copy 1
SEARCH INSTANCEIndex Partition 2
Copy 1
SEARCH INSTANCEIndex Partition 1
Copy 2
SEARCH INSTANCEIndex Partition 1
Copy n
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
SEARCH INSTANCEIndex Partition n
Copy 1
SEARCH INSTANCEIndex Partition 2
Copy 2
SEARCH INSTANCEIndex Partition n
Copy 2
SEARCH INSTANCEIndex Partition 2
Copy n
SEARCH INSTANCEIndex Partition n
Copy n
SEARCH INSTANCEIndex Partition 1
Copy 1
SEARCH INSTANCEIndex Partition 2
Copy 1
SEARCH INSTANCEIndex Partition 1
Copy 2
SEARCH INSTANCEIndex Partition 1
Copy n
ComputeStorage
Load BalancingSecurity
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Text Search
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Highly Relevant Results
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Faceted Drilldown
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Integer Range Searching
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Complex Queries
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Query
564
726
123
Ranking
564
726
123
SortingFilteringMatching
Search Query Processing
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Reference Architecture
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Create An Amazon CloudSearch Domain
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Text fields for matching user terms
Result enabled to retrieve source data
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Literal fields for Faceting
Facet enabled to retrieve facets
Search enabled for narrowing
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Integer fields for ranking, narrowing
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Configure the Domain
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Data Preparation and Upload
Search Documents
ExtractSDF Batch
Amazon CloudSearch
POST
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
CloudSearch SDF[{"type":"add", "id": "b007oznzg0", "version": 1, "lang": "en", "fields": { "title":"Kindle Paperwhite", "description":"World's most advanced e-reader", "category": ["Electronics","eBook Readers"], "price":11900} }, ...]
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Document Service API
http(s)://< document service endpoint >/2011-02-01/documents/batchAccept: application/json Content-Length: 1176 Content-Type: application/json Host: doc.imdb-movies-rr2f34ofg56xneuemujamut52i.us-east-1.cloudsearch.amazonaws.com
[{"type": "add","id":"b007oznzg0","version": 1,"lang": "en","fields": {"title":"Kindle Paperwhite","description":"World's most advanced e-reader","category":["Electronics","eBook Readers"],"price":11900} },{ "type": "delete", "id": "tt0434409", "version": 1337648735 } ]
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Search Service APIhttp(s)://< search service endpoint>/2011-02-01/search?
Simple searches• q= text
Boolean combination of fields• bq= (or field:'value1' (and field:'value2' field:'value3'))
Faceting• facet= comma separated list of facet fields
Pagination• start=, size=
Customized ranking• rank= sort results based on the rank expression provided
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Search Results{"rank": "-text_relevance","match-expr": "(label 'kindle paperwhite')","hits": { "found": 204, "start": 0, "hit": [ { "id": "sontsst12cf5f88b42" }, { "id": "sopvopr12ab017f082" }, { "id": "sorzrpw12ac468a13b" }, ] },...}
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Customizing Ranking
Rank expressions• Compute a score for each document• &rank=<function>
E.g. recency based
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Customizing Ranking With Queries
Define rank expressions in your query• &rank-recency=text_relevance + (1 / (2012 - year)) * 100• &rank=-recency
Uses• A/B testing• User-customized searches• Geo-searching
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
IMDB DATA DEMO
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Pricing
Get started for just $2.40/day; $75/month
AWS Calculator http://calculator.s3.amazonaws.com/calc5.html
Free Trial
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Wrap Up
Powerful search is a critical component of today's applications
Amazon CloudSearch makes adding search easy
Create a domain, POST documents, GET search results
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Resources and Q&AAmazon CloudSearch Overview Pagehttp://aws.amazon.com/cloudsearch/• FAQs• Community Forum• Documentation & Getting Started Tutorial (IMDb)
Contact our EU business development team• http://aws.amazon.com/contact-us
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Thank You
Jon Handler / [email protected]
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Searching Wikipedia with Amazon CloudSearch
Iain [email protected]
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Search Engine ExpertiseMicrosoft SharePoint/FASTGoogle Search ApplianceSolrAmazon CloudSearch LucidWorksAttivioExaleadAutonomyMarkLogicelasticsearchVivisimoSinequaHadoopSphinx…..
37
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
400+ Customers
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Searching Wikipedia with Amazon CloudSearch
Iain [email protected]
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Agenda
Project BackgroundHigh-level ArchitectureSummary & Observations
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Project Background
Amazon contracted with Search Technologies to help with beta-testing, prior to the launch of Amazon CloudSearchDecision to use Wikipedia as a convenient data set for testing purposes
41
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
High-level Architecture42
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Indexing
Wikipedia provides content in a series of large xml filesAmazon CloudSearch ingests xml in a specified formVarious content processing tasks to perform• Splitting into individual documents• Date normalization• Metadata extraction & mapping• Cleanup, etc.
We used Aspire for these tasks
43
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Aspire in BriefBased on Apache Felix / OSGi• Thread-safe, multi-threaded, distributable• Any number of pipelines, conditional branching• Plug-in components individually testable & upgradable• In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA.• Tested with Elasticsearch and SP 2013
44
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
XML Input45
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Indexing
Streaming Wikipedia Dump Files directly into CloudSearch500 docs/second achieved without much effort• Using 4 x XL instances of CloudSearch• 1 x XL EC2 instance for Aspire
46
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Searching
Amazon CloudSearch provides a RESTful/XML interface for search purposesFor the Wikipedia project, we needed a UI• Chose to use Twigkit• Wrote a Java API for CloudSearch • The Java API is freely downloadable (with source) at http://
www.searchtechnologies.com/java-api-amazon-cloudsearch.html
47
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
SearchingSupports navigators and relevancy customization• E.g. a “PageRank” style link analysis
was performed
Limits set high: E.g. retrieve 500,000 results in a single list, delivered in just a few seconds• Hugely useful for analysis applications
So, what does it look like?
48
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
wikipedia.searchtechnologies.com 49
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
wikipedia.searchtechnologies.com50
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Summary & Observations
A capable and scalable “raw” engine• xml in, RESTful/xml out• Easy to set up – much the same as an EC2 instance• Elastic scalability
51
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Summary & Observations
Cost effective• From $75 per month, including management /
maintenanceExtremely convenient• Switch on / off at leisure• Promotes experimentation & agility
52
© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Iain [email protected]
For further details, see Paul Nelson’s blog at www.searchtechnologies.com