Search engine optimization service, search engine optimization
Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web...
Transcript of Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web...
N O V E M B E R 1 6 - 1 8 , 2 0 1 6 • S E V I L L E , S P A I N
Building a Search Engine for the Cuban Web
Jorge Luis Betancourt
Search/Crawl Engineer
2
Who am I01
Jorge Luis Betancourt González
Search/Crawl Engineer
Apache Nutch Committer & PMC
Apache Solr/ES enthusiast
3
Agenda
• Introduction & motivation
• Technologies used
• Customizations
• Conclusions and future work
4
Introduction / Motivation
Cuba
Internet Intranet
Global search engines can’t access documents
hosted the Cuban Intranet
5
Writing your own web search engine
from scratch?
or …
6
Common search engine features
2
1
3
Web search: HTML & documents (PDF, DOC)
Image search (size, format, color, objects)
News search (alerting, notifications)
• highlighting
• filters (facets)
• suggestions
• autocorrection
• thumbnails
• filters (facets)
• show metadata
• match text with images
• near real time • email, push, SMS
7
How to fulfill these requirements?
store query At the core a search
engine: stores some
information a retrieve this
information when a
question is received
8
Open Source to the rescue …
Index Server
crawler
web interface
2
1
3
9
Apache Nutch
“ Nutch is a well matured, production ready
Web crawler. Enables fine grained
configuration, relying on Apache Hadoop™
data structures, which are great for batch
processing.
10
Apache Nutch
• Highly scalable
• Highly extensible
• Pluggable parsing protocols, storage,
indexing, scoring,
• Active community
• Apache License
11
Apache Solr
TOTAL DOWNLOADS
8M+MONTHLY
DOWNLOADS 250,000+• Apache License
• Highly modular
• Based on Lucene
• Great community
• Stability / Scalability
• Battle tested
12
Back to the list of features
2
1
3
Web search: HTML & documents (PDF, DOC)
Image search (size, format, color, objects)
News search (alerting, notifications)
• highlighting
• filters (facets)
• suggestions
• autocorrection
• thumbnails • show metadata
• match text with images
• near real time • email, push, SMS
• filters (facets)
13
Image search and thumbnails
Custom parser & indexer to store the image
thumbnail
img p
h1
Custom parser &
indexer & scoring
identify and store the text
related with an image
14
How does it work?
img p
h11
img
img
3
2
15
News search (NRT & alerting)
Nutch is really not suited for this task: Batch nature of
the Hadoop Jobs doesn’t fit well in this scenario
16
Our topology
http://news-site.com
RSS fetch parse
index
parse the RSS feed and outputs the news links to be processed according to SC protocol.
https://github.com/commoncrawl/news-crawl
monitor
flaxsearch/luwak
17
Querying the data
2
1
3
Web search: HTML & documents (PDF, DOC)
Image search (size, format, color, objects)
News search (alerting, notifications)
• highlighting
• filters (facets)
• suggestions
• autocorrection
• thumbnails • show metadata
• match text with images
• near real time • email, push, SMS
• filters (facets)
17
18
Querying the data
2
1
3
Web search: HTML & documents (PDF, DOC)
Image search (size, format, color, objects)
News search (alerting, notifications)
• highlighting
• filters (facets)
• suggestions
• autocorrection
• thumbnails • show metadata
• match text with images
• near real time • email, push, SMS
• filters (facets)
18
19
Apache Solr
• Solr has full support for highlighting (3 impl)
• powerful faceting capabilities (even more on recent
releases)
• autocorrection support based on the index content
• awesome scalability (SolrCloud, classic master-slave
replication)
20
The features, once again
2
1
3
Web search: HTML & documents (PDF, DOC)
Image search (size, format, color, objects)
News search (alerting, notifications)
• highlighting
• filters (facets)
• suggestions
• autocorrection
• thumbnails • show metadata
• match text with images
• near real time • email, push, SMS
• filters (facets)
21
The features, once again
2
1
3
Web search: HTML & documents (PDF, DOC)
Image search (size, format, color, objects)
News search (alerting, notifications)
• highlighting
• filters (facets)
• suggestions
• autocorrection
• thumbnails • show metadata
• match text with images
• near real time • email, push, SMS
• filters (facets)
22
Other features - monitoring
We needed a way of monitoring our infrastructure
without a great Internet connection you can’t send
GB of logs to a cloud environment, so …
(and facets)analytical tool
(and logs)
(and metrics)time series store
23
Other features - monitoring
(and facets)analytical tool
(and logs)
(and metrics)time series store
(and logs) parsing & aggregation
24
Banana (Kibana port) for visualizations
25
Infrastructure
Solr Master
CrawlersNutch
SolrReplicador
WEB
HTTP
HTTP HTTP HTTP
HTTP HTTP
JAVABIN
1
2
26
Some usage stats
less than 10 000 visits around 600 unique visitors
27
Future work
Apply deep learning techniques to process the raw
images and mix with current approach
Increase the number of signals that we get from our
crawlers (correlate even more crawl related events)