Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web...

28
NOVEMBER 16-18, 2016 • SEVILLE, SPAIN Building a Search Engine for the Cuban Web Jorge Luis Betancourt Search/Crawl Engineer

Transcript of Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web...

Page 1: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

N O V E M B E R 1 6 - 1 8 , 2 0 1 6 • S E V I L L E , S P A I N

Building a Search Engine for the Cuban Web

Jorge Luis Betancourt

Search/Crawl Engineer

Page 2: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

2

Who am I01

Jorge Luis Betancourt González

Search/Crawl Engineer

Apache Nutch Committer & PMC

Apache Solr/ES enthusiast

Page 3: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

3

Agenda

• Introduction & motivation

• Technologies used

• Customizations

• Conclusions and future work

Page 4: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

4

Introduction / Motivation

Cuba

Internet Intranet

Global search engines can’t access documents

hosted the Cuban Intranet

Page 5: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

5

Writing your own web search engine

from scratch?

or …

Page 6: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

6

Common search engine features

2

1

3

Web search: HTML & documents (PDF, DOC)

Image search (size, format, color, objects)

News search (alerting, notifications)

• highlighting

• filters (facets)

• suggestions

• autocorrection

• thumbnails

• filters (facets)

• show metadata

• match text with images

• near real time • email, push, SMS

Page 7: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

7

How to fulfill these requirements?

store query At the core a search

engine: stores some

information a retrieve this

information when a

question is received

Page 8: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

8

Open Source to the rescue …

Index Server

crawler

web interface

2

1

3

Page 9: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

9

Apache Nutch

“ Nutch is a well matured, production ready

Web crawler. Enables fine grained

configuration, relying on Apache Hadoop™

data structures, which are great for batch

processing.

Page 10: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

10

Apache Nutch

• Highly scalable

• Highly extensible

• Pluggable parsing protocols, storage,

indexing, scoring,

• Active community

• Apache License

Page 11: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

11

Apache Solr

TOTAL DOWNLOADS

8M+MONTHLY

DOWNLOADS 250,000+• Apache License

• Highly modular

• Based on Lucene

• Great community

• Stability / Scalability

• Battle tested

Page 12: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

12

Back to the list of features

2

1

3

Web search: HTML & documents (PDF, DOC)

Image search (size, format, color, objects)

News search (alerting, notifications)

• highlighting

• filters (facets)

• suggestions

• autocorrection

• thumbnails • show metadata

• match text with images

• near real time • email, push, SMS

• filters (facets)

Page 13: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

13

Image search and thumbnails

Custom parser & indexer to store the image

thumbnail

img p

h1

Custom parser &

indexer & scoring

identify and store the text

related with an image

Page 14: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

14

How does it work?

img p

h11

img

img

3

2

Page 15: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

15

News search (NRT & alerting)

Nutch is really not suited for this task: Batch nature of

the Hadoop Jobs doesn’t fit well in this scenario

Page 16: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

16

Our topology

http://news-site.com

RSS fetch parse

index

parse the RSS feed and outputs the news links to be processed according to SC protocol.

https://github.com/commoncrawl/news-crawl

monitor

flaxsearch/luwak

Page 17: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

17

Querying the data

2

1

3

Web search: HTML & documents (PDF, DOC)

Image search (size, format, color, objects)

News search (alerting, notifications)

• highlighting

• filters (facets)

• suggestions

• autocorrection

• thumbnails • show metadata

• match text with images

• near real time • email, push, SMS

• filters (facets)

17

Page 18: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

18

Querying the data

2

1

3

Web search: HTML & documents (PDF, DOC)

Image search (size, format, color, objects)

News search (alerting, notifications)

• highlighting

• filters (facets)

• suggestions

• autocorrection

• thumbnails • show metadata

• match text with images

• near real time • email, push, SMS

• filters (facets)

18

Page 19: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

19

Apache Solr

• Solr has full support for highlighting (3 impl)

• powerful faceting capabilities (even more on recent

releases)

• autocorrection support based on the index content

• awesome scalability (SolrCloud, classic master-slave

replication)

Page 20: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

20

The features, once again

2

1

3

Web search: HTML & documents (PDF, DOC)

Image search (size, format, color, objects)

News search (alerting, notifications)

• highlighting

• filters (facets)

• suggestions

• autocorrection

• thumbnails • show metadata

• match text with images

• near real time • email, push, SMS

• filters (facets)

Page 21: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

21

The features, once again

2

1

3

Web search: HTML & documents (PDF, DOC)

Image search (size, format, color, objects)

News search (alerting, notifications)

• highlighting

• filters (facets)

• suggestions

• autocorrection

• thumbnails • show metadata

• match text with images

• near real time • email, push, SMS

• filters (facets)

Page 22: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

22

Other features - monitoring

We needed a way of monitoring our infrastructure

without a great Internet connection you can’t send

GB of logs to a cloud environment, so …

(and facets)analytical tool

(and logs)

(and metrics)time series store

Page 23: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

23

Other features - monitoring

(and facets)analytical tool

(and logs)

(and metrics)time series store

(and logs) parsing & aggregation

Page 24: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

24

Banana (Kibana port) for visualizations

Page 25: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

25

Infrastructure

Solr Master

CrawlersNutch

SolrReplicador

WEB

HTTP

HTTP HTTP HTTP

HTTP HTTP

JAVABIN

1

2

Page 26: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

26

Some usage stats

less than 10 000 visits around 600 unique visitors

Page 27: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

27

Future work

Apply deep learning techniques to process the raw

images and mix with current approach

Increase the number of signals that we get from our

crawlers (correlate even more crawl related events)

Page 28: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process

Thanks

Questions?

M

!

[email protected]

@jorgelbg