A customized web search engine [autosaved]

25
Supervised By Dr. Mohamed A. El-Rashidy Eng. Ahmed Ghozia Dept. of Computer Science & Engineering Faculty of Electronic Engineering, Menoufiya University.

description

A customized web search engine is graduation project .This presentation displays what search engine is and open source software which used in this project

Transcript of A customized web search engine [autosaved]

Page 1: A customized web search engine [autosaved]

Supervised By

Dr. Mohamed A. El-Rashidy Eng. Ahmed Ghozia

Dept. of Computer Science & Engineering

Faculty of Electronic Engineering,

Menoufiya University.

Page 2: A customized web search engine [autosaved]

The main purpose of this project is to build our own search engine that should suffice for our needs as a nation

In this project has been tried to add customized features to search engine such as building and developing a time-based search engine that is meant to deal with local and international news

Page 3: A customized web search engine [autosaved]

Question : What is a Search Engine?

How web search engine work?

Web crawler , Indexing , Ranking

Lucene , Nutch , Solr

Who uses solr?

Setup Nutch for web crawling

Setup Solr for search

Running Nutch in Eclipse for developing

Experiments

Page 4: A customized web search engine [autosaved]

Answer: A software that

builds an index on text

answers queries using that index

A search engine offers

Scalability

Relevance Ranking

Integrates different data sources (email, web pages, files, database,...)‏

Page 5: A customized web search engine [autosaved]

A search engine operates, in the following order

1. Web crawling

2. Indexing

3. Ranking

Page 6: A customized web search engine [autosaved]

a program or automated script which browses the World Wide Web

used to create a copy of all the visited pages for later processing by a search engine

it starts with a list of URLs to visit, called the seeds

URLs recursively visited according to a set of policies

A selection policy

A re-visit policy

A politeness policy

A parallelization policy

Page 7: A customized web search engine [autosaved]

Indexing process entails how data is collected, parsed, and stored to facilitate fast and accurate search query evaluation.

The process involves the following steps

Data collection

Data traversal

Indexing

Page 8: A customized web search engine [autosaved]

Indexing process: Convert document Extract text and meta data Normalize text(stop word,stim)Write (inverted) index

Example: Document 1: “Apache Lucene at Jazoon“ Document 2: “Jazoon conference“ Index: apache -> 1 conference -> 2 Jazoon -> 1, 2 lucene -> 1

Page 9: A customized web search engine [autosaved]

The web search engine responds to a query that a user enters into a web search engine to satisfy his or her information needs

Page 10: A customized web search engine [autosaved]
Page 11: A customized web search engine [autosaved]

a high-performance, scalable information retrieval (IR) library

lets you add searching capabilities to your applications.

free, open source project implemented in Java

With Lucene, you can index and search email messages, mailing-list archives, instant messenger chats, your wiki pages…the list goes on.

Page 12: A customized web search engine [autosaved]

Web Search Engine Software

Open source web crawler

Coded entirely in the Java programming language

Advantages Scalability

Crawler Politeness

Crawler Management

Quality

Page 13: A customized web search engine [autosaved]

Open source enterprise search platform based on Apache Lucene project.

Powerful full-text search, hit highlighting, faceted search

Database integration, and rich document (e.g., Word, PDF) handling

Page 14: A customized web search engine [autosaved]
Page 15: A customized web search engine [autosaved]

Download a binary package (apache-nutch-bin.zip)

cd apache-nutch-1.X/

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

Now you should be able to see the following directories created:

crawl/crawldb

crawl/linkdb

crawl/segments

Page 16: A customized web search engine [autosaved]

If you have a Solr core already set up and wish to index to it we should use

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

Now skip to here for how to set up your Solr instance and index your crawl data.

Page 17: A customized web search engine [autosaved]

Download binary file (apache-Solr-bin.zip)

cd ${APACHE_SOLR_HOME}/example

java -jar start.jar

After you started Solr admin console, you should be able to access the following link:

http://localhost:8983/solr/admin/

Integrate Solr with Nutchcp ${NUTCH_RUNTIME_HOME}/conf/schema.xml

${APACHE_SOLR_HOME}/example/solr/conf/

Page 18: A customized web search engine [autosaved]

restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example

run the Solr Index command: bin/nutch solrindex http://127.0.0.1:8983/solr/

crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

Page 19: A customized web search engine [autosaved]
Page 20: A customized web search engine [autosaved]

Crawling the Egyptian Universities

Page 21: A customized web search engine [autosaved]

Crawling the Arabic news websites

Page 22: A customized web search engine [autosaved]

Crawling the Arabic news websites

Page 23: A customized web search engine [autosaved]
Page 24: A customized web search engine [autosaved]

Mustafa Mohammed Ahmed Elkhiat

Email:[email protected]

Page 25: A customized web search engine [autosaved]