Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. ·...

23
Michael Christen http://yacy.net FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Web Search Engine For Everyone Decentralised Web Search Web Search Engine Software For Everyone: We can remove dependency from (a) large search engine provider

Transcript of Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. ·...

Page 1: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine For Everyone

Decentralised Web Search

Web Search Engine Software For Everyone:We can remove dependency

from (a) large search engine provider

Page 2: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

Tech Dev

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

Demo

•Search Engine Technologyhow large-scale search engines are made available for everyoneusing peer-to-peer technology

•Demonstration:what you can do in just five minutes:installation, crawling, searching, monitoring, scheduling

•System Components and Development:Details about a search appliance components like scheduler, document parser, administration and visualization. Easy integration into a web page.APIs for external index queries and external index feeding components.

Topics

Page 3: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

DevDemoTechSearch Engine Components

Retrieval, Indexing, Storage and Search Components

Se

arc

h

Inte

rfac

eD

atab

aseIn

de

xin

gCra

wle

r

Text Analysis

words

Double LinkCheck

Stop wordsCheck

ReverseWord Index

@

URLCrawl Stack

links

URL ReferencesWordYaCy has an

integrated NoSQL Database. The

database stores a Reverse Word

Index, Metadata and the source

documents.

Depth = 0 Start-URL

Depth = 1

Depth = 2

ranking,verification,visualisation

filtering,parsing

Page 4: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

DevDemoTech

SearchEngine

SearchEngine

Search Engine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

Efficient search enginesare constructed using

a matrix of many small search engines

h o r i z o n t a l s c a l i n g : m o re d o c u m e n t svert

ical

sca

ling:

m

ore

quer

ies

per

seco

nd Search Engine Cluster

Large Search Cluster: Model

Page 5: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

DevDemoTech

Usually such search engine clustersare hosted by one organization in a data center

Large Search Cluster in Data Center

SearchEngine

SearchEngine

Search Engine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

Page 6: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

DevDemoTech

Imagine you can take the software outsideand connect peers decentralised

Large Search Cluster: Decentralised

SearchEngine

SearchEngine

Search Engine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

SearchEngine

Page 7: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

DevDemoTech

YaCy is a search engine appliance that can be used eitherin a data center or as a decentralised network of private peers

Decentralised Search with YaCy 1/3

Peer Peer Peer Peer Peer

Peer Peer Peer Peer Peer

Peer Peer Peer Peer Peer

How can a search matrix be distributed? The peers are ordered using an ordering on peer hashes. The hash-ordering is

closed at the end and the resulting network can be drawn as a circle...

Page 8: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

DevDemoTech

DHT-Store DHT-Read

This peer (as an example) fetches some

Web pages and distributes index

fragments to other peers.

YaCy peers store index fragments according to a ‘folded‘ ordering on word-hashes and url-hashes in a distributed hash table (DHT). The index is distributed redundantly to save the index when some peers are

not available. The redundancy also helps to increase search performance.

A peer which searches information can access directly peers holding

the corresponding index

Peer

Peer

PeerPeer

Peer

Peer

Peer

Peer

Peer

Peer

Peer

PeerPeer

Peer

Peer

Peer

A ,Folded‘ Search Matrix

Decentralised Search with YaCy 2/3

Page 9: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

DevDemoTech

The ,default‘ YaCy Search Engine Network

DHT-StoreDHT-Read

Juniorbehind firewall or router

Seniorhas open server port

Principalpublishes seed-lists

Peer Types:

Decentralised Search with YaCy 3/3

Page 10: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

DevDemoTech

http://sciencenet.fzk.de

300 million documents

,Sciencenet‘: Search Engine for scientific content in the Karlsruhe Institute of Technology:

34 computers running YaCy in it‘s own network

YaCy Search Cluster in a Data Center

Page 11: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

DevDemoTech

Search Engine @Home

> 1 Billion Documents

Decentralised Search for Everyone

People run they own YaCy search peer at home and create independent search for everyone

Page 12: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

DevDemoTech

Impact of running your own search engine:

become independentfrom large search engine operators

keep company secretssearch tracks can reveal industrial research targets

your personal relevanceyou can create a ranking method for your personal needs

same rights for all peopleeveryone can run a search engine

Benefits

Page 13: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

Tech DevDemo

geoclub.de

linuxtag.org

linux-club.de

fsfe.org

Demo: Users

Page 14: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

Tech DevDemo

•Decentralised Peer-to-Peer Web Searchsearch engines for everyone

•High-Performance Search Clustersgeneric search portals for any need

•Internet Search Portal for a projectcombining wikis, blogs, forums and portal pages

•Alert-Service for News using RSScreate a News-Feed using recent search results for a specific topic

•Intranet Search Appliancesearch in local web servers and file shares

Use Cases

Page 15: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

Tech DevDemo

SRU

API for search results is RSS (Opensearch) and JSON

Facets:Domains, Authors

every link is verified before it is displayed: the content is loaded, parsed and used for a search

snippet generation

Search Interface

Opensearch (search results with RSS), JSON, AJAX toolsAPIssearch widget, ready-to-use code snippets to embed search everywhereTools

Standards

Page 16: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

Tech Demo DevDocument Retrieval and Parser

Connection

load and crawl from:

HTTP, HTTPS, FTP, filesystem, SMB-shares

Import from:

Dublin Core / XML files, OAI-PMH, wikimedia dumps,

SQL databases

Interpretation

find metadata (headline, author, date, locations)

find links of different kind (text, images, movies etc.)

store statistical data for search suggestions

Parsing

read document formats:

HTML, XHTML, RSS, RDF, XHTML+RDFa, FOAF, vCard, Flash, PDF, PS, Word, Excel,

Visio, Powerpoint, OpenOffice, RTF, csv, gzip, zip,

tar, rar, bzip2, 7zip, images(EXIF), torrent files

A SearchEngine should support people in the search for documents in unstructured

formats: this needs a kind of ‘understanding‘ of content

Page 17: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

Tech Demo DevSearch Result Ranking

that‘s what lucene has

similar to G**gle PR

do you use the same ranking as G**gle?no. PR is difficult

and sometimes useless (i.e. in intranets) then you cannot be better?

we have many ranking criteria and users can mix them.

but is this better?

what is ‘better‘? G**gle defines ‘better‘ as: ‘most people like it‘

I have an idea: ....

in YaCy, you can combine many

weighted attributes

a prototype discussion about ranking

suddenly people think about their personal relevance requirements..... then what is the

best ranking?

do experiments! If you run your own search engine, then you may need

your own ranking. Different contents may need different rankings. every peer?

when doing a remote search, the remote peer uses your own

ranking too!

Page 18: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

Tech Demo DevParts of a Search Appliance

Search Engine

retrieval, indexing, storage and search components

Data Visualisation

index creation process, system load, link structure, p2p net configuration

Scheduler and Steering

automatic scheduled re-indexing and back-up of search appliance set-up

Database Administration

crawl queues, robots.txt, rss feeds, scheduler data, p2p connections, network messages

Page 19: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

Tech Demo Dev

<iframe name="target2" src="http://141.52.175.43:8080/yacysearch.html?display=2&resource=local" width="100%" height="180" frameborder="0" scrolling="auto" id="target2"</iframe>

Search Interface Integration

<form method="get" accept-charset="UTF-8" action="http://141.52.175.43:8080/yacysearch.html"> <div> <div>MySearch</div> <input type="text" name="query" value="" maxlength="80" /> <input type="hidden" name="verify" value="true" /> <input type="hidden" name="maximumRecords" value="10" /> <input type="hidden" name="meanCount" value="5" /> <input type="hidden" name="resource" value="local" /> <input type="hidden" name="urlmaskfilter" value=".*" /> <input type="hidden" name="prefermaskfilter" value="" /> <input type="hidden" name="display" value="2" /> <input type="hidden" name="nav" value="all" /> <input type="submit" name="Enter" value="Search" /> </div></form>

How to integrate a YaCy Search Portal:Just copy-paste the code snippet to your web page source code.

Code Snippet Example #1: a search window in an iframe

Code Snippet Example #2: a search box (points to new page)Code Snippet #2 looks like:

The YaCy administration interface offers more code snippets. An example from/ConfigSearchBox.htmllooks like:

your YaCy peer provides help pages with code snippets for an easy integration!

Page 20: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

Tech Demo Dev

> curl http://localhost:8080/yacysearch.rss?query=foaf&maximumRecords=10<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type='text/xsl' href='/yacysearch.xsl' version='1.0'?><rss version="2.0" xmlns:yacy="http://www.yacy.net/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/"<!-- very short example --><item> <title>Friend of a Friend (FOAF) project</title> <link>http://www.foaf-project.org/</link> <pubDate>Fri, 23 May 2008 02:00:00 +0200</pubDate></item><item> <title>FOAF - Wikipedia</title> <link>http://de.wikipedia.org/wiki/FOAF</link> <pubDate>Tue, 08 Jan 2008 01:00:00 +0100</pubDate></item><item> <link>http://microformats.org/wiki/xfn-to-foaf</link> <pubDate>Fri, 09 May 2008 02:00:00 +0200</pubDate></item></rss>

Standards:The YaCy-internal Dublin Core Metadata Format fits into the RSS format for search result data in Opensearch standard very well.

If wanted, also JSON can be used as export format.

How to get Opensearch/JSON Search Results:- do a normal web search in YaCy- replace the ‘html‘ extension of the result page URL with ‘rss‘- for json, replace the ‘html‘ extension with ‘json‘

SRU Standard for Queries:

External Index Retrieval

http://www.loc.gov/standards/sru/specs/search-retrieve.htmlOpensearch Standard: http://www.opensearch.org

Page 21: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

Tech Demo Dev

<?xml version="1.0" encoding="utf-8"?><!-- YaCy surrogate using dublin core notion --><surrogates xmlns:dc="http://purl.org/dc/elements/1.1/">

<record> <dc:title><![CDATA[Alan Smithee]]></dc:title> <dc:identifier>http://de.wikipedia.org/wiki/Alan_Smithee</dc:identifier> <dc:description> <![CDATA['''Alan Smithee''' ist ein Anagramm von „The Alias Men“.]]> </dc:description> <dc:language>de</dc:language> <dc:date>2009-04-14T00:00:00Z</dc:date> <!-- date is in ISO 8601 --> </record> </surrogates>

Standards:YaCy can import standard Dublin Core Metadata XML files as input for indexing

How to import Dublin Core Files:just place the xml files into a hand-over directory atDATA/SURROGATES/in/

The Dublin Core XML File Standard:http://dublincore.org/documents/dc-xml-guidelines/

External Index Feeding

Page 22: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine for everyone

Tech Demo Dev

•Download from http://yacy.net

•Just Extract the Package, then Start the Start-Script

•Administration using the Web Interface

•Support

YaCy for Windows YaCy for Mac YaCy for Debian YaCy for Linux / generic (tar.gz)

There are simple installers for Windows, Mac and a debian release, but it is easy to just install the generic release because it contains everything that is needed.

YaCy is a Web Application. The administration can be done completely using the built-in web interface with your web browser. Just open http://localhost:8080The main configuration is done when you select your use case (Distributed P2P Web Search, Portal Search, Intranet Search) after just two clicks.

We have a web forum: http://forum.yacy.deSome information can be found at the wiki: http://wiki.yacy.de...or contact me: [email protected]

License: GPL Free SoftwareInstallation

Page 23: Decentralised Web Search - Home - YaCyyacy.net/material/YaCy_FOSS_ASIA_2010.pdf · 2014. 4. 8. · Web Search Engine for everyone Demo •Search Engine Technology how large-scale

Michael Christenhttp://yacy.net

FOSS ASIA - Ho Chi Minh City, Vietnam 2010Web Search Engine For Everyone

• learn about search engine technology and teach other people

• create your own search portal

• be creative! -- we listen to your ideas

• help -- make a translation of the administration interface!

what you can do