Search Engines & Question Answering Giuseppe Attardi Dipartimento di Informatica Università di Pisa.
Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.
-
Upload
samantha-cheyne -
Category
Documents
-
view
219 -
download
2
Transcript of Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.
![Page 1: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/1.jpg)
Next generation search engines
Paolo FerraginaDipartimento di Informatica, Pisa
![Page 2: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/2.jpg)
Our journey today!
Websearch engines
XMLsearch engines
Basic Researchon data compression,indexing and mining
![Page 3: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/3.jpg)
More than 85% users arrive to a site from a SE
Web Searches: 45% Google, 29% Yahoo, 13% MSN, 5% ASK,... Toolbar searches: 49.6% Google, 46.1% Yahoo,...
SE impact onto: Web structure, knowledge and
understanding, social behavior.... and marketing
!!
33% users believe that “the results of a query are the
best place where to buy things” !!
Ads (4B$ in USA, 2B€ in Europe, 180M€ in Italy) Paid search: 65% Google, 25% Yahoo, 8% MSN,... Portal search: 15% Yahoo, 10% MSN, 7% AOL-Google,...
Much interest...
![Page 4: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/4.jpg)
Retrieve the docs that are “relevant” for the user query
Doc: file word or pdf, web page, email, blog, e-
book,... Query: paradigm “bag of words”
Relevant ?!?
...We face many difficulties, especially on the
Web!!!
Goal of a Search Engine
![Page 5: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/5.jpg)
Web is huge: 8 bil pages [Google]
We need to “rank” the results !!
![Page 6: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/6.jpg)
Languages/Encodings Hundreds of languages: 55 (Jul01) Home pages:
In 1997: English 82%, the next 15 take 13% In 2001: English 53%, the next 9 take 30%
Distributed authorship Millions of people creating pages with their own style… Not all have the purest motives in providing high-quality
information - commercial motives drive “spamming”.
Web is heterogeneous
Extracting “significant data” is difficult !!
![Page 7: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/7.jpg)
Web is highly dynamic [154 sites, 2004]
A “good” coverage of the indexed Web is
difficult !!
Normalizedwrt first week
![Page 8: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/8.jpg)
User Queries are “difficult”
Query composition: Short
2001: 2.54 terms avg
80% less than 3 terms
Imprecise terms
78% of the queries are not modified
Query results: Users are lazy: 85% look at just one page of results
![Page 9: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/9.jpg)
User Needs are “variegate”
Informational – want to learn about something (~40%)
Navigational – want to go to a page (~25%)
Transactional – want to do something (~35%)
Access a service Downloads Shop
Asthma
Alitalia
NY weatherMars surface images
Nikon CoolPix
![Page 10: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/10.jpg)
Evolution of Search Engines First generation -- use only on-page, web-text data
Word frequency and language
Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)
Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data Query mining
1995-1997 AltaVista, Excite, Lycos, etc
1998: Google, now everyone
No winner yet !!
Various players: Google, Yahoo, Msn, Ask,…
Fourth generation Information Supply[Andrei Broder, VP emerging search tech, Yahoo! Research]
![Page 11: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/11.jpg)
Yesterday.....
...Today
![Page 12: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/12.jpg)
Yesterday...
...Today
All these toolsare built upon aSearch Engine
![Page 13: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/13.jpg)
Structure of (Web) Search Engines
Web
Crawler
Page archive
PageAnalizer
Control
Query
Queryresolver
Ranker
Indexing data structures
Indexer
![Page 14: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/14.jpg)
Size of search engines [2005]
Google vs Yahoo: 20-30% sharing of results
![Page 15: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/15.jpg)
Ranking: Google vs Yahoo!
![Page 16: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/16.jpg)
Ranking: Google.com - Google.cn
![Page 17: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/17.jpg)
Clustering engines Vivisimo, Snaket,...
Suggestions
Products
Local searches
News, Blogs, ....
Not only Web Searches...
![Page 18: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/18.jpg)
Web search and mining
We +
![Page 19: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/19.jpg)
Yahoo! World Search
Yahoo! Image, Yahoo! Video, Yahoo! Local, Yahoo! News, Yahoo! Shopping Search,
Communication Yahoo! Mail, Yahoo! Messenger, My Web, Yahoo! Personals, Yahoo! 360º, Yahoo! Photos, Flickr, delicious, ... Yahoo! Answers
Content: Yahoo! Sports, Yahoo! Finance, Yahoo! Music, Yahoo! Movies, Yahoo! News, Yahoo! Games. My Yahoo!
Mobile: Yahoo! Mobile Yahoo! Go
Commerce: Yahoo! Shopping, Yahoo! Autos, Yahoo! Auctions, Yahoo! Travel,
Small Business: Yahoo! Small Business Yahoo! Domains, Yahoo! Web Hosting, Yahoo! Merchant Solutions, Yahoo! Business Email, HotJobs
Advertising: Yahoo! Search Marketing Yahoo! Publisher Network.
[source: R. Baeza-Yates]
![Page 20: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/20.jpg)
Yahoo! numbers [April, ‘06]
15 languages, 20 countries, 6B users
Each day: 1 million new accounts 3.4 billion page views 10 Tb of data processed (total, 20Pb) 2 billion Mail+Messenger sent
[source: R. Baeza-Yates]
![Page 21: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/21.jpg)
Yahoo! Research Barcelona
Starting date: May 2006, Barcelona Director: Ricardo Baeza-Yates Areas: Web Mining and Web Search People: more than 10 and… fast growing !!
Why me ? First academic grant in Europe Three years project on “Data compression and
indexing on hierarchical memories”
[source: R. Baeza-Yates]
![Page 22: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/22.jpg)
Data to be mined or searched
Crawled data (large, heterogeneous, …) Web Pages & Links Blogs Items for sale: Shopping, Travel, etc. RSS Feeds
Produced data (high quality, sparse,…) Yahoo’s Web: YCars, YHealth, Ytravel,… Edited news, purchased news,…
Direct interaction (quality??) Social links Tagged content
[source: R. Baeza-Yates]
![Page 23: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/23.jpg)
What is Flickr ?
![Page 24: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/24.jpg)
![Page 25: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/25.jpg)
![Page 26: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/26.jpg)
The wisdom of the crowd can
be used to improve thesearch and extraction
process
![Page 27: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/27.jpg)
Observed data
Query Logs spelling, synonyms, phrases (named entities),
substitutions,…
Clicks relevance, intent, …
“There is a new type of economics that has emerged and that the world doesn't understand,”
“Web usage data is an amazing leading indicator because it tells you where intent is heading”
U. Fayyad, Yahoo Chief Data Officer
![Page 28: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/28.jpg)
Our future goals…
Deploy user actions, e.g. queries + clicks + …
Implicit semantic information
It's free and unbiased
Large volume
… the Semantic Web Hypothesis - Explicit Semantic Information
Obstacle - Us
Possible uses:• Query suggestion• Query disambiguation• Adv suggestions• Web-site design...
![Page 29: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/29.jpg)
XML search and mining
We +
![Page 30: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/30.jpg)
An XML excerpt
<dblp> <book>
<author> Donald E. Knuth </author><title> The TeXbook </title><publisher> Addison-Wesley </publisher><year> 1986 </year>
</book> <article>
<author> Donald E. Knuth </author><author> Ronald W. Moore </author><title> An Analysis of Alpha-Beta Pruning </title><pages> 293-326 </pages><year> 1975 </year><volume> 6 </volume><journal> Artificial Intelligence </journal>
</article>
...</dblp>
![Page 31: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/31.jpg)
The literature on XML indexing...
Various tools are available TreSy [Cribecu, 1997] eXist [TU Darmstadt, 2002] GalaTex [AT&T, 2004]
Some of their limitations Run on a single machine Use a lot of computational resources (time, space,…) Limit the indexable XML document structure
XML document types data centric [relational data: DB exports] text centric [literary texts, reports, emails, news, …]
![Page 32: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/32.jpg)
Application Level
Our proposal:
Tauro
Query interface
• XML based
Query solver
• analysis + optimization
Result retriever
• indexing data structure Data Collection manager
• data compression • snippet extraction
![Page 33: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/33.jpg)
The first scenario: Client-Server
Context of use : Biblio search,...
![Page 34: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/34.jpg)
The second scenario: Peer-to-Peer
Context of use: Collaborative search
![Page 35: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/35.jpg)
Exploit the power of the crowd The largest library of XML tagged text
collections
…and the power of search engines A suite of search + text mining tools
Syntactic text comparison Motifs extraction for text pattern identification Concept identification via LSI
Our goal...
![Page 36: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/36.jpg)
You find already loaded
rare texts
in editions and translations
coming from ‘400 and ‘500
![Page 37: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/37.jpg)
My documents 5
![Page 38: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/38.jpg)
You may compose sophisticated queries
![Page 39: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/39.jpg)
![Page 40: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/40.jpg)
![Page 41: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/41.jpg)
![Page 42: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/42.jpg)
![Page 43: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/43.jpg)
![Page 44: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/44.jpg)
you can
visually compose
sophisticated
structural queries
![Page 45: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/45.jpg)
http://signum.sns.it
Everything on the finger tips of
humanists Nokia 770, Origami (Microsoft ), SmartPhones,
…
Stay in touch...
![Page 46: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/46.jpg)
Basic research Recurrent themes of this talk
Large volume of data Efficient search
Hierarchical memory systems: L1-L2 caches, RAM,
(Multi-) Disks, (Web) Network, …
Basic algorithmic tools
Indexing data structures
Data compression
Do we face a paradoxical situation ?
![Page 47: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/47.jpg)
Six years ago... [now, J. ACM 05]
Opportunistic Data Structures with Applications
P. Ferragina, G. Manzini
Survey by Navarro-Makinen cites more than 50 papers on the subject !!
![Page 48: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/48.jpg)
[December 2003] [January 2005]
![Page 49: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/49.jpg)
Joint effort with Navarro’s group at Univ. Chile
Some figures over hundreds of MBs of data:• Count(P) takes few millisecs
• Locate(P) takes few millisecs for each occurrence of P• Space is about [bzip ~ 20%]
• 22% (support just Count ops)• 35% (Count, Locate ops)
![Page 50: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/50.jpg)
Compressed index for XML [Ferragina et al, WWW ’06]
Query (counting) time 8 ms, Navigation time 3 ms
0%
10%
20%
30%
40%
50%
60%
DBLP Pathways News
Huffword XPress XQzip XBzipIndex XBzip
UniPi is
patenting it !!
![Page 51: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/51.jpg)
Next generation search enginesPaolo Ferragina
University of Pisa
Thanks !!
![Page 52: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/52.jpg)
An XML excerpt<dblp> <book>
<author> Donald E. Knuth </author><title> The TeXbook </title><publisher> Addison-Wesley </publisher><year> 1986 </year>
</book> <article>
<author> Donald E. Knuth </author><author> Ronald W. Moore </author><title> An Analysis of Alpha-Beta Pruning </title><pages> 293-326 </pages><year> 1975 </year><volume> 6 </volume><journal> Artificial Intelligence </journal>
</article>
...</dblp>
It is verbose !
![Page 53: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/53.jpg)
A tree interpretation...
XML document exploration Tree navigation XML document search Labeled subpath
searches
Subset of XPath [W3C]
![Page 54: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/54.jpg)
The Problem
Summary indexes (like Dataguide, 1-index or 2-index) large space and do not support “content” searches
XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) need the whole decompression
We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations:
Navigational operations Subpath and content searches Visualization operation
XML-queriable compressors (like XPress, XGrind, XQzip,...) poor compression and scan of the whole (compressed) file
XML-native search engines
might exploit this tool as a core block for
query optimization and (compressed) storage
![Page 55: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/55.jpg)
A transform for “labeled trees” [Ferragina et al, IEEE Focs ’05]
We propose the XBW-transform that linearizes
a labeled tree T in 2 arrays such that:
the compression of T reduces to the compression of these two arrays (via e.g. gzip, bzip2, ppm,...)
the indexing of T reduces to implement simple rank/select query operations over these two arrays
A = a b a a a c b c d a b e c d ...
Rank( a , 7 ) = #a in A[1,7] = 4Select( a , 2 ) = pos 2° a = 3
![Page 56: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/56.jpg)
The XBW-TransformC
B A B
D c
c a
b a D
c
D a
b
CBDcacAb aDcBDba
S
CB CD B CD B CB CCA CA CA CD A CCB CD B CB C
S
upward labeled paths
Permutationof tree nodes
Step 1.Visit the tree in pre-order. For each node, write down its label and the labels on its upward path
![Page 57: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/57.jpg)
The XBW-TransformC
B A B
D c
c a
b a D
c
D a
b
CbaDDc DaBABccab
S
A CA CA CB CB CB CB C CCCD A CD B CD B CD B C
S
upward labeled paths
Step 2.Stably sort according to S
![Page 58: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/58.jpg)
XBW takes optimal space
1001010 10011011
The XBW-TransformC
B A B
D c
c a
b a D
c
D a
b
CbaDDc DaBABccab
S
A CA CA CB CB CB CB C CCCD A CD B CD B CD B C
S
Step 3.Add a binary array Slast marking the
rows corresponding to last children
Slast
XBW
XBW can be built and inverted
in optimal time
![Page 59: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/59.jpg)
An illustrative example
Pcdata
Tags, Attributes and the symbol =
XBW is compressible:
S and Spcdata are locally homogeneous
Slast has some structure
![Page 60: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/60.jpg)
XBzip = XBW + PPMd [Ferragina et al, WWW
’06]
0%
5%
10%
15%
20%
25%
DBLP Pathways News
gzip bzip2 ppmdi xmill + ppmdi scmppm XBzip
![Page 61: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/61.jpg)
A general algorithmic paradigm Basic approach (…now only for text and labelled trees)
Transform the input data in few arrays Index (+compress) to support Rank/Select
Theory: Soda ’06 (2), Cpm ’06 (2), Icalp ’06 (2), DCC ’06 (1)
Experimental: Wea ’06 (2)
A lot of interest around it:
http://pizzachili.di.unipi.it or http://pizzachili.dcc.uchile.cl
You can test it:
![Page 62: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/62.jpg)
![Page 63: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/63.jpg)
A general algorithmic paradigm Basic (magic ?!?) approach
Transform the input data in few arrays Index (+compress) them to support Rank/Select ops
Theory: Soda ’06 (2), Cpm ’06 (2), Icalp ’06 (2), DCC ’06 (1)
Experimental: Wea ’06 (2)
A lot of interest around it:
A = a b a a a c b c d a b e c d ...
Rank( a , 7 ) = #a in A[1,7] = 4Select( a , 2 ) = pos 2° a = 3
![Page 64: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/64.jpg)
![Page 65: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/65.jpg)
![Page 66: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/66.jpg)
![Page 67: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/67.jpg)
![Page 68: Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.](https://reader035.fdocuments.us/reader035/viewer/2022062318/55168023550346a2698b5d8a/html5/thumbnails/68.jpg)