Thumbnail Summarization Techniques For Web Archives
description
Transcript of Thumbnail Summarization Techniques For Web Archives
Thumbnail Summarization Techniques For Web Archives
Ahmed AlSum*
Stanford University Libraries
Stanford CA, [email protected]
1
Michael L. Nelson
Old Dominion University
Norfolk VA, [email protected]
The 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014
*Ahmed AlSum did this work while he was PhD student at Old Dominion University
ECIR 2014 Amsterdam, Netherlands
ECIR 2014 Amsterdam, Netherlands 2
What is a Web Archive?
http://www.cs.odu.edu
Thumbnails in Web Archive
Internet Archive UK Web Archive
3ECIR 2014 Amsterdam, Netherlands
4
Memento Terminology
URI-R, R
URI-M, M
URI-T, TM
http://www.amazon.com
http://web.archive.org/web/20110411070244/http://amazon.com
Original Resource
Memento
TimeMap
ECIR 2014 Amsterdam, Netherlands
Thumbnails Creation Challenges• Scalability in Time
• IA may need 361 years to create thumbnail for each memento using one hundred machines.
• Scalability in Space• IA will need 355 TB to store 1 thumbnail per each memento.
• Page quality
5ECIR 2014 Amsterdam, Netherlands
Thumbnails Usage Challenges
6
• This is partial view of the first 700 thumbnails out of 10,500 available mementos for www.apple.com
ECIR 2014 Amsterdam, Netherlands
From 10,500 Mementos to 69 Thumbnails.
7ECIR 2014 Amsterdam, Netherlands
How many thumbnails do we need?
www.unfi.com on the live Web
8ECIR 2014 Amsterdam, Netherlands
How many thumbnails do we need?
www.unfi.com on the live Web
9ECIR 2014 Amsterdam, Netherlands
40 Thumbnails are good.
10ECIR 2014 Amsterdam, Netherlands
METHODOLOGY
11ECIR 2014 Amsterdam, Netherlands
Visual Similarity and Text Similarity
Sim
ilar
Dif
fere
nt
HTML Text
12ECIR 2014 Amsterdam, Netherlands
Correlation between Visual Similarity and Text Similarity • Text Similarity
• SimHash• DOM Tree• Embedded resources• Memento Datetime (Capture time)
• Visual Similarity
13ECIR 2014 Amsterdam, Netherlands
Text Similarity
SimHash• Computes 64-bit SimHash fingerprints with k = 4 for two
pages• Full HTML text ✔• The main content from the web page• All the text • Templates including the text• The template excluding the text
• Calculate the differences using Hamming Distance
14ECIR 2014 Amsterdam, Netherlands
Text Similarity
DOM Tree• Transfer each webpage to DOM tree• Calculate the difference using Levenshtein Distance
• Levenshtein distance: is the number of operations to insert, update, and delete.
15ECIR 2014 Amsterdam, Netherlands
Text Similarity
Embedded resources• Extract the embedded resources for each page • Calculate the total number of new resources that have
been added and the resources that have been removed.• For example, the difference between M1 and M2:
• Addition of 5 resources (2 javascript files and 3 images) • Removal of 2 resources (1 javascript file and 1 image).
16ECIR 2014 Amsterdam, Netherlands
Text Similarity
Memento datetime• Calculate the difference between the record capture time
for both pages in seconds.
17ECIR 2014 Amsterdam, Netherlands
Visual Similarity• Measurement: the number of different pixels between two
thumbnails• To compare two thumbnails,
• Resize them into different dimensions: 64x64, 128x128, 256x256, and 600x600.
• Calculate the Manhattan distance and Zero distance between each pair
18ECIR 2014 Amsterdam, Netherlands
Correlation between Visual Similarity and Text Similarity
SimHash DOM tree
Embedded resources Memento Datetime
19
SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013]
ECIR 2014 Amsterdam, Netherlands
SELECTION ALGORITHMS
20ECIR 2014 Amsterdam, Netherlands
Threshold Grouping
21ECIR 2014 Amsterdam, Netherlands
Threshold Grouping
22ECIR 2014 Amsterdam, Netherlands
Clustering technique• Input:
• TimeMap with n mementos• A set of features.
• For example, F = {SimHash, Memento-Datetime}
• Task:• Cluster n mementos in K clusters.
23ECIR 2014 Amsterdam, Netherlands
Clustering technique
SimHash Feature SimHash and Datetime Features
24
Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341.
ECIR 2014 Amsterdam, Netherlands
Time Normalization
25ECIR 2014 Amsterdam, Netherlands
Selection Algorithms Comparison
Threshold Grouping K clustering Time Normalization
TimeMap Reduction 27% 9% to 12% 23% Image Loss 28 78 - 101 109
# Features 1 feature 1 or more 1 feature
Preprocessing required Yes Yes No
Efficient processing Medium Extensive Light
Incremental Yes No Yes
Online/offline Both Both Both
26ECIR 2014 Amsterdam, Netherlands
Generalization outside the Web Archive
• Get k thumbnails from website that has n pages
27ECIR 2014 Amsterdam, Netherlands
Conclusions• We explored the similarity between the text and visual
appearance of the web page.• We found that SimHash and Levenshtein distance have the highest
correlation
• We presented three algorithms to select k thumbnails from n mementos per TimeMap.
28
[email protected]@aalsum
ECIR 2014 Amsterdam, Netherlands