Summarizing archival collections using storytelling techniques
-
Upload
michael-nelson -
Category
Technology
-
view
1.190 -
download
0
Transcript of Summarizing archival collections using storytelling techniques
![Page 1: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/1.jpg)
Summarizing archival collections using storytelling techniques
Yasmin AlNoamanyMichele C. WeigleMichael L. Nelson
Old Dominion UniversityWeb Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/@phonedude_mln
Research Funded by IMLS LG-71-15-0077-15
Dodging the Memory Hole Los Angeles, CA, 2016-10-14
![Page 2: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/2.jpg)
2
Archive-It, a subscription-based service, allows creation of collections
> 3,000 collections
> 340 institutions
> 10B archived pages
![Page 3: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/3.jpg)
3
Collection title
Collection categorization based on the
curator
Seed URI
Metadata about the collection
Text search box
The group that the resource belongs to
List of the seed
URIs
Timespan of the resource
and the number of times it has been captured
![Page 4: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/4.jpg)
4
Collection understanding and collection summarization are not currently supported Not easy to answer “what’s in that collection?” or “how is this collection different from others”?
![Page 5: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/5.jpg)
5
There is more than one collection about “Egyptian Revolution”
• “2010-2011 Arab Spring” https://archive-it.org/collections/3101• “North Africa & the Middle East 2011-2013” https://archive-it.org/collections/2349• “Egypt Revolution and Politics” https://archive-it.org/collections/2358
![Page 6: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/6.jpg)
6
One of at least seven Human Rights collections…
![Page 7: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/7.jpg)
7
![Page 8: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/8.jpg)
8
![Page 9: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/9.jpg)
9
Our early attempts at collection understanding tried to include everything…
“Visualizing digital collections at Archive-It”, JCDL 2012.http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
![Page 10: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/10.jpg)
10
1000s of seeds X 1000s of archived pages == Conventional Vis Methods Not Applicable
![Page 11: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/11.jpg)
11
Idea: Storytelling
![Page 12: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/12.jpg)
12
Stories in literature
Story elements: setting, characters, sequence, exposition, conflict, climax, resolution
Once upon a time
http://www.learner.org/interactives/story/
![Page 13: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/13.jpg)
13
Stories in social media“It's hard to define a story, but I know it when I see it” (Alexander, 2008)
basically, just arranging web pages in time
![Page 14: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/14.jpg)
14
“Storytelling” is becoming a popular technique in social media
![Page 15: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/15.jpg)
15
What are the limitations of storytelling services?
![Page 16: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/16.jpg)
16
The Egyptian Revolution on Storify
![Page 17: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/17.jpg)
17
Bookmarking, not preserving!
![Page 18: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/18.jpg)
18
Despite these limitations, how do we combine storytelling & archives?
![Page 19: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/19.jpg)
19
Use interface people already know how to use to summarize collections
Archived collectionsStorytelling services
Archived enriched stories
![Page 20: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/20.jpg)
20
We sample k mementos from N pages of the collection (k << N) to create a summary story
S1
S2
S3
S4
S2
S1
S3
Collection Y
S3
S2
S1
Collection Z
Archive-It Collections
Collection X
Story
The Web
![Page 21: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/21.jpg)
21
Yasmin hand-crafted stories to summarize the Egyptian Revolution collection for her son, Yousof
https://storify.com/yasmina_anwar/the-egyptian-revolution-on-archive-it-collection
https://storify.com/yasmina_anwar/the-story-of-the-egyptian-revolution-from-archive-
![Page 22: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/22.jpg)
22
How do we generate this automatically?
![Page 23: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/23.jpg)
23
Collections have two dimensions:{Fixed, Sliding} X {Page, Time}
t1 t3t2 t5t4 tk
…
URI
Time
t6
…
…
![Page 24: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/24.jpg)
24
Fixed Page, Fixed Time
A desktop Chrome user-agenthttp://www.cnn.com/2014/02/24/world/africa/egypt-politics/index.html?hpt=wo_c2
Android Chrome user-agenthttp://www.cnn.com/2014/02/24/world/africa/egypt-politics/index.html?hpt=wo_c2
Schneider and McCown, “First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites”, JCDL 2013.Kelly et al. “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine 2013 .
![Page 25: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/25.jpg)
25
Feb 1 Feb 1 Feb 2
Feb 4 Feb 5 Feb 7
Feb 9 Feb 11 Feb 11
Fixed Page, Sliding Time
![Page 26: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/26.jpg)
26
Feb. 11, 2011Mubarak resigns Sliding Page, Fixed Time
![Page 27: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/27.jpg)
27
Jan 27 Jan 31
Feb 7Feb 4
Feb 11 Feb 11
Feb 2
Jan 25
Feb 10
Sliding Page, Sliding Time
![Page 28: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/28.jpg)
28
The Dark and Stormy Archives (DSA) framework
Establish a baseline
Reduce the candidate pool of archived pages
Select good representative
pages
Characteristics of human-generated
Stories
Characteristics of Archive-It collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages in each slice
Select high-quality pages from each
cluster
Order pages by time
Visualize
https://pbs.twimg.com/media/BQcpj7ACMAAHRp4.jpg
![Page 29: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/29.jpg)
29
Establish a baseline of social media stories
"Characteristics of Social Media Stories”, TPDL 2015, IJDL 2016.
![Page 30: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/30.jpg)
30
What is the length of a story(the number of resources per story)?
This story has 31 resources
1
3
2
![Page 31: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/31.jpg)
31
What are the types of resources that compose a story?
Quotes
Video
This story has • 19 quotes • 8 images• 4 videos
![Page 32: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/32.jpg)
32
What are the most frequently used domains?
Twitter.com
Twitter.com
Twitter.com
This story has • 90% twitter.com• 7% instagram.com• 3% facebook.com
![Page 33: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/33.jpg)
33
Top 25 domains represents 92% of all domains
![Page 34: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/34.jpg)
34
What differentiates a popular story? (popular = stories with the top 25% of views)
19,795 views 64 views
![Page 35: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/35.jpg)
35
The distributions for the features of the stories
• Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the unpopular stories are different in terms of most of the features
• Popular stories tend to have:• more web elements (medians of 28 vs. 21) • longer timespan (5 hours vs. 2 hours) than the unpopular stories
![Page 36: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/36.jpg)
36
Do popular stories have a lower decay rate?
The 75th percentile of decay rate per popular story is 10% of the resources, while it is 15% in the unpopular stories
![Page 37: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/37.jpg)
37
We found that 28 mementos is a good number for the resources in the stories.
![Page 38: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/38.jpg)
38
Establish a baseline of current Archive-It collections
"Characteristics of Social Media Stories. What makes a good story?", International Journal on Digital Libraries 2016.
![Page 39: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/39.jpg)
39
The mean and median number of
URIs in a collection
This collection has 435 seed URIs
![Page 40: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/40.jpg)
40
The mean and median number of mementos per URI
This seed URI has 16 mementos
![Page 41: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/41.jpg)
41
The most frequent used domains
abcnews.go.com
blogspot.com
This collection has 30% abcnews.com, 10% blogspot.com, 3% facebook.com
![Page 42: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/42.jpg)
42
Archive-It top 25 is fundamentally different than Storify top 25
![Page 43: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/43.jpg)
43
Archive-It top 25 is fundamentally different than Storify top 25
Twitter is #10 not #1
![Page 44: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/44.jpg)
44
What we archive and what we share on social mediaare different subsets of the web(seeds != shares)
see also: Brunelle, et al., “The impact of JavaScript on archivability”, IJDL 2015
![Page 45: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/45.jpg)
45
Detecting off-topic pages
"Detecting Off-Topic Pages in Web Archives”, TPDL 2015, IJDL 2016.
![Page 46: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/46.jpg)
46
Archive-It provides their partners with tools that allow them to build themed collections
![Page 47: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/47.jpg)
47
Archive-It tools are about HTTP events / mechanics, not “content”
![Page 48: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/48.jpg)
48
These tools won’t detect that > 60% of mementos of hamdeensabahy.com are off-topic
May 13, 2012: The page started as on-topic.
May 24, 2012: Off-topic due to adatabase error.
Mar. 21, 2013: Not working because offinancial problems.
May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired.
http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
![Page 49: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/49.jpg)
49
How do we automatically detect off-topic pages?
![Page 50: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/50.jpg)
50
Textual contentcosine similarity, intersection of the most frequent terms, Jaccard similarity
Method Similaritycosine 0.7TF-Intersection 0.6Jaccard 0.5
![Page 51: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/51.jpg)
51
Textual contentcosine similarity, intersection of the most frequent terms, Jaccard similarity
Method Similaritycosine 0.7TF-Intersection 0.6Jaccard 0.5
Method Similaritycosine 0.0TF-Intersection 0.0Jaccard 0.0
![Page 52: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/52.jpg)
52
Semantics of the textWeb based kernel function using the search engine (SE)
Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
![Page 53: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/53.jpg)
53
Semantics of the textWeb based kernel function using the search engine (SE)
Method SimilaritySE-Kernel 0.7
Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
![Page 54: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/54.jpg)
54
Structural methodsno. of words, content-length
100 109
Method % changeWordCount 0.09
![Page 55: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/55.jpg)
55
Structural methodsno. of words, content-length
100 109
100 5
Method % changeWordCount 0.09
Method % changeWordCount -0.95
![Page 56: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/56.jpg)
56
We built a gold standard data set to evaluate the methods
![Page 57: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/57.jpg)
57
We manually labeled 15,760 mementos
Egypt Revolution and PoliticsURI-Rs: 136URI-Ms: 6,886Off-topic URI-Ms: 384
Occupy MovementURI-Rs: 255URI-Ms: 6,570Off-topic URI-Ms: 458
Columbia Univ. Human Rights collectionURI-Rs: 198URI-Ms: 2,304Off-topic URI-Ms: 94
![Page 58: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/58.jpg)
58
Evaluated 6 methods + combos at 21 thresholdsAveraged the results at each threshold over the three gold standard collections
Similarity Measure Threshold FP FN FP+FN ACC F1 AUC
(Cosine,WordCount) (0.10,-0.85) 24 10 34 0.987 0.906 0.968
(Cosine,SEKernel) (0.10,0.00) 6 35 40 0.990 0.901 0.934
Cosine 0.15 31 22 53 0.983 0.881 0.961
(WordCount,SEKernel) (-0.80,0.00) 14 27 42 0.985 0.818 0.885
WordCount -0.85 6 44 50 0.982 0.806 0.870
SEKernel 0.05 64 83 147 0.965 0.683 0.865
Bytes -0.65 28 133 161 0.962 0.584 0.746
Jaccard 0.05 74 86 159 0.962 0.538 0.809
TF-Intersection 0.00 49 104 153 0.967 0.537 0.740
![Page 59: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/59.jpg)
59
Average precision of 0.89 on 18 different Archive-It collections
(Cosine,WordCount) with (0.10,-0.85) thresholds
![Page 60: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/60.jpg)
60
How do we dynamically divide the collections into appropriate slices?(in other words, how do we pick just 28?)
![Page 61: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/61.jpg)
61
We expected most collections to look like this…
The Global Food Crisis collection at Archive-It
![Page 62: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/62.jpg)
62
This is what we found
Egypt Revolution and Politics
Human Rights April 16 Archive Virginia Tech Shooting
Jasmine Revolution 2011 Wikileaks Document Release
![Page 63: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/63.jpg)
63
Selecting representative pages for generating stories(skipping clustering details, but goal is k=28)
![Page 64: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/64.jpg)
64
Quality metrics for selecting mementos• In the DSA, memento quality Mq is calculated as
following: Mq = (1 − wm*Dm) + wql*Sql + wqc*Sqc
• Dm is the memento damage (Brunelle, JCDL 2014)
• Sql is the snippet quality based on the URI level• Sqc is the snippet quality based on URI category• wm, wql, wqc are the weights of memento damage, level,
and category
![Page 65: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/65.jpg)
65
We prefer a higher quality memento (Dm)
http://wayback.archive-it.org/2358/20110201231457/http://news.blogs.cnn.com/category/world/egypt-world-latest-news/
http://wayback.archive-it.org/2358/20110201231622/http://www.bbc.co.uk/news/world/middle_east/
Brunelle et al. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources, JCDL 2014
![Page 66: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/66.jpg)
66
We prefer pages with attractive snippets
https://wayback.archive-it.org/2358/20110207193404/http://news.blogs.cnn.com/2011/02/07/egypt-crisis-country-to-auction-treasury-bills/
https://wayback.archive-it.org/2358/20110207194425/http://www.cnn.com/2011/WORLD/africa/02/07/egypt.google.executive/index.html?hpt=T1
![Page 67: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/67.jpg)
67
We prefer deep links over high level domains (Sql)
Feb. 11, 2011: the homepage of BBC on Storify
Feb. 11, 2011: the homepage of BBC Middle East section on Storify
Feb. 11, 2011: the article of BBC on Storify
https://wayback.archive-it.org/2358/20110211191429/http://www.bbc.co.uk/
https://wayback.archive-it.org/2358/20110211192204/http://www.bbc.co.uk/news/world-middle-east-12433045
https://wayback.archive-it.org/2358/20110211191942/http://www.bbc.co.uk/news/world/middle_east/
![Page 68: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/68.jpg)
68
Social media pages may not produce good snippets (Sqc)
http://wayback.archive-it.org/1784/20100131023240/http:/twitter.com/Haitifeed/http://wayback.archive-it.org/2358/20141225080305/https:/www.facebook.com/elshaheeed.co.uk
![Page 69: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/69.jpg)
69
Visualizing stories in Storify
![Page 70: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/70.jpg)
70
Remember Yasmin’s hand-crafted stories?
![Page 71: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/71.jpg)
71
Remember Yasmin’s hand-crafted stories?
![Page 72: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/72.jpg)
72
We extract the metadata of the pages and order them chronologically
{ "elements":[ { "permalink":"http://wayback.archive-it.org/694/20070523182134/http://www.usatoday.com/news/nation/2007-04-16-virginia-tech_N.htm", "type":"link", "source":{"href":"http://www.usatoday.com", "name":"www.usatoday.com @ 23, May 2007"} }, { "permalink":"http://wayback.archive-it.org/694/20070530182159/http://www.time.com/time/specials/2007/vatech_victims", "type":"link", "source":{"href":"http://www.time.com", "name":"www.time.com @ 30, May 2007" } }, { "permalink":"http://wayback.archive-it.org/694/20070530182206/http://www.collegiatetimes.com/", "type":"link", "source":{"href":"http://www.collegiatetimes.com", "name":"www.collegiatetimes.com @ 30, May 2007" } }, { "permalink":"http://wayback.archive-it.org/694/20070606234248/http://hokies416.wordpress.com/", "type":"link", "source":{ "href":"http://hokies416.wordpress.com", "name":"hokies416.wordpress.com @ 06, Jun 2007" } }, …{ "permalink":"http://wayback.archive-it.org/694/20070620234329/http://www.hokiesports.com/april16/", "type":"link", "source":{"href":"http://www.hokiesports.com", "name":"www.hokiesports.com @ 20, Jun 2007" } }, ],
"description":"This is an automatically generated story from Archive-It collection.", "title":"April 16 Archive ”}
Using the Storify API, we override the default metadata to generate more attractive snippets
![Page 73: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/73.jpg)
73
Example of an automatically generated story
Notice the good metadata: images, titles with dates, favicons
![Page 74: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/74.jpg)
74
Evaluating the Dark and Stormy Archive framework(how good are the automatically generated stories?)
![Page 75: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/75.jpg)
75
Evaluation is tricky!(two perfectly good stories could have non-overlapping k=28 elements!)
• We use human evaluators (via Amazon's Mechanical Turk) to compare:
• Human-generated stories• DSA (automatically) generated stories• Randomly generated stories
• Successful evaluation means:• Human and DSA stories are indistinguishable• Human and DSA stories are better than Random
![Page 76: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/76.jpg)
76
Our guidelines for expert archivists at Archive-It for generating stories from the collections
![Page 77: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/77.jpg)
77
We received 23 stories for 10 Archive-It collections
SPST is “Sliding Page, Sliding Time”SPFT is “Sliding Page, Fixed Time” FPST is “Fixed Page, Sliding Time”
![Page 78: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/78.jpg)
78https://storify.com/mturk_exp/3649b1s-57218803f5db94d11030f90b
• Generated by domain experts• Sliding Page, Sliding Time• The Boston Marathon
Bombing collection
![Page 79: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/79.jpg)
79
Automatically generated stories from archived collections
1. Obtain the seed list and the TimeMap of URIs from the front-end interface of Archive- It
2. Extract the HTML of the mementos from the WARC files (locally hosted at ODU) and download the collections that we do not have in the ODU mirror from Archive-It
3. Extract the text of the page using the Boilerpipe library 4. Eliminate the off-topic pages based on the best-performing method ((Cosine,
Word-Count) with the suggested thresholds (0.1, −0.85))5. Exclude duplicates in each TimeMap 6. Eliminate the non-English language pages7. Slice the collection dynamically and then cluster the mementos of each slice
using DBSCAN algorithm8. Apply the quality metrics to select the best representative pages9. Sort the selected mementos chronologically then put them and their metadata
in a JSON object
![Page 80: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/80.jpg)
80https://storify.com/mturk_exp/3649b0s
• Automatically generated story • Sliding Page, Sliding Time• The Boston Marathon
Bombing collection
![Page 81: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/81.jpg)
81
Random stories
28 mementos were randomly selected from each collection before excluding off-topic and duplicate pages
![Page 82: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/82.jpg)
82https://storify.com/mturk_exp/3649b2s-57227227bb79 048c2d0388dc
• Randomly generated story• Sliding Page, Sliding Time• The Boston Marathon
Bombing collection
![Page 83: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/83.jpg)
83https://storify.com/mturk_exp/3649bads
if someone prefers this story, we exclude their results
• Poorly generated story• The same memento, 28 times• The Boston Marathon
Bombing collection
![Page 84: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/84.jpg)
84
MT experiment setup
• Three HITs for each story (69 HITs to evaluate 23 stories); two comparisons per HIT:
• HIT1: human vs. automatic, human vs. poor• HIT2: human vs. random, human vs. poor• HIT3: random vs. automatic, automatic vs. poor
• 15 distinct turkers with master qualification (i.e., high acceptance rate) for each HIT
• We rejected the submissions contained poorly-generated stories and the HITs that were completed in less than 10 seconds (mean time per HIT = 7 minutes)
• 989 out of 1,035 (69*15) valid HITs
• We awarded the turker $0.50 per HIT
https://www.mturk.com/mturk/help?helpPage=worker#what_is_master_worker
![Page 85: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/85.jpg)
85
A sample HIT
![Page 86: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/86.jpg)
86
DSA == Human(Human,DSA) > Random
![Page 87: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/87.jpg)
87
Automatic versus Human
Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
![Page 88: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/88.jpg)
88
Human versus Random
Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
![Page 89: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/89.jpg)
89
Automatic versus Random
Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
![Page 90: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/90.jpg)
90
Success!
DSA-generated stories are just as good as stories generated by human experts
![Page 91: Summarizing archival collections using storytelling techniques](https://reader031.fdocuments.us/reader031/viewer/2022013013/5879108b1a28ab6f658b6be7/html5/thumbnails/91.jpg)
91
Use interface people already know how to use to summarize collections
Archived collectionsStorytelling services
Archived enriched stories
All the code, datasets, papers, slides, etc.:http://bit.ly/YasminPhD