Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource
-
Upload
ian-milligan -
Category
Internet
-
view
502 -
download
4
description
Transcript of Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource
![Page 1: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/1.jpg)
Ian Milligan (@ianmilligan1) Assistant Professor of History [email protected]
Clustering Search to Navigate A Case
Study of the Canadian World Wide Web as a
Historical Resource
![Page 2: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/2.jpg)
Why? !
Historians need to think about Computational Methods in an era of
web archives.
![Page 3: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/3.jpg)
INTERNET ARCHIVE~ 10,240 TBs
LIBRARY of CONGRESS~ 200 TBs
est. HOLDINGS:
![Page 4: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/4.jpg)
The 80TB Wide Web Scrape
[March - December 2011]
![Page 5: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/5.jpg)
Wayback Machine
or WARC files?
![Page 6: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/6.jpg)
Building a .ca sample: !
622,365 distinct URLs / 8,512,275 overall URLs =
7.31% in case study
![Page 7: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/7.jpg)
WARC Web ARChive file format
ISO 28500:2009
![Page 8: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/8.jpg)
filesdump.py available at https://github.com/ianmilligan1/Historian-WARC-1/tree/master/WARC/warc-tools-mandel
WARC File WARC-Tools/Lynx!(warcfilter.py,
warchtmlindex.py and filesdump.py)
Indexing
CDX Files !(finding aids)
![Page 9: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/9.jpg)
Full Text Index
Clustering Workbench
Other sorts of text
analysis
![Page 10: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/10.jpg)
![Page 11: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/11.jpg)
![Page 12: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/12.jpg)
![Page 13: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/13.jpg)
![Page 14: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/14.jpg)
https://github.com/ianmilligan1/Historian-WARC-1/tree/master/WARC/warc-tools-mandel
WARC File WARC-Tools/Lynx!(warchtmlindex.py and filesdump.py)
Indexing
![Page 15: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/15.jpg)
Downside is you still have to know what you’re looking for.
![Page 16: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/16.jpg)
![Page 17: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/17.jpg)
Playing with images?
![Page 18: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource](https://reader034.fdocuments.us/reader034/viewer/2022051818/549f8906ac7959504c8b48a3/html5/thumbnails/18.jpg)
Ian Milligan Assistant Professor of History [email protected]
Thanks (to you all and to funders).
!
http://ianmilligan.ca/