"Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records
-
Upload
valerie-glenn -
Category
Presentations & Public Speaking
-
view
104 -
download
0
Transcript of "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records
![Page 1: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/1.jpg)
HATHITRUST A Shared Digital Repository
“Unique,” “Descriptive,” and Other Damned Lies: The Challenges of
Identifying Related Records
Valerie Glenn and Bill DueberLITA Forum
November 14, 2015
![Page 2: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/2.jpg)
Overview
• Introduction/Background• What we’re trying to do & why• What is a Federal Government Document?• What’s been done• Next steps
![Page 3: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/3.jpg)
Background
• 2011 Constitutional Convention – Ballot Initiative #4
• Resolved: “that HathiTrust facilitate collective action to create a comprehensive digital corpus of U.S. federal publications including those issued by GPO and other federal agencies”
• Resolved: “that HathiTrust develop a process of catalog record review to ensure accurate and full display of U.S. federal publications including those issued by GPO and other federal agencies”
![Page 4: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/4.jpg)
What are we trying to do?
•Define the corpus of US federal documents•Identify documents that aren’t in the HathiTrust Digital Library
•Find documents and digitize them
![Page 5: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/5.jpg)
What is a Federal Document?• How the HathiTrust Digital Library defines
federal document• How the Registry defines federal document• How libraries identified federal documents• Examples out-of-scope/bad records:
• Uncharted• Other governments’ documents• Organizations with “United States” or “
national” in their name• Reprints / reproductions
![Page 6: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/6.jpg)
What’s Been Done
• Matching on Identifiers• OCLC #• LCCN• ISSN• SuDoc Call number
• “Duplicates”• Related (parts of the same series, etc.)
![Page 7: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/7.jpg)
Enumeration and Chronology
Image found at http://goo.gl/qkrd0Q
![Page 8: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/8.jpg)
Quick Record-matching Quiz #1
•Mathematical preparation for general physics with calculus / by Davidson, Ronald C. Published: 1973
•A textbook of oral pathology by Shafer, William G. Published: 1974
![Page 9: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/9.jpg)
Quick Record-matching Quiz #2
•Mathematical preparation for general physics with calculus / by Davidson, Ronald C. Published: 1973
•Mathematical preparation for general physics with calculus / by Davidson, Ronald Published: 1973
![Page 10: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/10.jpg)
Quick Record-matching Quiz #3
What is the most reliable unique identifier in all of Libraryland?
![Page 11: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/11.jpg)
Quick Record-matching Quiz #3
What is the most reliable unique identifier in all of Libraryland?
OCLC Number
![Page 12: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/12.jpg)
FEEL BAD!!!!!
![Page 13: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/13.jpg)
Enum/Chron
![Page 14: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/14.jpg)
FEEL BAD!!!!!
![Page 15: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/15.jpg)
Examples
1985
v. 3
NO. 1-12 1963-64
This stuff we can parse with a few dozen lines of ruby, or even regex.
![Page 16: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/16.jpg)
Examples
V. 138 NO. 125-127 PT. 2 SEP 15-17 1992
NO. 3-4, 8, 13, 15-20, 22-23, 25, 27-28, 30-31, 33, 39-41, 43-44:V. 1, 45-46, 48-58, 63-66, 68-81, 83-91, 95, 99-113, 115-128, 130-133, 135-136, 144-145, 147-148, 151, 155, 157, 159, 162-164, 173-174, 178, 180, 182, 185, 190, 195, 198-199, 201-202, 205, 207-208
![Page 17: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/17.jpg)
Examples
V. 138 NO. 125-127 PT. 2 SEP 15-17 1992
NO. 3-4, 8, 13, 15-20, 22-23, 25, 27-28, 30-31, 33, 39-41, 43-44:V. 1, 45-46, 48-58, 63-66, 68-81, 83-91, 95, 99-113, 115-128, 130-133, 135-136, 144-145, 147-148, 151, 155, 157, 159, 162-164, 173-174, 178, 180, 182, 185, 190, 195, 198-199, 201-202, 205, 207-208
![Page 18: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/18.jpg)
Examples
V. 33:NO. 36-54+SS1-4;SUP. ;ANNUAL SUMM. 1984
![Page 19: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/19.jpg)
Examples
31-40D
V. 45:NO. 7-9V. 45:NO. 7-92008
2011:pt.1 (1.501-1.640) = P.1 (1.501-1.640)/2011
V 11-13,14b/d no 11ab - 14 Jul 93 + abs 1992/93 c-f not e index
![Page 20: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/20.jpg)
Examples
982
NOS. 9-1461 WITH MANY EXCEPTIONS
![Page 21: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/21.jpg)
So...where are we?
• Parser up over 1000 lines with a long way to go
• “parse” about 65% of enumchron (3.5M)
• Not at all sure they’re all right
• ...or how to compare them
• ...or how to do gap detection
• ...or what to do with the other 35%
![Page 22: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/22.jpg)
FEEL BAD!!!!!
![Page 23: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/23.jpg)
Next steps
• Refine enum/chron parsing• String matching• Automated gap detection
![Page 24: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/24.jpg)
How to find out more
• HathiTrust Registry of US Federal Government Documents: http://www.hathitrust.org/usdocs_registry
• Contact Bill: [email protected]@billdueber
• Contact Valerie: [email protected]@vdglenn
![Page 25: "Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying Related Records](https://reader030.fdocuments.us/reader030/viewer/2022021503/5876020c1a28ab4a508b5baf/html5/thumbnails/25.jpg)
Thank you! Questions?