Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany...

27
Corpus Assembly as Text Data Integration from Digital Libraries and the Web Jena University Language & Information Engineering (JULIE) Lab https://julielab.de/ DFG Graduate School „Romanticism as a Model“ http://modellromantik.uni-jena.de Friedrich Schiller University Jena, Germany Jun 3 2019 – Urbana-Champaign IL JCDL 19‘ – Session 1A – Generation and Linking Udo Hahn & Tinghui Duan

Transcript of Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany...

Page 1: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Corpus Assembly as Text Data Integration from Digital Libraries and the Web

Jena University Language & Information Engineering (JULIE) Lab

https://julielab.de/

DFG Graduate School „Romanticism as a Model“

http://modellromantik.uni-jena.de

Friedrich Schiller University Jena, Germany

Jun 3 2019 – Urbana-Champaign ILJCDL 19‘ – Session 1A – Generation and Linking

Udo Hahn & Tinghui Duan

Page 2: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Jena/HalleGermany

Allgemeine Literatur-Zeitung (1785-1849)

Very important historical text sourcefor literary studies

in German Romanticism (1790-1830)

General Literature Gazette, ALZ

Page 3: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Allgemeine Literatur-Zeitung (1785-1849)

Corpus • Analyse

Research Result

Page 4: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Allgemeine Literatur-Zeitung (1785-1849)Traditional Workflow

Printed Book • Scan

Scanned Picture

• OCR

Full Text• Encode

• Assemble

Corpus • Analyse

Research Result

Page 5: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Allgemeine Literatur-Zeitung (1785-1849)Traditional Workflow

Printed Book • Scan

Scanned Picture

• OCR

Full Text• Encod

• Assemble

Corpus • Analyse

Research Result

315 Volumes

≈ 150,000 Pages

≈ 150,000,000 Tokens

Page 6: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Allgemeine Literatur-Zeitung (1785-1849)Traditional Workflow

Printed Book • Scan

Scanned Picture

• OCR

Full Text• Encode

• Assemble

Corpus • Analyse

Research Result

Cost- and Time-Consuming

315 Volumes

≈ 150,000 Pages

≈ 150,000,000 Tokens

Page 7: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Allgemeine Literatur-Zeitung (1785-1849)

Full Text• Encode

• Assemble

Digital Libraries

Corpus • Analyse

Research Result

Alternative Workflow

Page 8: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Germany:Bavarian State Library

Scattered Digital Resources of ALZ

Page 9: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Austria:Austrian National Library

Germany:Bavarian State Library

Scattered Digital Resources of ALZ

Page 10: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Austria:Austrian National Library

Switzerland:University of Lausanne

Germany:Bavarian State Library

Scattered Digital Resources of ALZ

Page 11: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

UK:University of Oxford

Austria:Austrian National Library

Switzerland:University of Lausanne

Germany:Bavarian State Library

Scattered Digital Resources of ALZ

Page 12: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

USA:Harvard UniversityIndiana UniversityNew York Public LibraryPrinceton UniversityStanford UniversityUniversity of IllinoisUniversity of Michigan

UK:University of Oxford

Austria:Austrian National Library

Switzerland:University of Lausanne

Germany:Bavarian State Library

Scattered Digital Resources of ALZ

Page 13: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

USA:Harvard UniversityIndiana UniversityNew York Public LibraryPrinceton UniversityStanford UniversityUniversity of IllinoisUniversity of Michigan

UK:University of Oxford

Austria:Austrian National Library

Switzerland:University of Lausanne

Germany:Bavarian State Library

Scattered Digital Resources of ALZ

Page 14: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

USA:Harvard UniversityIndiana UniversityNew York Public LibraryPrinceton UniversityStanford UniversityUniversity of IllinoisUniversity of Michigan

UK:University of Oxford

Austria:Austrian National Library

Switzerland:University of Lausanne

Germany:Bavarian State Library

Scattered Digital Resources of ALZ

Page 15: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

USA:Harvard UniversityIndiana UniversityNew York Public LibraryPrinceton UniversityStanford UniversityUniversity of IllinoisUniversity of Michigan

UK:University of Oxford

Austria:Austrian National Library

Switzerland:University of Lausanne

Germany:Bavarian State Library

Scattered Digital Resources of ALZ

1,200+ Volumes

600,000+ Pages

600,000,000+ Tokens

Page 16: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Proposed Workflow

Digital Libraries and the Web

• Collect

• Correct Metadata

Page 17: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Proposed Workflow

Digital Libraries and the Web

• Collect

• Correct Metadata

https://archive.org/details/bub_gb_udTjAAAAMAAJ/

Page 18: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Proposed Workflow

Digital Libraries and the Web

• Collect

• Correct Metadata

Full-Texts • Evaluate

• Select

Page 19: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Proposed Workflow

Digital Libraries and the Web

• Collect

• Correct Metadata

Full-Texts • Evaluate

• Select

14 different full-text versions for this page!

Page 20: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Proposed Workflow

Digital Libraries and the Web

• Collect

• Correct Metadata

Full-Texts • Evaluate

• Select

Best-Quality Full-Texts

• Encode

• Assemble

Page 21: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Proposed Workflow

Digital Libraries and the Web

• Collect

• Correct Metadata

Full-Texts • Evaluate

• Select

Best-Quality Full-Texts

• Encode

• Assemble

Target-Corpus

Page 22: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Result

Digital Libraries and the Web

• Collect

• Correct Metadata

Full-Texts • Evaluate

• Select

Best-Quality Full-Texts

• Encode

• Assemble

Target-Corpus

261 Volumes

126,612 Pages

120,369,005 Tokens

Page 23: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Result

Digital Libraries and the Web

• Collect

• Correct Metadata

Full-Texts • Evaluate

• Select

Best-Quality Full-Texts

• Encode

• Assemble

Target-Corpus

315 Volumes

≈ 150,000 Pages

≈ 150,000,000 Tokens

261 Volumes

126,612 Pages

120,369,005 Tokens

≈ 82% coverage

Page 24: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Result

Digital Libraries and the Web

• Collect

• Correct Metadata

Full-Texts • Evaluate

• Select

Best-Quality Full-Texts

• Encode

• Assemble

Target-Corpus

The Largest Corpus for German Romanticism

https://github.com/JULIELab/ALZ

315 Volumes

≈ 150,000 Pages

≈ 150,000,000 Tokens

261 Volumes

126,612 Pages

120,369,005 Tokens

≈ 82% coverage

Page 25: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Problems

• Restricted Accessibility

• Heterogeneous Digitizing Conditions and OCR-Qualities

Page 26: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Conclusion

• The Largest Corpus for German Romanticism

• Big Potential of DLs for Computational Literary Studies

• More Cooperation Between DLs Desirable

• Better Metadata and OCR-Quality are Desirable

Page 27: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary

Corpus Assembly as Text Data Integration from Digital Libraries and the Web

Jena University Language & Information Engineering (JULIE) Lab

https://julielab.de/

DFG Graduate School „Romanticism as a Model“

http://modellromantik.uni-jena.de

Friedrich Schiller University Jena, Germany

Udo Hahn & Tinghui Duan

Thank you!