Leipziger Rektoratsreden 1871 1933 -...

Leipziger Rektoratsreden 1871 – 1933Insights into Six Decades of Scientific Practice

Thomas Efer*, Jens Blecher**, Gerhard Heyer*

*Natural Language Processing Group, University of Leipzig** Leipzig University Archive

*{efer|heyer}@informatik.uni-leipzig.de ; **[email protected]

Abstract: The aim of this paper is to introduce university archives as valuable sources fordocument-centric historical research. That comprises the history of science as well as the historyof society. With the example of Leipzig as a university city with an outstanding wealth of archivedacademic material we want to stress on the great and in many cases not yet digitally exploredpotential of such sources.

We then focus on a collection of annual administrative speeches called "Rektorasreden", thatspan over 60 important years of Leipzig’s university life. We discuss some of the possibilitiesfor content analysis using methods of Natural Language Processing (NLP). The focus lies onfacilitating the access to larger corpora. We present a minimalist process chain for a distant-reading, explorative approach on the Rektoratsreden-corpus. For more general considerations wealso highlight some of the digitization efforts that took place in Leipzig and reflect on how archivematerial as well as archival workflows can benefit from research infrastructure and vice versa.

Keywords: Science Archives, University History, Named Entity Recognition, Research Infrastructure

Introduction

Historically-minded researchers of all fieldsshould not miss the opportunity to visit their localuniversity’s archive. With the chance to study -firsthand - the witnesses of a time long passed oneis easily immersed in learning about past academ-ical habits. While it then may seem as if time hadstood still, by its very nature, time never standsstill in an archive. New items keep on arrivingconstantly and new methods have to be applied tokeep up with this dynamics. We can observe theadvent of a digital age of archival work. When wewant to highlight the special role of its archive forLeipzig’s university, we first have to briefly explainwhere it comes from.

The University of Leipzig originates from one ofthe latest medieval University foundations in Eu-rope. In a tense time between Czech and Ger-man nation-building and early European ideas ofchurch reformation, a massive dispute at the Uni-versity of Prague lead to the emigration of manyscholars and successively to the founding of a newuniversity in Leipzig.

From the very beginning on – and rooted in thespecial founding circumstances – there was astrong urge to record and save written testimoni-als of all the relevant documents, certificates, ma-triculation lists, official seals and insignia. In thatspirit, the university archive already got foundedwithin the first statutes from 1409, with the head-master of the university (the "Rector") as the re-sponsible holder.

In the course of the centuries, the University ofLeipzig ranged among the six largest universitiesin Germany. In a flourishing industrial and cul-tural environment Leipzig eventually rose to beone of the World’s leading universities at about1900. The turning point came with World WarI, disconnecting Leipzig from its internationalpeers. The ensuing economic difficulties in theWeimar Republic shrunk its capacities further.The following Nazi regime and the devastatingWorld War II destroyed the intellectual and mate-rial foundations of science. In times of the GDR,strict ideological barriers were imposed upon the"Karl-Marx-Universität", until the German Re-unification finally brought the freedom for a re-orientation.

1

Inventory of the Leipzig UniversityArchiveToday, after six centuries of German and Europeanscientific development, there are about 140 mil-lion sheets of paper stored in the Leipzig Univer-sity Archive, accounting for more than 7000 me-ters of shelve space. This vast amount of sourcesis searched by about 800 Researchers each year,leading to a new publication at almost weekly fre-quency.

Each month about 1500 new (physical) files arestored in the archive. In the last years, about50’000 digital files (summing up at ca. 800 GB)were collected – all have to be stored and con-served for at least 30 years. The inventory is man-aged with 1,2 million database entries. Aroundhalf of them describe digitized pages from univer-sity files. The early digitization efforts were costintensive and of vague benefits1. While qualityand accessibility of digitized documents have im-proved, the costs of producing them are still rela-tively high. Only about 5% of the stored universityfiles are yet digitized and available online.

Not only from the perspective of corpus-basedstudies, many of the archival tasks resemble therequirements for general research infrastructure:Easy and reliable access, long-time conservation,complete metadata handling, and the like are com-mon concerns in the e-humanities and beyond.The European infrastructure projects for arts, hu-manities and linguistics – foremost CLARIN2

and DARIAH3 – aim at a a level of service com-pleteness that covers the complete resource lifecy-cle and therefore have to deal with archival aspectsto a great extent. The NLP group in Leipzig ispart of CLARIN-D, the German CLARIN ini-tiative. CLARIN-D operates several data- andcomputing-centers, of which two (Garching andJülich) are responsible for archiving. This de-centralized system of distributed labour within aservice-oriented environment is an important as-pect of modern infrastructures of that scale. Todayregional university archives are in the process ofplanning a cloud-based distributed infrastructureconnecting Halle, Jena and Leipzig. It can only

be to everyone’s advantage to develop synergies,ensure a high level of compatibility and to activelyexchange experiences between such initiatives.Consequent service-orientation, an open sourcepolicy and a lot of standards work is the techno-logical and organizational key for sustainable andfuture-proof infrastructures.

Documents of contiguous temporal and con-tentual nature are often of special historical inter-est. Several corpora with diachronic features areavailable in Leipzig’s university archive. Amongthe already digitized ones are for example univer-sity newspaper corpora from the GDR-era, as theofficial university newspaper (1957-1991) or thescience-related newspaper "WissenschaftlicheZeitschrift" (1951-1991). In the following sec-tion we want to present a corpus that dates backearlier and covers the "Golden Age" of our univer-sity, when the university was still organized withan intact academic self-administration.

Rektoratsreden-CorpusThe yearly transfer of administrative power fromthe rector to an elected successor was a long-standing tradition in Leipzig. Within a solemnframework, usually held on reformation day,it formed the ritual highlight of the academicyear. Main parts were the inaugural speech("Antrittsrede") of the new rector and the an-nual report ("Jahresbericht") of the respectivelyreplaced rector. The annual report contained im-portant news, events, staff changes and the likewhile the new rector used his speech for sciencecommunication, presenting his respective field ofstudy. The possible insights from both speechesdiffer but they complement each other in an inter-esting way.

The speeches were published in written form soonafter the event. In the year of 2004, in the prepa-ration phase for the university’s 900th anniver-sary, 123 of those "Rektoratsreden"-speecheswere compiled into a two-volume edition [HL09].Based on the digitized texts from 2340 scannedimages of the original prints4, the edition was cre-ated in a manual process. A textual correction

1At the end of the 1990es, some TIFF-Files of 50 MB size were hard to handle while today their resolution isconsidered too low for many purposes.

2http://www.clarin.eu3http://dariah.eu/4OCR-processing and manual copying of Fraktur-typeset documents

2

http://www.clarin.eu

http://dariah.eu/

was performed, retaining the original grammarand spelling, but correcting obvious printer’s ortypographical errors as well as faulty OCR re-sults.

The principle of the edition work was to repro-duce the speeches unabridged in authentic lengthand depth. In order to keep a reasonable pagecount, no extensive commentary or inclusion ofthe speakers’ biographic data was added. Thisprocess resulted in 1790 printed pages for the cor-pus consisting of ca. 700’000 tokens in about 5,1MB of plain text. For accessing the vast amountof information stored in that two-volume edition,a comprehensive index with about 6500 entries forpeople, places and topics was created. The manu-ally performed index creation accounted for morethan two thirds of the whole project phase.

The edition covers the time from 1871, when thefirst speech was published in print to the year1933, when a shift from freely elected rectors toappointed rectors took place. Afterwards, thespeeches were merely occasions where confor-mity with the Nazi propaganda was demonstrated,rather than performing the accustomed reflectionson the inner structure, events and positions of theuniversity.

Supplementary resources for corpusanalysis

The documents collected at the archive do notonly form diachronic corpora of prose but alsohold large lists of structured or semi-structureddata that can be used to analyze those corpora.While the digitization of (mostly) hand-writtenlist is a very time- and resource-consuming task,it can provide a valuable framework of data entriesto which other documents can be linked.

In Leipzig, several of these lists were made digi-tally accessible: There are digital editions of ma-triculation lists (containing names and for entriesafter 1810 also a lot of bibliographical informa-tion), databases of employee lists and bursarydata ("Quästur") and other special lists5 avail-able. These lists are not only a great entry point

for researchers but also constitute an invaluableinput of diachronic and entity-centric data thatexactly matches the time and place of the corpora.The university archive is already working on aresearch-friendly GIS6-application to hold about280’000 person-related entries with geo-temporalanchors. An ongoing cooperation with the archiveof the University of Prague can further extend thedata volume. The collected data already powersseveral online research interfaces such as a sta-tistical overview over the historical popularity ofcertain given names7.

Not only can digitized lists be used to retrievenames and name variants that were popular in thepast. The corpora themselves give the chance foreven further extraction and validation of names forexample by the extraction of context-based rulesas presented in [SR12].

Setup for a first explorative analy-sis

The goal of the analysis prototype was to show-case how a minimalist NLP-based approach onthe corpus can result in new means of browsingthe texts in a somewhat more topic-oriented man-ner. Instead of an in-depth analysis, novel methoddevelopment or a display of recent algorithms wesimply want to encourage experiments with stan-dard tools and promising corpora as an interdisci-plinary "brainstorming".

The idea was to extract Named Entities fromwithin the speeches and use them as anchor pointsfor navigation, creating an alternative, largely au-tomated way of indexing the resources. Withoutusing any pre-existing indices and correspondingpage references, we want to show how to buildbasic cross-document links across sources andentities.

For the task of Named Entity Extraction we as-sembled a simple process chain using the GATEframework and specifically the robust ANNIEworkflow package [CMBT02]. Basic look-up an-notations were created using a gazetteer that con-tained all given names and family names from the

5e.g. list of detention cell ("Karzer") punishments:http://www.archiv.uni-leipzig.de/digitale-archivalien/datenbanken/karzerstrafen/

6Geographical Information System7http://www.archiv.uni-leipzig.de/vornamensuche/

3

http://www.archiv.uni-leipzig.de/digitale-archivalien/datenbanken/karzerstrafen/

http://www.archiv.uni-leipzig.de/vornamensuche/

Figure 1: Occurrences of famous archaeologist Georg Steindorff (rector 1923/24), as highlighted in theGephi tool. From the density of the the greyed-out nodes in the background it can be seen that the graph ismuch sparser at the bottom where the non-administrative inaugural speeches (that naturally contain fewer

references to people) are layouted.

above mentioned databases as well as a short,manually compiled list of titles and abbrevia-tions. Using GATE’s annotation processing lan-guage "JAPE", we then created simple rules for acontext-aware matching of the basic annotationsto recognise the mentioning of people’s full names.For simplicity we skipped more advanced meth-ods of the ANNIE toolbox such as Co-ReferenceResolution, that would have required a POS tag-ging which is relatively hard to produce for such aspecial corpus with significant diachronic features.The JAPE-based rules followed an intuitive andrather conservative approach, focusing on preci-sion rather than recall – although both valueshave not been scientifically evaluated8. Possibleentities of type "Person" for which only a fam-ily name could be retrieved (e.g people being in-troduced with just "Professor" or "our dear col-league" instead of a given name) were disregardedin the later analyses.

For simplicity we decided not to extract other NE-types: The added complexity, e.g. resulting fromthe frequent occurrence of places in ambiguitywith family names would be too much. As a fur-ther idea we initially planned to also define ab-stract concepts such as "war" and observe theircontext including a corresponding impact on used

terminology. We skipped such analyses becauseof the serious manual effort needed to provide a fit-ting terminological or ontological framework witha respective linking to the involved terms, usedfrom the 1870es to the 1930es. But such an ap-proach of conceptual terminology extraction willcertainly be addressed in further work.

In order to construct a "a bigger picture" of thepeople mentioned in the corpus, a fitting repre-sentation of the extracted NE-annotations wassought. So the single speech documents wereeach regarded as an (addressable) unit, where theoccurrence of people within that unit was recorded.The extracted links were then exported as a graphstructure. We chose the GEXF format9 since itis well integrated with Gephi [BHJ09], our graph-analytics tool of choice, and because its "DynamicGraph" feature allows for a later extension of ouroutput with temporal constraints.

Sub-units of speeches, like paragraphs could al-low for better understanding of the "relatedness"between people but unfortunately the corpus istoo sparse on the temporal axis with only 2 docu-ments per year.

8The edition’s index could not efficiently be used as gold standard because it mixes different NE-types. Insteadthe rules have been iteratively altered to boost the perceived precision until moderately-sized random samples did notcontain errors anymore (still finding reasonably many different full names). While this trial-and-error-principle inarguablylacks methodological elegance, it may be generally a good approach for inter-disciplinary work, allowing for a lot ofcommunication on the processing quality in tight feedback-loops and keeping all parties close to the data as well as to thealgorithm.

9http://gexf.net/

4

http://gexf.net/

ResultsWith our prototype we could extract 2391 namedentities of type "Person" in a fully automatedway. This includes a few name deviations de-noting identical persons (e.g. "Dr. med. ErnstLeberecht Wagener" instead of the correct familyname "Wagner"). Luckily their low Levensthein-Distance makes it possible to automatically com-pile a small set of candidates for merging bothname labels into a single representative for theentity that can be manually processed.

Figure 2: Force-directed layout of theoccurrence-graph, left: full graph, right: filtered by

node degree 2 and greater, revealing more of thestructure

We created two complementary graphs out of theextracted data: First an "occurence graph" withall the speeches and the recognized people asnodes and second a "co-occurrence-graph" of justpeople. The occurrence graph works just like anindex. While an index entry is usually pointingto a page number, people entries are here point-ing to speeches (Figure 1 on the preceding page).Figure 2 shows the visual effect of filtering theoccurrence graph so that only 680 nodes with adegree of two and greater remain (people withmore than one speech mentioning them). Onlyover those node a navigation step from one speechto another can be performed. To comfortably getfrom one person to another via the speeches men-

tioning both, we constructed the co-occurrencegraph that is connecting two people if they bothappear in the same speech - the edge weight isthe number of such speeches. Figure 3 shows theco-occurring nodes for Leipzig-born artist MaxKlinger. While there may not be obvious semanti-cally motivated connections between those people(such as Karl Marx co-occurring with FriedrichEngels) it nevertheless depicts some kind of tem-poral and topical similarity that can then be furtherinvestigated by reading the passages where bothappear in the speeches.

Figure 3: Co-occurrences for Max Klinger withedges weight of 2 or more (being mentioned together

in at lest 2 speeches)

We believe that NLP methods can be valuabletools to facilitate the creation of indices for peopleplaces and topics for classical print publications.When extending the rule set to gain more recall,the lower precision can easily be compensatedin later manual analyses in the edition process.But printed editions do not have to be the finalresult of edition work. Online editions can providehighly interlinked hypertextual documents. Spe-cific corpus browsers with an explorative focuscould encourage a more "casual" browsing andresearch behaviour, opening up the possible rangeof readers beyond the purely academic world.

5

A next step should be the integration with other re-sources such as linked data hubs like Freebase10

or the Personennormdaten (PND) identifiers fromthe German National Library. Using such un-ambigous anchors for entities enables an inter-corpus linking. Identity-aware matching meth-ods such as [RTHB12] can be employed. Thenarrow geographical focus of our corpus may al-low for a simpler disambiguation in many casesby just finding "Leipzig" in fulltext descriptionsor by finding a corresponding edge in the seman-tic (hyper)graph towards the resource for the cityof Leipzig. Further restrictions can be made e.g.by rejecting all people born after the date of thespeech that mentioned their name.

Outlook and future possibilitiesWe showcased a merely quantitative, entity-basedanalysis of a diachronic scientific corpus. This al-lows for a rather distant view on "texts" and "pro-tagonists" resembling the publication-author-model which is central to bibliometrics. This dis-cipline employs various forms of network analysiswith a focusing on temporal aspects and can there-fore be seen as related work with many suitableapproaches to diachronic document collections.It would be most interesting to extend bibliomet-rical methodology to work on NLP-processeddocuments and to combine the extracted infor-mation from within the texts with classical bib-liographic information where possible. In a waythis could possibly even constitute a next steptowards a better understanding of Science Dy-namics [SBB12].

The inventory of science archives is applicative inmultiple fields of e-humanities: Not only can his-torical studies and social sciences gain insightsfrom the material – other disciplins may benefittoo, like historical linguistics, extracting knowl-edge about academic language and its develop-ment. To meet all scientist’s quantitative require-ments it is necessary to digitize more documentswhich is costly and often connected with caveatssuch as insufficient OCR-quality, especially forolder types and of course handwriting.

Working on the Retoratsreden-corpus it becameevident that the introduction of NLP methods in

archival and editorial work can drastically reducemanual workload, especially in indexing, link-ing, classification and general processing of dig-itized documents. In return, the historical datafrom the archives can enhance the applicabilityof NLP methods on historical texts. Within theUniversity of Leipzig, a cooperation beyond theexchange of data and methods is planned. Wewant to encourage a fruitful collaboration of Com-puter Scientists and their respective UniversityArchives throughout the country. Therefore wewant to establish a close interexchange on thelevel of archival and research-infrastructure inLeipzig in order to prepare the way for other uni-versity locations. As mentioned earlier, archivalinfrastructure and research infrastructure sharea large set of common concerns that should leadto synergies - not necessarily by mixing data orworkflows but by collaborations in standardizingand defining methods. Compatibility (or even bet-ter interoperability) between both systems is animportant topic. For example could any archive in-frastructure that used a service oriented approachand a common metadata categorization systemlike IsoCat [BKSU+10] be easily extended to in-teract with the CLARIN infrastructure. Anotheraspect of digital archive work is the inclusion ofthe ever growing contemporary archive materialthat is now created mostly in digital form. Es-pecially infrastructure projects for "digital borndocuments" are needed. A "private cloud" infra-structure based on open source technology andhosted across several regional university locationsis already planned.

An ongoing general goal of NLP is to find novelways to transform documents into a suitable repre-sentation form for further processing in the emerg-ing field of Visual Analytics [KKEM10]. New,scalable methods from there can produce alter-native views on document collections, allowingfor an explorative and quantitative research whileretaining the ability to access every textual occur-rence in place. After all, a better accessibility ofuniversity archive material could also lead to inter-est from researchers from various non-historicalfields and contribute to fruitful reflections on theroots and development of science itself as well asits possible forms of institutionalization.

10http://www.freebase.com/

6

http://www.freebase.com/

References[BHJ09] BASTIAN, M. ; HEYMANN, S. ; JACOMY, M.: Gephi: An Open Source Software for Exploring

and Manipulating Networks. In: International AAAI Conference on Weblogs and SocialMedia, 2009

[BKSU+10] BROEDER, Daan ; KEMPS-SNIJDERS, Marc ; UYTVANCK, Dieter V. ; WINDHOUWER, Menzo ;WITHERS, Peter ; WITTENBURG, Peter ; ZINN, Claus: A Data Category Registry- andComponent-based Metadata Framework. In: Proceedings of the Seventh InternationalConference on Language Resources and Evaluation (LREC’10). Valletta, Malta : EuropeanLanguage Resources Association (ELRA), may 2010. ISBN 2–9517408–6–7

[CMBT02] CUNNINGHAM, Hamish ; MAYNARD, Diana ; BONTCHEVA, Kalina ; TABLAN, Valentin: GATE:A Framework and Graphical Development Environment for Robust NLP Tools and Applications.In: Proceedings of the 40th Anniversary Meeting of the Association for ComputationalLinguistics (ACL’02), 2002

[HL09] HÄUSER, F. ; (LEIPZIG), Universität: Die Leipziger Rektoratsreden 1871- 1933: Heraus-gegeben zum 600-jährigen Gründungsjubiliäum der Universität im Jahr 2009. de Gruyter.,2009. ISBN 9783110209198

[KKEM10] KEIM, Daniel A. (Hrsg.) ; KOHLHAMMER, Joern (Hrsg.) ; ELLIS, Geoffrey (Hrsg.) ; MANS-MANN, Florian (Hrsg.): Mastering The Information Age - Solving Problems with VisualAnalytics. Eurographics, 2010 http://www.vismaster.eu/book/

[RTHB12] RIZZO Giuseppe ; TRONCY Raphael ; HELLMANN Sebastian ; BRUEMMER Martin: NERDmeets NIF: Lifting NLP extraction results to the linked data cloud. In: LDOW, 5th Workshop onLinked Data on the Web, April 16, 2012, Lyon, France. Lyon, FRANCE, 04 2012

[SBB12] SCHARNHORST, Andrea (Hrsg.) ; BÖRNER, Katy (Hrsg.) ; BESSELAAR, Peter van d. (Hrsg.):Models of Science Dynamics: Encounters Between Complexity Theory and InformationSciences. Berlin : Springer, 2012. ISBN 978–3–642–23068–4

[SR12] SCHLAF, Antje ; REMUS, Robert: Learning Categories and their Instances by ContextualFeatures. In: Proceedings of the 8th International Conference on Language Ressources andEvaluation (LREC’12), 2012

7

http://www.vismaster.eu/book/

Leipziger Rektoratsreden 1871 1933 -...

Documents

Transcript of Leipziger Rektoratsreden 1871 1933 -...