Challenges in the linguistic exploitation of specialized republishable web corpora
-
Upload
adrien-barbaresi -
Category
Science
-
view
166 -
download
1
Transcript of Challenges in the linguistic exploitation of specialized republishable web corpora
Challenges in the linguistic exploitation of specializedrepublishable web corpora
Adrien Barbaresi
Berlin-Brandenburg Academy of Sciences and Humanities (BBAW)
RESAW conference 2015Arhus – June 10, 2015
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 1 / 15
Outline
• Context• Specialized web corpora
• Construction and availability
• Challenges• Metadata extraction
• Quality assessment of content
• Licensing and republishing
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 2 / 15
Context Specialized web corpora
Text corpora
Text collections
in German
gathered on the Web
used by linguists
available via a web interface
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 3 / 15
Context Specialized web corpora
“Specialized” corpora
Definition
The corpora focus on a particular text genre or source.
Goal for linguists: better coverage of specific written text types and genresnot found in “traditional” corpora.
Construction
1 Discovery and download: web crawling techniques
2 Stored in a processed version: linguistic corpus
3 Standardized formats: interoperability within the research community
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 4 / 15
Context Specialized web corpora
Two cases of republishable corpora
“Standard” case: German political speeches
Chancellery | 1.831 speeches | 1998–2012Presidency | 1.442 speeches | 1984–2012https://adrien.barbaresi.eu/corpora/speeches/
“Borderline” case: German blogs under Creative Commons licenses
Blogs | 250.000 documents | ∼ 100 MTokenshttps://kaskade.dwds.de/dstar/blogs/
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 5 / 15
Context Specialized web corpora
Two cases of republishable corpora
“Standard” case: German political speeches
Chancellery | 1.831 speeches | 1998–2012Presidency | 1.442 speeches | 1984–2012https://adrien.barbaresi.eu/corpora/speeches/
“Borderline” case: German blogs under Creative Commons licenses
Blogs | 250.000 documents | ∼ 100 MTokenshttps://kaskade.dwds.de/dstar/blogs/
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 5 / 15
Context Construction and availability
File formats
1 Web archives (HTML, no WARC to this day)
⇒ linguistic processing toolchain
2a XML TEI format (https://tei-c.org)
2b Browser-friendly HTML documents
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 6 / 15
Context Construction and availability
Interface to the political speeches: static HTML documents
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 7 / 15
Context Construction and availability
Interface to the blogs: querying architecture @DWDS
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 8 / 15
Challenges Metadata extraction
Data quality
Even small or rare mistakes in date encoding for instance may cause theapplication to be disregarded or discarded by researchers in the humanities.
Potentially erroneous metadata in “one size fits all” web corpora mayundermine the relevance of web texts for linguistic purposes.
→ “Hi-Fi” web corpora promote web sources and modernization ofresearch methodology
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 9 / 15
Challenges Metadata extraction
Examples: quality of metadata
Figure: Relative frequency of lemma“Google” in the blog corpus, classifiedby date
Figure: Relative frequency of lemma“Zuckerberg” in the blog corpus,classified by date
Querying and plotting software (DDC & DiaCollo): Bryan Jurish (BBAW)http://odo.dwds.de/~moocow/software/
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 10 / 15
Challenges Quality assessment of content
Example: text quality (query: “document” in blog corpus)
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 11 / 15
Challenges Licensing and republishing
Last but not least: License issues
Different countries, different laws (public domain in the USA, politicalspeeches in Germany etc.)
To be sure: check content and licenses
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 12 / 15
Challenges Licensing and republishing
Manual content checks for the blogs
2727 blog candidates
1766 blogs can be used without restriction (65 %), since all the textualcontent qualifies for archiving:
• At least something on the website
• It is a blog
• Mostly written in German
• Under CC license
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 13 / 15
Challenges Licensing and republishing
CC licence terms (blog corpus)
Most frequent licence types:
652 BY-NC-SA
532 BY-NC-ND
351 BY-SA
282 BY
129 BY-NC
58 BY-ND
Remarks
• Theoretically, the CC license cannot be overridden by another oncethe content has been published
• The usage of *-ND might be a problem
• Differences between countries are not supposed to be a concern
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 14 / 15
Challenges Licensing and republishing
CC licence terms (blog corpus)
Most frequent licence types:
652 BY-NC-SA
532 BY-NC-ND
351 BY-SA
282 BY
129 BY-NC
58 BY-ND
Remarks
• Theoretically, the CC license cannot be overridden by another oncethe content has been published
• The usage of *-ND might be a problem
• Differences between countries are not supposed to be a concern
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 14 / 15
Challenges Licensing and republishing
Thank you for your attention
@adbarbaresi
http://purl.org/adrien-barbaresi
Document under CC BY-SA 4.0 license
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 15 / 15