Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The...

31
Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode http://emeld.org/school/classroom/annotation/ http://emeld.org/school/classroom/unicode/ The Work Room http://emeld.org/school/workroom Chairmain: Dafydd Gibbon, Universität Bielefeld with contributions from Deborah W. Anderson, Berkeley

Transcript of Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The...

Page 1: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

EMELD Conference July 2004, Detroit

Working Group 3: The Classroom (1)

Annotations, Unicodehttp://emeld.org/school/classroom/annotation/http://emeld.org/school/classroom/unicode/

The Work Roomhttp://emeld.org/school/workroom

Chairmain: Dafydd Gibbon, Universität Bielefeld

with contributions fromDeborah W. Anderson, Berkeley

Page 2: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

● Goal: website– accurate– reasonably complete– clear enough to interest and guide the linguist who does not

have extensive technical training● Focus: structure of the existing School and its "rooms"● Questions:

– What works, what should be added?– Which tutorials should be added?– Which references should be included in the bibliography?– Which software should be included in the Tool Room?– How do you find the navigation and design of The School?

● Any suggestions? Are the right-hand-side menus helpful?● Deliverable: report

Working Group objectives

Page 3: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Anthony Aristar, Wayne State UniversitySteven Bird, University of MelbourneChin-Chuan Cheng, Academia SinicaDafydd Gibbon, Universität BielefeldDavid Harrison, Yale UniversityWill Lewis, California State University, FresnoJim Mason, Rosetta ProjectDonald Salting, North Dakota UniversityJoszef Szakos, Providence University & National DongHua University

College of Aboriginal studiesDietmar Zaefferer, Ludwig-Maximilians-Universität München

Megan Zdrojkowski, Workshop liaison

Working Group members

Page 4: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Task:Review of the Unicode, Annotation and Work Room areas of The Classroom.

Classroom:The Classroom is designed to contain lessons and (eventually) online tutorials on

recommended practices. The Classroom is central to the website, so its clarity and accuracy are important. We would appreciate corrections and suggestions (especially about tutorials that we should develop). The annotation section is particularly in need of additional content; the Unicode section needs expert review, since many linguists are eager for information on Unicode.

Work Room:The Work Room is intended to be a place where linguists can work on their own

documentation online, using facilities on the E-MELD servers. At present we have only a few online facilities, e.g. the FIELD tool for lexical input, the OLAC Repository Editor, and Charwrite (http://emeld.org/tools/charwrite.cfm), an ontology of linguistic concepts developed by the E-MELD team at the U. of Arizona.

Working Group preliminary notes

Page 5: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Definitions:"Annotation is a [ kind ] with [ specifics]"The current EMELD definition is biased towards written text and is

idiosyncratic by international standards.Types:

inline vs. standoffinterlinear glossingmultitier labellingPOS markuptreebank

Terms:alignment, annotation, labelling, markup, transcription, treebank, ...

Annotation basics

Page 6: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Alignment:Domain: speech recognition.Hyperonyms: evaluation method.Def.: A function over two strings yielding a measure of their

similarity in terms of insertions, deletions and substitutions. In determining the performance of a continuous speech recognition system, the response of the recogniser has to be compared to the transcription of the utterance presented to the system. In this process, the two word strings have to be aligned in order to compare them.

Annotation: some EAGLES definitions (1)

Page 7: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Annotation:Domain: corpus representation.Hyperonyms: description, representation, characterisation.

Hyponyms: part of speech annotation, POS annotation, segmental annotation, prosodic annotation.

Synonyms: labelling, markup.Def.:

1. Symbolic description of a speech signal or text by assigning categories to intervals or points in the speech signal or to substrings or positions in the text.

2. Process of obtaining a symbolic representation of signal data, the act of adding additional types of linguistic information to the transcription (representation) of a text or discourse.

3. The material added to a corpus by means of (a): e.g. part-of-speech tags.

Annotation: some EAGLES definitions (2)

Page 8: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Tagging scheme:Domain: dialogue representation.Def.: A list of annotation tags together with their definitions and the

guidelines needed to map them on to a corpus.

Tagset Domain: dialogue representation.Def.: The set of tags used for labelling words in a particular

language and in a particular corpus.

Annotation: some EAGLES definitions (3)

Page 9: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Distinctions:Relation ALIGNMENT(object, annotation) over a pair:

object to be annotatedannotationrelation

Data:Empirical principle: primary data, corpusHermeneutic principle: secondary, descriptive, interpretative data

Domains:speech recordings: audio, video, ....

e.g. set of temporally ordered pairs of transcriptions and (possibly paired) time-stamps.

text: authentic orthographic, transcription (note different ontological status)

e.g. pair of word and tag (POS markup), or phrase and tag (treebank)artefacts: digital: photos, scans; analog: books, notes, objects...

Annotation - back to basics

Page 10: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

The Linguistics Society of America may seek some sort of relationship with the Unicode Consortium. Larry Hyman had told me on Wednesday that LSA was going to write a letter regarding this, but per the President of Unicode, nothing has been received as yet. I'll follow up. I gave a talk at the recent Assoc. for Literary and Linguistic Computing in Gothenburg, Sweden, encouraging more universities and professional associations to participate in Unicode, but received no positive (or negative) response at all. TEI may investigate becoming a liaison (?) member, though I will have to follow up on this too.

I had a document at the last Unicode Technical Committee meeting in June on the use of subscripts and superscripts in linguistics, a topic which theoretically could be handled with Unicode codepoints or markup (or "style" feature in a font). The UTC never got to it. I will raise it again, because it brings up important questions for linguists in terms of how to encode their data (and decisions directly impact the development of a font--i.e., should linguists use Unicode subscripts "1", "2", "3" in the font or instead use the "style" feature, which means the subscripts may not necessarily be retained across the Web when doing searches).

Unicode update by Deborah Anderson (1)

Page 11: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

There will be a panel on Unicode at this year's LSA, assuming I get everyone's abstract in on time (I am organizing it). The session will include a demonstration of a new free Mac Unicode font for linguists that ought to be ready by January, if not long before. I'm trying to get funding for a comparable one for the PC using OpenType technology.

A slew of new phonetic characters have been accepted for Unicode 5.0, assuming a positive balloting by ISO national bodies. I can provide a list of these for those who are interested. Peter Constable may pursue getting some horizontal tone bars into Unicode.

I am very happy to report that there is a new full Unicode Consortium member from Africa, representing "Agence Intergouvernementale de la Francophonie." It is wonderful to finally have a representative from Africa in Unicode. Two African scripts were just approved for Unicode 5.0: Tifinagh (used by the "Berbers") and Ethiopic Extensions (critically needed for Ethiopia).

Unicode update by Deborah Anderson (2)

Page 12: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

A new proposal author has started to get involved in Unicode, Chris Harvey, who is working with Peter Brand to get those characters into Unicode that are needed for Native American orthographies (particularly those in Canada). I left a handout at Leanne Hinton's "Stabilizing Indigenous Languages" conference in June with some recommendations for those groups developing orthographies (from the viewpoint of Unicode, particularly Ken Whistler). I would be happy to send a copy of this to EMELD and make it available. It contained basic Unicode info with tips (i.e. avoid using all caps, etc.).

Things are improving in terms of Unicode support generally, though people need to be running a very recent OS in order to get the best results.

Most of the following info is about Windows, but Mac has made some headway. In April, I spoke to a developer for RedHat, Owen Taylor, about Pango, an open source text rendering framework. I am happy to see more work is taking place on Linux, and I would expect it to continue.

Unicode update by Deborah Anderson (3)

Page 13: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Fonts:SIL Doulos: has diacritic stacking capability (however, the user must have the appropriate software installed, see SIL website)Mac Unicode: Another font is being developed for the Mac which will be freely available later this year. It is aimed at providing linguists with the full range of IPA and those symbols used by Americanists, as well as other Latin block repertoires.OpenType: possibly an OpenType version for the PC.SIL Gentium: The free Gentium font from SIL offers precomposed forms and the creator, Victor Gaultney, seems open to adding new symbols/letters if requested.

Rendering engines:Uniscribe: From a search on the Web, I found this comment by a fairly reputable source: "In Windows, the Uniscribe processor, usp10.dll, which is in the \windows\system32 directory handles complex script processing. Any application and font needing complex script shaping in Windows can make use of it."Graphite (SIL): Runs on Windows and is free, allows diacritic stacking.

Unicode update by Deborah Anderson (4)

Page 14: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Input methods:

Windows: Windows now has a way to set up customizable keyboards for those running Windows XP, Windows 2000, and Windows server 2003: http://www.microsoft.com/globaldev/tools/msklc.mspx

Also Tavultesoft Keyman.

Mac: It is now possible to install an XML-based keyboard layout: http://developer.apple.com/technotes/tn2002/tn2056.htmlOther: http://wordherd.com/keyboards/

SIL Ukelele: http://scripts.sil.org/cms/scripts/page.php?site_id=3Dnrsi&item_id=3DukeleleIPA kb layout: http://www.floodlight.net/MacOSX/IPAKeys.shtml

Unicode update by Deborah Anderson (5)

Page 15: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

1. Views on Unicode (code, glyph, linguistic category, association):1. Official views of the Unicode consortium.2. Linguistic views of characters as units of language.

2. Unicode as an abstract character representation language.- structural properties of characters

- compositional character definitions (poss. for missing chars)- holistic character definitions- diacritic stacking

- interpretation of characters- mapping to glyphs (character "semantics")- mapping to linguistic categories ("orthography", "transcription")

3. Use of Unicode in fonts: full implementations, partial implementations...4. Availaiblity of Unicode-ready font processors:

- input widgets- rendering engines

5. Terminology: characters, codes, glyphs, ...

Unicode discussion: basics

Page 16: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

1. The alien font problem :- missing font, maybe unknown, untraceable, hacked, ad hoc, ...- wrong or unclear interpretation of character codes as glyphs- incomplete font- missing 8th bit in information exchange- browser problem: overriding of document metadata by browser metadata

header2. The squares and wingdings problem:

Rendering engines choose default renderings for alien fonts:- squares (most common)- wingdings (MS-PowerPoint)

3. The competing philosophies problem:- Unicode is somewhat biased towards western alphabetic scripts.- East Asian scripts have different linguistic interpretations, different structural

principles, different character definition conventions:- BIG-5, GB, compositional?- "Unicode has problems" - Tony

Unicode discussion: why Unicode?

Page 17: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

"Unicode has problems" – TonyThe sorting problemThe competing philosophies problem:

- Unicode is somewhat biased towards western alphabetic scripts.

- East Asian scripts have different linguistic interpretations, different structural principles, different character definition conventions:

The national identity problem: e.g. BIG-5, GB

Unicode discussion: problems

Page 18: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Recommendations:Definition: "metadata"?

- File header information- Library catalogue card information- Can use controlled vocabulary to describe the content of a

resource.- Schema for annotation?

Description: Shorter description of OLAC - catalogue information collection (harvesting)

Clarify role: EMELD wants metadata NOT dataTerminology: Annotation is not metadata

Work Room: Metadata

Page 19: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Recommendation:Start with metadata - different MD systems for different domainsDefine metadata with

an appropriate metaphordirect characterisation

Branch tometadata content

Does an "abstract" constitute metadata?metadata conventions: DC, OLAC, IMDI, ...metadata processing

editingharvestingsearching

Distinguished between different uses of "controlled vocabulary"

Work Room: Metadata

Page 20: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Recommendation:Definition of terminology vs. everyday languageReferences to Terminology Science

- cutting edge - ML extraction of domain-specific term from corporaTypical features of terminology:

- controlled vocabulary- one-to-one mapping of terms and concepts- terminology relative to a specific domain or community

here: linguistic terminologies- "MIT school of notational variation"

Work Room: Terminology

Page 21: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

WATCH THIS SPACE ...

Bibliography: Annotation, Unicode

Page 22: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Desktops + LaptopsSoftware

office type toolsspeech toolsXML technologiescustom tools

Functionalityunlimited: speech, text, archive, web, ...

PDAsSoftware:

office type toolscustom tools

FunctionalityMetadata entryInterview materials, questionnairesNote-taking:

Tools: Platforms

Page 23: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Speech annotation:Available:

Audio: Praat, esps-waves+ (Xwaves), Transcriber, WaveSurfer, SoundFileSystems ...

Video: TASX, Elan, ...Sound archive systems: emu

Needed:Automatic (semiautomatic) segmentation toolSignal file chunking toolSpeech concordancerIntegration of annotator with support tools for specific tiers:

morphological analysisGOLD ontology (cf. ontoELAN)

Tools: Speech

Page 24: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Lexical DBMS:Available:

ShoeboxFIELDMorphological analysers

Needed:Corpus wordlist extractorKWIC concordancer

Tools: Lexicon

Page 25: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Spreadsheets:Available

OpenOffice CalcMS-Excel

Functionality:Table object + appropriate styleFlexiible sortingEyeball consistency checkingQuick to get used to - flat learning curveIn some ways more, in others less flexible than DBMS

Result:Convenient entry tools for lexical databasesExport as Character Separated Value, e.g. tab delimitedExport as XML / HTML

Tools: Data entry

Page 26: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

General editor:Available:

OpenOffice (native file format: XML)MS-Word (export to XML

Problems:Most people don't (know how to) use styles, which support the

production of structured documents and automatic export into XML etc.

Structured editor (e.g. XML)

Metadata form editor:...

Tools: Text editing

Page 27: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Fonts:Available:

uncountable...Needed:

Full unicode fontsCompatibility across operating systems

Character input gadgets:Available:

CharwriteÖsten's widget

Needed:Touchscreen, graphics tablet, .... ?

Rendering engines:[ see notes from email with Deborah Anderson)

Tools: Characters

Page 28: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

● Kinds of user

● Use cases

● Frequently Asked Questions

● Catalogue/index of available teaching/learning materials

● Quiz

Tutorial issues

Page 29: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Browser problems:stylesheet (esp. font) overridingtext overwriting between frames, table cellsuse of Javause of JavaScriptuse of graphics:

Avoid text in graphics!

Professional web designSite mapClick depthSmall information chunks, not long texts

Distinguish:- programmer- information architect- graphics designer

Browser ergonomics better practice

Page 30: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

Display of information:Is GOLD displayed optimally?

Browser ergonomics better practice

Page 31: Working Group 3: The Classroom (I) 1 EMELD Conference July 2004, Detroit Working Group 3: The Classroom (1) Annotations, Unicode

Working Group 3: The Classroom (I) 1

● Great start

● More work needed

Conclusion