Christopher Manning Computer Science and Linguistics, Stanford University
description
Transcript of Christopher Manning Computer Science and Linguistics, Stanford University
![Page 1: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/1.jpg)
Kirrkirr: Transforming the representation of lexical information
Experiments with endangered language dictionaries
Christopher ManningComputer Science and Linguistics, Stanford University
(with Jane Simpson, Kevin Jansz, University of Sydney, and Nitin Indurkhya, Nanyang Technological University)
http://www.sultry.arts.usyd.edu.au/kirrkirr/
![Page 2: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/2.jpg)
Research Program: Lexicon A language is more than individual words with a
definition– it is a vast network of associations between words and
within and across the concepts represented by words
The aim of this work is to provide a wide variety of users – not just linguists – with a better understanding of this conceptual map.
Traditional paper dictionaries offer very limited ways for making such networks visible
On a computer, there are no such limitations to the way information can be displayed.
![Page 3: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/3.jpg)
Research: Computational Lexicography
Dictionaries on computers are now common– But there has been insufficient work trying to utilize the
potential of the new medium– Most present a plain, search-oriented representation of the
paper version
Goal: fun dictionary tools that are effective for browsing and language learning (cf. Kegl 1995)– Like flicking through a paper dictionary, but better– Innovative ways for representing and linking dictionary
information, through creative use of computer software– Should improve user supports and incidental learning
Focus: exploration/dissemination, not creation
![Page 4: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/4.jpg)
Initial focus: Warlpiri Warlpiri is an Australian Aboriginal language spoken
in the Tanami desert (NW of Alice Springs) There are a number of factors influencing this choice:
– Rich lexical materials have been collected by linguists over decades (Ken Hale, MIT, from 1950s, Simpson, Nash, Laughren, Hoogenraad) resulting in the most comprehensive lexical databases for any Australian Language
– Warlpiri is the first language of a relatively large community of people. There is reasonable vernacular literacy
– Until now, results haven’t been produced in a format usable by the community (only raw printouts) – which is not really acceptable. Fixing this is also good science: for subtle linguistic judgments, one needs speaker involvement.
![Page 5: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/5.jpg)
Educational goals Dictionary structure and usability are often dictated
by professional linguists, while the needs of others (speakers, semi-speakers, young users, second language learners) are not met. Focus: school kids.
The low level of literacy in the region makes an e-dictionary potentially more useful than a paper edition
• less dependent on good knowledge of spelling and alphabetical order. • builds on captivating qualities of computers• multimedia content and the pronunciations of words is a considerable help as well.
![Page 6: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/6.jpg)
Kirrkirr: A Warlpiri dictionary browser
(Jansz 1998; Jansz, Manning and Indurkhya 1999, 2000)
An environment for the interactive exploration of dictionaries.
Although our current work has just been with Warlpiri, the design is general – any XML dictionary
Attempts to more fully utilise graphical interfaces, hypertext, multimedia, and different ways of indexing and accessing information
Written in Java, it can either be run over the web (needs bandwidth) or locally (here Java’s main advantage is cross-platform support: Win/Mac/Unix)– originally JDK1.1.6+Swing, now Java 2
![Page 7: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/7.jpg)
Overview
Kirrkirr provides various modules Animated network layout of word relationships Formatted dictionary entries A notes facility for ‘jotting in the margin’ annotations Multimedia: audio, pictures Semantic domain browsing Advanced searching interfaces
others in planning: formatting (XSL) editing, figuration patterns, terminology sets
These attempt to cater to users with different interests and competence levels
![Page 8: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/8.jpg)
(Kirrkirr screen shot)
![Page 9: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/9.jpg)
The lexical database Original text materials are stored in an ad hoc format of
markup using backslash codes with some (rather odd) nesting of structural tags [origin: troff]
These are converted to XML using an error-correcting stack-based parser (written in PERL)– The inconsistency and flexibility of dictionary entries actually
made this a surprisingly difficult task.
– Innumerable structural errors/inconsistencies/typos from years of hand maintenance in text editors and via regexps
– Innumerable problems with link consistency
– Heuristic content-sensitive parser imposes data integrity
– After much grief – Software Engineering 101 – we now have a ‘one click’ process for regenerating the whole system
![Page 10: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/10.jpg)
Data representation: XML Use of XML gives a clear structure to the lexical data Separates structure of the data from its presentation Much of the recent enthusiasm for XML has involved
textual representation of simple and rigid structures well-represented in RDB records (e-commerce, etc.)
But dictionary entries thoroughly exploit XML – rich hierarchical structure– entities vary greatly depending on the word being defined
Result remains a portable, tangible text file
Use of standards for field linguistic data means: many (free) tools are available, extra functionality comes for free, one can interoperate with mainstream software.
![Page 11: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/11.jpg)
KirrkirrDictionary Browser
<DICTIONARY>
<ENTRY>...</ENTRY>
<ENTRY>...</ENTRY>
<ENTRY>...</ENTRY>
</DICTIONARY>
word file positionword file positionword file position
XML Formatted Warlpiri dictionary file
Index in Memory
XML Parser
Across file system or web
Kirrkirr’s XML Index Process
XML Document Object Model
We currently uses ad hoc indexing of the XML for efficient access (but expect to move to XQL, as it matures).
![Page 12: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/12.jpg)
Visualization of dictionary information
For dictionaries with simple textual content behind them, there is little that can be done but an on-line reflection of a printed page
But we would like to be able to do more– we want to know a word’s relationships to other words, and
the patterning in these relationships
In a computational approach, the program can mediate between lexical data and the user
The interface can select from and choose how to present information (according to the user’s preferences and abilities) – in many different ways
![Page 13: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/13.jpg)
Perils of visualisation
![Page 14: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/14.jpg)
Graph-based visualisation
(Jansz 1998; Jansz, Manning and Indurkhya 1999)
Classic graph layout problem Adapts work by Eades et al. (1998) and Huang et al.
(1998) on visualisation and navigation of WWW document linkages
Uses the spring algorithm. Big advantage is that it is an iterative updating algorithm, and so gives an easy interactivity:– it wiggles and people can play with it, clicking to sprout nodes
A major goal was clarity and simplicity of the graph: the software maintains a set of focus nodes to prevent overcrowding
![Page 15: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/15.jpg)
Kirrkirr network display
![Page 16: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/16.jpg)
Formatted dictionary entries Are produced automatically and online from the XML
by using XSL(T) – a tree transformation language XSL allows easy modelling of some user preferences One can leave out information such as part of
speech, or detailed definitions, or rearrange it We provide several stylesheets to choose from This issue is surprisingly important: many users find
information overload confusing and demotivating Can produce a bilingual or monolingual dictionary Can also use this for print dictionaries (via RTF or
TeX). We have produced a couple of samples.
![Page 17: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/17.jpg)
Formatted dictionary entries
![Page 18: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/18.jpg)
Rich typology of link types The semantic links present in a dictionary (synonym,
antonym, hyponym, subentry, variant, coverbs, …) solve a major problem of the web: we have many link types each with a clear semantic interpretation
We use consistent colour-coding of text and network edges to show these link types
Gives a richer browsing experience You can tell where you are going before clicking Dictionary-given links are supplemented by links
derived from collocational analysis of Warlpiri texts– uses loglikelihood ratios (Dunning 1993)– works reasonably successfully from 1/4 million words
![Page 19: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/19.jpg)
Educational advantages/usability Work (at PARC and elsewhere: Pirolli et al. 1996)
has stressed the role for browsing as well as searching in information access
It provides a context for learning A student can opportunistically explore words that are
related in various ways Important semantic relationships can be understood
People continually see alphabetical order and word spellings, but don’t need to know them to use Kirrkirr
Use of “fuzzy spelling” in searches supports users with poor spelling. It usually finds what you wanted.
![Page 20: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/20.jpg)
Other components Multimedia (currently pictures and audio)
– Can hear pronunciations – gives a much better understanding of pronunciation than phonetic symbols
– pictures of plants and animals are more intelligible than descriptions
– (future: videos of Warlpiri sign language …)
Advanced search page– search various fields,
regular expressions, fuzzy spelling, etc.
Notes:– one can annotate dictionary
entries (to correct or personalise)
![Page 21: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/21.jpg)
User studyMim Corris (Yuendumu, Willowra), Jane Simpson (Lajamanu)
User testing with primary and (lower) secondary students (observation, and dictionary tasks)
Observation of trainee Warlpiri literacy workers Comments from teachers, other adults, etc. Qualitative ethnographic study of dictionary use.
(Doing anything much else would be difficult.) Initial reactions are very enthusiastic Students used it voluntarily during lunch breaks Could use as a basis for classroom activities (better
with some further development: games and puzzles)
![Page 22: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/22.jpg)
A positive anecdote
“One of the introductory Warlpiri literacy students, who had not been very interested in the literacy class, spent nearly 3/4 hour looking at Kirrkirr apparently in absorbed concentration. She wasn’t especially interested in the sound and picture possibilities. She moved between words, scrolling along the list, typing in the search, clicking on the words in the network pane. She wasn’t even put off when the dictionary definitions stopped appearing – looking at the networks of words instead. … After the Kirrkirr demo she asked if she could have a printed dictionary to take away with her to use in camp to learn the words. I interpret this as a desire to learn words in her own time and place.”
![Page 23: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/23.jpg)
Endangered language dictionaries
(Corris, Manning, Poetsch, and Simpson 1999). Based on 72 people.
We’ve been testing both paper and electronic dictionary desires, use and usability– competing goals: documentation dictionaries vs.
maintenance/learning dictionaries (linguist vs. other user)
– symbolic organization vs. practically useful organization
– lack of understanding of dictionaries, and limited literacy can make paper dictionaries ineffective
• 45–60 minutes for 12 dictionary lookups…
– lack of electricity makes e-dictionaries ineffective in some places (e.g., Indonesia – but OK in Australian schools)
E-dictionaries can solve many usability issues– font size, amount of info, ‘infinite’ space, easy lookup, sound,
customizability
![Page 24: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/24.jpg)
Interim Conclusions Kirrkirr is a prototype of what one can do to develop
new ways to organize and visualize lexicons We have addressed the challenge of making
dictionary information accessible and usable in the creation of an application which mediates between well-structured data and users’ needs and insights in searching/browsing and presentation
It allows web distribution of information from a server We’ve begun making it available in Warlpiri schools –
more results to follow.
![Page 25: Christopher Manning Computer Science and Linguistics, Stanford University](https://reader035.fdocuments.us/reader035/viewer/2022062422/56812cad550346895d915d4f/html5/thumbnails/25.jpg)
Kirrkirr: Experiences with a flexible software interface to
indigenous dictionaries
Christopher ManningComputer Science and Linguistics, Stanford University
(with Kevin Jansz, University of Sydney, and Nitin Indurkhya, Nanyang Technological University)
http://www.sultry.arts.usyd.edu.au/kirrkirr/