Post on 25-Feb-2016
description
GSK: Development and Distribution of
ResourcesHitoshi ISAHARA
GSK: Gengo Shigen Kyokai (Language Resource Association)
National Institute of Information and Communications Technology (NICT)
Licensing and Distribution of Resources and Applications
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
2
Organizing Creation & Utilization of Language Corpora
Creation of language corpora needs some cost.Utilization needs a system to distribute corpora.Some activities started early in 1990s. 1992 LDC in U.S.A. 1995 ELRA in Europe
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
3
Japanese Activities
GSK: Gengo Shigen Kyokai (Language Resource Association) Launched in 1999, Reformed as an NPO in 2003, Project accepted in 2005 for 3 years, Text corpora are its main concern at present. NII-SRC distributes speech corpora.
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
4
GSK and NII-SRC
Language Resource Association (GSK) A nonprofit organization collecting and distributing text and speech corpora.
http://www.gsk.or.jp/
NII-Speech Resources Consortium (NII-SRC) Collects and distributes most major speech corpora. http://research.nii.ac.jp/src/eng/
These two organizations try to play central roles for collecting
and distributing speech and language corpora in Japan.
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
5
Knowledge Information Processing Technologies
Committee
Language ResourceSub-committee
JEITA(Japan Electronics and Information Technology Industries Association)
Natural Language Processing Portal Site
SHACHI: Language Resource Metadata DB
NICT: National Institute of Information and Communications
Technology
GSKNII-SRC
TCL
NII: National Institute of Informatics
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
6
Purpose of GSK
Collection, distribution, investigation, research, and standardization of electronic data and software tools necessary for the promotion of science, technology, education and industry concerning natural language.
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
7
GSK Organization
President
Two vice presidents
11 board members
25 steering committee members
All are voluntary workers.
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
8
No-fee Distribution
Provider
UserGSK
Agreement
Distribution permission
Corpus
Payment
As a rule, the cost of handling corpora falls on the user, though the corpus itself is free of charge.
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
9
Agency
Agency
Commission
GSK Request
Form
Payment
Agreement
Provider
User
The providers of the corpora entrust GSK with requests received from users. GSK mediates between users and providers.
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
10
Advertizing
Provider
User
GSKAd request
Ad rate
Payment
Agreement
Publicity
Corpora providers entrust GSK with advertizing useful information on their data or corpora.
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
11
Some Examples of GSK Corpora
JEITA Multimodal Corpus
Japanese Web N-ram Version 1
CICC Multilingual Dictionary
IPAL Lexicon of Basic Japanese
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
12
JEITA Multimodal Corpus
A corpus of collected person-to-person task-oriented dialogues. 80 min. of video for 9 conversations concerning topics of “faces” and “travel” included. Speech data transcribed and provided with annotations indicating morphemes, dialogue structure and prosody. Contained in 1 DVD-R (800 MB).
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
13
Japanese Web N-gram Version 1 N-grams that have been extracted from Google cr
awling publicly available Japanese webpages. Pages requiring special permission to brows or indicated with nonarchaive/noindex are not included. N-grams (1-7) with frequency greater than 20 were extracted from approximately 20 billion sentences.
Contained in 6 DVD-Rs (26 GB after gzip compression).
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
14
CICC Multilingual Dictionary
A collection of Malay, Indonesian, Chinese, and Thai Dictionaries containing 50,000 basic words, POS tags; some contains English translations. Technical Term Dictionary for each language is also available.
Contained in 1 CD-ROM for each language.CICC: Center for the International Cooperation for Computation
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
15
IPAL Lexicon of Basic Japanese
Containing
861 verbs, 136 adjectives, and 1,081 Nouns and glossary. English translations also provided for nouns contained in glossary.
Contained in 1 CD-ROM.
Regional Conference on Localized ICT Development and Dissemination
across AsiaJan. 15, Vientiane, Laos
16
Summary1. There are several distributers of language
resources in Japan.2. GSK is the only consortium of language
resources qualified as NPO in Japan. 3. GSK plans to collaborate with Language
Grid Project.